gptdevelopers.io
Hire GPT Developers
Table of Contents:
Smart News Aggregator – Cut Through the Noise with AI/

Thanks For Commenting On Our Post!
We’re excited to share this comprehensive guide with you. This resource includes best practices, and real-world implementation strategies that we use at slashdev when building apps for clients worldwide.
What’s Inside This Guide:
- Why traditional news feeds fail you – and how AI curation fixes it
- The smart aggregation architecture – RSS parsing, sentiment analysis, and deduplication
- Production-ready Python code – feedparser, NLP models, and intelligent ranking
- Step-by-step deployment – from local testing to automated daily digests
- Key learnings – building context-aware content filters that actually work
Overview:
Your news feed is designed to keep you scrolling, not informed. Every outlet competes for clicks with sensational headlines, duplicate stories get recycled across dozens of sources, and the algorithm shows you what drives engagement – not what matters to you.
Here’s the uncomfortable truth: you’re not consuming news. You’re consuming noise optimized for ad revenue.
The Real Problem
A typical news consumer sees:
- 73% duplicate stories from different outlets saying the same thing
- Clickbait headlines engineered to trigger emotional reactions
- No context about why a story matters or how biased the source is
- Doomscroll design that surfaces negativity because fear keeps you engaged
You know this. You’ve felt it. That exhaustion after 30 minutes of “catching up on news” and realizing you learned nothing useful.
Smart aggregation flips this model. Instead of feeding you everything, it filters intelligently based on relevance, bias detection, sentiment analysis, and topic clustering.
The Solution: AI-Powered Curation
Traditional RSS readers just collect articles. Smart aggregators understand them.
Here’s what we’re building:
- RSS parsing that pulls from trusted sources (BBC, Reuters, TechCrunch, Hacker News)
- Sentiment analysis to flag overly negative or sensational headlines
- Topic clustering to group related stories and eliminate duplicates
- Bias detection to identify partisan language and one-sided framing
- Relevance ranking based on your actual interests, not engagement metrics
The result? A personalized dashboard that shows 10 meaningful stories instead of 100 mediocre ones.
How It Actually Works
The architecture is surprisingly straightforward:
Step 1: Collection Feedparser pulls RSS feeds from your configured sources every hour. You’re not scraping websites – you’re using the standardized feeds they already publish.
Step 2: Analysis Each article runs through an NLP pipeline:
- Extract headline and summary text
- Analyze sentiment (positive/negative/neutral scores)
- Detect emotional manipulation keywords
- Generate embeddings for semantic similarity
Step 3: Filtering Articles get scored on multiple dimensions:
- Relevance to your topic preferences (tech, business, science)
- Sentiment balance (not too doom-y, not too promotional)
- Source credibility (you set the trust levels)
- Uniqueness (deduplicated against recent stories)
Step 4: Presentation Your dashboard shows the top-ranked articles with:
- Clean summaries (no clickbait rewrites)
- Source attribution and bias indicators
- Related story clusters
- Reading time estimates
Why This Matters Now
We’re drowning in information but starving for insight. The average person sees 5,000 marketing messages per day. News has become just another marketing channel.
This aggregator isn’t about filtering out perspectives you disagree with—it’s about removing manipulation, redundancy, and noise. You’ll still see multiple viewpoints on important stories. You just won’t see 47 identical takes on the same press release.
Practical Codes
1.RSS Feed Parser with Multi-Source Support
# news_fetcher.py
import feedparser
from datetime import datetime, timedelta
from typing import List, Dict
import hashlib
class NewsSource:
def __init__(self, name: str, url: str, category: str, credibility: float):
self.name = name
self.url = url
self.category = category
self.credibility = credibility # 0.0 to 1.0 score
SOURCES = [
NewsSource("BBC News", "http://feeds.bbci.co.uk/news/rss.xml", "world", 0.9),
NewsSource("Reuters", "https://www.reutersagency.com/feed/", "world", 0.95),
NewsSource("TechCrunch", "https://techcrunch.com/feed/", "tech", 0.75),
NewsSource("Hacker News", "https://hnrss.org/frontpage", "tech", 0.8),
NewsSource("MIT Tech Review", "https://www.technologyreview.com/feed/", "tech", 0.85),
]
class Article:
def __init__(self, title: str, link: str, summary: str, source: NewsSource,
published: datetime):
self.title = title
self.link = link
self.summary = summary
self.source = source
self.published = published
self.content_hash = self._generate_hash()
def _generate_hash(self) -> str:
"""Generate unique hash for deduplication"""
content = f"{self.title}{self.summary}".lower()
return hashlib.md5(content.encode()).hexdigest()
class NewsFetcher:
def __init__(self, sources: List[NewsSource]):
self.sources = sources
self.seen_hashes = set()
def fetch_all(self, hours_back: int = 24) -> List[Article]:
"""Fetch articles from all sources"""
all_articles = []
cutoff_time = datetime.now() - timedelta(hours=hours_back)
for source in self.sources:
try:
articles = self._fetch_from_source(source, cutoff_time)
all_articles.extend(articles)
print(f"✓ Fetched {len(articles)} from {source.name}")
except Exception as e:
print(f"✗ Error fetching {source.name}: {e}")
return all_articles
def _fetch_from_source(self, source: NewsSource,
cutoff: datetime) -> List[Article]:
"""Fetch and parse articles from single source"""
feed = feedparser.parse(source.url)
articles = []
for entry in feed.entries:
# Parse publish date
published = datetime(*entry.published_parsed[:6]) \
if hasattr(entry, 'published_parsed') else datetime.now()
if published < cutoff:
continue
# Extract content
title = entry.get('title', 'No title')
link = entry.get('link', '')
summary = entry.get('summary', entry.get('description', ''))
article = Article(title, link, summary, source, published)
# Deduplicate
if article.content_hash not in self.seen_hashes:
articles.append(article)
self.seen_hashes.add(article.content_hash)
return articles
# Usage
if __name__ == "__main__":
fetcher = NewsFetcher(SOURCES)
articles = fetcher.fetch_all(hours_back=24)
print(f"\nTotal unique articles: {len(articles)}")
2. Sentiment Analysis and Bias Detection
# sentiment_analyzer.py
from transformers import pipeline
import re
from typing import Dict, List
class SentimentAnalyzer:
def __init__(self):
# Using DistilBERT for efficiency
self.sentiment_model = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
# Emotional manipulation keywords
self.clickbait_patterns = [
r'\b(shocking|unbelievable|insane|mind-blowing)\b',
r'\b(you won\'t believe|wait until you see)\b',
r'\b(destroys|slams|blasts|crushes)\b',
r'\d+ (things|reasons|ways|secrets)',
r'this one (trick|weird trick)',
]
# Bias indicator words
self.bias_keywords = {
'left': ['progressive', 'social justice', 'equity', 'systemic'],
'right': ['traditional', 'freedom', 'patriot', 'liberty'],
'sensational': ['unprecedented', 'explosive', 'bombshell', 'crisis']
}
def analyze(self, article) -> Dict:
"""Comprehensive sentiment and bias analysis"""
text = f"{article.title}. {article.summary}"
# Sentiment score
sentiment_result = self.sentiment_model(text[:512])[0]
sentiment_score = sentiment_result['score'] if sentiment_result['label'] == 'POSITIVE' else -sentiment_result['score']
# Clickbait detection
clickbait_score = self._detect_clickbait(article.title)
# Bias detection
bias_indicators = self._detect_bias(text)
# Overall manipulation score
manipulation_score = (clickbait_score * 0.6) + (abs(sentiment_score) * 0.4)
return {
'sentiment_score': sentiment_score, # -1 to 1
'sentiment_label': 'positive' if sentiment_score > 0.1 else 'negative' if sentiment_score < -0.1 else 'neutral',
'clickbait_score': clickbait_score, # 0 to 1
'bias_indicators': bias_indicators,
'manipulation_score': manipulation_score, # 0 to 1
'is_trustworthy': manipulation_score < 0.5
}
def _detect_clickbait(self, title: str) -> float:
"""Detect clickbait patterns in headlines"""
title_lower = title.lower()
matches = sum(1 for pattern in self.clickbait_patterns
if re.search(pattern, title_lower, re.IGNORECASE))
return min(matches * 0.3, 1.0)
def _detect_bias(self, text: str) -> Dict[str, int]:
"""Count bias indicator keywords"""
text_lower = text.lower()
indicators = {}
for bias_type, keywords in self.bias_keywords.items():
count = sum(1 for keyword in keywords if keyword in text_lower)
if count > 0:
indicators[bias_type] = count
return indicators
# Usage
analyzer = SentimentAnalyzer()
# result = analyzer.analyze(article)
3. Topic Clustering and Deduplication
# topic_clusterer.py
from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from typing import List, Dict
class TopicClusterer:
def __init__(self):
# Lightweight embedding model
self.model = SentenceTransformer('all-MiniLM-L6-v2')
def cluster_articles(self, articles: List, similarity_threshold: float = 0.75) -> Dict:
"""Group similar articles into topic clusters"""
if not articles:
return {}
# Generate embeddings
texts = [f"{a.title}. {a.summary}" for a in articles]
embeddings = self.model.encode(texts)
# Cluster using DBSCAN
clustering = DBSCAN(
eps=1-similarity_threshold,
min_samples=2,
metric='cosine'
).fit(embeddings)
# Organize into clusters
clusters = {}
for idx, label in enumerate(clustering.labels_):
if label == -1: # Noise/unique articles
label = f"unique_{idx}"
if label not in clusters:
clusters[label] = []
clusters[label].append(articles[idx])
return self._rank_clusters(clusters, embeddings, clustering.labels_)
def _rank_clusters(self, clusters: Dict, embeddings: np.ndarray,
labels: np.ndarray) -> Dict:
"""Rank articles within each cluster by quality"""
ranked_clusters = {}
for cluster_id, cluster_articles in clusters.items():
if len(cluster_articles) == 1:
ranked_clusters[cluster_id] = cluster_articles
continue
# Get embeddings for this cluster
indices = [i for i, label in enumerate(labels)
if label == cluster_id or f"unique_{i}" == cluster_id]
cluster_embeddings = embeddings[indices]
# Calculate centroid
centroid = np.mean(cluster_embeddings, axis=0)
# Rank by similarity to centroid + source credibility
scores = []
for i, article in enumerate(cluster_articles):
similarity = cosine_similarity(
[cluster_embeddings[i]],
[centroid]
)[0][0]
credibility = article.source.credibility
score = (similarity * 0.6) + (credibility * 0.4)
scores.append((score, article))
# Sort by score descending
ranked_clusters[cluster_id] = [
article for _, article in sorted(scores, reverse=True)
]
return ranked_clusters
def get_best_from_clusters(self, clusters: Dict, max_per_cluster: int = 1) -> List:
"""Extract top articles from each cluster"""
best_articles = []
for cluster_articles in clusters.values():
best_articles.extend(cluster_articles[:max_per_cluster])
return best_articles
# Usage
clusterer = TopicClusterer()
# clusters = clusterer.cluster_articles(articles)
# best = clusterer.get_best_from_clusters(clusters)
4. Complete News Aggregator Pipeline
# main.py
from news_fetcher import NewsFetcher, SOURCES
from sentiment_analyzer import SentimentAnalyzer
from topic_clusterer import TopicClusterer
from datetime import datetime
import json
class SmartNewsAggregator:
def __init__(self, user_preferences: dict = None):
self.fetcher = NewsFetcher(SOURCES)
self.analyzer = SentimentAnalyzer()
self.clusterer = TopicClusterer()
self.preferences = user_preferences or {
'categories': ['tech', 'world'],
'max_negative_sentiment': -0.5,
'min_credibility': 0.7,
'max_articles': 15
}
def generate_daily_digest(self) -> dict:
"""Main pipeline: fetch → analyze → cluster → rank"""
print("📰 Fetching articles...")
articles = self.fetcher.fetch_all(hours_back=24)
print("🔍 Analyzing content...")
analyzed_articles = []
for article in articles:
analysis = self.analyzer.analyze(article)
# Filter based on preferences
if not self._meets_criteria(article, analysis):
continue
article.analysis = analysis
analyzed_articles.append(article)
print(f"✓ {len(analyzed_articles)} articles passed filters")
print("🧩 Clustering topics...")
clusters = self.clusterer.cluster_articles(analyzed_articles)
best_articles = self.clusterer.get_best_from_clusters(clusters)
# Final ranking
ranked = self._final_ranking(best_articles)[:self.preferences['max_articles']]
return self._format_digest(ranked)
def _meets_criteria(self, article, analysis: dict) -> bool:
"""Check if article meets user preferences"""
# Category filter
if article.source.category not in self.preferences['categories']:
return False
# Sentiment filter
if analysis['sentiment_score'] < self.preferences['max_negative_sentiment']:
return False
# Credibility filter
if article.source.credibility < self.preferences['min_credibility']:
return False
# Quality filter
if not analysis['is_trustworthy']:
return False
return True
def _final_ranking(self, articles: List) -> List:
"""Final scoring based on multiple factors"""
scored = []
for article in articles:
score = (
article.source.credibility * 0.3 +
(1 - article.analysis['manipulation_score']) * 0.3 +
(article.analysis['sentiment_score'] + 1) / 2 * 0.2 + # Normalize to 0-1
0.2 # Recency bonus (could be time-based)
)
scored.append((score, article))
return [article for _, article in sorted(scored, reverse=True)]
def _format_digest(self, articles: List) -> dict:
"""Format for output/display"""
return {
'generated_at': datetime.now().isoformat(),
'total_articles': len(articles),
'articles': [
{
'title': a.title,
'source': a.source.name,
'link': a.link,
'summary': a.summary[:200] + '...',
'published': a.published.isoformat(),
'sentiment': a.analysis['sentiment_label'],
'credibility': a.source.credibility,
'quality_score': round(a.analysis['manipulation_score'], 2)
}
for a in articles
]
}
# Run the aggregator
if __name__ == "__main__":
preferences = {
'categories': ['tech', 'world'],
'max_negative_sentiment': -0.3,
'min_credibility': 0.75,
'max_articles': 10
}
aggregator = SmartNewsAggregator(preferences)
digest = aggregator.generate_daily_digest()
print(f"\n📊 Your Smart News Digest ({digest['total_articles']} articles)")
print("=" * 60)
for i, article in enumerate(digest['articles'], 1):
print(f"\n{i}. {article['title']}")
print(f" Source: {article['source']} | Sentiment: {article['sentiment']}")
print(f" {article['link']}")
# Save to JSON
with open('news_digest.json', 'w') as f:
json.dump(digest, f, indent=2)
print("\n✓ Digest saved to news_digest.json")
How to Run:
Initial Setup (One Time)
1. Install Dependencies
pip install feedparser transformers sentence-transformers scikit-learn torch
2. Download Required Models (runs automatically on first use)
python -c "from transformers import pipeline; pipeline('sentiment-analysis')"
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
Running the Aggregator
Basic Usage:
python main.py
With Custom Preferences:
python
>>> from main import SmartNewsAggregator
>>> prefs = {'categories': ['tech'], 'max_articles': 5}
>>> aggregator = SmartNewsAggregator(prefs)
>>> digest = aggregator.generate_daily_digest()
Automated Daily Digest (cron job):
# Add to crontab (run daily at 7 AM)
0 7 * * * cd /path/to/project && python main.py
Testing Individual Components
Test RSS Fetching:
python news_fetcher.py
Test Sentiment Analysis:
python
>>> from sentiment_analyzer import SentimentAnalyzer
>>> from news_fetcher import Article, NewsSource
>>> analyzer = SentimentAnalyzer()
>>> # Create test article and analyze
Test Clustering:
python topic_clusterer.py
Output Format
The aggregator generates news_digest.json:
{
"generated_at": "2024-12-11T09:00:00",
"total_articles": 10,
"articles": [...]
}
You can build a simple web dashboard or email digest using this JSON output.
Key Concepts
You’ve now built a production-ready news aggregator using four core AI techniques: RSS feed parsing with multi-source support for standardized content collection, transformer-based sentiment analysis for detecting emotional manipulation and clickbait, semantic embeddings with DBSCAN clustering for intelligent deduplication and topic grouping, and multi-factor ranking algorithms that balance credibility, quality, and relevance. Implementing this system teaches you practical NLP pipeline design, content filtering strategies, and how to build context-aware recommendation systems – giving you skills that apply to any content curation or information filtering challenge. The key is not just filtering content, but understanding how to evaluate information quality algorithmically and building systems that prioritize signal over noise.
Would you like to focus on the first detailed section, “Why traditional news feeds fail you?”
About slashdev.io
At slashdev.io, we’re a global software engineering company specializing in building production web and mobile applications. We combine cutting-edge LLM technologies (Claude Code, Gemini, Grok, ChatGPT) with traditional tech stacks like ReactJS, Laravel, iOS, and Flutter to deliver exceptional results.
What sets us apart:
- Expert developers at $50/hour
- AI-powered development workflows for enhanced productivity
- Full-service engineering support, not just code
- Experience building real production applications at scale
Whether you’re building your next app or need expert developers to join your team, we provide ongoing developer relationships that go beyond one-time assessments.
Need Development Support?
Building something ambitious? We’d love to help. Our team specializes in turning ideas into production-ready applications using the latest AI-powered development techniques combined with solid engineering fundamentals.
