Build an AI Agent to Automate Your Research
If you’re a student or working in AI, you probably feel overwhelmed by the amount of information out there. The answers exist, but sorting through and making sense of them takes a lot of time. Imagine if you could create an AI agent to handle that first round of research for you. In this guide, we’ll build an AI agent to automate research that takes your question, searches the web, reads the top results, and gives you the most relevant passages along with a short summary.
How will our AI Agent for Research Work?
The agent we’re building is a fantastic example of a simple Retrieval-Augmented system. The core idea isn’t just to find pages with the right keywords (like a simple Ctrl+F), but to find passages with the right meaning.
We will use vector embeddings. Think of an embedding as a coordinate of meaning in a vast, high-dimensional space. The sentence-transformers library provides models that are experts at turning any piece of text into a list of numbers (a vector) that represents its location in that meaning space.
Our agent’s entire job is to:
- Turn your query into a vector.
- Search the web, scrape the text, and turn all the content into more vectors.
- Find the text vectors whose coordinates are closest to your query’s coordinates.
That’s it. It’s just very clever math, and you can build it in about 100 lines of Python.
Now, Let’s Build an AI Agent to Automate Your Research
First, create a new Python file and install the dependencies:
pip install ddgs requests beautifulsoup4 sentence-transformers numpy
Step 1: Import Libraries and Configure Parameters
The top of your file will define all imports and key settings:
import re
import urllib.parse
from ddgs import DDGS # package name is 'ddgs' (duckduckgo_search renamed)
import requests
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer
import numpy as np
import time
Next, configure constants like search results, summary size, and embedding model:
SEARCH_RESULTS = 6 # How many URLs to check
PASSAGES_PER_PAGE = 4 # How many passages to pull from each URL
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2" # Fast, high-quality model
TOP_PASSAGES = 5 # How many relevant passages to use for the summary
SUMMARY_SENTENCES = 3 # How many sentences in the final summary
TIMEOUT = 8 # How long to wait for a webpage to load
Step 2: Search and Fetch Web Pages
We’ll use DuckDuckGo Search for free, API-less search results:
def unwrap_ddg(url):
"""If DuckDuckGo returns a redirect wrapper, extract the real URL."""
try:
parsed = urllib.parse.urlparse(url)
if "duckduckgo.com" in parsed.netloc:
qs = urllib.parse.parse_qs(parsed.query)
uddg = qs.get("uddg")
if uddg:
return urllib.parse.unquote(uddg[0])
except Exception:
pass
return url
def search_web(query, max_results=SEARCH_RESULTS):
"""Search the web and return a list of URLs."""
urls = []
with DDGS() as ddgs:
for r in ddgs.text(query, max_results=max_results):
url = r.get("href") or r.get("url")
if not url:
continue
url = unwrap_ddg(url) # Clean up DDG redirect links
urls.append(url)
return urls
Then, fetch and clean the page with requests and BeautifulSoup:
def fetch_text(url, timeout=TIMEOUT):
"""Fetch and clean text content from a URL."""
headers = {"User-Agent": "Mozilla/5.0 (research-agent)"}
try:
r = requests.get(url, timeout=timeout, headers=headers, allow_redirects=True)
if r.status_code != 200:
return ""
ct = r.headers.get("content-type", "")
if "html" not in ct.lower(): # Skip non-HTML content
return ""
soup = BeautifulSoup(r.text, "html.parser")
# Remove all annyoing tags
for tag in soup(["script", "style", "noscript", "header", "footer", "svg", "iframe", "nav", "aside"]):
tag.extract()
# Get all paragraph text
paragraphs = [p.get_text(" ", strip=True) for p in soup.find_all("p")]
text = " ".join([p for p in paragraphs if p])
if text.strip():
# Clean up whitespace
return re.sub(r"\s+", " ", text).strip()
# --- Fallback logic if <p> tags fail ---
meta = soup.find("meta", attrs={"name": "description"}) or soup.find("meta", attrs={"property": "og:description"})
if meta and meta.get("content"):
return meta["content"].strip()
if soup.title and soup.title.string:
return soup.title.string.strip()
except Exception:
return "" # Fail silently
return ""
Step 3: Chunk, Embed, and Rank Passages
We’ll break long articles into smaller passages and embed them using SentenceTransformer:
def chunk_passages(text, max_words=120):
"""Split long text into smaller passages."""
words = text.split()
if not words:
return []
chunks = []
i = 0
while i < len(words):
chunk = words[i : i + max_words]
chunks.append(" ".join(chunk))
i += max_words
return chunks
def split_sentences(text):
"""A simple sentence splitter."""
parts = re.split(r'(?<=[.!?])\s+', text)
return [p.strip() for p in parts if p.strip()]
class ShortResearchAgent:
def __init__(self, embed_model=EMBEDDING_MODEL):
print(f"Loading embedder: {embed_model}...")
# This downloads the model on first run
self.embedder = SentenceTransformer(embed_model)
def run(self, query):
start = time.time()
# 1. Search
urls = search_web(query)
print(f"Found {len(urls)} urls.")
# 2. Fetch & Chunk
docs = []
for u in urls:
txt = fetch_text(u)
if not txt:
continue
chunks = chunk_passages(txt, max_words=120)
for c in chunks[:PASSAGES_PER_PAGE]:
docs.append({"url": u, "passage": c})
if not docs:
print("No documents fetched.")
return {"query": query, "passages": [], "summary": ""}
Now, initialize the embedding model and compute cosine similarity:
# 3. Embed (Turn text into numbers)
texts = [d["passage"] for d in docs]
emb_texts = self.embedder.encode(texts, convert_to_numpy=True, show_progress_bar=False)
q_emb = self.embedder.encode([query], convert_to_numpy=True)[0]
# 4. Rank (Find similarity)
def cosine(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-10)
sims = [cosine(e, q_emb) for e in emb_texts]
top_idx = np.argsort(sims)[::-1][:TOP_PASSAGES]
top_passages = [{"url": docs[i]["url"], "passage": docs[i]["passage"], "score": float(sims[i])} for i in top_idx]
Step 4: Generate a Mini Summary
Now, extract and rank sentences within top passages to form a concise summary:
# 5. Summarize (Extractive)
sentences = []
for tp in top_passages:
for s in split_sentences(tp["passage"]):
sentences.append({"sent": s, "url": tp["url"]})
if not sentences:
summary = "No summary could be generated."
else:
sent_texts = [s["sent"] for s in sentences]
sent_embs = self.embedder.encode(sent_texts, convert_to_numpy=True, show_progress_bar=False)
sent_sims = [cosine(e, q_emb) for e in sent_embs]
top_sent_idx = np.argsort(sent_sims)[::-1][:SUMMARY_SENTENCES]
chosen = [sentences[idx] for idx in top_sent_idx]
# De-duplicate and format
seen = set()
lines = []
for s in chosen:
key = s["sent"].lower()[:80] # Check first 80 chars for duplication
if key in seen:
continue
seen.add(key)
lines.append(f"{s['sent']} (Source: {s['url']})")
summary = " ".join(lines)
elapsed = time.time() - start
return {"query": query, "passages": top_passages, "summary": summary, "time": elapsed}
Step 5: Run the Agent
Finally, run the AI agent with any query:
if __name__ == "__main__":
agent = ShortResearchAgent()
q = "What causes urban heat islands and how can cities reduce them?"
print(f"Running query: {q}\n")
out = agent.run(q)
print("\nTop passages:")
for p in out["passages"]:
print(f"- score {p['score']:.3f} src {p['url']}\n {p['passage'][:200]}...\n")
print("--- Extractive summary ---")
print(out["summary"])
print("--------------------------")
print(f"\nDone in {out['time']:.1f}s")

Final Words
You have just built the core logic of RAG, the architecture behind many of the most powerful GenAI systems today. First, you built the Retriever (Search, Fetch, Chunk); next, you built the Ranker (Embed, Cosine Similarity); and then, you built a simple Generator (the extractive summarizer).
The journey from a simple script to a powerful AI system is just a series of small, understandable steps. You just took the first and most important one.