Build an AI Agent to Automate Your Research

← Previous Article All Articles Next Article →

If you’re a student or working in AI, you probably feel overwhelmed by the amount of information out there. The answers exist, but sorting through and making sense of them takes a lot of time. Imagine if you could create an AI agent to handle that first round of research for you. In this guide, we’ll build an AI agent to automate research that takes your question, searches the web, reads the top results, and gives you the most relevant passages along with a short summary.

How will our AI Agent for Research Work?

The agent we’re building is a fantastic example of a simple Retrieval-Augmented system. The core idea isn’t just to find pages with the right keywords (like a simple Ctrl+F), but to find passages with the right meaning.

We will use vector embeddings. Think of an embedding as a coordinate of meaning in a vast, high-dimensional space. The sentence-transformers library provides models that are experts at turning any piece of text into a list of numbers (a vector) that represents its location in that meaning space.

Our agent’s entire job is to:

  1. Turn your query into a vector.
  2. Search the web, scrape the text, and turn all the content into more vectors.
  3. Find the text vectors whose coordinates are closest to your query’s coordinates.

That’s it. It’s just very clever math, and you can build it in about 100 lines of Python.

Now, Let’s Build an AI Agent to Automate Your Research

First, create a new Python file and install the dependencies:

pip install ddgs requests beautifulsoup4 sentence-transformers numpy 

Step 1:  Import Libraries and Configure Parameters

The top of your file will define all imports and key settings:

import re
import urllib.parse
from ddgs import DDGS          # package name is 'ddgs' (duckduckgo_search renamed)
import requests
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer
import numpy as np
import time

Next, configure constants like search results, summary size, and embedding model:

SEARCH_RESULTS = 6        # How many URLs to check
PASSAGES_PER_PAGE = 4     # How many passages to pull from each URL
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2" # Fast, high-quality model
TOP_PASSAGES = 5          # How many relevant passages to use for the summary
SUMMARY_SENTENCES = 3     # How many sentences in the final summary
TIMEOUT = 8               # How long to wait for a webpage to load

Step 2: Search and Fetch Web Pages

We’ll use DuckDuckGo Search for free, API-less search results:

def unwrap_ddg(url):
    """If DuckDuckGo returns a redirect wrapper, extract the real URL."""
    try:
        parsed = urllib.parse.urlparse(url)
        if "duckduckgo.com" in parsed.netloc:
            qs = urllib.parse.parse_qs(parsed.query)
            uddg = qs.get("uddg")
            if uddg:
                return urllib.parse.unquote(uddg[0])
    except Exception:
        pass
    return url

def search_web(query, max_results=SEARCH_RESULTS):
    """Search the web and return a list of URLs."""
    urls = []
    with DDGS() as ddgs:
        for r in ddgs.text(query, max_results=max_results):
            url = r.get("href") or r.get("url")
            if not url:
                continue
            url = unwrap_ddg(url) # Clean up DDG redirect links
            urls.append(url)
    return urls

Then, fetch and clean the page with requests and BeautifulSoup:

def fetch_text(url, timeout=TIMEOUT):
    """Fetch and clean text content from a URL."""
    headers = {"User-Agent": "Mozilla/5.0 (research-agent)"}
    try:
        r = requests.get(url, timeout=timeout, headers=headers, allow_redirects=True)
        if r.status_code != 200:
            return ""
        ct = r.headers.get("content-type", "")
        if "html" not in ct.lower(): # Skip non-HTML content
            return ""
        
        soup = BeautifulSoup(r.text, "html.parser")
        
        # Remove all annyoing tags
        for tag in soup(["script", "style", "noscript", "header", "footer", "svg", "iframe", "nav", "aside"]):
            tag.extract()
            
        # Get all paragraph text
        paragraphs = [p.get_text(" ", strip=True) for p in soup.find_all("p")]
        text = " ".join([p for p in paragraphs if p])
        
        if text.strip():
            # Clean up whitespace
            return re.sub(r"\s+", " ", text).strip()
            
        # --- Fallback logic if <p> tags fail ---
        meta = soup.find("meta", attrs={"name": "description"}) or soup.find("meta", attrs={"property": "og:description"})
        if meta and meta.get("content"):
            return meta["content"].strip()
        if soup.title and soup.title.string:
            return soup.title.string.strip()
            
    except Exception:
        return "" # Fail silently
    return ""

Step 3: Chunk, Embed, and Rank Passages

We’ll break long articles into smaller passages and embed them using SentenceTransformer:

def chunk_passages(text, max_words=120):
    """Split long text into smaller passages."""
    words = text.split()
    if not words:
        return []
    chunks = []
    i = 0
    while i < len(words):
        chunk = words[i : i + max_words]
        chunks.append(" ".join(chunk))
        i += max_words
    return chunks

def split_sentences(text):
    """A simple sentence splitter."""
    parts = re.split(r'(?<=[.!?])\s+', text)
    return [p.strip() for p in parts if p.strip()]
  
class ShortResearchAgent:
    def __init__(self, embed_model=EMBEDDING_MODEL):
        print(f"Loading embedder: {embed_model}...")
        # This downloads the model on first run
        self.embedder = SentenceTransformer(embed_model)

    def run(self, query):
        start = time.time()
        
        # 1. Search
        urls = search_web(query)
        print(f"Found {len(urls)} urls.")
        
        # 2. Fetch & Chunk
        docs = []
        for u in urls:
            txt = fetch_text(u)
            if not txt:
                continue
            chunks = chunk_passages(txt, max_words=120)
            for c in chunks[:PASSAGES_PER_PAGE]:
                docs.append({"url": u, "passage": c})
        
        if not docs:
            print("No documents fetched.")
            return {"query": query, "passages": [], "summary": ""}

Now, initialize the embedding model and compute cosine similarity:

# 3. Embed (Turn text into numbers)
        texts = [d["passage"] for d in docs]
        emb_texts = self.embedder.encode(texts, convert_to_numpy=True, show_progress_bar=False)
        q_emb = self.embedder.encode([query], convert_to_numpy=True)[0]
        
        # 4. Rank (Find similarity)
        def cosine(a, b): 
            return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-10)
            
        sims = [cosine(e, q_emb) for e in emb_texts]
        top_idx = np.argsort(sims)[::-1][:TOP_PASSAGES]
        top_passages = [{"url": docs[i]["url"], "passage": docs[i]["passage"], "score": float(sims[i])} for i in top_idx]

Step 4: Generate a Mini Summary

Now, extract and rank sentences within top passages to form a concise summary:

        # 5. Summarize (Extractive)
        sentences = []
        for tp in top_passages:
            for s in split_sentences(tp["passage"]):
                sentences.append({"sent": s, "url": tp["url"]})
        
        if not sentences:
            summary = "No summary could be generated."
        else:
            sent_texts = [s["sent"] for s in sentences]
            sent_embs = self.embedder.encode(sent_texts, convert_to_numpy=True, show_progress_bar=False)
            sent_sims = [cosine(e, q_emb) for e in sent_embs]
            
            top_sent_idx = np.argsort(sent_sims)[::-1][:SUMMARY_SENTENCES]
            chosen = [sentences[idx] for idx in top_sent_idx]

            # De-duplicate and format
            seen = set()
            lines = []
            for s in chosen:
                key = s["sent"].lower()[:80] # Check first 80 chars for duplication
                if key in seen:
                    continue
                seen.add(key)
                lines.append(f"{s['sent']} (Source: {s['url']})")
            summary = " ".join(lines)

        elapsed = time.time() - start
        return {"query": query, "passages": top_passages, "summary": summary, "time": elapsed}

Step 5: Run the Agent

Finally, run the AI agent with any query:

if __name__ == "__main__":
    agent = ShortResearchAgent()
    q = "What causes urban heat islands and how can cities reduce them?"
    
    print(f"Running query: {q}\n")
    out = agent.run(q)
    
    print("\nTop passages:")
    for p in out["passages"]:
        print(f"- score {p['score']:.3f} src {p['url']}\n  {p['passage'][:200]}...\n")
        
    print("--- Extractive summary ---")
    print(out["summary"])
    print("--------------------------")
    print(f"\nDone in {out['time']:.1f}s")
Output

Final Words

You have just built the core logic of RAG, the architecture behind many of the most powerful GenAI systems today. First, you built the Retriever (Search, Fetch, Chunk); next, you built the Ranker (Embed, Cosine Similarity); and then, you built a simple Generator (the extractive summarizer).

The journey from a simple script to a powerful AI system is just a series of small, understandable steps. You just took the first and most important one.