Build Your Personal AI Data Analyst

Oct 22, 2025 • 5 min read

← Previous Article All Articles Next Article →

We have all been there. You have a messy CSV file, a deadline, and a head full of questions. But instead of finding answers, you find yourself trying to remember the exact syntax to reshape a Pandas DataFrame or change the colour of a Matplotlib bar chart. What if you could just ask your data questions? Today, I’ll show you how to build your own personal AI data analyst that uses LLM to analyse data and provide you with a final answer.

Build Your Own Personal AI Data Analyst

Many Chat with your Data tools make a critical mistake: they try to make the AI answer the question directly. LLMs are bad at math. If you ask an LLM to calculate the average of 10,000 rows, it will hallucinate.

The solution is not to ask the AI to do the math. Ask the AI to write the code that does the math.

Here is the architecture we are building:

Ingest: Load any flat file (CSV, Excel, JSON).
Detect: Algorithmically understand the column types (Numeric vs. Categorical).
Reason: Convert natural language (“Show me the outliers”) into Python code (df.loc[z_score > 3]).
Execute: Run that code in a sandboxed environment and return the result.

This approach combines the linguistic power of AI with the computational precision of Python.

Prerequisites

To follow along, you need a basic Python environment. You will need the following libraries:

pip install streamlit pandas matplotlib numpy scipy openpyxl

If you want to use the local LLM features for custom queries, install Ollama and pull a model like Llama 3:

ollama pull llama3.1

Building The Brain of Our Personal AI Data Analyst

Create a file named analyst.py. This file handles the heavy lifting. It’s the engine room. We are going to break down the code provided into logical chunks.

Step 1: Robust Data Loading

Data comes in messy formats. This section ensures that whether the user uploads a pure CSV or a binary stream, we can read it:

import io
import tempfile
import subprocess
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import textwrap
import sys
import os

# Optional duckdb import (not required)
try:
    import duckdb
except Exception:
    duckdb = None

# ----------------- Load data -----------------
def _looks_like_csv(raw_bytes: bytes) -> bool:
    try:
        sample = raw_bytes[:1024].decode(errors="ignore")
    except Exception:
        return False
    return "," in sample and "\n" in sample

def load_data(file_or_path) -> pd.DataFrame:
    """
    Accepts Streamlit UploadedFile, path string/Path, or file-like object.
    Returns pandas DataFrame.
    """
    if isinstance(file_or_path, (str, Path)):
        p = Path(file_or_path)
        s = p.suffix.lower()
        if s == ".csv":
            return pd.read_csv(p)
        if s in {".xls", ".xlsx"}:
            return pd.read_excel(p)
        if s == ".json":
            return pd.read_json(p)
        return pd.read_csv(p)

    # file-like (UploadedFile)
    name = getattr(file_or_path, "name", None)
    suffix = Path(name).suffix.lower() if name else None
    raw = file_or_path.read()
    if isinstance(raw, str):
        raw = raw.encode("utf-8")
    bio = io.BytesIO(raw)

    if suffix == ".csv" or (suffix is None and _looks_like_csv(raw)):
        bio.seek(0); return pd.read_csv(bio)
    if suffix in {".xls", ".xlsx"}:
        bio.seek(0); return pd.read_excel(bio)
    if suffix == ".json":
        bio.seek(0); return pd.read_json(bio)
    # fallback
    bio.seek(0)
    try:
        return pd.read_csv(bio)
    except Exception:
        bio.seek(0); return pd.read_json(bio)

A good tool doesn’t complain about file formats. By checking bytes and suffixes, we make the tool just work for the end user.

Step 2: Deterministic Intelligence (Prompt Suggestions)

Before we even touch an LLM, we can be smart. We scan the DataFrame columns. If we see numbers, we suggest histograms. If we see dates, we suggest a time series:

def _detect_column_types(df: pd.DataFrame):
    numeric = df.select_dtypes(include=[np.number]).columns.tolist()
    datetime = []
    # try to infer datetime columns
    for c in df.columns:
        if np.issubdtype(df[c].dtype, np.datetime64):
            datetime.append(c)
        else:
            # try to parse small sample as date
            try:
                sample = df[c].dropna().astype(str).iloc[:20]
                parsed = pd.to_datetime(sample, errors="coerce")
                if parsed.notna().sum() >= max(1, min(5, len(sample)//2)):
                    datetime.append(c)
            except Exception:
                pass
    # categoricals: low cardinality non-numeric
    categorical = [c for c in df.columns if c not in numeric + datetime and df[c].nunique(dropna=True) <= 50]
    return {"numeric": numeric, "datetime": datetime, "categorical": categorical}

def suggest_prompts(df: pd.DataFrame, max_suggestions: int = 8):
    """
    Return a list of helpful, ready-to-run prompt strings for the dataset.
    Deterministic and works without any LLM.
    """
    types = _detect_column_types(df)
    numeric = types["numeric"]
    datetime = types["datetime"]
    categorical = types["categorical"]

    suggestions = []
    # Basic summary
    suggestions.append("Summarize the dataset in 5 bullet points (rows, columns, missing values, numeric columns, top categorical).")
    # Top value queries
    if categorical:
        col = categorical[0]
        suggestions.append(f"Show the top 10 counts for the categorical column '{col}'.")
    # Numeric summaries
    if numeric:
        suggestions.append(f"Show summary statistics (count, mean, std, min, 25%, 50%, 75%, max) for numeric columns.")
        col = numeric[0]
        suggestions.append(f"Create a histogram of the numeric column '{col}'.")
        if len(numeric) >= 2:
            suggestions.append(f"Create a scatter plot comparing '{numeric[0]}' (x) vs '{numeric[1]}' (y).")
        suggestions.append(f"Show the top 10 rows sorted by '{col}' descending.")
    # Time series
    if datetime:
        dcol = datetime[0]
        # choose a numeric for aggregation if exists
        ag = numeric[0] if numeric else None
        if ag:
            suggestions.append(f"Create a time series of monthly sum of '{ag}' using the datetime column '{dcol}'.")
        else:
            suggestions.append(f"Show counts per month using the datetime column '{dcol}'.")
    # Correlation
    if len(numeric) >= 2:
        suggestions.append("Show the correlation matrix heatmap for numeric columns.")
    # Generic top-k
    suggestions.append("Find rows that look like anomalies using z-score > 3 on numeric columns and show top 20.")
    # limit suggestions
    return suggestions[:max_suggestions]

This is Low-Tech AI. It feels intelligent to the user because the system anticipates their needs, but it’s actually just solid software engineering.

Step 3: The Translator

This is the safety layer. If the user asks for a standard summary or a correlation matrix, we don’t need an LLM. We can map that request directly to trusted, bug-free code:

def prompt_to_code(prompt: str, df: pd.DataFrame):
    """
    Convert known prompt templates into runnable python code strings.
    If the prompt is custom/unrecognized, return None (so UI can send to LLM instead).
    """
    p = prompt.strip().lower()

    # Summary
    if p.startswith("summarize the dataset"):
        code = textwrap.dedent("""
            # produce a short summary as printed text
            info = []
            info.append(f"Rows: {len(df)}, Columns: {len(df.columns)}")
            info.append("Column types: " + ", ".join([f\"{c}:{str(df[c].dtype)[:10]}\" for c in df.columns[:10]]))
            miss = df.isnull().sum().sort_values(ascending=False).head(10)
            info.append("Top missing: " + ", ".join([f\"{idx}:{val}\" for idx,val in miss.items() if val>0]))
            numeric = df.select_dtypes(include=['number']).columns.tolist()
            info.append(f\"Numeric columns count: {len(numeric)}\")
            # print concise bullets
            result = \"\\n\".join([\"- \"+i for i in info])
        """)
        return code

    # Top counts for categorical
    if "top 10 counts for the categorical column" in p or "top 10 counts" in p and "'" in p:
        # try to extract column name between quotes
        import re
        m = re.search(r"'([^']+)'", prompt)
        if not m:
            m = re.search(r'"([^"]+)"', prompt)
        col = m.group(1) if m else None
        if col:
            code = textwrap.dedent(f"""
                # top 10 counts for '{col}'
                result = df['{col}'].value_counts(dropna=False).head(10).reset_index()
                result.columns = ['value','count']
            """)
            return code

    # Summary statistics for numeric
    if "summary statistics" in p or "describe" in p:
        code = textwrap.dedent("""
            result = df.select_dtypes(include=['number']).describe().T
        """)
        return code

    # Histogram
    if p.startswith("create a histogram of the numeric column") or "histogram of the numeric column" in p:
        import re
        m = re.search(r"'([^']+)'", prompt)
        col = m.group(1) if m else None
        if col:
            code = textwrap.dedent(f"""
                # histogram for '{col}'
                plt.figure(figsize=(6,4))
                df['{col}'].dropna().astype(float).hist(bins=30)
                plt.title('Histogram of {col}')
                plt.xlabel('{col}')
                plt.ylabel('count')
                # produce an image by saving to result_img_path variable
                result_img_path = None
            """)
            # We'll return plotting code that uses plt; execution will save figure
            return code

    # Scatter plot
    if "scatter plot comparing" in p and "vs" in p:
        import re
        m = re.search(r"'([^']+)' \\(x\\) vs '([^']+)' \\(y\\)", prompt)
        if m:
            xcol, ycol = m.group(1), m.group(2)
            code = textwrap.dedent(f"""
                plt.figure(figsize=(6,4))
                df.plot.scatter(x='{xcol}', y='{ycol}')
                plt.title('{ycol} vs {xcol}')
                result_img_path = None
            """)
            return code

    # Top N rows sorted by col
    if p.startswith("show the top 10 rows sorted by"):
        import re
        m = re.search(r"by '([^']+)'", prompt)
        if m:
            col = m.group(1)
            code = textwrap.dedent(f"""
                result = df.sort_values('{col}', ascending=False).head(10).reset_index(drop=True)
            """)
            return code

    # Time series monthly sum
    if "monthly sum" in p and "using the datetime column" in p:
        import re
        m = re.search(r"sum of '([^']+)' using the datetime column '([^']+)'", prompt)
        if m:
            ag, dcol = m.group(1), m.group(2)
            code = textwrap.dedent(f"""
                tmp = df.copy()
                tmp['{dcol}'] = pd.to_datetime(tmp['{dcol}'], errors='coerce')
                res = tmp.dropna(subset=['{dcol}'])
                res = res.set_index('{dcol}').resample('M')['{ag}'].sum().reset_index()
                result = res
            """)
            return code

    # Counts per month (datetime only)
    if "counts per month using the datetime column" in p:
        import re
        m = re.search(r"datetime column '([^']+)'", prompt)
        dcol = m.group(1) if m else None
        if dcol:
            code = textwrap.dedent(f"""
                tmp = df.copy()
                tmp['{dcol}'] = pd.to_datetime(tmp['{dcol}'], errors='coerce')
                res = tmp.dropna(subset=['{dcol}']).set_index('{dcol}').resample('M').size().reset_index(name='count')
                result = res
            """)
            return code

    # Correlation heatmap
    if "correlation matrix heatmap" in p or "correlation heatmap" in p:
        code = textwrap.dedent("""
            corr = df.select_dtypes(include=['number']).corr()
            import matplotlib.pyplot as plt
            plt.figure(figsize=(6,5))
            plt.imshow(corr, cmap='viridis', aspect='auto')
            plt.colorbar()
            plt.xticks(range(len(corr)), corr.columns, rotation=90)
            plt.yticks(range(len(corr)), corr.columns)
            plt.title('Correlation matrix')
            result_img_path = None
        """)
        return code

    # Anomaly detection using z-score
    if "anomalies" in p and "z-score" in p:
        code = textwrap.dedent("""
            from scipy import stats
            num = df.select_dtypes(include=['number']).dropna()
            if num.shape[1]==0:
                result = pd.DataFrame()
            else:
                z = np.abs(stats.zscore(num.select_dtypes(include=['number'])))
                mask = (z > 3).any(axis=1)
                result = df.loc[mask].head(20).reset_index(drop=True)
        """)
        return code

    # Unknown / custom prompts -> return None
    return None

Step 4: The Execution Engine

This is the key part of our code. We take the string of code, execute it, and capture the output (text or image):

def run_code(df: pd.DataFrame, code: str):
    """
    Execute code string in a restricted local namespace.
    Returns a dict:
      - {"type":"text","output":...}
      - {"type":"dataframe","df": pandas.DataFrame}
      - {"type":"image","path": path_to_png}
    Execution conventions:
      - If code sets a variable `result` to a DataFrame or string, we return it.
      - If code uses matplotlib to plot, we save the current figure to a temp PNG and return image.
      - If code raises, return error text.
    """
    # prepare namespace
    local_ns = {"pd": pd, "np": np, "df": df, "plt": plt}
    # capture prints
    old_stdout = sys.stdout
    stdout_buf = io.StringIO()
    sys.stdout = stdout_buf
    try:
        # run
        exec(code, {}, local_ns)
        # first, if plotting occurred (plt has a current figure), save it
        # If code created result_img_path variable, prefer it
        if "result_img_path" in local_ns and local_ns["result_img_path"]:
            path = local_ns["result_img_path"]
            return {"type": "image", "path": path}
        # check for figure in plt
        figs = plt.get_fignums()
        if figs:
            with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as f:
                plt.savefig(f.name, bbox_inches="tight", dpi=150)
                plt.close("all")
                return {"type": "image", "path": f.name}
        # check for result variable
        if "result" in local_ns:
            res = local_ns["result"]
            if isinstance(res, pd.DataFrame):
                return {"type": "dataframe", "df": res}
            else:
                return {"type": "text", "output": str(res)}
        # otherwise, return captured stdout
        out = stdout_buf.getvalue().strip()
        if out:
            return {"type": "text", "output": out}
        return {"type": "text", "output": "Execution finished. No result produced."}
    except Exception as e:
        return {"type": "text", "output": f"Execution error: {e}"}
    finally:
        sys.stdout = old_stdout

Step 5: The LLM Connector

If the user asks a unique question (like “Calculate the moving average of Sales”), the deterministic rules won’t work. We send the schema to Ollama (a local LLM runner) and ask it to write the Python code:

def ask_llm(prompt: str, model: str = "llama3.1", timeout: int = 60) -> str:
    """
    Send prompt to local Ollama via CLI. Returns stdout text.
    If ollama is not installed or fails, returns an error string starting with [LLM...].
    Expect the model to return code inside ```python blocks.
    """
    try:
        proc = subprocess.run(["ollama", "run", model], input=prompt.encode("utf-8"),
                              stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
        out = proc.stdout.decode("utf-8", errors="replace")
        err = proc.stderr.decode("utf-8", errors="replace")
        if not out and err:
            return f"[LLM-error] {err}"
        return out
    except FileNotFoundError:
        return "[LLM-missing] ollama not found on PATH."
    except Exception as e:
        return f"[LLM-failed] {e}"

Building The Interface for Our Personal AI Data Analyst

Now, create a file named app.py. We will use Streamlit for the frontend. It allows us to build a data dashboard in pure Python without knowing HTML or CSS:

import streamlit as st
from analyst import load_data, suggest_prompts, prompt_to_code, run_code, ask_llm
import pandas as pd

st.set_page_config(page_title="Personal AI Data Analyst", layout="wide")
st.title("🧠 Personal AI Data Analyst — Interactive Dashboard")

st.sidebar.header("Settings")
use_llm = st.sidebar.checkbox("Use local LLM (ollama) for custom prompts", value=False)
llm_model = st.sidebar.text_input("LLM model name (ollama)", value="llama3.1")
st.sidebar.markdown("If you don't have `ollama` installed, leave this off and use built-in prompts.")

uploaded = st.file_uploader("Upload CSV, Excel, or JSON", type=["csv","xls","xlsx","json"])
if uploaded is None:
    st.info("Upload a CSV / XLSX / JSON to get started. Suggestions will appear automatically.")
    st.stop()

# Load data
try:
    df = load_data(uploaded)
except Exception as e:
    st.error(f"Failed to load file: {e}")
    st.stop()

st.success("File loaded.")
with st.expander("Preview data (first 100 rows)"):
    st.dataframe(df.head(100))

# Generate suggestions
suggestions = suggest_prompts(df)
st.markdown("## Suggested analyses (pick one or write your own)")
col1, col2 = st.columns([3,1])
with col1:
    selected = st.selectbox("Choose a suggested prompt", options=suggestions)
    custom = st.text_area("Or write a custom prompt (leave blank to use the selected suggestion)", height=80)
with col2:
    st.markdown("**Quick actions**")
    if st.button("Show suggestions again"):
        st.write(suggestions)

# Determine final prompt
final_prompt = custom.strip() if custom and custom.strip() else selected

st.markdown("### Final prompt")
st.write(final_prompt)

# Run button
if st.button("Run analysis"):
    with st.spinner("Running..."):
        # First try deterministic conversion
        code = prompt_to_code(final_prompt, df)
        if code:
            res = run_code(df, code)
        else:
            # No deterministic code found. If user requested LLM, send the prompt.
            if use_llm:
                # craft a system instruction that asks for python in a ```python``` block that uses df, pd, plt
                system = (
                    "You are a helpful data analyst and will respond with Python code only.\n"
                    "You must return code inside a ```python ... ``` block. The DataFrame is named `df`.\n"
                    "Use pandas for data manipulation and matplotlib for charts. Do not import heavy libs.\n"
                    "If returning a chart, produce matplotlib code that draws the figure (no show()) and nothing else.\n"
                )
                raw = system + "\n# User prompt: " + final_prompt
                llm_out = ask_llm(raw, model=llm_model)
                if llm_out.startswith("[LLM-missing]") or llm_out.startswith("[LLM-"):
                    st.warning("LLM unavailable or returned an error. Falling back to built-in behavior is not possible for this custom prompt.")
                    st.write(llm_out)
                    st.stop()
                # attempt to extract python block
                if "```python" in llm_out:
                    try:
                        code = llm_out.split("```python")[1].split("```")[0]
                        res = run_code(df, code)
                    except Exception as e:
                        st.error(f"Failed to execute code from LLM: {e}")
                        st.write(llm_out)
                        st.stop()
                else:
                    st.error("LLM did not return a python code block. Showing raw LLM output:")
                    st.write(llm_out)
                    st.stop()
            else:
                st.error("This is a custom prompt that the app cannot deterministically convert to code. Enable 'Use local LLM' in the sidebar to let a local model generate Python, or edit your prompt to match one of the suggested patterns.")
                st.stop()

    # Display result
    if res["type"] == "text":
        st.markdown("#### Output (text)")
        st.text(res["output"])
    elif res["type"] == "dataframe":
        st.markdown("#### Output (table)")
        st.dataframe(res["df"])
        # Provide CSV download
        csv = res["df"].to_csv(index=False).encode("utf-8")
        st.download_button("Download result as CSV", data=csv, file_name="result.csv", mime="text/csv")
    elif res["type"] == "image":
        st.markdown("#### Output (chart)")
        st.image(res["path"], use_column_width=True)
    else:
        st.write("Unknown result type", res)

Now, open your terminal and run:

streamlit run app.py

A local web server will start, and your default browser will open the application. Upload a CSV file (like a Titanic dataset or your monthly expenses), and watch the system generate questions for you automatically. Below is a sample output:

Here’s an example of a query with a custom prompt:

Custom Prompt with Ollama for Personal AI Data Analyst

Why This Matters for You

In the world of Data Science, there is a fear that AI will replace analysts. This project demonstrates why that is unlikely. AI doesn’t know what makes sense contextually; it just predicts tokens. By building this tool, you aren’t being replaced; you are becoming the architect.

By building this Personal AI Data Analyst, you are moving from the person who types the code to the person who designs the system that writes the code. This shift allows you to focus on the why of the data, the business logic, the bias, and the implications, rather than the syntax of the plot.

I hope you liked this article on how to build your own personal AI data analyst that uses LLM to analyse data. Follow me on Instagram for many more resources.