Automate Data Cleaning using Python

← Previous Article All Articles Next Article →

If you’ve worked on any real-world data project, you already know the truth that most of the time isn’t spent building models, it’s spent cleaning messy data. As a Data Engineer or Data Scientist, you’re often handed raw, inconsistent, duplicate-ridden datasets. And cleaning them every time from scratch is not only repetitive, it’s a waste of time. That’s why automating your data cleaning steps is one of the most valuable productivity boosts you can build. So, in this article, I’ll take you through how to write a reusable function to automate data cleaning using Python.

Automate Data Cleaning using Python

Let’s import the necessary Python libraries we need to write a function to automate data cleaning steps for any data:

That’s it. No more dependencies. Just clean, reliable Pandas and NumPy.

Now, we will write a Python function to:

  1. Standardize column names
  2. Remove duplicate rows
  3. Clean messy text values
  4. Handle missing values (even disguised ones!)
  5. Flag weird stuff like constant columns and outliers
  6. Prepare your data for modelling by converting text columns smartly

Here’s how to write such a function:

import pandas as pd
import numpy as np

def clean_dataframe(df: pd.DataFrame, verbose: bool = True) -> pd.DataFrame:
    df = df.copy()

    # log helper
    def log(msg):
        if verbose:
            print(f"[INFO] {msg}")

    # 1. standardize column names
    df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
    log("Standardized column names.")

    # 2. remove exact duplicates
    dup_count = df.duplicated().sum()
    if dup_count > 0:
        df.drop_duplicates(inplace=True)
        log(f"Removed {dup_count} duplicate rows.")

    # 3. trim and lowercase all string (object) values
    for col in df.select_dtypes(include='object'):
        df[col] = df[col].astype(str).str.strip().str.lower()
    log("Standardized string columns (lowercase + trimmed).")

    # 4. detect missing values (including blanks and placeholders)
    placeholder_values = ['n/a', 'na', '--', '-', 'none', 'null', '', 'nan']
    df.replace(placeholder_values, np.nan, inplace=True)
    null_report = df.isnull().sum()
    null_report = null_report[null_report > 0]
    if not null_report.empty:
        log(f"Missing values found in columns:\n{null_report}")

    # 5. flag constant columns
    constant_cols = [col for col in df.columns if df[col].nunique() == 1]
    if constant_cols:
        log(f"Constant columns (consider removing): {constant_cols}")

    # 6. flag high cardinality categorical columns
    high_card_cols = [col for col in df.select_dtypes(include='object') if df[col].nunique() > 100]
    if high_card_cols:
        log(f"High-cardinality columns (consider encoding strategies): {high_card_cols}")

    # 7. detect numeric outliers using IQR
    num_cols = df.select_dtypes(include=np.number).columns
    outlier_report = {}
    for col in num_cols:
        q1, q3 = df[col].quantile([0.25, 0.75])
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        outliers = df[(df[col] < lower) | (df[col] > upper)][col].count()
        if outliers > 0:
            outlier_report[col] = outliers
    if outlier_report:
        log(f"Potential numeric outliers detected:\n{outlier_report}")

    # 8. convert applicable columns to category
    for col in df.select_dtypes(include='object'):
        n_unique = df[col].nunique()
        if n_unique < len(df) * 0.05:
            df[col] = df[col].astype('category')
    log("Converted suitable object columns to category dtype.")

    log("Data cleaning complete.")
    return df

Here’s what this Python function does:

  1. Copies the data to keep your original safe.
  2. Standardizes column names (lowercase, no spaces).
  3. Removes duplicate rows if any.
  4. Cleans string columns (trims and lowercases all text).
  5. Handles missing values (turns things like –, N/A into NaN).
  6. Reports missing values in each column.
  7. Flags constant columns (same value in all rows).
  8. Identifies high-cardinality text columns (too many unique values).
  9. Detects numeric outliers using the IQR method.
  10. Converts low-unique text columns to category type.
  11. Logs every step if verbose=True.

Now, let’s use this function on a dataset (download it here):

import pandas as pd
df = pd.read_csv('/content/loan-recovery.csv')
clean_df = clean_dataframe(df)
[INFO] Standardized column names.
[INFO] Standardized string columns (lowercase + trimmed).
[INFO] High-cardinality columns (consider encoding strategies): ['borrower_id', 'loan_id']
[INFO] Potential numeric outliers detected:
{'outstanding_loan_amount': np.int64(3), 'monthly_emi': np.int64(35), 'num_missed_payments': np.int64(21), 'collection_attempts': np.int64(40)}
[INFO] Converted suitable object columns to category dtype.
[INFO] Data cleaning complete.

Using this function, you can build data engineering projects by extending it to:

  • Imputing missing values
  • Exporting logs to a file
  • Integrating into ETL workflows or Airflow DAGs
  • Saving cleaned data to a data lake or S3

Summary

So, in this article, we explored how to automate data cleaning using Python by building a reusable function with Pandas and NumPy. This function helps standardize column names, remove duplicates, clean text, handle missing values (even hidden ones), detect outliers, flag constant and high-cardinality columns, and prepare text data for modelling, all while logging each step.