Exploratory Data Analysis using SQL

Data Scientists use SQL for Exploratory Data Analysis (EDA) to query, filter, aggregate, and extract specific subsets of data efficiently when working directly with large databases, especially in relational databases like MySQL or PostgreSQL. If you want to learn how to perform Exploratory Data Analysis using SQL, this article will guide you. In this article, I’ll solve a problem using SQL to demonstrate how to perform Exploratory Data Analysis.

Getting Started: Storing Data in Your SQL Server

To start with Exploratory Data Analysis using SQL, you must first store your data in an SQL server. Download the data we are using from here and follow these steps:

  1. Step 1: Log into your SQL server and create a database to store your data using a command: CREATE DATABASE your_database_name;
  2. Step 2: Define the table schema based on your data structure using a CREATE TABLE statement, specifying column names, data types, and constraints.
  3. Step 3: Use SQL tools like LOAD DATA INFILE (MySQL) or database interfaces like pgAdmin (PostgreSQL) to import your data. Below is an example for the data we are using:
CREATE TABLE instagram_data (
    Post_ID VARCHAR(255),
    Account_ID VARCHAR(255),
    Account_username VARCHAR(255),
    Account_name VARCHAR(255),
    Description TEXT,
    Duration_secs INT,
    Publish_time DATETIME,
    Permalink VARCHAR(255),
    Post_type VARCHAR(255),
    Data_comment VARCHAR(255),
    Date VARCHAR(50),
    Impressions INT,
    Reach INT,
    Likes INT,
    Shares INT,
    Follows INT,
    Comments INT,
    Saves INT,
    Plays INT
);

LOAD DATA INFILE '/path_to_your_csv_file.csv'
INTO TABLE instagram_data
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(Post_ID, Account_ID, Account_username, Account_name, Description, Duration_secs, Publish_time, Permalink, Post_type, Data_comment, Date, Impressions, Reach, Likes, Shares, Follows, Comments, Saves, Plays);

After completing these steps, you can begin Exploratory Data Analysis using SQL.

Exploratory Data Analysis using SQL

Now, let’s get started with Exploratory Data Analysis using SQL. First, we will fetch the first five rows of the dataset:

SELECT * 
FROM instagram_data 
LIMIT 5;

The output will show you the first five rows of the data just like df.head() in pandas.

Next, let’s have a look at the summary statistics of the data:

SELECT 
    COUNT(*) AS Total_Posts,
    AVG(Impressions) AS Avg_Impressions,
    MAX(Likes) AS Max_Likes,
    MIN(Likes) AS Min_Likes
FROM instagram_data;
Total_Posts   Avg_Impressions    Max_Likes    Min_Likes
------------ ----------------- ----------- -----------
100 18000.45 1200 50

In the above SQL query, COUNT(*) counts total rows, AVG(Impressions) calculates the average impressions, MAX(Likes) and MIN(Likes) retrieve the maximum and minimum likes.

Identifying Missing Data

Now, let’s see if the data has any missing values:

SELECT 
    COUNT(*) AS Total_Rows,
    SUM(CASE WHEN Plays IS NULL THEN 1 ELSE 0 END) AS Missing_Plays,
    SUM(CASE WHEN Data_comment IS NULL THEN 1 ELSE 0 END) AS Missing_Data_Comment
FROM instagram_data;
Total_Rows   Missing_Plays   Missing_Data_Comment
----------- -------------- --------------------
100 30 50

In the above query, SUM(CASE … WHEN NULL) counts the number of missing values in each column.

Analyzing Distribution

Next, let’s have a look at the post type distribution:

SELECT 
    Post_type, 
    COUNT(*) AS Frequency
FROM instagram_data
GROUP BY Post_type;
Post_type      Frequency
-------------- ----------
IG carousel 60
IG image 40
IG reel 20

In the above SQL query, GROUP BY Post_type groups rows by post type and COUNT(*) calculates the frequency for each type.

Analyzing Engagement Metrics

Now, let’s have a look at the engagement metrics:

SELECT 
    Post_ID,
    Likes / NULLIF(Impressions, 0) AS Likes_Per_Impression,
    Saves / NULLIF(Impressions, 0) AS Saves_Per_Impression
FROM instagram_data
LIMIT 5;
Post_ID         Likes_Per_Impression   Saves_Per_Impression
--------------- --------------------- ---------------------
1.799668e+16 0.0367 0.0441
1.800604e+16 0.0343 0.0313
1.829417e+16 0.0035 0.0045
1.809139e+16 0.0319 0.0321
1.796449e+16 0.0410 0.0586

This query calculates two engagement metrics for each post: the ratio of likes to impressions (Likes_Per_Impression) and the ratio of saves to impressions (Saves_Per_Impression). The NULLIF(Impressions, 0) function ensures that division by zero is avoided by returning NULL if Impressions is zero. The query retrieves the Post_ID along with these calculated metrics for the first five rows of the instagram_data table, providing insights into how effectively each post converts impressions into likes and saves.

Analyzing Trends over Time

Next, let’s have a look at the trends over time:

SELECT 
    DATE(Publish_time) AS Date,
    SUM(Impressions) AS Total_Impressions,
    SUM(Likes) AS Total_Likes
FROM instagram_data
GROUP BY DATE(Publish_time)
ORDER BY Date ASC;
Date         Total_Impressions   Total_Likes
----------- ------------------ -----------
2024-01-01 30000 1500
2024-01-02 25000 1200
2024-03-28 18000 800
2024-03-29 50000 2300

This query calculates the total impressions and likes for each day by grouping the posts based on their Publish_time date. The DATE(Publish_time) function extracts just the date part (excluding the time) from the Publish_time column. Using SUM(Impressions) and SUM(Likes), it aggregates the total impressions and likes for each date. The results are then sorted in ascending order of the date using ORDER BY Date ASC, providing a daily summary of engagement metrics.

Identifying Outliers

The last step in this EDA is to detect outliers:

SELECT * 
FROM instagram_data
WHERE Impressions > (
    SELECT AVG(Impressions) + 2 * STDDEV(Impressions) FROM instagram_data
) 
OR Impressions < (
    SELECT AVG(Impressions) - 2 * STDDEV(Impressions) FROM instagram_data
);
Post_ID         Impressions   Likes
--------------- ------------ ------
1.829417e+16 500000 900
1.794940e+16 419927 1011

This query identifies outlier posts based on their impressions. It calculates the mean (AVG(Impressions)) and standard deviation (STDDEV(Impressions)) of the Impressions column, then selects posts with impressions greater than two standard deviations above or below the mean. By comparing each post’s impressions to these thresholds, the query isolates posts that significantly overperform or underperform relative to the dataset’s average.

Summary

This is how to perform Exploratory Data Analysis using SQL. Data Scientists rely on SQL for EDA to query, filter, aggregate, and extract specific subsets of data efficiently when working directly with large databases, especially in relational databases like MySQL or PostgreSQL. I hope you found this article on performing Exploratory Data Analysis using SQL helpful.