SQL Data Cleaning Methods
Data cleaning is an essential step in the process of solving a Data Science problem to ensure the data’s quality and reliability. If you are learning Data Science and want to understand how to clean your data using SQL, this article is for you. In this article, I’ll take you through a detailed practical guide on SQL data cleaning methods you should know for Data Science.
SQL Data Cleaning Methods
Below are some essential SQL data cleaning methods you should know for Data Science:
- Identifying Missing Data
- Removing Duplicate Records
- Handling Missing Data
- Standardizing Data
- Correcting Data Entry Errors
Let’s understand all these SQL data cleaning methods in detail one by one.
Identifying Missing Data
Before cleaning, you must identify missing or NULL values. You can use the IS NULL condition to find missing data. For example, look at the table given below:
| SaleID | ProductName | SaleAmount | SaleDate |
|---|---|---|---|
| 1 | Laptop | 1000 | 2024-08-01 |
| 2 | NULL | 500 | 2024-08-02 |
| 3 | Phone | NULL | 2024-08-03 |
| 4 | Tablet | 300 | NULL |
Here’s how to identify missing data from the above table:
SELECT *
FROM Sales
WHERE ProductName IS NULL OR SaleAmount IS NULL OR SaleDate IS NULL;
| SaleID | ProductName | SaleAmount | SaleDate |
|---|---|---|---|
| 2 | NULL | 500 | 2024-08-02 |
| 3 | Phone | NULL | 2024-08-03 |
| 4 | Tablet | 300 | NULL |
Removing Duplicate Records
Duplicate records can skew analysis. Use DISTINCT or ROW_NUMBER() with PARTITION BY to identify and remove duplicates. For example, look at the table given below:
| OrderID | CustomerID | OrderDate |
|---|---|---|
| 1 | 101 | 2024-08-01 |
| 2 | 102 | 2024-08-02 |
| 3 | 101 | 2024-08-01 |
| 4 | 103 | 2024-08-03 |
Here’s how to identify duplicate records from the table:
SELECT OrderID, CustomerID, OrderDate,
COUNT(*) OVER (PARTITION BY CustomerID, OrderDate) AS DuplicateCount
FROM Orders;
| OrderID | CustomerID | OrderDate | DuplicateCount |
|---|---|---|---|
| 1 | 101 | 2024-08-01 | 2 |
| 3 | 101 | 2024-08-01 | 2 |
| 2 | 102 | 2024-08-02 | 1 |
| 4 | 103 | 2024-08-03 | 1 |
And, here’s how to remove the duplicate records:
WITH CTE AS (
SELECT OrderID, CustomerID, OrderDate,
ROW_NUMBER() OVER (PARTITION BY CustomerID, OrderDate ORDER BY OrderID) AS RowNum
FROM Orders
)
DELETE FROM Orders
WHERE OrderID IN (SELECT OrderID FROM CTE WHERE RowNum > 1);
| OrderID | CustomerID | OrderDate |
|---|---|---|
| 1 | 101 | 2024-08-01 |
| 2 | 102 | 2024-08-02 |
| 4 | 103 | 2024-08-03 |
Handling Missing Data
There are several strategies for handling missing data, including deletion, imputation, or replacement with default values. Look at the table given below, it has missing values in all the rows except the first row:
| SaleID | ProductName | SaleAmount | SaleDate |
|---|---|---|---|
| 1 | Laptop | 1000 | 2024-08-01 |
| 2 | NULL | 500 | 2024-08-02 |
| 3 | Phone | NULL | 2024-08-03 |
| 4 | Tablet | 300 | NULL |
Here’s how to delete rows with missing data:
DELETE FROM Sales
WHERE ProductName IS NULL OR SaleAmount IS NULL OR SaleDate IS NULL;
| SaleID | ProductName | SaleAmount | SaleDate |
|---|---|---|---|
| 1 | Laptop | 1000 | 2024-08-01 |
And if you want to replace missing values with a default value, here’s an example:
UPDATE Sales
SET ProductName = 'Unknown'
WHERE ProductName IS NULL;
UPDATE Sales
SET SaleAmount = 0
WHERE SaleAmount IS NULL;
UPDATE Sales
SET SaleDate = '2024-01-01'
WHERE SaleDate IS NULL;
| SaleID | ProductName | SaleAmount | SaleDate |
|---|---|---|---|
| 1 | Laptop | 1000 | 2024-08-01 |
| 2 | Unknown | 500 | 2024-08-02 |
| 3 | Phone | 0 | 2024-08-03 |
| 4 | Tablet | 300 | 2024-01-01 |
Standardizing Data
Standardizing data involves converting data to a consistent format. For example, converting all text to lowercase or uppercase. Look at the example table below, the format of the country column is inconsistent:
| CustomerID | CustomerName | Country |
|---|---|---|
| 1 | John Doe | USA |
| 2 | jane SMITH | usA |
| 3 | Michael Brown | Canada |
| 4 | linda JOHNSON | CANADA |
Here’s how to correct the inconsistencies in our data:
UPDATE Customers
SET Country = UPPER(Country);
| CustomerID | CustomerName | Country |
|---|---|---|
| 1 | John Doe | USA |
| 2 | jane SMITH | USA |
| 3 | Michael Brown | CANADA |
| 4 | linda JOHNSON | CANADA |
Correcting Data Entry Errors
Data entry errors, like misspelt words or incorrect values, can be corrected using SQL. For example, look at the table given below, the spelling of Phone is incorrect in the table:
| ProductID | ProductName | Price |
|---|---|---|
| 1 | Laptop | 1000 |
| 2 | Phonne | 500 |
| 3 | Tablet | 300 |
| 4 | Desktop | 800 |
Here’s how you can correct it:
UPDATE Products
SET ProductName = 'Phone'
WHERE ProductName = 'Phonne';
| ProductID | ProductName | Price |
|---|---|---|
| 1 | Laptop | 1000 |
| 2 | Phone | 500 |
| 3 | Tablet | 300 |
| 4 | Desktop | 800 |
Summary
Below are some essential SQL data cleaning methods you should know for Data Science:
- Identifying Missing Data: Use IS NULL to find missing data.
- Removing Duplicate Records: Use DISTINCT or ROW_NUMBER() to identify and remove duplicates.
- Handling Missing Data: Delete, impute, or replace missing data with default values.
- Standardizing Data: Convert data to a consistent format.
- Correcting Data Entry Errors: Use UPDATE to correct misspelled words or incorrect values.