Hypothesis Testing for Data Scientists with Python
As a Data Scientist, you’re often tasked with determining whether a difference in outcomes or a trend in the data is significant, or simply the result of random variation. This is where hypothesis testing becomes essential. It provides a structured, statistical framework to validate assumptions, compare groups, and make confident, data-driven decisions. So, in this article, I’ll take you through a practical guide to Hypothesis Testing for Data Scientists with Python.
Hypothesis Testing for Data Scientists with Python: Getting Started
We’ve been given a dataset of 1000 employees, with information on:
- Age, Department, Education, Experience
- Whether they attended a training program
- Their performance scores (scaled from 0 to 100)
We want to evaluate whether the training program improved performance, on average, compared to employees who didn’t attend the training. You can find the dataset here.
Step 1: Define the Hypotheses
In hypothesis testing, we start by stating two opposing claims:
- Null Hypothesis (H₀): There is no difference in average performance scores between trained and untrained employees.
- Alternative Hypothesis (H₁): Trained employees have a higher average performance score than untrained employees.
This is a one-tailed test, as we’re specifically interested in improvement.
Now, before the second step, we will import the dataset:
import pandas as pd
df = pd.read_csv('/content/Employee_Training_and_Performance_Dataset.csv')
df.head()

Step 2: Prepare the Groups
Next, we will split the dataset into two groups based on whether employees attended the training:
group_yes = df[df['TrainingAttended'] == 'Yes']['PerformanceScore']
group_no = df[df['TrainingAttended'] == 'No']['PerformanceScore']
Step 3: Check for Normality
Most parametric tests, including the t-test, assume that the data is normally distributed. So, we will use the Shapiro-Wilk Test to verify this for both groups. If the p-value > 0.05, we fail to reject the assumption of normality:
from scipy import stats
sample_size = min(len(group_yes), len(group_no), 300)
shapiro_yes = stats.shapiro(group_yes.sample(sample_size, random_state=1))
shapiro_no = stats.shapiro(group_no.sample(sample_size, random_state=1))
print("Shapiro Test (Training = Yes):", shapiro_yes)
print("Shapiro Test (Training = No):", shapiro_no)
Shapiro Test (Training = Yes): ShapiroResult(statistic=np.float64(0.9947190464566062), pvalue=np.float64(0.3910129582664982))
Shapiro Test (Training = No): ShapiroResult(statistic=np.float64(0.99501435026432), pvalue=np.float64(0.44369527154076494))
Both groups are approximately normal, so we can proceed with the t-test.
Step 4: Check for Equal Variance
Before running a t-test, we need to determine whether the two groups have equal variances. We use Levene’s Test for this:
levene = stats.levene(group_yes, group_no)
print("Levene’s Test:", levene)
Levene’s Test: LeveneResult(statistic=np.float64(3.6987757209752585), pvalue=np.float64(0.05473666933558896))
While this p-value is just above 0.05, it’s borderline. To be cautious, we assume unequal variances and use Welch’s t-test, which is more robust.
Step 5: Perform Welch’s T-Test
Now, we will perform the actual hypothesis test:
t_stat, p_val = stats.ttest_ind(group_yes, group_no, equal_var=False)
print("T-test statistic:", t_stat)
print("T-test p-value:", p_val)
T-test statistic: 9.187893626181372
T-test p-value: 2.8582551803382495e-19
A p-value this small means there’s an extremely low probability that the observed difference happened by chance.
Since p-value < 0.05, we will reject the null hypothesis. We now have strong statistical evidence that employees who attended training perform significantly better than those who did not.
Here’s a visual comparison of both groups:
import plotly.express as px
fig = px.box(
df,
x='TrainingAttended',
y='PerformanceScore',
title='Performance Score by Training Attendance',
labels={
'TrainingAttended': 'Training Attended',
'PerformanceScore': 'Performance Score'
},
color='TrainingAttended',
points='all',
)
fig.update_layout(
plot_bgcolor='rgba(0,0,0,0)',
paper_bgcolor='white',
margin=dict(l=40, r=40, t=80, b=60),
showlegend=False
)
fig.show()
