Building Synthetic Medical Records using GANs
If you’ve been following the Generative AI wave, you’ve probably seen AI generate images, text, and even code. You can use the same technology to create realistic synthetic datasets for healthcare, finance, and more. So, in this article, I’ll walk you through building GANs (Generative Adversarial Networks) from scratch to generate synthetic medical records using Python.
Building Synthetic Medical Records Using GANs
For building synthetic medical records using GANs, we’ll break it down into the following steps:
- Data Preprocessing
- GAN Architecture
- Training Loop
- Evaluating Model Performance
- Generating Synthetic Medical Records
Make sure to download the dataset from here.
Data Preprocessing: Where most GAN projects fail before they start
GANs understand numbers, not text, not categorical labels. So our first job is to:
- Convert categorical features into one-hot encoded vectors
- Scale numerical values between -1 and 1 (because our Generator uses Tanh activation)
import pandas as pd
df = pd.read_csv("/content/Follow-up_Records.csv")
print(df.head())
patient_id visit_date age_years weight_kg bmi systolic_bp_mmHg \
0 P-2025-001 2024-02-15 52 83.7 28.3 138
1 P-2025-001 2024-03-15 52 83.4 28.2 147
2 P-2025-001 2024-04-15 52 83.1 28.1 140
3 P-2025-001 2024-05-15 52 83.0 28.1 136
4 P-2025-001 2024-06-15 52 82.6 27.9 133
diastolic_bp_mmHg heart_rate_bpm body_temp_C fasting_glucose_mg_dL ... \
0 86 80 36.8 137 ...
1 89 80 37.0 140 ...
2 84 76 36.8 122 ...
3 88 77 36.8 112 ...
4 88 78 36.8 101 ...
diet_quality_score_0_100 sleep_hours exercise_sessions_per_week \
0 62 6.6 3
1 61 6.8 2
2 65 7.0 3
3 66 7.8 1
4 54 7.0 2
alcohol_units_per_week smoking_cigs_per_day \
0 0 0
1 1 0
2 3 0
3 0 0
4 3 0
clinical_notes neuropathy retinopathy \
0 Baseline visit: poor glycemic control; lifesty... 0 0
1 Routine follow-up. 0 0
2 Reports tingling in feet at night; B12 checked... 0 0
3 Routine follow-up. 0 0
4 SGLT2 inhibitor added due to persistent hyperg... 0 0
hypoglycemia uti
0 0 0
1 0 0
2 0 1
3 0 0
4 0 0
[5 rows x 34 columns]
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
import numpy as np
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
cat_cols = df.select_dtypes(include=['object']).columns
encoder = OneHotEncoder(sparse_output=False)
cat_encoded = encoder.fit_transform(df[cat_cols])
scaler = MinMaxScaler(feature_range=(-1, 1))
num_scaled = scaler.fit_transform(df[num_cols])
# combine processed data
data_processed = np.hstack((num_scaled, cat_encoded))
If you skip proper preprocessing, your GAN will either:
- Fail to learn patterns (mode collapse)
- Generate nonsensical outputs
- Scaling is significant because Tanh outputs range from -1 to 1.
GAN Architecture: Two networks playing a game
A GAN has a:
- Generator: Starts with random noise, learns to produce realistic samples
- Discriminator: Tries to tell real from fake data
Here’s how to build the architecture of GANs:
import torch
import torch.nn as nn
data_dim = data_processed.shape[1] # total features
latent_dim = 64 # size of random noise input
# generator
class Generator(nn.Module):
def __init__(self):
super(Generator, self).__init__()
self.model = nn.Sequential(
nn.Linear(latent_dim, 128),
nn.LeakyReLU(0.2),
nn.Linear(128, 256),
nn.LeakyReLU(0.2),
nn.Linear(256, data_dim),
nn.Tanh() # output in range [-1, 1]
)
def forward(self, z):
return self.model(z)
# discriminator
class Discriminator(nn.Module):
def __init__(self):
super(Discriminator, self).__init__()
self.model = nn.Sequential(
nn.Linear(data_dim, 256),
nn.LeakyReLU(0.2),
nn.Linear(256, 128),
nn.LeakyReLU(0.2),
nn.Linear(128, 1),
nn.Sigmoid() # probability of real/fake
)
def forward(self, x):
return self.model(x)
While building GANs, always make sure to:
- Use LeakyReLU in hidden layers to avoid “dead” neurons.
- Match Generator’s output range (Tanh) with your data scaling.
- Keep the architecture small at first; big models overfit small datasets quickly.
Training Loop: Where the Model Learns
Now, we will train both networks in turns:
- Discriminator: Learns to classify real vs fake correctly
- Generator: Learns to fool the Discriminator
from torch.utils.data import DataLoader, TensorDataset
# convert data to PyTorch tensors
real_data = torch.tensor(data_processed, dtype=torch.float32)
dataset = TensorDataset(real_data)
loader = DataLoader(dataset, batch_size=16, shuffle=True)
# initialize models
generator = Generator()
discriminator = Discriminator()
# optimizers
lr = 0.0002
optim_G = torch.optim.Adam(generator.parameters(), lr=lr)
optim_D = torch.optim.Adam(discriminator.parameters(), lr=lr)
# loss
criterion = nn.BCELoss()
epochs = 2000
for epoch in range(epochs):
for real_batch, in loader:
batch_size = real_batch.size(0)
# labels for real and fake data
real_labels = torch.ones((batch_size, 1))
fake_labels = torch.zeros((batch_size, 1))
# train discriminator
z = torch.randn(batch_size, latent_dim)
fake_data = generator(z)
real_loss = criterion(discriminator(real_batch), real_labels)
fake_loss = criterion(discriminator(fake_data.detach()), fake_labels)
d_loss = (real_loss + fake_loss) / 2
optim_D.zero_grad()
d_loss.backward()
optim_D.step()
# train generator
z = torch.randn(batch_size, latent_dim)
fake_data = generator(z)
g_loss = criterion(discriminator(fake_data), real_labels) # want fake to be real
optim_G.zero_grad()
g_loss.backward()
optim_G.step()
if epoch % 200 == 0:
print(f"Epoch [{epoch}/{epochs}] D_loss: {d_loss.item():.4f} G_loss: {g_loss.item():.4f}")
Epoch [0/2000] D_loss: 0.6956 G_loss: 0.6612
Epoch [200/2000] D_loss: 0.1672 G_loss: 2.9219
Epoch [400/2000] D_loss: 0.6135 G_loss: 1.5207
Epoch [600/2000] D_loss: 0.3837 G_loss: 2.6847
Epoch [800/2000] D_loss: 0.8646 G_loss: 1.4029
Epoch [1000/2000] D_loss: 0.3813 G_loss: 1.6410
Epoch [1200/2000] D_loss: 0.3263 G_loss: 2.6179
Epoch [1400/2000] D_loss: 0.0688 G_loss: 3.0163
Epoch [1600/2000] D_loss: 0.2220 G_loss: 1.4931
Epoch [1800/2000] D_loss: 0.0581 G_loss: 3.5975
Here are some common mistakes that beginners make in this step:
- Training the Generator more than the Discriminator (can destabilize training)
- Using the wrong activation (e.g., ReLU in the last Generator layer without scaling data)
- Batch size too large (small datasets train better with small batches)
Generating Synthetic Medical Records
Once trained, we can sample new patient records from random noise:
# generate new synthetic data
z = torch.randn(10, latent_dim) # 10 synthetic samples
synthetic_data_scaled = generator(z).detach().numpy()
# inverse transform
num_synthetic = scaler.inverse_transform(synthetic_data_scaled[:, :len(num_cols)])
cat_synthetic = encoder.inverse_transform(synthetic_data_scaled[:, len(num_cols):])
# combine into dataframe
synthetic_df = pd.DataFrame(num_synthetic, columns=num_cols)
synthetic_df[cat_cols] = cat_synthetic
print(synthetic_df)
age_years weight_kg bmi systolic_bp_mmHg diastolic_bp_mmHg \
0 52.017231 81.608612 27.656334 128.573212 83.058716
1 52.049911 81.312317 27.621420 127.594505 82.952423
2 52.004284 82.087952 27.929657 132.905823 86.329712
3 52.999413 81.105415 27.438124 130.629898 73.565216
4 52.921329 81.529053 27.588375 131.838181 75.753464
5 52.000004 81.658142 27.952627 126.800835 90.225380
6 52.001060 81.389557 27.754536 126.812881 86.642319
7 52.949970 81.282715 27.489239 130.044556 75.864662
8 52.994469 81.226418 27.434101 128.948669 74.581017
9 52.999737 80.793098 27.320147 122.675903 73.543121
heart_rate_bpm body_temp_C fasting_glucose_mg_dL \
0 76.080704 36.872677 89.162071
1 76.967606 36.932449 90.351837
2 76.076080 36.841934 86.890717
3 74.126099 36.732826 105.403008
4 74.325439 36.778103 96.714897
5 79.380981 36.918476 85.431534
6 78.033195 36.897865 85.991028
7 74.698959 36.814651 102.886810
8 74.389000 36.800858 102.727158
9 74.411812 36.774315 107.315430
postprandial_glucose_mg_dL hba1c_percent ... alcohol_units_per_week \
0 129.606934 7.381011 ... 1.700157
1 136.598312 7.326695 ... 1.963187
2 124.384651 7.747846 ... 1.313976
3 171.038132 6.749748 ... 0.067310
4 142.759796 6.931028 ... 0.275825
5 122.884781 7.793033 ... 1.902780
6 124.722404 7.453174 ... 2.455485
7 158.861588 6.892792 ... 0.434503
8 164.242416 6.800313 ... 0.281971
9 180.745453 6.673672 ... 0.202333
smoking_cigs_per_day neuropathy retinopathy hypoglycemia uti \
0 0.001285 0.000759 0.001573 0.100375 0.017240
1 0.005959 0.004435 0.005770 0.477865 0.020934
2 0.000674 0.000674 0.001165 0.017366 0.011973
3 0.000537 0.000106 0.000128 0.001489 0.000276
4 0.001492 0.000539 0.000662 0.005973 0.003403
5 0.000028 0.000010 0.000051 0.146468 0.001701
6 0.000439 0.000241 0.000527 0.302272 0.009886
7 0.005161 0.001687 0.001972 0.061553 0.005250
8 0.002669 0.000815 0.001331 0.019851 0.003891
9 0.000336 0.000034 0.000046 0.151734 0.000156
patient_id visit_date medications \
0 P-2025-001 2024-06-15 Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...
1 P-2025-001 2024-06-15 Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...
2 P-2025-001 2024-10-15 Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...
3 P-2025-001 2025-02-15 Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...
4 P-2025-001 2025-02-15 Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...
5 P-2025-001 2024-08-15 Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...
6 P-2025-001 2024-06-15 Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...
7 P-2025-001 2025-02-15 Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...
8 P-2025-001 2025-02-15 Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...
9 P-2025-001 2025-02-15 Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...
clinical_notes
0 Routine follow-up.
1 Routine follow-up.
2 Routine follow-up.
3 Routine follow-up.
4 Routine follow-up.
5 Routine follow-up.
6 Routine follow-up.
7 Routine follow-up.
8 Routine follow-up.
9 Mild symptomatic hypoglycemia post-exercise; s...
[10 rows x 34 columns]
You now have privacy-safe, realistic-looking data for experiments. It can also be used to augment training datasets for better ML model performance.
Final Words
Building GANs for real-world datasets like medical records is more about:
- Data preprocessing discipline
- Matching architecture to data
- Careful training to avoid collapse
- Post-processing to make data usable