Building Synthetic Medical Records using GANs

Oct 26, 2025 • 5 min read

← Previous Article All Articles Next Article →

If you’ve been following the Generative AI wave, you’ve probably seen AI generate images, text, and even code. You can use the same technology to create realistic synthetic datasets for healthcare, finance, and more. So, in this article, I’ll walk you through building GANs (Generative Adversarial Networks) from scratch to generate synthetic medical records using Python.

Building Synthetic Medical Records Using GANs

For building synthetic medical records using GANs, we’ll break it down into the following steps:

Data Preprocessing
GAN Architecture
Training Loop
Evaluating Model Performance
Generating Synthetic Medical Records

Make sure to download the dataset from here.

Data Preprocessing: Where most GAN projects fail before they start

GANs understand numbers, not text, not categorical labels. So our first job is to:

Convert categorical features into one-hot encoded vectors
Scale numerical values between -1 and 1 (because our Generator uses Tanh activation)

import pandas as pd

df = pd.read_csv("/content/Follow-up_Records.csv")

print(df.head())

   patient_id  visit_date  age_years  weight_kg   bmi  systolic_bp_mmHg  \
0  P-2025-001  2024-02-15         52       83.7  28.3               138   
1  P-2025-001  2024-03-15         52       83.4  28.2               147   
2  P-2025-001  2024-04-15         52       83.1  28.1               140   
3  P-2025-001  2024-05-15         52       83.0  28.1               136   
4  P-2025-001  2024-06-15         52       82.6  27.9               133   

   diastolic_bp_mmHg  heart_rate_bpm  body_temp_C  fasting_glucose_mg_dL  ...  \
0                 86              80         36.8                    137  ...   
1                 89              80         37.0                    140  ...   
2                 84              76         36.8                    122  ...   
3                 88              77         36.8                    112  ...   
4                 88              78         36.8                    101  ...   

   diet_quality_score_0_100  sleep_hours  exercise_sessions_per_week  \
0                        62          6.6                           3   
1                        61          6.8                           2   
2                        65          7.0                           3   
3                        66          7.8                           1   
4                        54          7.0                           2   

   alcohol_units_per_week  smoking_cigs_per_day  \
0                       0                     0   
1                       1                     0   
2                       3                     0   
3                       0                     0   
4                       3                     0   

                                      clinical_notes  neuropathy  retinopathy  \
0  Baseline visit: poor glycemic control; lifesty...           0            0   
1                                 Routine follow-up.           0            0   
2  Reports tingling in feet at night; B12 checked...           0            0   
3                                 Routine follow-up.           0            0   
4  SGLT2 inhibitor added due to persistent hyperg...           0            0   

   hypoglycemia  uti  
0             0    0  
1             0    0  
2             0    1  
3             0    0  
4             0    0  

[5 rows x 34 columns]

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
import numpy as np

num_cols = df.select_dtypes(include=['int64', 'float64']).columns
cat_cols = df.select_dtypes(include=['object']).columns

encoder = OneHotEncoder(sparse_output=False)
cat_encoded = encoder.fit_transform(df[cat_cols])

scaler = MinMaxScaler(feature_range=(-1, 1))
num_scaled = scaler.fit_transform(df[num_cols])

# combine processed data
data_processed = np.hstack((num_scaled, cat_encoded))

If you skip proper preprocessing, your GAN will either:

Fail to learn patterns (mode collapse)
Generate nonsensical outputs
Scaling is significant because Tanh outputs range from -1 to 1.

GAN Architecture: Two networks playing a game

A GAN has a:

Generator: Starts with random noise, learns to produce realistic samples
Discriminator: Tries to tell real from fake data

Here’s how to build the architecture of GANs:

import torch
import torch.nn as nn

data_dim = data_processed.shape[1]  # total features
latent_dim = 64  # size of random noise input

# generator
class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(latent_dim, 128),
            nn.LeakyReLU(0.2),
            nn.Linear(128, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, data_dim),
            nn.Tanh()  # output in range [-1, 1]
        )
    def forward(self, z):
        return self.model(z)

# discriminator
class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(data_dim, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 128),
            nn.LeakyReLU(0.2),
            nn.Linear(128, 1),
            nn.Sigmoid()  # probability of real/fake
        )
    def forward(self, x):
        return self.model(x)

While building GANs, always make sure to:

Use LeakyReLU in hidden layers to avoid “dead” neurons.
Match Generator’s output range (Tanh) with your data scaling.
Keep the architecture small at first; big models overfit small datasets quickly.

Training Loop: Where the Model Learns

Now, we will train both networks in turns:

Discriminator: Learns to classify real vs fake correctly
Generator: Learns to fool the Discriminator

from torch.utils.data import DataLoader, TensorDataset

# convert data to PyTorch tensors
real_data = torch.tensor(data_processed, dtype=torch.float32)
dataset = TensorDataset(real_data)
loader = DataLoader(dataset, batch_size=16, shuffle=True)

# initialize models
generator = Generator()
discriminator = Discriminator()

# optimizers
lr = 0.0002
optim_G = torch.optim.Adam(generator.parameters(), lr=lr)
optim_D = torch.optim.Adam(discriminator.parameters(), lr=lr)

# loss
criterion = nn.BCELoss()

epochs = 2000
for epoch in range(epochs):
    for real_batch, in loader:
        batch_size = real_batch.size(0)

        # labels for real and fake data
        real_labels = torch.ones((batch_size, 1))
        fake_labels = torch.zeros((batch_size, 1))

        # train discriminator
        z = torch.randn(batch_size, latent_dim)
        fake_data = generator(z)

        real_loss = criterion(discriminator(real_batch), real_labels)
        fake_loss = criterion(discriminator(fake_data.detach()), fake_labels)
        d_loss = (real_loss + fake_loss) / 2

        optim_D.zero_grad()
        d_loss.backward()
        optim_D.step()

        # train generator
        z = torch.randn(batch_size, latent_dim)
        fake_data = generator(z)
        g_loss = criterion(discriminator(fake_data), real_labels)  # want fake to be real

        optim_G.zero_grad()
        g_loss.backward()
        optim_G.step()

    if epoch % 200 == 0:
        print(f"Epoch [{epoch}/{epochs}]  D_loss: {d_loss.item():.4f}  G_loss: {g_loss.item():.4f}")

Epoch [0/2000]  D_loss: 0.6956  G_loss: 0.6612
Epoch [200/2000]  D_loss: 0.1672  G_loss: 2.9219
Epoch [400/2000]  D_loss: 0.6135  G_loss: 1.5207
Epoch [600/2000]  D_loss: 0.3837  G_loss: 2.6847
Epoch [800/2000]  D_loss: 0.8646  G_loss: 1.4029
Epoch [1000/2000]  D_loss: 0.3813  G_loss: 1.6410
Epoch [1200/2000]  D_loss: 0.3263  G_loss: 2.6179
Epoch [1400/2000]  D_loss: 0.0688  G_loss: 3.0163
Epoch [1600/2000]  D_loss: 0.2220  G_loss: 1.4931
Epoch [1800/2000]  D_loss: 0.0581  G_loss: 3.5975

Here are some common mistakes that beginners make in this step:

Training the Generator more than the Discriminator (can destabilize training)
Using the wrong activation (e.g., ReLU in the last Generator layer without scaling data)
Batch size too large (small datasets train better with small batches)

Generating Synthetic Medical Records

Once trained, we can sample new patient records from random noise:

# generate new synthetic data
z = torch.randn(10, latent_dim)  # 10 synthetic samples
synthetic_data_scaled = generator(z).detach().numpy()

# inverse transform
num_synthetic = scaler.inverse_transform(synthetic_data_scaled[:, :len(num_cols)])
cat_synthetic = encoder.inverse_transform(synthetic_data_scaled[:, len(num_cols):])

# combine into dataframe
synthetic_df = pd.DataFrame(num_synthetic, columns=num_cols)
synthetic_df[cat_cols] = cat_synthetic

print(synthetic_df)

   age_years  weight_kg        bmi  systolic_bp_mmHg  diastolic_bp_mmHg  \
0  52.017231  81.608612  27.656334        128.573212          83.058716   
1  52.049911  81.312317  27.621420        127.594505          82.952423   
2  52.004284  82.087952  27.929657        132.905823          86.329712   
3  52.999413  81.105415  27.438124        130.629898          73.565216   
4  52.921329  81.529053  27.588375        131.838181          75.753464   
5  52.000004  81.658142  27.952627        126.800835          90.225380   
6  52.001060  81.389557  27.754536        126.812881          86.642319   
7  52.949970  81.282715  27.489239        130.044556          75.864662   
8  52.994469  81.226418  27.434101        128.948669          74.581017   
9  52.999737  80.793098  27.320147        122.675903          73.543121   

   heart_rate_bpm  body_temp_C  fasting_glucose_mg_dL  \
0       76.080704    36.872677              89.162071   
1       76.967606    36.932449              90.351837   
2       76.076080    36.841934              86.890717   
3       74.126099    36.732826             105.403008   
4       74.325439    36.778103              96.714897   
5       79.380981    36.918476              85.431534   
6       78.033195    36.897865              85.991028   
7       74.698959    36.814651             102.886810   
8       74.389000    36.800858             102.727158   
9       74.411812    36.774315             107.315430   

   postprandial_glucose_mg_dL  hba1c_percent  ...  alcohol_units_per_week  \
0                  129.606934       7.381011  ...                1.700157   
1                  136.598312       7.326695  ...                1.963187   
2                  124.384651       7.747846  ...                1.313976   
3                  171.038132       6.749748  ...                0.067310   
4                  142.759796       6.931028  ...                0.275825   
5                  122.884781       7.793033  ...                1.902780   
6                  124.722404       7.453174  ...                2.455485   
7                  158.861588       6.892792  ...                0.434503   
8                  164.242416       6.800313  ...                0.281971   
9                  180.745453       6.673672  ...                0.202333   

   smoking_cigs_per_day  neuropathy  retinopathy  hypoglycemia       uti  \
0              0.001285    0.000759     0.001573      0.100375  0.017240   
1              0.005959    0.004435     0.005770      0.477865  0.020934   
2              0.000674    0.000674     0.001165      0.017366  0.011973   
3              0.000537    0.000106     0.000128      0.001489  0.000276   
4              0.001492    0.000539     0.000662      0.005973  0.003403   
5              0.000028    0.000010     0.000051      0.146468  0.001701   
6              0.000439    0.000241     0.000527      0.302272  0.009886   
7              0.005161    0.001687     0.001972      0.061553  0.005250   
8              0.002669    0.000815     0.001331      0.019851  0.003891   
9              0.000336    0.000034     0.000046      0.151734  0.000156   

   patient_id  visit_date                                        medications  \
0  P-2025-001  2024-06-15  Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...   
1  P-2025-001  2024-06-15  Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...   
2  P-2025-001  2024-10-15  Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...   
3  P-2025-001  2025-02-15  Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...   
4  P-2025-001  2025-02-15  Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...   
5  P-2025-001  2024-08-15  Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...   
6  P-2025-001  2024-06-15  Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...   
7  P-2025-001  2025-02-15  Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...   
8  P-2025-001  2025-02-15  Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...   
9  P-2025-001  2025-02-15  Metformin 1000 mg BID, Ramipril 5 mg QD, Atorv...   

                                      clinical_notes  
0                                 Routine follow-up.  
1                                 Routine follow-up.  
2                                 Routine follow-up.  
3                                 Routine follow-up.  
4                                 Routine follow-up.  
5                                 Routine follow-up.  
6                                 Routine follow-up.  
7                                 Routine follow-up.  
8                                 Routine follow-up.  
9  Mild symptomatic hypoglycemia post-exercise; s...  

[10 rows x 34 columns]

You now have privacy-safe, realistic-looking data for experiments. It can also be used to augment training datasets for better ML model performance.

Final Words

Building GANs for real-world datasets like medical records is more about:

Data preprocessing discipline
Matching architecture to data
Careful training to avoid collapse
Post-processing to make data usable