Fine-Tuning LLMs Guide

← Previous Article All Articles Next Article →

Fine-tuning large language models (LLMs) involves adapting a pre-trained model to perform well on a specific task or to reflect a specialized domain of language. Fine-tuning is essential when the model’s general knowledge needs refinement to meet the precision required in a specific field or task. If you want to learn about fine-tuning LLMs, this article is for you. In this article, I’ll take you through a practical guide to fine-tuning LLMs with Python.

Fine-Tuning LLMs Guide

Fine-tuning is the process of taking a pre-trained model and further training it on a specialized dataset to adapt it for a specific task. In traditional Machine Learning, training typically starts from scratch with a model initialized with random parameters. The model gradually learns by updating these parameters to minimize errors on the dataset. However, fine-tuning large language models (LLMs) begins with a model that has already learned general language patterns from extensive pre-training on vast, diverse datasets. This gives the model a foundational understanding of language that can be tailored by fine-tuning on a smaller, more focused dataset to capture domain-specific nuances.

Fine-tuning is ideal when we need a model to perform well in a particular field or when you need the model to generate text that aligns closely with specialized terminology or style (e.g., legal or medical text). Conversely, using LLMs directly without fine-tuning is effective when a task is broad, has a general purpose, or benefits from the diversity of the original pre-training data, such as casual conversation, creative writing, or answering general knowledge questions.

Fine-tuning requires additional time and resources, so it’s best reserved for tasks where the model’s performance noticeably improves by specializing in a specific domain.

Fine-Tuning LLMs with Python: A Practical Guide

Now, let’s understand how to fine-tune LLMs practically using Python. In this guide, I’ll be using a lightweight LLM and a smaller dataset to explain the process of fine-tuning. It will help you understand the fine-tuning process practically on your available computational resources.

Step 1: Installation and Initial Setup

Install the necessary libraries and set up the environment:

!pip install transformers datasets

The transformers library, provided by Hugging Face, contains pre-trained models and tools for building and fine-tuning various Natural Language Processing (NLP) models. The datasets library is used to load popular datasets conveniently, which makes it easy to prepare data for training and fine-tuning models. Run this installation command at the beginning to set up these libraries.

Step 2: Loading and Sampling the Dataset

Load a dataset suitable for fine-tuning:

from datasets import load_dataset

# load IMDb dataset and take a small sample
dataset = load_dataset("imdb", split="train[:1%]")
print(dataset[0])
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.
...
(no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.', 'label': 0}

Here, we load the IMDb movie reviews dataset, often used in NLP tasks for sentiment analysis. By specifying train[:1%], we only load 1% of the training set, which is beneficial for quick experimentation and avoids using excessive computational resources. The print(dataset[0]) command checks that the data is loaded correctly.

Step 3: Data Preprocessing

Prepare data by cleaning the text and ensuring consistent formatting:

def preprocess(batch):
    batch['text'] = [text.replace('\n', ' ') for text in batch['text']]
    return batch

# apply preprocessing to the dataset
dataset = dataset.map(preprocess, batched=True)

In this function, we replaced newline characters in each review with spaces. This step is crucial because some models may not handle newline characters well, especially if trained for single-line inputs. dataset.map(preprocess, batched=True) applies this preprocessing function to the entire dataset, batch by batch, which improves efficiency.

Step 4: Initializing the Model and Tokenizer

Load a pre-trained model and tokenizer for fine-tuning:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

Here, we loaded distilgpt2, a lightweight version of GPT-2, which is suitable for causal language modelling tasks. AutoTokenizer and AutoModelForCausalLM automatically download and set up the tokenizer and model architecture for the specified model. Setting the pad_token to eos_token ensures consistent padding in sequences, which is necessary for batch processing.

Step 5: Tokenizing the Data

Convert text into tokens the model can understand:

def tokenize_function(examples):
    tokenized = tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)
    tokenized['labels'] = tokenized['input_ids'].copy()  # set labels to be the same as input_ids
    return tokenized

tokenized_data = dataset.map(tokenize_function, batched=True)

This function tokenizes each text input by converting it into integer IDs that the model can process. Using padding= “max_length” and truncation=True; ensures each tokenized sequence has a fixed length of 128, which avoids model memory overflow. Setting labels as a copy of input_ids prepares the dataset for language modelling by ensuring the model learns to predict the next word in a sequence.

Step 6: Configuring Training Parameters

The next step in the fine-tuning process is to set up hyperparameters for model training:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    logging_dir='./logs',
    logging_steps=10,
    save_total_limit=1
)

The TrainingArguments class is used to define the hyperparameters and settings for training. Key parameters include:

  • output_dir: Directory to save model checkpoints.
  • evaluation_strategy= “epoch”: Evaluate the model at the end of each epoch.
  • per_device_train_batch_size and per_device_eval_batch_size: Number of samples processed per device in each batch during training and evaluation, respectively.
  • num_train_epochs=1: Train the model for a single epoch.
  • logging_steps: How often to log training information.
  • save_total_limit=1: Limits the saved checkpoints to avoid storage overload.

Step 7: Splitting the Dataset

Now, divide the dataset into training and evaluation sets:

train_data = tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data))))
eval_data = tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data)), len(tokenized_data)))

Here, we randomly shuffle the dataset and then split it into 80% training data and 20% evaluation data. This ensures that the model has enough data to learn from and also allows for a validation set to assess the model’s performance.

Step 8: Setting Up the Trainer & Fine-Tuning the Model

Now, the next step in the process of fine-tuning LLMs is to initialize and configure the training process for fine-tuning:

from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data
)

The Trainer class in transformers simplifies the training process by automating tasks like gradient updates and model evaluation. It uses training_args for hyperparameters and takes the train_data and eval_data datasets to structure the training and validation process.

Now, this is the fine-tuning step. Start training the model on the custom dataset:

trainer.train()

This command initiates the fine-tuning process. The train() function performs multiple forward and backward passes through the data, which updates the model’s weights to minimize prediction errors based on the IMDb dataset. Fine-tuning will allow the pre-trained distilgpt2 model to adjust to the specific language and style of movie reviews.

Step 9: Save & Test the Fine-tuned Model

Save the model and tokenizer for future use:

model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

Once training is completed, saving the model ensures that the fine-tuned parameters can be reused without re-running the entire process. The save_pretrained function saves both the model weights and the tokenizer configuration to a directory.

Now, let’s generate text based on a prompt to evaluate the model:

prompt = "The script"
inputs = tokenizer(prompt, return_tensors="pt")

output = model.generate(inputs['input_ids'], max_length=15)
print(tokenizer.decode(output[0], skip_special_tokens=True))
The script is a bit too long, but it's a good one.

In this final section, we provide a sample prompt (“The script”) to test the model’s generative capabilities. The generate() function creates a new text sequence by sampling from the model’s learned distribution. By decoding and printing the output, you can observe how well the fine-tuned model generates text that aligns with the IMDb dataset.

So, this is how we can fine-tune LLMs. Now, use this knowledge to solve a real-world problem like Code Generation by fine-tuning an LLM on real-world code files from GitHub. Here’s an example.

Summary