Fine-tuning Gemma 27B Model

In my previous blog, I explored the importance of fine-tuning Large Language Models (LLMs) and explained the do-it-for-me approach. In this blog post, I'll guide you through the do-it-yourself method for fine-tuning LLMs.

By the end of this article, you'll have:

  • A clear understanding of when DIY fine-tuning makes financial sense (with real cost comparisons)
  • Step-by-step instructions for setting up RunPod with the right GPU configuration
  • Practical guidance on preparing your environment and installing necessary tools
  • A complete workflow for fine-tuning Gemma 27B using Axolotl, from configuration to training
  • Instructions for quantizing your model to run on consumer hardware
  • Troubleshooting tips based on real-world experience

Let's dive in!

Why DIY Fine-Tuning?

If this is your first time attempting to fine-tune any LLM, I would encourage you to first get your feet wet using a managed service like OpenPipe or Together AI. These services provide a user-friendly interface and handle the complexities of GPU management, model selection, and training configurations. That said, here are the reasons why you might want to consider the DIY approach:

  1. Cost Considerations: Managed services can be expensive, especially for larger models. I will discuss this in more detail below.
  2. Flexibility: You have complete control over the training process, including picking a model that is not yet available as a managed service.
  3. Learning Experience: The DIY approach provides a hands-on learning experience that can be invaluable for your understanding and might prove useful in your career.

Cost Considerations

Let's look at some real-world examples of the cost of fine-tuning LLMs using managed services. The dataset that I used for fine-tuning the llama 3.1 8B model had around 7K records. Each record contains an input (system prompt + user prompt) and output (from expected LLM response). If we convert this data into tokens and add up all the inputs and output tokens, it is roughly 100M input tokens and 10M output tokens. Considering that this is a summarization task, it makes sense that the output is 1/10th of the input. That is 110M tokens in total. If we set aside 10% for test, and use the rest for training, we are left with roughly 99M tokens for training.

But, why are we counting tokens in the dataset? Because that is the factor that affects the cost for training. The cost for training a 8B Parameter Models is $0.48 / 1M tokens (this is the cost for OpenPipe). That cost jumps to $2.90 / 1M tokens for higher parameter model.

So, if we were to train the same dataset on Gemma 27B, the cost would be roughly $2.9 * 99 * 2 = $573.6. I am multiplying the tokens by 2 because I am assuming that we use 2 epochs for training (essentially, we are training the model twice on the same dataset). You should also assume that you might have to repeat the training process a few times to get the right model. So, the cost can easily go up to $1,000 or more. Now that is serious money for a hobby project, and you can understand why I wanted to explore the DIY approach.

Side note: If you want to get a quick understanding for the total cost for fine-tuning for your use-case, I found the interactive pricing calculator from Together AI to be very useful.

Together AI Pricing Calculator

Cost considerations for DIY fine-tuning

The primary cost of DIY fine-tuning is the GPU rental cost. If we look at the cost of renting a GPU on RunPod (which is considered to be very price competitive), the cost can range from $0.4 per hour (for a RTX 4000 Ada GPU) to $8 per hour (for B200 GPU with 160 GB VRAM). So, how would you go about deciding which GPU to rent? Here are the factors to consider:

Key Considerations for picking a GPU for Fine-Tuning:

  1. VRAM Capacity: Determines the maximum size of the model and the batch size you can use. Running out of VRAM is a common bottleneck.
  2. Memory Bandwidth: How quickly data (model parameters, activations, gradients) can be moved between the GPU's memory and its compute units. High bandwidth is essential for keeping the powerful cores fed, especially with large models.
  3. Compute Performance (FLOPS/TOPS): Raw processing power. Modern fine-tuning heavily relies on Tensor Cores for mixed-precision training (FP16, BF16, TF32, and increasingly FP8 on newer architectures like Hopper, Ada Lovelace, and Blackwell).
  4. Architecture: Newer architectures (Blackwell > Hopper > Ada Lovelace) generally offer better performance per watt, more efficient Tensor Cores, and support for newer data formats (like FP8).

So, VRAM is the limiting factor for most models. For a large parameter model (like 27B for Gemma), you will need at least 48GB of VRAM to train the model. The rest of the parameters affect performance, which translates to the time it takes to train the model. Which is also important because you are paying for the GPU rental by the hour.

In addition to the parameter count of the base model, another factor that affects the amount of VRAM required is the context length. Higher context lengths require more VRAM; but with techniques like sequence parallelism, you can split the VRAM across multiple GPUs. side-note: Sequence parallelism divides the input sequence into smaller chunks, with each GPU processing one chunk. This approach reduces the VRAM requirement on individual GPUs.

Let's look at this with an example. With an L40 Ada GPU (with 48GB of VRAM), I was able to start training a 27B parameter model with a context length of 22K if I split the input across 4 GPUs. The estimated time for completing the training was around 14 hours. Given the cost of $.86 per hour for L40, this translates to a total cost of .86 * 4 * 14 = $48.16. But the resulting model will be limited to a context length of 22K. The cost from OpenPipe was for training a model with much longer context length (around 128K).

So, let's look at the cost of training a 27B parameter model with a context length of 128K. I looked at using H200 GPU (with 141GB of VRAM), which is priced at $3.99 per hour. With 4 H200 GPUs, I was able to train the model with a context length of more than 100K. Since this is a much better GPU, the estimated time for training was around 5 hours. So, the total cost for training a 27B parameter model with a context length of 128K would be $3.99 * 4 * 5 = $79.80. Not bad, considering that the cost of do-it-for-me was roughly 7 times that amount. So, even if I don't get it right the first time, I can still afford to repeat the training process a few times before I reach the cost of do-it-for-me. Also, since we are seeing newer open models being released frequently, I can always switch to a newer model if I find one that is better suited for my use case.

The economics of DIY fine-tuning becomes much more compelling if I pick a smaller model to start the training (e.g. Llama 3.2 1B model). With that model size, I am finish the training in a consumer GPU like 4090 with 24 GB VRAM, which I happen to have. So, why pick the larger Gemma 27B model?

Why Gemma-3 27B?

The Gemma 27B model is considered state-of-the-art (as of Apr 2025) open-source model that has been shown to outperform many other models in various tasks. And it supports much larger context lengths. Specifically for HN Companion project, I had collected more than 14K records of Hacker News homepage discussions and their summaries. The average context length of the discussions was around 11K tokens, but for really long discussions, it can go up to 100K tokens or more. And the need for AI assistance to summarize discussions increases with the length of the discussion. So, I was looking for a model that can handle long context lengths. Also, I wanted to get comfortable with the process of fine-tuning LLMs with large context lengths.

Why RunPod?

RunPod is a cloud GPU rental service that offers competitive pricing and a user-friendly interface for managing GPU instances. This space is crowded, with many other options. But I found that most solutions are targeted at enterprise customers, where access to GPUs with higher VRAM is available only if we commit to weeks or months. RunPod on the other hand, s billed by the second. You buy credits in advance, and you can use them for any GPU instance. This is a great option for hobbyists and small projects, as it allows you to experiment with different GPU types and configurations without long-term commitments.

RunPod GPU inventory

Step-by-Step RunPod Setup

We will go through the process of setting up a RunPod account, creating a pod, configuring the environment, and installing the necessary software for fine-tuning the Gemma 27B model. Any other GPU infrastructure as a service (IaaS) provider should be similar, but I will be using RunPod as an example.

Account Setup and SSH Configuration

After creating your RunPod account, you will need to set up SSH access to your pod. Go to settings and look for 'SSH Public Keys' section. Copy and paste your public key into the text box. If you don't have a public key, you can generate one using the instructions here.

Pod Creation and Storage Configuration

Once your SSH key is set up, you can create a pod. Go to the pod creation page and select the GPU type you want to use. Pick a GPU with enough VRAM for your model. If you need to fine-tune a model with larger context lengths, you will need to provision a pod with multiple GPUs. In my case, I provisioned a pod with 4 H200 GPUs each with 141GB of VRAM (totalling 564GB of VRAM). RunPod allows you to create a pod with B200 GPU as well. I found that the software tools I was using (Axolotl) and its dependencies (pytorch) did not support the advanced instructions set for B200 GPUs in current builds.

Storage Configuration: The default pod container storage is 20GB, which is not enough for the model and training data. They allow you to attach additional persistent disk space mounted to the container. If you start and stop the pod, the data on the persistent disk will be retained. For my pod, I attached 200GB of persistent disk space.

RunPod Pod Creation template

Once these settings are configured, you can deploy the pod. It will take a few minutes to provision the pod.

Deploy pod on demand

Once the pod is deployed, you should see a 'connect' button; this brings up a model with the ssh command details. Pick the 'SSH over exposed TCP' option, and use that to connect to the pod.

SSH connection details

Great! You are now connected to your pod 🎉. You can use the command line to install the necessary software and configure the environment for fine-tuning the model.

Software Installation and Configuration

The first step is to configure the environment and install the necessary software and dependencies for fine-tuning the model. This includes:

  1. Setting up the required environment variables
  2. Installing the necessary libraries and dependencies
  3. Start and manage the training process

Environment Variables: Since we want to deploy Hugging Face dependencies to the attached persistent disk (which is mounted to /workspace), we need to set up a few environment variables.

# Make sure that the Hugging Face cache is set to the persistent disk (which is mounted to /workspace)
export HF_HOME="/workspace/huggingface"
# Get the Hugging Face token from your account settings
export HF_USERNAME=<your_hugging_face_username>
export HF_TOKEN=<your_hugging_face_token>
# I also set up the following environment variables to avoid running out of memory
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

You also need to upload your training and test data to the pod. You can use the scp command to copy files from your local machine to the pod. You can use the following syntax - scp -P <post> -i <path_to_your_private_key> <path_to_your_file> <username>@<pod_ip>:<destination_path>. Keep in mind that you will get a different IP and port for each pod you create. You can find the IP and port in the ssh connection details. Following the connection details from the above image, the command to copy the training data would look like this:

# Copy the training data to the pod (assuming you have created the `hn-finetune` folder in `/workspace` directory)
scp -P 15924 -i ~/.ssh/id_ed25519 gemma-5k-mixed-prompt-30Kcontext.jsonl root@103.196.86.37:/workspace/hn-finetune/train.jsonl
# Copy the test data to the pod
scp -P 15924 -i ~/.ssh/id_ed25519 gemma-test-500-mixed-prompt-30Kcontext.jsonl root@103.196.86.37:/workspace/hn-finetune/test.jsonl
# Default images does not contain basic utilities like nano, tmux, etc. So, I had to install them manually.
apt update
apt install tmux nano
# At this point, I also check the installed version of CUDA runtime and driver details
nvcc --version
# check the CUDA driver version
nvidia-smi

Installing the necessary libraries and dependencies: In my previous attempts at fine-tuning LLMs, I had used autotrain-advanced from Hugging Face 🤗. I found that this is no update as frequently as I would like. They did not have out-of-the-box support for the Gemma 27B model. So I picked axolotal for this project.

Following is a brief explanation of axolotl and its features:

Axolotl is a tool designed to streamline post-training for various AI models. Post-training refers to any modifications or additional training performed on pre-trained models - including full model fine-tuning, parameter-efficient tuning (like LoRA and QLoRA), supervised fine-tuning (SFT), instruction tuning, and alignment techniques.

Installing Axolotl and dependencies: Use the following commands to install Axolotl and its dependencies:

# create a virtual environment
python3 -m venv .venv
source ./.venv/bin/activate
# update pip and install dependencies
python3 -m pip install --upgrade pip setuptools wheel ninja
# install pytorch
pip3 install torch --index-url https://download.pytorch.org/whl/cu126
# install axolotl and its dependencies
pip3 install --no-build-isolation axolotl[flash-attn,deepspeed,ring-flash-attn]

Once you have reached this point, you should have a working environment with all the necessary libraries and dependencies installed. Congrats! You are now ready to start the fine-tuning process.

Training Configuration

Axolotl adopts a config driven approach, allowing users to define their training and evaluation configurations in a structured manner. To help you get started, Axolotl provides a set of pre-defined configurations for popular models. You can find the list here. Since I was targeting Gemma, I choose this template as the base for my configuration.

I ended up with the following configuration file, which I saved as gemma-27b-qlora.yml in the hn-finetune folder.

base_model: google/gemma-3-27b-it
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
# gemma3 doesn't seem to play nice with ddp
ddp_find_unused_parameters: true
load_in_8bit: false
load_in_4bit: true
strict: false
# huggingface repo
chat_template: gemma3
datasets:
- path: ./train.jsonl
type: chat_template
field_messages: messages
test_datasets:
- path: ./test.jsonl
split: train
type: chat_template
field_messages: messages
val_set_size: 0.0
output_dir: ./outputs/out
adapter: qlora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
# Since we are using the gemma model (which is multimodal), we need to set the target modules for the model to train only for text
# I had missed this in my training
lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
sequence_len: 102400
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
bf16: auto
tf32: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
resume_from_checkpoint:
logging_steps: 1
flash_attention: true
warmup_ratio: 0.1
evals_per_epoch:
saves_per_epoch: 1
weight_decay: 0.0
special_tokens:
# Sequence Parallelism and since I am using 4 GPUs
sequence_parallel_degree: 4
heads_k_stride: 1

Let's go through the key parameters in this configuration file - you can also refer to the documentation for more details.

  • base_model: google/gemma-3-27b-it -> this is the base model that we are using. Axolotal will donwload the model from Hugging Face and use it for fine-tuning.
  • datasets: -> this is the path to the training and test datasets. The path parameter is the path to the dataset file, and the type parameter is the type of dataset (in this case, it is a chat template). For details about the dataset format, refer to my previous blog post.
  • num_epochs: 2 -> this is the number of epochs for training. I found that 2 epochs was a good starting point for fine-tuning the model.
  • gradient_accumulation_steps: 4 -> this is the number of gradient accumulation steps. This is used to reduce the memory footprint of the model during training.
  • sequence_parallel_degree:4 -> this is the number of GPUs that we are using for training. This is used to split the input sequence across multiple GPUs.

Training Process

You can start the training job using the following command:

# Start the training job from the configuration file - gemma-27b-qlora.yml
axolotl train gemma-27b-qlora.yml

This will start the training process, and you should see the training logs in the console. The training process will take a few hours to complete, depending on the size of the model and the number of GPUs you are using. In my case, it took just above 5 hours to complete the training process. You can monitor the training process using the logs in the console.

Console log of training process

Monitoring training metrics: In the above image, you can see the training metrics for the model. At the bottom of the image, you can see the training time (time taken so far and estimated time remaining). In this case, we see the following: 30/292 [33:04<4:48:31, 66.08s/it]; This indicates that we have completed 30 iterations out of 292, and the estimated time remaining is 4 hours and 48 minutes and the average time per iteration is 66.08 seconds. You can also see the training loss (0.7995 in the above screenshot) and the learning rate for each iteration. The training loss is a measure of how well the model is performing on the training data. Training loss will jump around a lot in the beginning, but it should stabilize as the training progresses. I believe you should see something close to 0.5 towards the end of the training process.

The training process also creates a set of checkpoints in the output directory (/outputs/out in this case). You can use these checkpoints to resume the training process if it fails or to evaluate the model at different stages of training.

Model Merging

The output of the training process is a set of lora model weights that are stored in the output directory. You can use the following command to merge the model weights with the base model.

# Merge the model weights using the configuration file - gemma-27b-qlora.yml
axolotl merge-lora gemma-27b-qlora.yml

This will merge the model weights and create a new model file in the output directory (/outputs/out/merged in this case). You can use this model file for inference or further fine-tuning.

Evaluation of the Model

Axolotl provides a set of commands for running inference on the trained model. For more details on other commands, refer to the documentation.

# Run inference on the model using the configuration file - gemma-27b-qlora.yml
# Use the `--gradio` flag to start a gradio server - this will allow you to test the model using a web interface
axolotl inference gemma-27b-qlora.yml --gradio

I have uploaded the model that I trained to Hugging Face. You can find the model here.

Quantization

Hopefully, you have a working model that you are happy with. But the model is still too large to run on a consumer GPU (with upto 24 GB of VRAM). So, you will need to quantize the model to reduce its size and make it more efficient for inference. Quantization is the process of reducing the precision of the model weights and activations to reduce the memory footprint and improve performance. This enables deployment of the model on consumer hardware while maintaining acceptable performance. Among quantization methods, Q4_K_M (4-bit "K-Medium" quantization - i.e. converting 32/16-bit floating-point weights to lower-bit integers 4-bit) has emerged as a popular balance between efficiency and accuracy. So that is what I will be using for this model.

Quantization Process

I use the Llama.cpp toolkit for quantization. This is a C++ library that provides a set of tools for quantizing and running LLMs on consumer hardware. The workflow is as follows:

  1. Convert the model to the GGUF format using the convert_hf_to_gguf.py script from the Llama.cpp toolkit.
  2. Use the llama-quantize tool to quantize the GGUF model to the Q4_K_M format.

Convert the model to GGUF format: I am assuming that you have already installed the Llama.cpp toolkit and have the convert.py script available in your path. If not, you can clone the git repo and follow the instructions here to build the toolkit.

# Convert the model to GGUF format
python3 convert_hf_to_gguf.py <path_to_merged_model> --outfile <model_file_name> --outtype bf16 --no-lazy --model-name <model_name>

This will create a GGUF model file in the output directory. You can use this model file for quantization.

Quantize the model: Once you have the GGUF model file, you can use the llama-quantize tool to quantize the model to the Q4_K_M format.

# Quantize the model to Q4_K_M format
llama-quantize <input_gguf_file> <quantized_output_gguf_file> Q4_K_M

Deployment Options

Once you have the quantized model, you can start using it on your local machine. One of the most popular options for running LLMs on consumer hardware is Ollama. To be able to use this quantized model with Ollama, you will need to convert the model to the Ollama format. More details here

Troubleshooting Guide

The most common issues that I faced during the training process were related to the GPU memory and the training configuration. Essentially, you get a CUDA out of memory error. For suggestion on how to solve this, I would recommend checking the Axolotl documentation for more details.

Conclusion

Fine-tuning LLMs can be a complex and time-consuming process, but it is also a rewarding experience. The DIY approach allows you to have complete control over the training process and the model configuration. I hope this blog post has provided you with a good starting point for fine-tuning LLMs on your own.

If you have any questions or feedback, please feel free to reach out to me on Bluesky.

p.s. If you are wondering which GPU to pick for your next fine-tuning project, I would suggest taking a look at the following Gist that I created with the help of Gemini AI. I was confused myself, and I found this to be a good starting point.