In this blog, we explore the following topics:
Fine-tuning is the process of taking a pre-trained LLM like such as Llama or Gemma (that has already learned general language patterns) and training it further on a smaller, task-specific dataset. Instead of learning from scratch, you're adapting the model's existing knowledge to become better at a particular task, understand a specific domain's jargon, or adopt a certain style or tone. It's about specialization, not foundational learning. In simple terms - Think of a large language model (LLM) like ChatGPT or Llama as a highly educated graduate with a vast general knowledge base, learned from reading a massive chunk of the internet. Fine-tuning is like giving that graduate specific on-the-job training for a particular role or company.
When I started using LLMs, I was not initially convinced about the need for fine-tuning. I should admit, I felt intimidated by the whole process. So, I thought, why not pick a good enough frontier model and create an elaborate enough prompt, and I could get the model to do what I wanted. But as I learned more about fine-tuning, it became less intimidating and I realized that it has some powerful advantages.
Let's explore the benefits of Fine-tuning:
Improved Performance on Specific Tasks: A fine-tuned model will almost always outperform a general model on the narrow task it was trained for. A quantized (think compressed) Llama 8B model when fine-tuned will perform equally or better than a frontier model for that specific task. This opens up opportunities that were previously not possible. Think lower costs, lower latency, higher accuracy and more control - what is not to like about that 😎?
Domain Adaptation: What if your domain was proprietary or niche such that the general model was not trained on it? Fine-tuning allows the model to learn specific jargon, nuances, and contexts relevant to a particular field (like legal, medical, or a specific company's internal knowledge) that might not be well-represented in the general training data.
Better Adherence to Style and Tone: You can fine-tune a model to consistently respond in a specific voice, style, or format (e.g., maintaining a particular brand voice for marketing copy or chatbot responses). As an example, in our HNCompanion project, we wanted the model to respond back in a specific format. To get that output with a general model, we had to provide an elaborate system prompt and a lot of examples. Our system prompt was itself more than 1000 tokens. This made the inference slow and expensive for every request. Fine-tuning allowed us to drastically reduce the size of the system prompt and get the model to respond in the desired format more consistently.
Once you have established your goals for fine-tuning (e.g., improving performance on a specific task or reducing costs), you have two approaches for fine-tuning LLMs:
Do-It-For-Me: This is the easiest option, where you can use a 'Fine-tuning-as-a-Service' platforms like OpenPipe or Together.ai to fine-tune models without needing to manage the underlying infrastructure. These platforms often provide user-friendly interfaces and pre-built pipelines for fine-tuning.
Do-It-Yourself: This option gives you more control and flexibility, allowing you to fine-tune models on your own infrastructure or cloud resources. Based on your use-case, this might prove to be a more cost-effective option. You can use libraries like Hugging Face AutoTrain or Lightning AI's LitGPT to set up your fine-tuning pipeline. For more advanced users, you can also go down one more layer of abstraction and use the Hugging Face Transformers library directly to fine-tune models. In this option, you have to manage the GPU infrastructure for doing the training. This could be on your own hardware (if you have Nvidia GPU with 16 or 24 GB memory), infrastructure like Google Colab or GPU providers like Lambda Labs, RunPod. For enterprise use-case, you might want to consider cloud providers like AWS, Azure, or GCP (I won't be covering cloud-providers in this blog).
Whichever option you choose, the first step is to prepare your dataset. The quality of your dataset will directly impact the performance of your fine-tuned model. You should have enough number of examples (at least couple hundreds to ideally more than a few 1000s) that are relevant to your task. The dataset should also be diversified enough to cover the different scenarios that you expect the model to handle.
In my opinion, the best format for your dataset is to have a list of examples in JSONL structure. Each line in the file should be a JSON object with the following keys:
{"messages": [{"role": "system","content": "You are a helpful assistant"},{"role": "user","content": "What is the capital of Tasmania?"},{"role": "assistant","content": "Hobart"}]}
If you have been using an LLM for a while, you might already have a lot of examples that you can use to create your dataset. In essence, you log the end-user interactions with the model and use that as your dataset. If you are not happy with the quality of the LLM's responses, you can manually curate the dataset by providing the correct responses. 'Fine-tuning-as-a-Service' providers like OpenPipe makes this process quite easy.
In the case of HNCompanion, we created the dataset manually. We first created a list of past HN HomePage posts and their comments. Then we used Google Gemini to generate the desired responses using our elaborate system prompt. We then stored the LLM response and other meta-data like LLM input/output token count in a SQLite database. Once we have the data in the database, we used a simple NodeJS script to convert the data into JSONL format. To ensure diversity in the dataset, we used the meta-date to exclude the outliers (e.g., posts with very few comments or posts that were extremely long - some posts were so large that the total token could was close to 1M). I have uploaded the HNCompanion training data to Hugging Face Hub. You can find the dataset here.
If you are using a 'Fine-tuning-as-a-Service' provider, you can upload the dataset in JSONL format to the platform. The platform will take care of the rest, including providing an estimate cost for the training and then doing the actual training. Once the training is done, they also allow you to test its performance and once you are happy with the results, you can deploy the model to their platform or download the model so that you can host it yourself.
OpenPipe provides a convenient and user-friendly interface for fine-tuning LLMs. Their process is simple and representative enough that I will use it as an example for the rest of the blog. To get good results, you might end up doing the prepare-dataset, train, evaluate and deploy steps multiple times. Once you create the account with OpenPipe, you get a generous $100 free credits. This should be enough to get you through the first few iterations of the process. They also provide a detailed documentation to help you get started.
OpenPipe solution - Step 1: prepare dataset If you have already created your dataset in JSONL format, you can upload it to OpenPipe.
Once the dataset is uploaded, the system will automatically start computing the input and output token counts for each example. It will also split the dataset into training and test sets. You can also use this interface to edit each record in the dataset (if required).
OpenPipe solution - Step 2: Start training Once the dataset is ready (computing tokens can take some time for large datasets), you can start the training process. You can choose the base model (e.g., Llama 2, Gemma, etc.). OpenPipe provides a good set of default values that you can use for the training parameters. You can also see the 'Estimated training price' based on the dataset size (specifically the input and output token counts).
OpenPipe solution - Step 4: Model evaluation Once the training is done, you can evaluate the model using the test dataset. You can also compare the performance of the fine-tuned model with the base model.
The dataset view will now show the inference results for each example in the test dataset. This is a great way to see how the model is performing on the test dataset.
OpenPipe solution - Step 5: Deploy the model Once you are happy with the results, you can deploy the model to OpenPipe's platform. You can also download the model to your local machine or cloud provider. For HNCompanion, I have uploaded the trained model to Hugging Face Hub. Here are the links for the models:
If you are using a DIY solution, I would recommend you get started with the Hugging Face AutoTrain infrastructure to fine-tune your model. You can provision the AutoTrain infrastructure in Hugging Face spaces using this link. Once your space is cloned and provisioned, you can upload your dataset in JSONL format to the space. You can also use a dataset that you have previously uploaded to Hugging Face Dataset.
The parameters that you see in this interface is the GUI representation of parameters that you would see in the AutoTrain documentation here.
I would highly recommend you to go through the documentation to understand the parameters and their impact on the training process. As an example, setting unsloth
to true
will make the training process faster and use significantly less GPU memory.
As I have noted here, Fine-tuning LLMs with unsloth.ai is a game-changer!
It reduces GPU memory needs for large context models and cuts training time. If you are using cloud GPU provides which are billed by the hour, this will save you considerable money.
And if you own a GPU, Unsloth.ai unlocks new possibilities! It lets you train models that were previously out of reach, making the most of your hardware. Unsloth.ai also provides a detailed Fine-Tuning Guide to help you get started.
task: llm-orpobase_model: meta-llama/Meta-Llama-3-8B-Instructproject_name: your-project-namelog: tensorboardbackend: localdata:path: georgeck/hacker-news-discussion-summarization-largetrain_split: trainvalid_split: nullchat_template: chatmlcolumn_mapping:text_column: messagesparams:block_size: 1024model_max_length: 8192max_prompt_length: 512epochs: 3batch_size: 2lr: 3e-5peft: truequantization: int4target_modules: all-linearpadding: rightoptimizer: adamw_torchscheduler: lineargradient_accumulation: 4mixed_precision: fp16unsloth: truehub:username: ${HF_USERNAME}token: ${HF_TOKEN}push_to_hub: true
By navigating to the settings, you can configure and deploy the GPU resources that you need for the training.
Once you have configured the compute resources, you can start the training process. The training process will take some time depending on the size of the dataset and the compute resources that you have provisioned.
In this blog, we have explored the benefits of fine-tuning LLMs and how you can get started with it. We looked at the do-it-for-me and do-it-yourself options for fine-tuning LLMs. In a later blog, I will explore the process of fine-tuning in GPU providers like RunPod in more detail.
I hope you found this useful. If you have any questions or comments, please feel free to reach out to me on 🦋 Bluesky. Let's continue the conversation there.