𝗜 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗲𝗱 𝗮 𝟮𝟳𝟬𝗠 𝗠𝗼𝗱𝗲𝗹 𝗼𝗻 𝗠𝘆 𝗟𝗮𝗽𝘁𝗼𝗽
I am testing three ways to fine-tune models. I use the same task for all three. I scale from the smallest model to the largest.
The series follows this path:
- Full Fine-Tuning (270M parameters)
- LoRA (1.5B parameters)
- QLoRA (7B parameters)
I want to understand the mechanics. I do not want to follow a tutorial blindly.
In this first step, I used full fine-tuning. This method updates every weight in the model. It is the most expensive way to train.
I used the Banking77 dataset. It contains 13,000 customer support messages. The goal is to identify 77 different intents, such as lost cards or exchange rates.
I chose Gemma 3 (270M). This model is small enough to train on a laptop using Apple Silicon. Full fine-tuning requires four times the model size in memory to store gradients and optimizer states.
Instead of adding a classification head, I made the model generate the intent as text. This makes the process identical to instruction tuning. It prepares the project for the next steps.
A critical step is masking the loss. You must tell the model to ignore the prompt and only grade itself on the label. If you skip this, the model wastes effort learning to repeat your prompt.
I used a low learning rate of 5e-5. High learning rates destroy pretrained knowledge during full fine-tuning. A rate of 2e-4 caused the model to fail.
The results:
- 96% accuracy on common intents.
- The model works well on a laptop.
- It still confuses card arrival with delivery estimates.
In Part 2, I will use a model five times larger. I will train less than 1% of its weights using LoRA. I will see if I can get the same accuracy.
Source: https://dev.to/sumanpro/i-fine-tuned-a-270m-model-on-my-laptop-full-fine-tuning-from-scratch-3p4l
Optional learning community: https://t.me/GyaanSetuAi