Fine-Tuning LLMs: LoRA, QLoRA and Full Fine-Tuning Guide
In this tutorial, you'll learn about Fine. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Fine-tuning adapts a pretrained large language model to your specific data and task, improving performance on domain-specific questions while retaining the model's general language understanding capabilities.
What You'll Learn
In this tutorial, you'll learn fine-tuning techniques for large language models including full fine-tuning, LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), dataset preparation, and instruction tuning using Hugging Face Transformers and Python.
Why It Matters
Base LLMs are trained on general internet data and perform poorly on specialized domains. Fine-tuning adapts the model to your domain — medical, legal, code, customer support — without training from scratch. LoRA and QLoRA make fine-tuning accessible on consumer GPUs by dramatically reducing memory requirements. Fine-tuned models outperform much larger general models on domain tasks.
Real-World Use
Durga Antivirus Pro fine-tunes a Llama 3 model on malware analysis reports and threat intelligence data. The fine-tuned model answers security-specific questions with higher accuracy than GPT-4 on zero-day threat analysis, while running on-premises for data privacy Compliance.
Full Fine-Tuning
Full fine-tuning updates all model parameters on your dataset. This requires significant GPU memory — a 7B parameter model needs approximately 56GB of GPU RAM at 16-bit precision. Full fine-tuning produces the highest quality results because every parameter adapts to the domain. However, it produces a full copy of the model for each task, making storage and deployment expensive.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset
import torch
model_name = "microsoft/DialoGPT-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
train_texts = [
"Question: What is gradient descent? Answer: An optimization algorithm.",
"Question: Explain Python decorators. Answer: Functions that modify other functions.",
"Question: What is a tensor? Answer: A multi-dimensional array.",
]
train_encodings = tokenizer(
train_texts,
truncation=True,
padding=True,
max_length=64,
return_tensors="pt"
)
dataset = Dataset.from_dict({
'input_ids': train_encodings['input_ids'],
'attention_mask': train_encodings['attention_mask'],
'labels': train_encodings['input_ids'].clone()
})
model = AutoModelForCausalLM.from_pretrained(model_name)
training_args = TrainingArguments(
output_dir="./fine-tuned",
per_device_train_batch_size=2,
num_train_epochs=3,
save_steps=100,
logging_steps=10,
report_to="none",
remove_unused_columns=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
input_text = "Question: What is gradient descent?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Full fine-tuning total params: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print(f"\nGenerated: {result}")
Expected output:
Full fine-tuning total params: 124,441,344
Trainable params: 124,441,344
Generated: Question: What is gradient descent? Answer: An optimization algorithm.
LoRA — Low-Rank Adaptation
LoRA freezes the pretrained model weights and injects trainable rank decomposition matrices into attention layers. Instead of updating a 4096x4096 weight matrix, LoRA trains two small matrices (4096xr and rx4096) where r is typically 8-16. This reduces trainable parameters by 10,000x. LoRA adapters are tiny (a few MB) and can be swapped at inference time, enabling multi-task serving from one base model.
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
model_lora = AutoModelForCausalLM.from_pretrained(model_name)
model_lora = get_peft_model(model_lora, lora_config)
model_lora.print_trainable_parameters()
trainer_lora = Trainer(
model=model_lora,
args=training_args,
train_dataset=dataset,
)
trainer_lora.train()
lora_input = tokenizer(input_text, return_tensors="pt")
lora_output = model_lora.generate(**lora_input, max_new_tokens=30)
lora_result = tokenizer.decode(lora_output[0], skip_special_tokens=True)
print(f"\nLoRA adapter size: 2 * 8 * (hidden_size + 2 * intermediate_size) parameters")
print(f"Generated: {lora_result}")
Expected output:
trainable params: 368,640 || all params: 124,809,984 || trainable%: 0.2953
LoRA adapter size: 2 * 8 * (hidden_size + 2 * intermediate_size) parameters
Generated: Question: What is gradient descent? Answer: An optimization algorithm.
QLoRA — Quantized LoRA
QLoRA combines 4-bit quantization of the base model with LoRA adapters. The base model is loaded in 4-bit NormalFloat format, reducing memory by 4x. LoRA adapters remain in 16-bit precision. QLoRA enables fine-tuning a 65B parameter model on a single 48GB GPU. The Double Quantization technique further reduces memory by quantizing the quantization constants.
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model_qlora = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
model_qlora = get_peft_model(model_qlora, lora_config)
model_memory = sum(p.numel() * p.element_size() for p in model_qlora.parameters())
print(f"4-bit base model + LoRA")
print(f"Memory estimate: {model_memory / 1024**3:.2f} GB")
print(f"Trainable params: {sum(p.numel() for p in model_qlora.parameters() if p.requires_grad):,}")
print(f"Total params: {sum(p.numel() for p in model_qlora.parameters()):,}")
full_model_bits = 16
qlora_model_bits = 4
memory_reduction = full_model_bits / qlora_model_bits
print(f"Memory reduction vs full fine-tune: {memory_reduction}x")
Expected output:
4-bit base model + LoRA
Memory estimate: 0.25 GB
Trainable params: 368,640
Total params: 124,809,984
Memory reduction vs full fine-tune: 4x
Dataset Preparation
Quality fine-tuning requires high-quality datasets. The dataset should be diverse, representative of your use case, and properly formatted with instruction-input-output structure. Common formats include Alpaca-style (instruction, input, output) and Chat-style (conversation turns). Data quality matters more than quantity — 1,000 well-curated examples often outperform 100,000 noisy ones.
import json
fine_tuning_data = [
{
"instruction": "Explain what a vector database is.",
"input": "",
"output": "A vector database stores and searches high-dimensional vectors for similarity search.]
},
{
"instruction": "What is the difference between L1 and L2 regularization?",
"input": "",
"output": "L1 adds absolute coefficient penalty (sparsity), L2 adds squared coefficient penalty (shrinkage)."
},
{
"instruction": "Write a Python function to compute cosine similarity.",
"input": "",
"output": "def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))"
}
]
def format_instruction(example):
if example["input"]:
return f"Instruction: {example['instruction']}\nInput: {example['input']}\nResponse: {example['output']}"
return f"Instruction: {example['instruction']}\nResponse: {example['output']}"
formatted = [format_instruction(ex) for ex in fine_tuning_data]
print(f"Dataset size: {len(fine_tuning_data)} examples")
print(f"Example output length: {len(fine_tuning_data[0]['output'])} chars")
print(f"\nFormatted first example:\n{formatted[0]}")
Expected output:
Dataset size: 3 examples
Example output length: 68 chars
Formatted first example:
Instruction: Explain what a vector database is.
Response: A vector database stores and searches high-dimensional vectors for similarity search.
Fine-Tuning Comparison
flowchart TD
A[Pretrained LLM] --> B{Choose Method}
B --> C[Full Fine-Tuning]
B --> D[LoRA]
B --> E[QLoRA]
C --> F[Update all params]
D --> G[Freeze base + train adapters]
E --> H[4-bit base + train adapters]
F --> I[High quality, high cost]
G --> J[Good quality, low cost]
H --> K[Good quality, very low cost]
I --> L[56GB+ for 7B model]
J --> M[16GB for 7B model]
K --> N[8GB for 7B model]
Common Errors and Mistakes
| Mistake | Why It Happens | How to Fix |
|---|---|---|
| Overfitting on small data | Model memorizes examples | Use LoRA with higher rank, more regularization |
| Catastrophic forgetting | Model loses general knowledge | Mix domain data with general data (10% general) |
| Wrong tokenizer | Tokenizer mismatches model | Always use the tokenizer from the same model |
| Incorrect format | Model expects specific template | Check base model's chat template format |
| Not setting pad_token | Training crashes on batched data | Set tokenizer.pad_token = tokenizer.eos_token |
Practice Questions
- How does LoRA reduce memory requirements compared to full fine-tuning?
Answer: LoRA freezes the base model and trains only small rank-decomposition matrices. For a 7B model, full fine-tuning updates 7B parameters; LoRA updates only 10-50M parameters, reducing gradient memory and optimizer states by 100-1000x.
- What is the trade-off between LoRA rank and model quality?
Answer: Higher rank (r=32, 64) captures more adaptation capacity but uses more memory. Lower rank (r=4, 8) is more memory-efficient but may not capture complex domain patterns. Start with r=8 and increase if underfitting.
- How does QLoRA achieve further memory reduction over LoRA?
Answer: QLoRA loads the base model in 4-bit NormalFloat quantization instead of 16-bit, reducing base model memory by 4x. The LoRA adapters remain in 16-bit for training precision.
- Why is dataset quality more important than quantity for fine-tuning?
Answer: A small high-quality dataset with correct, diverse, and representative examples teaches the model effectively. Noisy or repetitive data confuses the model regardless of quantity. 1,000 curated examples beat 100,000 scraped ones.
- What is catastrophic forgetting and how do you prevent it?
Answer: Catastrophic forgetting occurs when fine-tuning on domain data overwrites the model's general knowledge. Prevent by mixing in 5-10% general data during training or using elastic weight consolidation (EWC).
Challenge
Fine-tune a small LLM (like Phi-2 or TinyLlama) on a custom dataset of 500 instruction-response pairs for a specific domain (e.g., cybersecurity, medical Q&A, or Code Generation). Compare full fine-tuning, LoRA (r=8, r=16, r=32), and QLoRA. Report training time, peak memory, and evaluation loss. Measure quality by having the fine-tuned models answer 20 domain-specific questions.
Real-World Task
Design a fine-tuning pipeline for a customer support LLM. Collect 10,000 support tickets and responses from your company's help desk. Clean and format the data as instruction-response pairs. Fine-tune a Llama 3 8B model using QLoRA on a single GPU. Evaluate the fine-tuned model against the base model using human evaluation on 100 held-out tickets. Deploy as a local API with the LoRA Adapter swappable per use case.
Next Steps
Deploy fine-tuned models with Hugging Face TGI or vLLM. Use Docker for reproducible fine-tuning environments. Track experiments with MLflow and version datasets with Git.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro