Fine-Tuning LLMs: LoRA, QLoRA and Full Fine-Tuning Guide

DodaTech Updated 2026-06-22 7 min read

In this tutorial, you'll learn about Fine. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Fine-tuning adapts a pretrained large language model to your specific data and task, improving performance on domain-specific questions while retaining the model's general language understanding capabilities.

What You'll Learn

In this tutorial, you'll learn fine-tuning techniques for large language models including full fine-tuning, LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), dataset preparation, and instruction tuning using Hugging Face Transformers and Python.

Why It Matters

Base LLMs are trained on general internet data and perform poorly on specialized domains. Fine-tuning adapts the model to your domain — medical, legal, code, customer support — without training from scratch. LoRA and QLoRA make fine-tuning accessible on consumer GPUs by dramatically reducing memory requirements. Fine-tuned models outperform much larger general models on domain tasks.

Real-World Use

Durga Antivirus Pro fine-tunes a Llama 3 model on malware analysis reports and threat intelligence data. The fine-tuned model answers security-specific questions with higher accuracy than GPT-4 on zero-day threat analysis, while running on-premises for data privacy Compliance.

Full Fine-Tuning

Full fine-tuning updates all model parameters on your dataset. This requires significant GPU memory — a 7B parameter model needs approximately 56GB of GPU RAM at 16-bit precision. Full fine-tuning produces the highest quality results because every parameter adapts to the domain. However, it produces a full copy of the model for each task, making storage and deployment expensive.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset
import torch

model_name = "microsoft/DialoGPT-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

train_texts = [
    "Question: What is gradient descent? Answer: An optimization algorithm.",
    "Question: Explain Python decorators. Answer: Functions that modify other functions.",
    "Question: What is a tensor? Answer: A multi-dimensional array.",
]

train_encodings = tokenizer(
    train_texts,
    truncation=True,
    padding=True,
    max_length=64,
    return_tensors="pt"
)

dataset = Dataset.from_dict({
    'input_ids': train_encodings['input_ids'],
    'attention_mask': train_encodings['attention_mask'],
    'labels': train_encodings['input_ids'].clone()
})

model = AutoModelForCausalLM.from_pretrained(model_name)

training_args = TrainingArguments(
    output_dir="./fine-tuned",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    save_steps=100,
    logging_steps=10,
    report_to="none",
    remove_unused_columns=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()

input_text = "Question: What is gradient descent?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Full fine-tuning total params: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print(f"\nGenerated: {result}")

Expected output:

Full fine-tuning total params: 124,441,344
Trainable params: 124,441,344

Generated: Question: What is gradient descent? Answer: An optimization algorithm.

LoRA — Low-Rank Adaptation

LoRA freezes the pretrained model weights and injects trainable rank decomposition matrices into attention layers. Instead of updating a 4096x4096 weight matrix, LoRA trains two small matrices (4096xr and rx4096) where r is typically 8-16. This reduces trainable parameters by 10,000x. LoRA adapters are tiny (a few MB) and can be swapped at inference time, enabling multi-task serving from one base model.

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model_lora = AutoModelForCausalLM.from_pretrained(model_name)
model_lora = get_peft_model(model_lora, lora_config)

model_lora.print_trainable_parameters()

trainer_lora = Trainer(
    model=model_lora,
    args=training_args,
    train_dataset=dataset,
)

trainer_lora.train()

lora_input = tokenizer(input_text, return_tensors="pt")
lora_output = model_lora.generate(**lora_input, max_new_tokens=30)
lora_result = tokenizer.decode(lora_output[0], skip_special_tokens=True)
print(f"\nLoRA adapter size: 2 * 8 * (hidden_size + 2 * intermediate_size) parameters")
print(f"Generated: {lora_result}")

Expected output:

trainable params: 368,640 || all params: 124,809,984 || trainable%: 0.2953

LoRA adapter size: 2 * 8 * (hidden_size + 2 * intermediate_size) parameters
Generated: Question: What is gradient descent? Answer: An optimization algorithm.

QLoRA — Quantized LoRA

QLoRA combines 4-bit quantization of the base model with LoRA adapters. The base model is loaded in 4-bit NormalFloat format, reducing memory by 4x. LoRA adapters remain in 16-bit precision. QLoRA enables fine-tuning a 65B parameter model on a single 48GB GPU. The Double Quantization technique further reduces memory by quantizing the quantization constants.

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model_qlora = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

model_qlora = get_peft_model(model_qlora, lora_config)

model_memory = sum(p.numel() * p.element_size() for p in model_qlora.parameters())
print(f"4-bit base model + LoRA")
print(f"Memory estimate: {model_memory / 1024**3:.2f} GB")
print(f"Trainable params: {sum(p.numel() for p in model_qlora.parameters() if p.requires_grad):,}")
print(f"Total params: {sum(p.numel() for p in model_qlora.parameters()):,}")

full_model_bits = 16
qlora_model_bits = 4
memory_reduction = full_model_bits / qlora_model_bits
print(f"Memory reduction vs full fine-tune: {memory_reduction}x")

Expected output:

4-bit base model + LoRA
Memory estimate: 0.25 GB
Trainable params: 368,640
Total params: 124,809,984
Memory reduction vs full fine-tune: 4x

Dataset Preparation

Quality fine-tuning requires high-quality datasets. The dataset should be diverse, representative of your use case, and properly formatted with instruction-input-output structure. Common formats include Alpaca-style (instruction, input, output) and Chat-style (conversation turns). Data quality matters more than quantity — 1,000 well-curated examples often outperform 100,000 noisy ones.

import json

fine_tuning_data = [
    {
        "instruction": "Explain what a vector database is.",
        "input": "",
        "output": "A vector database stores and searches high-dimensional vectors for similarity search.]
    },
    {
        "instruction": "What is the difference between L1 and L2 regularization?",
        "input": "",
        "output": "L1 adds absolute coefficient penalty (sparsity), L2 adds squared coefficient penalty (shrinkage)."
    },
    {
        "instruction": "Write a Python function to compute cosine similarity.",
        "input": "",
        "output": "def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))"
    }
]

def format_instruction(example):
    if example["input"]:
        return f"Instruction: {example['instruction']}\nInput: {example['input']}\nResponse: {example['output']}"
    return f"Instruction: {example['instruction']}\nResponse: {example['output']}"

formatted = [format_instruction(ex) for ex in fine_tuning_data]
print(f"Dataset size: {len(fine_tuning_data)} examples")
print(f"Example output length: {len(fine_tuning_data[0]['output'])} chars")
print(f"\nFormatted first example:\n{formatted[0]}")

Expected output:

Dataset size: 3 examples
Example output length: 68 chars

Formatted first example:
Instruction: Explain what a vector database is.
Response: A vector database stores and searches high-dimensional vectors for similarity search.

Fine-Tuning Comparison

flowchart TD
  A[Pretrained LLM] --> B{Choose Method}
  B --> C[Full Fine-Tuning]
  B --> D[LoRA]
  B --> E[QLoRA]
  C --> F[Update all params]
  D --> G[Freeze base + train adapters]
  E --> H[4-bit base + train adapters]
  F --> I[High quality, high cost]
  G --> J[Good quality, low cost]
  H --> K[Good quality, very low cost]
  I --> L[56GB+ for 7B model]
  J --> M[16GB for 7B model]
  K --> N[8GB for 7B model]

Common Errors and Mistakes

Mistake	Why It Happens	How to Fix
Overfitting on small data	Model memorizes examples	Use LoRA with higher rank, more regularization
Catastrophic forgetting	Model loses general knowledge	Mix domain data with general data (10% general)
Wrong tokenizer	Tokenizer mismatches model	Always use the tokenizer from the same model
Incorrect format	Model expects specific template	Check base model's chat template format
Not setting pad_token	Training crashes on batched data	Set tokenizer.pad_token = tokenizer.eos_token

Practice Questions

How does LoRA reduce memory requirements compared to full fine-tuning?

Answer: LoRA freezes the base model and trains only small rank-decomposition matrices. For a 7B model, full fine-tuning updates 7B parameters; LoRA updates only 10-50M parameters, reducing gradient memory and optimizer states by 100-1000x.

What is the trade-off between LoRA rank and model quality?

Answer: Higher rank (r=32, 64) captures more adaptation capacity but uses more memory. Lower rank (r=4, 8) is more memory-efficient but may not capture complex domain patterns. Start with r=8 and increase if underfitting.

How does QLoRA achieve further memory reduction over LoRA?

Answer: QLoRA loads the base model in 4-bit NormalFloat quantization instead of 16-bit, reducing base model memory by 4x. The LoRA adapters remain in 16-bit for training precision.

Why is dataset quality more important than quantity for fine-tuning?

Answer: A small high-quality dataset with correct, diverse, and representative examples teaches the model effectively. Noisy or repetitive data confuses the model regardless of quantity. 1,000 curated examples beat 100,000 scraped ones.

What is catastrophic forgetting and how do you prevent it?

Answer: Catastrophic forgetting occurs when fine-tuning on domain data overwrites the model's general knowledge. Prevent by mixing in 5-10% general data during training or using elastic weight consolidation (EWC).

Challenge

Fine-tune a small LLM (like Phi-2 or TinyLlama) on a custom dataset of 500 instruction-response pairs for a specific domain (e.g., cybersecurity, medical Q&A, or Code Generation). Compare full fine-tuning, LoRA (r=8, r=16, r=32), and QLoRA. Report training time, peak memory, and evaluation loss. Measure quality by having the fine-tuned models answer 20 domain-specific questions.

Real-World Task

Design a fine-tuning pipeline for a customer support LLM. Collect 10,000 support tickets and responses from your company's help desk. Clean and format the data as instruction-response pairs. Fine-tune a Llama 3 8B model using QLoRA on a single GPU. Evaluate the fine-tuned model against the base model using human evaluation on 100 held-out tickets. Deploy as a local API with the LoRA Adapter swappable per use case.

Next Steps

Deploy fine-tuned models with Hugging Face TGI or vLLM. Use Docker for reproducible fine-tuning environments. Track experiments with MLflow and version datasets with Git.

What is the difference between fine-tuning and RAG?

Fine-tuning changes model weights to improve performance on a domain. RAG retrieves relevant documents and injects them into the prompt at inference time. Fine-tuning adapts the model permanently; RAG adapts per query with external data. They are complementary — fine-tune for domain knowledge, use RAG for specific documents.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Integrating LLM APIs: OpenAI, Anthropic and Open-Source Models Next → Text Embeddings: From Word2Vec to Modern Embedding Models

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Machine Learning