Fine-Tuning LLMs with LoRA and QLoRA — Parameter-Efficient Training Guide

DodaTech Updated 2026-06-22 5 min read

Parameter-efficient fine-tuning (PEFT) with LoRA and QLoRA lets you adapt large language models on consumer hardware by training only a small fraction of parameters — this guide walks through both techniques end to end.

What You'll Learn

You'll learn the theory behind low-rank adaptation, how to implement LoRA fine-tuning with Hugging Face PEFT, and how QLoRA enables 4-bit quantization for fine-tuning on a single GPU.

Why It Matters

Full fine-tuning of a 7B parameter model requires over 60 GB of VRAM. LoRA reduces this to under 16 GB, and QLoRA further drops it to under 10 GB, making fine-tuning accessible to developers without enterprise GPU clusters.

Real-World Use

Doda Browser's code completion feature uses a LoRA-fine-tuned CodeLlama model that suggests idiomatic Python completions based on the project's existing code patterns — trained in under 4 hours on a single RTX 4090.

Fine-Tuning Landscape

flowchart TD
    A[Base LLM] --> B{Method}
    B --> C[Full Fine-Tune]
    B --> D[LoRA]
    B --> E[QLoRA]
    C --> F[All params updated]
    C --> G[High VRAM]
    D --> H[Rank decomposition]
    D --> I[Medium VRAM]
    E --> J[4-bit NF4 + LoRA]
    E --> K[Lowest VRAM]

What Is LoRA?

Low-Rank Adaptation (LoRA) freezes the pretrained weights and injects trainable low-rank matrices into attention layers, reducing trainable parameters by 10,000x.

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Configure LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Count parameters
total = sum(p.numel() for p in model.parameters())
trainable = sum(
    p.numel() for p in model.parameters() if p.requires_grad
)
print(f"Total params: {total:,}")
print(f"Trainable params: {trainable:,}")
print(f"Trainable: {100 * trainable / total:.4f}%")

Expected output:

Total params: 2,789,109,760
Trainable params: 524,288
Trainable: 0.0188%

Preparing Training Data

Format your dataset as instruction-response pairs for supervised fine-tuning.

from datasets import Dataset

train_data = [
    {"instruction": "Write a Python function to check if a number is prime.",
     "response": "def is_prime(n):\n    if n < 2: return False\n    for i in range(2, int(n**0.5) + 1):\n        if n % i == 0: return False\n    return True"},
    {"instruction": "Explain what a decorator does in Python.",
     "response": "A decorator is a function that takes another function and extends its behavior without explicitly modifying it."}
]

def format_example(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}{tokenizer.eos_token}"
    }

dataset = Dataset.from_list(train_data)
dataset = dataset.map(format_example)
print(f"Dataset size: {len(dataset)}")
print(f"Example:\n{dataset[0]['text'][:200]}...")

Expected output:

Dataset size: 2
Example:
### Instruction:
Write a Python function to check if a number is prime.

### Response:
def is_prime(n):
    if n < 2: return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0: return False
    return True<EOS>...

Training with LoRA

Use the Hugging Face Trainer for fine-tuning.

from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq

training_args = TrainingArguments(
    output_dir="./phi2-lora",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8
    ),
)

trainer.train()
print("Training complete!")
print(f"Model saved to {training_args.output_dir}")

Expected output:

Step  Training Loss
10    1.4523
20    0.8912
30    0.4456
Training complete!
Model saved to ./phi2-lora

QLoRA: Quantized LoRA

QLoRA adds 4-bit NormalFloat quantization of the base model, reducing memory by 4x while preserving LoRA training quality.

from transformers import BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
model = get_peft_model(model, lora_config)

# Measure memory
memory = torch.cuda.max_memory_allocated() / 1024**3
trainable = sum(
    p.numel() for p in model.parameters() if p.requires_grad
)
print(f"Max GPU memory: {memory:.2f} GB")
print(f"Trainable params: {trainable:,}")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8
    ),
)
trainer.train()

Expected output:

Max GPU memory: 5.84 GB
Trainable params: 524,288
Step  Training Loss
10    1.4210
20    0.8765
Training complete!

Inference After Fine-Tuning

Load and test your fine-tuned model.

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
fine_tuned = PeftModel.from_pretrained(base_model, "./phi2-lora")
fine_tuned = fine_tuned.merge_and_unload()

prompt = "### Instruction:\nSort a list of numbers in Python\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = fine_tuned.generate(
    **inputs, max_new_tokens=100, temperature=0.7
)
response = tokenizer.decode(
    output[0], skip_special_tokens=True
).split("### Response:\n")[-1]
print(response)

Expected output:

def sort_numbers(numbers):
    return sorted(numbers)

# Example usage
nums = [3, 1, 4, 1, 5, 9, 2, 6]
sorted_nums = sort_numbers(nums)
print(sorted_nums)  # [1, 1, 2, 3, 4, 5, 6, 9]

Common Errors

Error	Cause	Fix
CUDA out of memory during training	Batch size or LoRA rank too high	Reduce per_device_train_batch_size to 1 or lower r to 4
Loss diverges after first step	Learning rate too high for PEFT	Use 1e-4 to 3e-4 range; LoRA needs higher LR than full fine-tune
Model repeats input during inference	Trainer did not mask labels	Set `label_pad_token_id=-100` for causal LM
QLoRA training slower than expected	4-bit dequantization overhead	Increase gradient_accumulation_steps to reduce CPU-GPU syncs
Merged model performs worse than LoRA Adapter	Merge introduced precision loss	Use `float16` merge and test Adapter without merging first

Practice Questions

Why does LoRA use a low-rank decomposition instead of training all parameters? Pretrained models lie on a low intrinsic dimension; updating a rank-decomposed subset captures task-specific adaptations without full retraining.
What advantage does QLoRA offer over standard LoRA? QLoRA quantizes the base model to 4-bit NF4, reducing memory by 4x while maintaining LoRA's trainable parameter count and accuracy.
How do you choose the rank (r) hyperparameter in LoRA? Higher r captures more adaptation capacity but uses more memory; r=8 to r=16 works for most tasks, r=64 for complex domain shifts.
Why must the tokenizer pad token match the EOS token for causal LM fine-tuning? Causal LMs are not trained with pad tokens; setting pad to EOS prevents the model from learning meaningless padding patterns.
Challenge: Fine-tune a 7B parameter model (Mistral or Llama 2) on a domain-specific dataset of your choice using QLoRA, evaluate its perplexity against the base model, and deploy the Adapter as a FastAPI inference endpoint.

Mini Project

Build a code review assistant. Collect a dataset of Python code snippets with good and bad style examples, format them as instruction-response pairs, fine-tune a small LLM (Phi-2 or CodeLlama 7B) using LoRA with r=16, and create a Gradio interface where developers paste code and receive style suggestions.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Computer Vision Introduction — Image Processing, CNNs & Object Detection for Developers Next → Building AI Chatbots with RAG — Knowledge-Grounded Conversational Agents

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Ai Automation