Skip to content

Fine-Tuning LLMs with LoRA and QLoRA — Parameter-Efficient Training Guide

DodaTech Updated 2026-06-22 5 min read

Parameter-efficient fine-tuning (PEFT) with LoRA and QLoRA lets you adapt large language models on consumer hardware by training only a small fraction of parameters — this guide walks through both techniques end to end.

What You'll Learn

You'll learn the theory behind low-rank adaptation, how to implement LoRA fine-tuning with Hugging Face PEFT, and how QLoRA enables 4-bit quantization for fine-tuning on a single GPU.

Why It Matters

Full fine-tuning of a 7B parameter model requires over 60 GB of VRAM. LoRA reduces this to under 16 GB, and QLoRA further drops it to under 10 GB, making fine-tuning accessible to developers without enterprise GPU clusters.

Real-World Use

Doda Browser's code completion feature uses a LoRA-fine-tuned CodeLlama model that suggests idiomatic Python completions based on the project's existing code patterns — trained in under 4 hours on a single RTX 4090.

Fine-Tuning Landscape

flowchart TD
    A[Base LLM] --> B{Method}
    B --> C[Full Fine-Tune]
    B --> D[LoRA]
    B --> E[QLoRA]
    C --> F[All params updated]
    C --> G[High VRAM]
    D --> H[Rank decomposition]
    D --> I[Medium VRAM]
    E --> J[4-bit NF4 + LoRA]
    E --> K[Lowest VRAM]

What Is LoRA?

Low-Rank Adaptation (LoRA) freezes the pretrained weights and injects trainable low-rank matrices into attention layers, reducing trainable parameters by 10,000x.

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Configure LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Count parameters
total = sum(p.numel() for p in model.parameters())
trainable = sum(
    p.numel() for p in model.parameters() if p.requires_grad
)
print(f"Total params: {total:,}")
print(f"Trainable params: {trainable:,}")
print(f"Trainable: {100 * trainable / total:.4f}%")

Expected output:

Total params: 2,789,109,760
Trainable params: 524,288
Trainable: 0.0188%

Preparing Training Data

Format your dataset as instruction-response pairs for supervised fine-tuning.

from datasets import Dataset

train_data = [
    {"instruction": "Write a Python function to check if a number is prime.",
     "response": "def is_prime(n):\n    if n < 2: return False\n    for i in range(2, int(n**0.5) + 1):\n        if n % i == 0: return False\n    return True"},
    {"instruction": "Explain what a decorator does in Python.",
     "response": "A decorator is a function that takes another function and extends its behavior without explicitly modifying it."}
]

def format_example(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}{tokenizer.eos_token}"
    }

dataset = Dataset.from_list(train_data)
dataset = dataset.map(format_example)
print(f"Dataset size: {len(dataset)}")
print(f"Example:\n{dataset[0]['text'][:200]}...")

Expected output:

Dataset size: 2
Example:
### Instruction:
Write a Python function to check if a number is prime.

### Response:
def is_prime(n):
    if n < 2: return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0: return False
    return True<EOS>...

Training with LoRA

Use the Hugging Face Trainer for fine-tuning.

from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq

training_args = TrainingArguments(
    output_dir="./phi2-lora",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8
    ),
)

trainer.train()
print("Training complete!")
print(f"Model saved to {training_args.output_dir}")

Expected output:

Step  Training Loss
10    1.4523
20    0.8912
30    0.4456
Training complete!
Model saved to ./phi2-lora

QLoRA: Quantized LoRA

QLoRA adds 4-bit NormalFloat quantization of the base model, reducing memory by 4x while preserving LoRA training quality.

from transformers import BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
model = get_peft_model(model, lora_config)

# Measure memory
memory = torch.cuda.max_memory_allocated() / 1024**3
trainable = sum(
    p.numel() for p in model.parameters() if p.requires_grad
)
print(f"Max GPU memory: {memory:.2f} GB")
print(f"Trainable params: {trainable:,}")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8
    ),
)
trainer.train()

Expected output:

Max GPU memory: 5.84 GB
Trainable params: 524,288
Step  Training Loss
10    1.4210
20    0.8765
Training complete!

Inference After Fine-Tuning

Load and test your fine-tuned model.

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
fine_tuned = PeftModel.from_pretrained(base_model, "./phi2-lora")
fine_tuned = fine_tuned.merge_and_unload()

prompt = "### Instruction:\nSort a list of numbers in Python\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = fine_tuned.generate(
    **inputs, max_new_tokens=100, temperature=0.7
)
response = tokenizer.decode(
    output[0], skip_special_tokens=True
).split("### Response:\n")[-1]
print(response)

Expected output:

def sort_numbers(numbers):
    return sorted(numbers)

# Example usage
nums = [3, 1, 4, 1, 5, 9, 2, 6]
sorted_nums = sort_numbers(nums)
print(sorted_nums)  # [1, 1, 2, 3, 4, 5, 6, 9]

Common Errors

Error Cause Fix
CUDA out of memory during training Batch size or LoRA rank too high Reduce per_device_train_batch_size to 1 or lower r to 4
Loss diverges after first step Learning rate too high for PEFT Use 1e-4 to 3e-4 range; LoRA needs higher LR than full fine-tune
Model repeats input during inference Trainer did not mask labels Set label_pad_token_id=-100 for causal LM
QLoRA training slower than expected 4-bit dequantization overhead Increase gradient_accumulation_steps to reduce CPU-GPU syncs
Merged model performs worse than LoRA Adapter Merge introduced precision loss Use float16 merge and test Adapter without merging first

Practice Questions

  1. Why does LoRA use a low-rank decomposition instead of training all parameters? Pretrained models lie on a low intrinsic dimension; updating a rank-decomposed subset captures task-specific adaptations without full retraining.

  2. What advantage does QLoRA offer over standard LoRA? QLoRA quantizes the base model to 4-bit NF4, reducing memory by 4x while maintaining LoRA's trainable parameter count and accuracy.

  3. How do you choose the rank (r) hyperparameter in LoRA? Higher r captures more adaptation capacity but uses more memory; r=8 to r=16 works for most tasks, r=64 for complex domain shifts.

  4. Why must the tokenizer pad token match the EOS token for causal LM fine-tuning? Causal LMs are not trained with pad tokens; setting pad to EOS prevents the model from learning meaningless padding patterns.

  5. Challenge: Fine-tune a 7B parameter model (Mistral or Llama 2) on a domain-specific dataset of your choice using QLoRA, evaluate its perplexity against the base model, and deploy the Adapter as a FastAPI inference endpoint.

Mini Project

Build a code review assistant. Collect a dataset of Python code snippets with good and bad style examples, format them as instruction-response pairs, fine-tune a small LLM (Phi-2 or CodeLlama 7B) using LoRA with r=16, and create a Gradio interface where developers paste code and receive style suggestions.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro