Fine-Tuning LLMs: LoRA vs QLoRA Production Guide

Full fine-tuning a 70B model requires $120K+ in datacenter GPUs. LoRA reduces memory by 10×, QLoRA by 20×—enabling domain adaptation on consumer hardware. This guide shows GPU requirements (H200, RTX 5090), quality trade-offs, and production deployment for Llama 3.2 models (Oct 2025).

The Fine-Tuning Spectrum (Oct 2025)

Full Fine-Tuning: Update all model parameters. Requires massive GPU memory (13B model = 26GB FP16 weights + 78GB gradients = 104GB total). Only viable for 8B models on single H200.

LoRA: Train tiny adapter layers (rank 8-64). 10× memory reduction. Llama 3.2 70B FP16: 140GB → 14GB trainable.

QLoRA: LoRA + 4-bit quantization. 20× memory reduction. Llama 3.2 70B: 140GB → 7GB trainable. Fits on single RTX 5090 (32GB).

GPU Memory Requirements (Llama 3.2 Models, Oct 2025)

Model Full FT (FP16) LoRA (rank 32) QLoRA (4-bit)
Llama 3.2 8B 48GB (H200) 16GB (RTX 5080) 8GB (RTX 5070)
Llama 3.2 70B 420GB (3× H200 141GB) 42GB (2× RTX 5090) 32GB (RTX 5090)

Key Insight: QLoRA makes 70B fine-tuning accessible on consumer GPUs. A single RTX 5090 ($2,000) can fine-tune models that previously required $120K+ in datacenter GPUs.

Performance Trade-Offs

Training Speed (Llama 3.2 8B, ~10K samples, Oct 2025)

  • Full FT: ~8 hours on H200 (1.5× faster than H100 due to HBM3e bandwidth)
  • LoRA (rank 32): ~3 hours on H200 (3× faster - fewer params to update)
  • QLoRA (4-bit): ~5 hours on RTX 5090 (2× faster than full FT, slower than LoRA due to dequantization)

Quality: Full FT vs LoRA vs QLoRA

Real-World Quality Results

Domain Adaptation (Legal Q&A):

  • Full FT: 92% accuracy (baseline)
  • LoRA rank 32: 91% accuracy (-1% vs full FT)
  • QLoRA 4-bit: 89% accuracy (-3% vs full FT)

Key Finding: LoRA recovers 90-95% of full fine-tuning quality with 10× less memory. QLoRA 80-90% quality with 20× less memory. For most production use cases, the 1-3% accuracy loss is acceptable given the 10-20× cost reduction.

When to Use Each Method

  • Full Fine-Tuning: Only if 8B model + H200 budget + need absolute best quality
  • LoRA: Production default. 1% quality loss, 10× memory savings, fast iteration
  • QLoRA: Budget deployments, rapid experimentation, consumer GPUs (RTX 5090, 5080)

Production Deployment Guide

LoRA Fine-Tuning (Llama 3 8B on Single GPU)

# Install dependencies
pip install transformers peft datasets bitsandbytes accelerate

# Fine-tune with LoRA (16GB VRAM)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    load_in_8bit=True,
    device_map="auto"
)

lora_config = LoraConfig(
    r=32,  # rank (8-64 typical)
    lora_alpha=64,  # scaling factor (2× rank common)
    target_modules=["q_proj", "v_proj"],  # attention layers
    lora_dropout=0.1,
    bias="none"
)

model = get_peft_model(model, lora_config)
# Train with HuggingFace Trainer...
# Training time: ~3 hours on RTX 5090 (Oct 2025)

QLoRA Fine-Tuning (Llama 3.2 70B on RTX 5090)

# QLoRA with 4-bit quantization (32GB VRAM)
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True  # nested quantization
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-70B",
    quantization_config=bnb_config,
    device_map="auto"
)

lora_config = LoraConfig(r=16, lora_alpha=32, ...)
model = get_peft_model(model, lora_config)
# Fits on single RTX 5090 (32GB) with headroom!

Serving Fine-Tuned LoRA Models

# Load base model + LoRA adapter
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-8B")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")

# Or merge adapter into base for faster inference
model = model.merge_and_unload()
model.save_pretrained("./merged-model")

# Serve with vLLM (supports LoRA adapters natively)
python -m vllm.entrypoints.openai.api_server \
  --model ./merged-model \
  --tensor-parallel-size 1

Cost Analysis: Full FT vs LoRA vs QLoRA

Llama 3.2 70B Fine-Tuning (10K samples, ~6 hours training, Oct 2025)

Full Fine-Tuning (3× H200 141GB): $120K hardware + $18/hr cloud = $108 training run
LoRA (2× RTX 5090 32GB): $4K hardware + $1.20/hr cloud = $7.20 training run
QLoRA (1× RTX 5090 32GB): $2K hardware + $0.60/hr cloud = $3.60 training run

Iteration Economics:

  • Full FT: $108/iteration → 10 iterations = $1,080 (slow experimentation)
  • LoRA: $7.20/iteration → 100 iterations = $720 (15× cheaper)
  • QLoRA: $3.60/iteration → 200 iterations = $720 (30× cheaper, enables rapid tuning)

Production Recommendation (Oct 2025)

Start with QLoRA for rapid experimentation (20-50 iterations to find best hyperparameters). Once satisfied, run final LoRA training for 1-2% quality boost if needed. Full fine-tuning rarely worth the 15× cost premium over LoRA—only justified for mission-critical applications where every 0.5% accuracy matters.

Related Articles

Fine-Tuning LLMs: LoRA vs QLoRA Production Guide

LoRA vs QLoRA: VRAM requirements, quality tradeoffs, and a production decision framework (when to fine-tune vs when to use RAG).

Want the full technical deep dive?

This page includes an executive brief in your language. Switch to English to read the full technical version with implementation details.

Key takeaways

  • Fine-tuning is for behavior and style; RAG is for freshness of knowledge.
  • LoRA improves quality with higher VRAM; QLoRA cuts cost and speeds iteration.
  • Decide based on dataset quality, drift rate, compliance needs, and target hardware.
  • Use a fixed eval set with regression gates; plan rollback and versioning.

30-day plan

  • Define the objective (format, tone, extraction, classification) and success metrics.
  • Build dataset + validation set; set a quality bar before training.
  • Train LoRA/QLoRA baseline and run eval + safety checks.
  • Deploy behind a feature flag; monitor quality and cost, then iterate.