The Fine-Tuning Spectrum (Oct 2025)
Full Fine-Tuning: Update all model parameters. Requires massive GPU memory (13B model = 26GB FP16 weights + 78GB gradients = 104GB total). Only viable for 8B models on single H200.
LoRA: Train tiny adapter layers (rank 8-64). 10× memory reduction. Llama 3.2 70B FP16: 140GB → 14GB trainable.
QLoRA: LoRA + 4-bit quantization. 20× memory reduction. Llama 3.2 70B: 140GB → 7GB trainable. Fits on single RTX 5090 (32GB).
GPU Memory Requirements (Llama 3.2 Models, Oct 2025)
| Model | Full FT (FP16) | LoRA (rank 32) | QLoRA (4-bit) |
|---|---|---|---|
| Llama 3.2 8B | 48GB (H200) | 16GB (RTX 5080) | 8GB (RTX 5070) |
| Llama 3.2 70B | 420GB (3× H200 141GB) | 42GB (2× RTX 5090) | 32GB (RTX 5090) |
Key Insight: QLoRA makes 70B fine-tuning accessible on consumer GPUs. A single RTX 5090 ($2,000) can fine-tune models that previously required $120K+ in datacenter GPUs.
Performance Trade-Offs
Training Speed (Llama 3.2 8B, ~10K samples, Oct 2025)
- Full FT: ~8 hours on H200 (1.5× faster than H100 due to HBM3e bandwidth)
- LoRA (rank 32): ~3 hours on H200 (3× faster - fewer params to update)
- QLoRA (4-bit): ~5 hours on RTX 5090 (2× faster than full FT, slower than LoRA due to dequantization)
Quality: Full FT vs LoRA vs QLoRA
Real-World Quality Results
Domain Adaptation (Legal Q&A):
- Full FT: 92% accuracy (baseline)
- LoRA rank 32: 91% accuracy (-1% vs full FT)
- QLoRA 4-bit: 89% accuracy (-3% vs full FT)
Key Finding: LoRA recovers 90-95% of full fine-tuning quality with 10× less memory. QLoRA 80-90% quality with 20× less memory. For most production use cases, the 1-3% accuracy loss is acceptable given the 10-20× cost reduction.
When to Use Each Method
- Full Fine-Tuning: Only if 8B model + H200 budget + need absolute best quality
- LoRA: Production default. 1% quality loss, 10× memory savings, fast iteration
- QLoRA: Budget deployments, rapid experimentation, consumer GPUs (RTX 5090, 5080)
Production Deployment Guide
LoRA Fine-Tuning (Llama 3 8B on Single GPU)
# Install dependencies
pip install transformers peft datasets bitsandbytes accelerate
# Fine-tune with LoRA (16GB VRAM)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-8B",
load_in_8bit=True,
device_map="auto"
)
lora_config = LoraConfig(
r=32, # rank (8-64 typical)
lora_alpha=64, # scaling factor (2× rank common)
target_modules=["q_proj", "v_proj"], # attention layers
lora_dropout=0.1,
bias="none"
)
model = get_peft_model(model, lora_config)
# Train with HuggingFace Trainer...
# Training time: ~3 hours on RTX 5090 (Oct 2025)
QLoRA Fine-Tuning (Llama 3.2 70B on RTX 5090)
# QLoRA with 4-bit quantization (32GB VRAM)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat 4-bit
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True # nested quantization
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-70B",
quantization_config=bnb_config,
device_map="auto"
)
lora_config = LoraConfig(r=16, lora_alpha=32, ...)
model = get_peft_model(model, lora_config)
# Fits on single RTX 5090 (32GB) with headroom!
Serving Fine-Tuned LoRA Models
# Load base model + LoRA adapter
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-8B")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
# Or merge adapter into base for faster inference
model = model.merge_and_unload()
model.save_pretrained("./merged-model")
# Serve with vLLM (supports LoRA adapters natively)
python -m vllm.entrypoints.openai.api_server \
--model ./merged-model \
--tensor-parallel-size 1
Cost Analysis: Full FT vs LoRA vs QLoRA
Llama 3.2 70B Fine-Tuning (10K samples, ~6 hours training, Oct 2025)
Iteration Economics:
- Full FT: $108/iteration → 10 iterations = $1,080 (slow experimentation)
- LoRA: $7.20/iteration → 100 iterations = $720 (15× cheaper)
- QLoRA: $3.60/iteration → 200 iterations = $720 (30× cheaper, enables rapid tuning)
Production Recommendation (Oct 2025)
Start with QLoRA for rapid experimentation (20-50 iterations to find best hyperparameters). Once satisfied, run final LoRA training for 1-2% quality boost if needed. Full fine-tuning rarely worth the 15× cost premium over LoRA—only justified for mission-critical applications where every 0.5% accuracy matters.