Fine-Tuning with LoRA & DPO

Fine-tune Qwen2.5-1.5B for JSON extraction using QLoRA (4-bit) supervised fine-tuning, then align with Direct Preference Optimization. Before-and-after metrics on identical evaluation prompts.

Qwen2.5-1.5B-Instruct QLoRA (4-bit NF4) LoRA r=16 DPO beta=0.1 JSON Extraction T4 GPU

JSON Valid Rate Gain
--
Base to SFT+DPO
Key F1 Improvement
--
Extraction accuracy
Value Accuracy Gain
--
Correct field values
Trainable Parameters
--
Of total model params

Training Pipeline

Four-stage pipeline from base model to preference-aligned fine-tune.

Step 1

Data Preparation

Load ultrachat_200k, format as JSON extraction instructions. Prepare DPO preference pairs.

Step 2

SFT with QLoRA

4-bit quantized base model + LoRA adapters. 3 epochs on 10K samples. Paged AdamW 8-bit optimizer.

Step 3

DPO Alignment

Merge SFT adapter, apply fresh LoRA. Train on 5K preference pairs with beta=0.1 sigmoid loss.

Step 4

Evaluation

8 JSON extraction prompts. Measure valid JSON rate, key F1, value accuracy, latency across all variants.

QLoRA Configuration

Base Model: Qwen/Qwen2.5-1.5B-Instruct (1.5B params) Quantization: 4-bit NF4 with double quantization LoRA Rank: 16 (alpha=32, dropout=0.05) Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj SFT: 3 epochs, lr=2e-4, batch=4, grad_accum=4, cosine schedule DPO: 1 epoch, lr=5e-5, batch=2, grad_accum=8, beta=0.1 Optimizer: Paged AdamW 8-bit Hardware: NVIDIA T4 (16GB VRAM)

Before & After Results

Same 8 evaluation prompts, same hardware, three model variants.

Detailed Comparison

SFT Training

Train Loss
--
Eval Loss
--

DPO Training

Train Loss
--
Eval Loss
--

Datasets

Curated, high-quality datasets for each training stage.

SFT

ultrachat_200k

High-quality multi-turn chat conversations filtered from ShareGPT-style logs. Widely used in open-source instruction model training. 10K samples used.

HuggingFaceH4/ultrachat_200k
SFT (Alt)

Nemotron Instruction Following

Large, commercially-usable instruction-following dataset from NVIDIA. Verifier-filtered conversations and structured-output examples.

nvidia/Nemotron-Instruction-Following-Chat-v1
DPO

distilabel-intel-orca-dpo-pairs

Curated preference dataset with prompt/chosen/rejected triples. Cleaned and enriched from Intel/orca_dpo_pairs for DPO alignment. 5K pairs used.

argilla/distilabel-intel-orca-dpo-pairs
Eval

GSM8K

8.5K grade-school math word problems for testing multi-step reasoning. Used as supplementary evaluation benchmark.

openai/gsm8k

Why LoRA + DPO?

Efficient fine-tuning for production constraints.

QLoRA Efficiency

Full fine-tuning a 1.5B model requires ~12GB VRAM and updates all parameters. QLoRA quantizes to 4-bit and trains only ~0.6% of parameters via low-rank adapters. Result: fits on a free T4 GPU with minimal quality loss versus full fine-tuning.

SFT vs DPO

SFT teaches the model what to output. DPO teaches it which output is better. SFT alone may produce valid JSON but with wrong field choices. DPO uses preference pairs to steer the model toward outputs humans actually prefer — better key naming, more complete extraction.

Rank Selection

LoRA rank (r) controls adapter capacity. r=4 is minimal, r=64 approaches full fine-tuning. We use r=16 as the sweet spot: enough capacity for task-specific adaptation without overfitting on 10K samples. Alpha=32 (2x rank) provides stable gradients.

Quantization Tradeoff

NF4 quantization reduces model size by ~4x. The double quantization flag further compresses quantization constants. Compute stays in bfloat16 for numerical stability. Measured quality loss vs full precision: <1% on most benchmarks.

Fine-Tuning Approaches Compared

Approach VRAM Required Params Trained Quality Training Time
Full Fine-Tune ~12 GB 100% Best Hours
LoRA (16-bit) ~8 GB ~0.6% Near-best ~1 hour
QLoRA (4-bit) ~4 GB ~0.6% Excellent ~45 min
Prompt Engineering 0 GB 0% Limited Minutes

Model Variant Summary

Head-to-head comparison on 8 JSON extraction evaluation prompts.

Variant Valid JSON Key F1 Value Acc Latency
Base (Qwen2.5-1.5B) -- -- -- --
+ SFT (QLoRA) -- -- -- --
+ SFT + DPO -- -- -- --