Fine-Tuning with LoRA & DPO

Fine-tune Qwen2.5-1.5B for JSON extraction using QLoRA (4-bit) supervised fine-tuning, then align with Direct Preference Optimization. Before-and-after metrics on identical evaluation prompts.

Qwen2.5-1.5B-Instruct QLoRA (4-bit NF4) LoRA r=16 DPO beta=0.1 JSON Extraction T4 GPU

Training Pipeline

Four-stage pipeline from base model to preference-aligned fine-tune.

Step 1

Data Preparation

Load ultrachat_200k, format as JSON extraction instructions. Prepare DPO preference pairs.

Step 2

SFT with QLoRA

4-bit quantized base model + LoRA adapters. 3 epochs on 10K samples. Paged AdamW 8-bit optimizer.

Step 3

DPO Alignment

Merge SFT adapter, apply fresh LoRA. Train on 5K preference pairs with beta=0.1 sigmoid loss.

Step 4

Evaluation

8 JSON extraction prompts. Measure valid JSON rate, key F1, value accuracy, latency across all variants.

QLoRA Configuration

Base Model: Qwen/Qwen2.5-1.5B-Instruct (1.5B params) Quantization: 4-bit NF4 with double quantization LoRA Rank: 16 (alpha=32, dropout=0.05) Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj SFT: 3 epochs, lr=2e-4, batch=4, grad_accum=4, cosine schedule DPO: 1 epoch, lr=5e-5, batch=2, grad_accum=8, beta=0.1 Optimizer: Paged AdamW 8-bit Hardware: NVIDIA T4 (16GB VRAM)

Datasets

Curated, high-quality datasets for each training stage.

SFT

ultrachat_200k

High-quality multi-turn chat conversations filtered from ShareGPT-style logs. Widely used in open-source instruction model training. 10K samples used.

HuggingFaceH4/ultrachat_200k

SFT (Alt)

Nemotron Instruction Following

Large, commercially-usable instruction-following dataset from NVIDIA. Verifier-filtered conversations and structured-output examples.

nvidia/Nemotron-Instruction-Following-Chat-v1

DPO

distilabel-intel-orca-dpo-pairs

Curated preference dataset with prompt/chosen/rejected triples. Cleaned and enriched from Intel/orca_dpo_pairs for DPO alignment. 5K pairs used.

argilla/distilabel-intel-orca-dpo-pairs

Eval

GSM8K

8.5K grade-school math word problems for testing multi-step reasoning. Used as supplementary evaluation benchmark.

openai/gsm8k

Why LoRA + DPO?

Efficient fine-tuning for production constraints.

QLoRA Efficiency

Full fine-tuning a 1.5B model requires ~12GB VRAM and updates all parameters. QLoRA quantizes to 4-bit and trains only ~0.6% of parameters via low-rank adapters. Result: fits on a free T4 GPU with minimal quality loss versus full fine-tuning.

SFT vs DPO

SFT teaches the model what to output. DPO teaches it which output is better. SFT alone may produce valid JSON but with wrong field choices. DPO uses preference pairs to steer the model toward outputs humans actually prefer — better key naming, more complete extraction.

Rank Selection

LoRA rank (r) controls adapter capacity. r=4 is minimal, r=64 approaches full fine-tuning. We use r=16 as the sweet spot: enough capacity for task-specific adaptation without overfitting on 10K samples. Alpha=32 (2x rank) provides stable gradients.

Quantization Tradeoff

NF4 quantization reduces model size by ~4x. The double quantization flag further compresses quantization constants. Compute stays in bfloat16 for numerical stability. Measured quality loss vs full precision: <1% on most benchmarks.

Fine-Tuning Approaches Compared

Approach	VRAM Required	Params Trained	Quality	Training Time
Full Fine-Tune	~12 GB	100%	Best	Hours
LoRA (16-bit)	~8 GB	~0.6%	Near-best	~1 hour
QLoRA (4-bit)	~4 GB	~0.6%	Excellent	~45 min
Prompt Engineering	0 GB	0%	Limited	Minutes

Variant	Valid JSON	Key F1	Value Acc	Latency
Base (Qwen2.5-1.5B)	--	--	--	--
+ SFT (QLoRA)	--	--	--	--
+ SFT + DPO	--	--	--	--