Fine-tune Qwen2.5-1.5B for JSON extraction using QLoRA (4-bit) supervised fine-tuning, then align with Direct Preference Optimization. Before-and-after metrics on identical evaluation prompts.
Four-stage pipeline from base model to preference-aligned fine-tune.
Load ultrachat_200k, format as JSON extraction instructions. Prepare DPO preference pairs.
4-bit quantized base model + LoRA adapters. 3 epochs on 10K samples. Paged AdamW 8-bit optimizer.
Merge SFT adapter, apply fresh LoRA. Train on 5K preference pairs with beta=0.1 sigmoid loss.
8 JSON extraction prompts. Measure valid JSON rate, key F1, value accuracy, latency across all variants.
Same 8 evaluation prompts, same hardware, three model variants.
Curated, high-quality datasets for each training stage.
High-quality multi-turn chat conversations filtered from ShareGPT-style logs. Widely used in open-source instruction model training. 10K samples used.
HuggingFaceH4/ultrachat_200kLarge, commercially-usable instruction-following dataset from NVIDIA. Verifier-filtered conversations and structured-output examples.
nvidia/Nemotron-Instruction-Following-Chat-v1Curated preference dataset with prompt/chosen/rejected triples. Cleaned and enriched from Intel/orca_dpo_pairs for DPO alignment. 5K pairs used.
argilla/distilabel-intel-orca-dpo-pairs8.5K grade-school math word problems for testing multi-step reasoning. Used as supplementary evaluation benchmark.
openai/gsm8kEfficient fine-tuning for production constraints.
Full fine-tuning a 1.5B model requires ~12GB VRAM and updates all parameters. QLoRA quantizes to 4-bit and trains only ~0.6% of parameters via low-rank adapters. Result: fits on a free T4 GPU with minimal quality loss versus full fine-tuning.
SFT teaches the model what to output. DPO teaches it which output is better. SFT alone may produce valid JSON but with wrong field choices. DPO uses preference pairs to steer the model toward outputs humans actually prefer — better key naming, more complete extraction.
LoRA rank (r) controls adapter capacity. r=4 is minimal, r=64 approaches full fine-tuning. We use r=16 as the sweet spot: enough capacity for task-specific adaptation without overfitting on 10K samples. Alpha=32 (2x rank) provides stable gradients.
NF4 quantization reduces model size by ~4x. The double quantization flag further compresses quantization constants. Compute stays in bfloat16 for numerical stability. Measured quality loss vs full precision: <1% on most benchmarks.
| Approach | VRAM Required | Params Trained | Quality | Training Time |
|---|---|---|---|---|
| Full Fine-Tune | ~12 GB | 100% | Best | Hours |
| LoRA (16-bit) | ~8 GB | ~0.6% | Near-best | ~1 hour |
| QLoRA (4-bit) | ~4 GB | ~0.6% | Excellent | ~45 min |
| Prompt Engineering | 0 GB | 0% | Limited | Minutes |
Head-to-head comparison on 8 JSON extraction evaluation prompts.
| Variant | Valid JSON | Key F1 | Value Acc | Latency |
|---|---|---|---|---|
| Base (Qwen2.5-1.5B) | -- | -- | -- | -- |
| + SFT (QLoRA) | -- | -- | -- | -- |
| + SFT + DPO | -- | -- | -- | -- |