Open-Source AI Judges Beat GPT-5.2 at 15x Lower Cost Using DPO Fine-Tuning

Luisa Crawford Feb 02, 2026 19:30

Together AI demonstrates fine-tuned open-source LLMs can outperform GPT-5.2 as evaluation judges using just 5,400 preference pairs, slashing costs dramatically.

Open-Source AI Judges Beat GPT-5.2 at 15x Lower Cost Using DPO Fine-Tuning

Fine-tuned open-source large language models can now outperform OpenAI's GPT-5.2 at evaluating AI outputs—at a fraction of the cost. Together AI released research showing their GPT-OSS 120B model achieved 62.63% accuracy on human preference alignment after Direct Preference Optimization training, surpassing GPT-5.2's 61.62% baseline while running 14x faster and costing 15x less per token.

The findings matter for any organization running AI evaluation pipelines at scale. GPT-5.2 currently charges $1.75 per million input tokens and $14 per million output tokens. The fine-tuned GPT-OSS 120B? Just $0.15 and $0.60 respectively.

The Training Approach

Together AI used DPO, a technique introduced in late 2023 that bypasses the complex reinforcement learning loops of traditional RLHF. Instead of training a separate reward model, DPO directly adjusts the language model's weights using preference pairs—one preferred response, one rejected response for each prompt.

The training data came from RewardBench 2, a benchmark containing examples with human-labeled preferred and rejected responses across six categories: safety, factuality, math, precise instruction following, focus, and ties. From roughly 1,500 training examples, the team generated 5,407 preference pairs.

Training took just 1.5 hours for GPT-OSS 120B using LoRA (Low-Rank Adaptation) with a learning rate of 5e-6 over three epochs.

Where Open Models Excel

The category-level breakdown reveals where fine-tuning delivered the biggest wins. GPT-OSS 120B after DPO beat GPT-5.2 on math evaluation by 10.3 percentage points and on focus (response quality assessment) by 6.3 points.

Safety evaluation proved easiest across all models, averaging 91.32% accuracy—unsurprising given these models undergo extensive safety training. Factuality detection hit 85.23%. The hardest category? Focus, where models averaged just 10.13% accuracy, highlighting how subjective quality judgments remain challenging.

One wrinkle: Qwen3 235B, which already beat GPT-5.2 out of the box at 62.63%, actually regressed slightly to 61.28% after fine-tuning. Not every model benefits from additional training, reinforcing that validation remains essential.

The Broader Implications

The "LLM-as-a-judge" paradigm has become standard for evaluating AI outputs at scale because judging is fundamentally simpler than generating. A model generating a response must juggle context, follow multi-step instructions, and synthesize information. Evaluating that response is a focused classification task.

This research suggests organizations can build evaluation pipelines using open-source models they control entirely—no API dependencies, full visibility into model behavior, and the ability to fine-tune for specific domains. The cost savings at production scale are substantial.

Together AI published the full methodology in a cookbook notebook for teams wanting to replicate the approach with their own preference data.

Image source: Shutterstock