Together AI demonstrates fine-tuned open-source LLMs can outperform GPT-5.2 as evaluation judges using just 5,400 preference pairs, slashing costs dramatically. (Together AI demonstrates fine-tuned open-source LLMs can outperform GPT-5.2 as evaluation judges using just 5,400 preference pairs, slashing costs dramatically. (

Open-Source AI Judges Beat GPT-5.2 at 15x Lower Cost Using DPO Fine-Tuning

2026/02/03 03:30
3 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

Open-Source AI Judges Beat GPT-5.2 at 15x Lower Cost Using DPO Fine-Tuning

Luisa Crawford Feb 02, 2026 19:30

Together AI demonstrates fine-tuned open-source LLMs can outperform GPT-5.2 as evaluation judges using just 5,400 preference pairs, slashing costs dramatically.

Open-Source AI Judges Beat GPT-5.2 at 15x Lower Cost Using DPO Fine-Tuning

Fine-tuned open-source large language models can now outperform OpenAI's GPT-5.2 at evaluating AI outputs—at a fraction of the cost. Together AI released research showing their GPT-OSS 120B model achieved 62.63% accuracy on human preference alignment after Direct Preference Optimization training, surpassing GPT-5.2's 61.62% baseline while running 14x faster and costing 15x less per token.

The findings matter for any organization running AI evaluation pipelines at scale. GPT-5.2 currently charges $1.75 per million input tokens and $14 per million output tokens. The fine-tuned GPT-OSS 120B? Just $0.15 and $0.60 respectively.

The Training Approach

Together AI used DPO, a technique introduced in late 2023 that bypasses the complex reinforcement learning loops of traditional RLHF. Instead of training a separate reward model, DPO directly adjusts the language model's weights using preference pairs—one preferred response, one rejected response for each prompt.

The training data came from RewardBench 2, a benchmark containing examples with human-labeled preferred and rejected responses across six categories: safety, factuality, math, precise instruction following, focus, and ties. From roughly 1,500 training examples, the team generated 5,407 preference pairs.

Training took just 1.5 hours for GPT-OSS 120B using LoRA (Low-Rank Adaptation) with a learning rate of 5e-6 over three epochs.

Where Open Models Excel

The category-level breakdown reveals where fine-tuning delivered the biggest wins. GPT-OSS 120B after DPO beat GPT-5.2 on math evaluation by 10.3 percentage points and on focus (response quality assessment) by 6.3 points.

Safety evaluation proved easiest across all models, averaging 91.32% accuracy—unsurprising given these models undergo extensive safety training. Factuality detection hit 85.23%. The hardest category? Focus, where models averaged just 10.13% accuracy, highlighting how subjective quality judgments remain challenging.

One wrinkle: Qwen3 235B, which already beat GPT-5.2 out of the box at 62.63%, actually regressed slightly to 61.28% after fine-tuning. Not every model benefits from additional training, reinforcing that validation remains essential.

The Broader Implications

The "LLM-as-a-judge" paradigm has become standard for evaluating AI outputs at scale because judging is fundamentally simpler than generating. A model generating a response must juggle context, follow multi-step instructions, and synthesize information. Evaluating that response is a focused classification task.

This research suggests organizations can build evaluation pipelines using open-source models they control entirely—no API dependencies, full visibility into model behavior, and the ability to fine-tune for specific domains. The cost savings at production scale are substantial.

Together AI published the full methodology in a cookbook notebook for teams wanting to replicate the approach with their own preference data.

Image source: Shutterstock
  • ai
  • llm
  • dpo
  • machine-learning
  • open-source
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Which Crypto Hits $1 First? Comparing ADA, DOGE & This Altcoin

Which Crypto Hits $1 First? Comparing ADA, DOGE & This Altcoin

The race to the one-dollar milestone is a frequent topic of discussion in April 2026. However, the mathematical reality for each project is very different. When
Share
Techbullion2026/04/03 20:29
For Users Who Prioritize Confidentiality In Their Transactions

For Users Who Prioritize Confidentiality In Their Transactions

The post For Users Who Prioritize Confidentiality In Their Transactions appeared on BitcoinEthereumNews.com. Verge is a privacy-focused cryptocurrency and blockchain platform designed to provide anonymous and secure transactions. XVG coin review by Coinidol.com. Privacy and anonymity A project DogeCoinDark was launched in 2014 but later in 2016 it was rebranded as Verge. The project focuses on enabling private and untraceable transactions while maintaining fast transaction speeds and a user-friendly experience. Verge employs multiple privacy mechanisms, including the use of Tor and I2P networks to obfuscate users’ IP addresses and hide transaction origins, enhancing privacy and anonymity. The Wraith Protocol of the platorm is a feature that allows users to switch between public and private ledgers, giving them the option to make transactions visible or private. By utilizing a proof-of-work (PoW) consensus algorithm and implementing technologies to enhance scalability Verge aims to provide fast transaction speeds. XVG is the native cryptocurrency of the Verge network.  The atomic swaps available on Verge, allow users to exchange XVG with other cryptocurrencies without the need for intermediaries. Moreover, it offers mobile wallets that allow users to send and receive XVG on the go. Disclaimer. This article is for informational purposes only and should not be viewed as an endorsement by Coinidol.com. The data provided is collected by the author and is not sponsored by any company or token developer. They are not a recommendation to buy or sell cryptocurrency. Readers should do their research before investing in funds. Source: https://coinidol.com/verge-xvg-token/
Share
BitcoinEthereumNews2025/09/18 17:15
Bitcoin ETFs Surge with 20,685 BTC Inflows, Marking Strongest Week

Bitcoin ETFs Surge with 20,685 BTC Inflows, Marking Strongest Week

TLDR Bitcoin ETFs recorded their strongest weekly inflows since July, reaching 20,685 BTC. U.S. Bitcoin ETFs contributed nearly 97% of the total inflows last week. The surge in Bitcoin ETF inflows pushed holdings to a new high of 1.32 million BTC. Fidelity’s FBTC product accounted for 36% of the total inflows, marking an 18-month high. [...] The post Bitcoin ETFs Surge with 20,685 BTC Inflows, Marking Strongest Week appeared first on CoinCentral.
Share
Coincentral2025/09/18 02:30

$30,000 in PRL + 15,000 USDT

$30,000 in PRL + 15,000 USDT$30,000 in PRL + 15,000 USDT

Deposit & trade PRL to boost your rewards!