CRITICBENCH is a benchmark designed to test AI models using data that exposes subtle weaknesses in reasoning. Instead of focusing on obvious mistakes, it samples “convincing wrong answers”—responses that appear correct but contain hidden flaws—alongside correct outputs with varied complexity. By filtering low-quality models, emphasizing reasoning steps, and using nuanced sampling strategies across datasets like GSM8K, HumanEval, and TruthfulQA, CRITICBENCH offers a rigorous way to compare strong versus weak LLMs.CRITICBENCH is a benchmark designed to test AI models using data that exposes subtle weaknesses in reasoning. Instead of focusing on obvious mistakes, it samples “convincing wrong answers”—responses that appear correct but contain hidden flaws—alongside correct outputs with varied complexity. By filtering low-quality models, emphasizing reasoning steps, and using nuanced sampling strategies across datasets like GSM8K, HumanEval, and TruthfulQA, CRITICBENCH offers a rigorous way to compare strong versus weak LLMs.

Why “Almost Right” Answers Are the Hardest Test for AI

Abstract and 1. Introduction

  1. Definition of Critique Ability

  2. Construction of CriticBench

    3.1 Data Generation

    3.2 Data Selection

  3. Properties of Critique Ability

    4.1 Scaling Law

    4.2 Self-Critique Ability

    4.3 Correlation to Certainty

  4. New Capacity with Critique: Self-Consistency with Self-Check

  5. Conclusion, References, and Acknowledgments

A. Notations

B. CriticBench: Sources of Queries

C. CriticBench: Data Generation Details

D. CriticBench: Data Selection Details

E. CriticBench: Statistics and Examples

F. Evaluation Settings

D CRITICBENCH: DATA SELECTION DETAILS

D.1 SAMPLING FROM CONVINCING WRONG-ANSWERS

The term convincing wrong-answer is coined by Lightman et al. (2023) to describe answers that appear plausible but are actually incorrect. Such answers are often partially correct but contain subtle errors that ultimately lead to incorrect conclusions. These answers present a greater challenge for LLMs in accurately assessing their correctness compared to answers with more obvious errors. Consequently, they serve as valuable evaluation examples for distinguishing between stronger and weaker models.

\ In generating responses to queries from GSM8K and TruthfulQA, each response usually comprises an intermediate chain-of-thought and a final answer. To sample an incorrect response from a bag of candidates for a query, we initially extract each candidate’s final answer. Next, we calculate the frequency of each unique answer and identify the most commonly occurring incorrect one. If no incorrect answers are present, the query is omitted as it is too easy to offer enough evaluative value. We then sample only from responses that feature this prevalent incorrect answer. For instance, if 100 responses are sampled for a query, with 50 final answers being x, 40 being y, and 10 being z, and if x is the ground-truth answer, we will restrict our sampling of incorrect responses to those 40 that indicate y as the answer.

\ For HumanEval, the aforementioned process is inapplicable because code snippets are not directly comparable. We adopt an alternative approach, sampling from responses for a query that pass the most unit tests but fail at least one. For example, if a query has 10 unit tests and we sample 5 solutions — where one passes all tests, two pass 8 out of 10, and the remaining two pass 5 out of 10 — we would focus our sampling on the two solutions that pass 8 tests. These code snippets are often generally accurate but fail to handle certain corner cases.

D.2 COMPLEXITY-BASED SELECTION

Fu et al. (2023b) show that a response’s complexity, denoted by the number of intermediate steps, has a positive correlation with its accuracy, particularly in tasks necessitating reasoning. To leverage this finding, we employ a complexity-based sampling strategy when selecting from either correct or commonly incorrect responses.

\

\ Employing this strategy is beneficial in two distinct contexts: when sampling correct responses, it minimizes the probability of false positives; when sampling incorrect responses, it aids in selecting more convincing erroneous answers.

D.3 FILTERING BY GENERATOR

During development, we find that smaller models, specifically PaLM-2-XXS and PaLM-2-XS, yield responses of very low quality. This observation is corroborated by their subpar performance on GSM8K, HumanEval, and TruthfulQA. Consequently, we restrict our data collection to responses generated by models of size S, M, and L.

D.4 CERTAINTY-BASED SELECTION

E CRITICBENCH: STATISTICS AND EXAMPLES

E.1 STATISTICS

Table 2 presents the detailed statistics of CRITICBENCH and each subset.

\ \ Table 2: The statistics of CRITICBENCH and each subset.

\

E.2 EXAMPLES

Figure 8, 9 and 10 provide examples in CRITICBENCH.

\

:::info Authors:

(1) Liangchen Luo, Google Research (luolc@google.com);

(2) Zi Lin, UC San Diego;

(3) Yinxiao Liu, Google Research;

(4) Yun Zhu, Google Research;

(5) Jingbo Shang, UC San Diego;

(6) Lei Meng, Google Research (leimeng@google.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight

American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight

The post American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight appeared on BitcoinEthereumNews.com. Key Takeaways: American Bitcoin (ABTC) surged nearly 85% on its Nasdaq debut, briefly reaching a $5B valuation. The Trump family, alongside Hut 8 Mining, controls 98% of the newly merged crypto-mining entity. Eric Trump called Bitcoin “modern-day gold,” predicting it could reach $1 million per coin. American Bitcoin, a fast-rising crypto mining firm with strong political and institutional backing, has officially entered Wall Street. After merging with Gryphon Digital Mining, the company made its Nasdaq debut under the ticker ABTC, instantly drawing global attention to both its stock performance and its bold vision for Bitcoin’s future. Read More: Trump-Backed Crypto Firm Eyes Asia for Bold Bitcoin Expansion Nasdaq Debut: An Explosive First Day ABTC’s first day of trading proved as dramatic as expected. Shares surged almost 85% at the open, touching a peak of $14 before settling at lower levels by the close. That initial spike valued the company around $5 billion, positioning it as one of 2025’s most-watched listings. At the last session, ABTC has been trading at $7.28 per share, which is a small positive 2.97% per day. Although the price has decelerated since opening highs, analysts note that the company has been off to a strong start and early investor activity is a hard-to-find feat in a newly-launched crypto mining business. According to market watchers, the listing comes at a time of new momentum in the digital asset markets. With Bitcoin trading above $110,000 this quarter, American Bitcoin’s entry comes at a time when both institutional investors and retail traders are showing heightened interest in exposure to Bitcoin-linked equities. Ownership Structure: Trump Family and Hut 8 at the Helm Its management and ownership set up has increased the visibility of the company. The Trump family and the Canadian mining giant Hut 8 Mining jointly own 98 percent…
Share
BitcoinEthereumNews2025/09/18 01:33
Bitcoin ETFs Surge with 20,685 BTC Inflows, Marking Strongest Week

Bitcoin ETFs Surge with 20,685 BTC Inflows, Marking Strongest Week

TLDR Bitcoin ETFs recorded their strongest weekly inflows since July, reaching 20,685 BTC. U.S. Bitcoin ETFs contributed nearly 97% of the total inflows last week. The surge in Bitcoin ETF inflows pushed holdings to a new high of 1.32 million BTC. Fidelity’s FBTC product accounted for 36% of the total inflows, marking an 18-month high. [...] The post Bitcoin ETFs Surge with 20,685 BTC Inflows, Marking Strongest Week appeared first on CoinCentral.
Share
Coincentral2025/09/18 02:30
Whales Shift Focus to Zero Knowledge Proof’s 3000x ROI Potential as Zcash & Toncoin’s Rally Slows Down

Whales Shift Focus to Zero Knowledge Proof’s 3000x ROI Potential as Zcash & Toncoin’s Rally Slows Down

Explore how Zero Knowledge Proof (ZKP) is reshaping personal finance, challenging banks, and standing out as one of the top crypto gainers ahead of ZCash and Toncoin
Share
coinlineup2026/01/15 13:00