This article unpacks how large language models are evaluated on CRITICBENCH using few-shot chain-of-thought prompting. Unlike zero-shot methods, this approach ensures fair testing across both pretrained and instruction-tuned models by grounding judgments in principle-driven exemplars. Evaluation covers GSM8K, HumanEval, and TruthfulQA with carefully crafted prompts, multiple trials, and accuracy extracted from consistent output patterns—offering a rigorous lens into how well AI systems truly perform.This article unpacks how large language models are evaluated on CRITICBENCH using few-shot chain-of-thought prompting. Unlike zero-shot methods, this approach ensures fair testing across both pretrained and instruction-tuned models by grounding judgments in principle-driven exemplars. Evaluation covers GSM8K, HumanEval, and TruthfulQA with carefully crafted prompts, multiple trials, and accuracy extracted from consistent output patterns—offering a rigorous lens into how well AI systems truly perform.

The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”

Abstract and 1. Introduction

  1. Definition of Critique Ability

  2. Construction of CriticBench

    3.1 Data Generation

    3.2 Data Selection

  3. Properties of Critique Ability

    4.1 Scaling Law

    4.2 Self-Critique Ability

    4.3 Correlation to Certainty

  4. New Capacity with Critique: Self-Consistency with Self-Check

  5. Conclusion, References, and Acknowledgments

A. Notations

B. CriticBench: Sources of Queries

C. CriticBench: Data Generation Details

D. CriticBench: Data Selection Details

E. CriticBench: Statistics and Examples

F. Evaluation Settings

F EVALUATION SETTINGS

To evaluate large language models on CRITICBENCH, we employ few-shot chain-of-thought prompting, rather than zero-shot. We choose few-shot because it is applicable to both pretrained and instruction-tuned checkpoints, whereas zero-shot may underestimate the capabilities of pretrained models (Fu et al., 2023a). The prompt design draws inspiration from Constitutional AI (Bai

\ Figure 8: Examples from Critic-GSM8K.

\ Figure 9: Examples from Critic-HumanEval.

\ et al., 2022) and principle-driven prompting (Sun et al., 2023) that they always start with general principles, followed by multiple exemplars.

\ In the evaluation process, we use a temperature of 0.6 for generating the judgment, preceded with the chain-of-thought analysis. Each model is evaluated 8 times, and the average accuracy is reported. The few-shot exemplars always end with the pattern "Judgment: X.", where X is either correct or incorrect. We search for this pattern in the model output and extract X. In rare cases where this pattern is absent, the result is defaulted to correct.

\ Figure 10: Examples from Critic-TruthfulQA.

F.1 PROMPT FOR CRITIC-GSM8K

Listing 2 shows the 5-shot chain-of-thought prompt used to evaluate on Critic-GSM8K. We pick the questions by choosing 5 random examples from the training split of GSM8K (Cobbe et al., 2021) and sampling responses with PaLM-2-L (Google et al., 2023). We manually select the responses with appropriate quality. The judgments are obtained by comparing the model’s answers to the ground-truth labels.

\

Listing 2: 5-shot chain-of-thought prompt for Critic-GSM8K.

F.2 PROMPT FOR CRITIC-HUMANEVAL

Listing 3 presents the 3-shot chain-of-thought prompt for Critic-HumanEval. Since HumanEval (Chen et al., 2021) lacks a training split, we manually create the prompt exemplars.

\

Listing 3: 3-shot chain-of-thought prompt for Critic-HumanEval.

F.3 PROMPT FOR CRITIC-TRUTHFULQA

Listing 4 presents the 5-shot chain-of-thought prompt for Critic-TruthfulQA. Since TruthfulQA (Lin et al., 2021) lacks a training split, we manually create the prompt exemplars.

\

Listing 4: 5-shot chain-of-thought prompt for Critic-TruthfulQA.

\

:::info Authors:

(1) Liangchen Luo, Google Research (luolc@google.com);

(2) Zi Lin, UC San Diego;

(3) Yinxiao Liu, Google Research;

(4) Yun Zhu, Google Research;

(5) Jingbo Shang, UC San Diego;

(6) Lei Meng, Google Research (leimeng@google.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.06729
$0.06729$0.06729
-2.26%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight

American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight

The post American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight appeared on BitcoinEthereumNews.com. Key Takeaways: American Bitcoin (ABTC) surged nearly 85% on its Nasdaq debut, briefly reaching a $5B valuation. The Trump family, alongside Hut 8 Mining, controls 98% of the newly merged crypto-mining entity. Eric Trump called Bitcoin “modern-day gold,” predicting it could reach $1 million per coin. American Bitcoin, a fast-rising crypto mining firm with strong political and institutional backing, has officially entered Wall Street. After merging with Gryphon Digital Mining, the company made its Nasdaq debut under the ticker ABTC, instantly drawing global attention to both its stock performance and its bold vision for Bitcoin’s future. Read More: Trump-Backed Crypto Firm Eyes Asia for Bold Bitcoin Expansion Nasdaq Debut: An Explosive First Day ABTC’s first day of trading proved as dramatic as expected. Shares surged almost 85% at the open, touching a peak of $14 before settling at lower levels by the close. That initial spike valued the company around $5 billion, positioning it as one of 2025’s most-watched listings. At the last session, ABTC has been trading at $7.28 per share, which is a small positive 2.97% per day. Although the price has decelerated since opening highs, analysts note that the company has been off to a strong start and early investor activity is a hard-to-find feat in a newly-launched crypto mining business. According to market watchers, the listing comes at a time of new momentum in the digital asset markets. With Bitcoin trading above $110,000 this quarter, American Bitcoin’s entry comes at a time when both institutional investors and retail traders are showing heightened interest in exposure to Bitcoin-linked equities. Ownership Structure: Trump Family and Hut 8 at the Helm Its management and ownership set up has increased the visibility of the company. The Trump family and the Canadian mining giant Hut 8 Mining jointly own 98 percent…
Share
BitcoinEthereumNews2025/09/18 01:33
Bitcoin ETFs Surge with 20,685 BTC Inflows, Marking Strongest Week

Bitcoin ETFs Surge with 20,685 BTC Inflows, Marking Strongest Week

TLDR Bitcoin ETFs recorded their strongest weekly inflows since July, reaching 20,685 BTC. U.S. Bitcoin ETFs contributed nearly 97% of the total inflows last week. The surge in Bitcoin ETF inflows pushed holdings to a new high of 1.32 million BTC. Fidelity’s FBTC product accounted for 36% of the total inflows, marking an 18-month high. [...] The post Bitcoin ETFs Surge with 20,685 BTC Inflows, Marking Strongest Week appeared first on CoinCentral.
Share
Coincentral2025/09/18 02:30
Whales Shift Focus to Zero Knowledge Proof’s 3000x ROI Potential as Zcash & Toncoin’s Rally Slows Down

Whales Shift Focus to Zero Knowledge Proof’s 3000x ROI Potential as Zcash & Toncoin’s Rally Slows Down

Explore how Zero Knowledge Proof (ZKP) is reshaping personal finance, challenging banks, and standing out as one of the top crypto gainers ahead of ZCash and Toncoin
Share
coinlineup2026/01/15 13:00