This article unpacks how large language models are evaluated on CRITICBENCH using few-shot chain-of-thought prompting. Unlike zero-shot methods, this approach ensures fair testing across both pretrained and instruction-tuned models by grounding judgments in principle-driven exemplars. Evaluation covers GSM8K, HumanEval, and TruthfulQA with carefully crafted prompts, multiple trials, and accuracy extracted from consistent output patterns—offering a rigorous lens into how well AI systems truly perform.This article unpacks how large language models are evaluated on CRITICBENCH using few-shot chain-of-thought prompting. Unlike zero-shot methods, this approach ensures fair testing across both pretrained and instruction-tuned models by grounding judgments in principle-driven exemplars. Evaluation covers GSM8K, HumanEval, and TruthfulQA with carefully crafted prompts, multiple trials, and accuracy extracted from consistent output patterns—offering a rigorous lens into how well AI systems truly perform.

The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”

Abstract and 1. Introduction

  1. Definition of Critique Ability

  2. Construction of CriticBench

    3.1 Data Generation

    3.2 Data Selection

  3. Properties of Critique Ability

    4.1 Scaling Law

    4.2 Self-Critique Ability

    4.3 Correlation to Certainty

  4. New Capacity with Critique: Self-Consistency with Self-Check

  5. Conclusion, References, and Acknowledgments

A. Notations

B. CriticBench: Sources of Queries

C. CriticBench: Data Generation Details

D. CriticBench: Data Selection Details

E. CriticBench: Statistics and Examples

F. Evaluation Settings

F EVALUATION SETTINGS

To evaluate large language models on CRITICBENCH, we employ few-shot chain-of-thought prompting, rather than zero-shot. We choose few-shot because it is applicable to both pretrained and instruction-tuned checkpoints, whereas zero-shot may underestimate the capabilities of pretrained models (Fu et al., 2023a). The prompt design draws inspiration from Constitutional AI (Bai

\ Figure 8: Examples from Critic-GSM8K.

\ Figure 9: Examples from Critic-HumanEval.

\ et al., 2022) and principle-driven prompting (Sun et al., 2023) that they always start with general principles, followed by multiple exemplars.

\ In the evaluation process, we use a temperature of 0.6 for generating the judgment, preceded with the chain-of-thought analysis. Each model is evaluated 8 times, and the average accuracy is reported. The few-shot exemplars always end with the pattern "Judgment: X.", where X is either correct or incorrect. We search for this pattern in the model output and extract X. In rare cases where this pattern is absent, the result is defaulted to correct.

\ Figure 10: Examples from Critic-TruthfulQA.

F.1 PROMPT FOR CRITIC-GSM8K

Listing 2 shows the 5-shot chain-of-thought prompt used to evaluate on Critic-GSM8K. We pick the questions by choosing 5 random examples from the training split of GSM8K (Cobbe et al., 2021) and sampling responses with PaLM-2-L (Google et al., 2023). We manually select the responses with appropriate quality. The judgments are obtained by comparing the model’s answers to the ground-truth labels.

\

Listing 2: 5-shot chain-of-thought prompt for Critic-GSM8K.

F.2 PROMPT FOR CRITIC-HUMANEVAL

Listing 3 presents the 3-shot chain-of-thought prompt for Critic-HumanEval. Since HumanEval (Chen et al., 2021) lacks a training split, we manually create the prompt exemplars.

\

Listing 3: 3-shot chain-of-thought prompt for Critic-HumanEval.

F.3 PROMPT FOR CRITIC-TRUTHFULQA

Listing 4 presents the 5-shot chain-of-thought prompt for Critic-TruthfulQA. Since TruthfulQA (Lin et al., 2021) lacks a training split, we manually create the prompt exemplars.

\

Listing 4: 5-shot chain-of-thought prompt for Critic-TruthfulQA.

\

:::info Authors:

(1) Liangchen Luo, Google Research (luolc@google.com);

(2) Zi Lin, UC San Diego;

(3) Yinxiao Liu, Google Research;

(4) Yun Zhu, Google Research;

(5) Jingbo Shang, UC San Diego;

(6) Lei Meng, Google Research (leimeng@google.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.0671
$0.0671$0.0671
-2.54%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts?

Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts?

The post Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts? appeared on BitcoinEthereumNews.com. In recent crypto news, Stephen Miran swore in as the latest Federal Reserve governor on September 16, 2025, slipping into the board’s last open spot right before the Federal Open Market Committee kicks off its two-day rate discussion. Traders are betting heavily on a 25-basis-point trim, which would bring the federal funds rate down to 4.00%-4.25%, based on CME FedWatch Tool figures from September 15, 2025. Miran, who’s been Trump’s top economic advisor and a supporter of his trade ideas, joins a seven-member board where just three governors come from Democratic picks, according to the Fed’s records updated that same day. Crypto News: Miran’s Background and Quick Path to Confirmation The Senate greenlit Miran on September 15, 2025, with a tight 48-47 vote, following his nomination on September 2, 2025, as per a recent crypto news update. His stint runs only until January 31, 2026, stepping in for Adriana D. Kugler, who stepped down in August 2025 for reasons not made public. Miran earned his economics Ph.D. from Harvard and worked at the Treasury back in Trump’s first go-around. Afterward, he moved to Hudson Bay Capital Management as an economist, then looped back to the White House in December 2024 to head the Council of Economic Advisers. There, he helped craft Trump’s “reciprocal tariffs” approach, aimed at fixing trade gaps with China and the EU. He wouldn’t quit his White House gig, which irked Senator Elizabeth Warren at the September 7, 2025, confirmation hearings. That limited time frame means Miran gets to cast a vote straight away at the FOMC session starting September 16, 2025. The full board now features Chair Jerome H. Powell (Trump pick, term ends 2026), Vice Chair Philip N. Jefferson (Biden, to 2036), and folks like Lisa D. Cook (Biden, to 2028) and Michael S. Barr…
Share
BitcoinEthereumNews2025/09/18 03:14
What John Harbaugh And Mike Tomlin’s Departures Mean For NFL Coaching

What John Harbaugh And Mike Tomlin’s Departures Mean For NFL Coaching

The post What John Harbaugh And Mike Tomlin’s Departures Mean For NFL Coaching appeared on BitcoinEthereumNews.com. Baltimore Ravens head coach John Harbaugh (L
Share
BitcoinEthereumNews2026/01/15 10:56
Twitter founder's "weekend experiment": Bitchat encryption software becomes a "communication Noah's Ark"

Twitter founder's "weekend experiment": Bitchat encryption software becomes a "communication Noah's Ark"

Author: Nancy, PANews In the crypto world, both assets and technologies are gradually taking center stage with greater practical significance. In the past few months
Share
PANews2026/01/15 11:00