This article unpacks how large language models are evaluated on CRITICBENCH using few-shot chain-of-thought prompting. Unlike zero-shot methods, this approach ensures fair testing across both pretrained and instruction-tuned models by grounding judgments in principle-driven exemplars. Evaluation covers GSM8K, HumanEval, and TruthfulQA with carefully crafted prompts, multiple trials, and accuracy extracted from consistent output patterns—offering a rigorous lens into how well AI systems truly perform.This article unpacks how large language models are evaluated on CRITICBENCH using few-shot chain-of-thought prompting. Unlike zero-shot methods, this approach ensures fair testing across both pretrained and instruction-tuned models by grounding judgments in principle-driven exemplars. Evaluation covers GSM8K, HumanEval, and TruthfulQA with carefully crafted prompts, multiple trials, and accuracy extracted from consistent output patterns—offering a rigorous lens into how well AI systems truly perform.

The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”

Abstract and 1. Introduction

  1. Definition of Critique Ability

  2. Construction of CriticBench

    3.1 Data Generation

    3.2 Data Selection

  3. Properties of Critique Ability

    4.1 Scaling Law

    4.2 Self-Critique Ability

    4.3 Correlation to Certainty

  4. New Capacity with Critique: Self-Consistency with Self-Check

  5. Conclusion, References, and Acknowledgments

A. Notations

B. CriticBench: Sources of Queries

C. CriticBench: Data Generation Details

D. CriticBench: Data Selection Details

E. CriticBench: Statistics and Examples

F. Evaluation Settings

F EVALUATION SETTINGS

To evaluate large language models on CRITICBENCH, we employ few-shot chain-of-thought prompting, rather than zero-shot. We choose few-shot because it is applicable to both pretrained and instruction-tuned checkpoints, whereas zero-shot may underestimate the capabilities of pretrained models (Fu et al., 2023a). The prompt design draws inspiration from Constitutional AI (Bai

\ Figure 8: Examples from Critic-GSM8K.

\ Figure 9: Examples from Critic-HumanEval.

\ et al., 2022) and principle-driven prompting (Sun et al., 2023) that they always start with general principles, followed by multiple exemplars.

\ In the evaluation process, we use a temperature of 0.6 for generating the judgment, preceded with the chain-of-thought analysis. Each model is evaluated 8 times, and the average accuracy is reported. The few-shot exemplars always end with the pattern "Judgment: X.", where X is either correct or incorrect. We search for this pattern in the model output and extract X. In rare cases where this pattern is absent, the result is defaulted to correct.

\ Figure 10: Examples from Critic-TruthfulQA.

F.1 PROMPT FOR CRITIC-GSM8K

Listing 2 shows the 5-shot chain-of-thought prompt used to evaluate on Critic-GSM8K. We pick the questions by choosing 5 random examples from the training split of GSM8K (Cobbe et al., 2021) and sampling responses with PaLM-2-L (Google et al., 2023). We manually select the responses with appropriate quality. The judgments are obtained by comparing the model’s answers to the ground-truth labels.

\

Listing 2: 5-shot chain-of-thought prompt for Critic-GSM8K.

F.2 PROMPT FOR CRITIC-HUMANEVAL

Listing 3 presents the 3-shot chain-of-thought prompt for Critic-HumanEval. Since HumanEval (Chen et al., 2021) lacks a training split, we manually create the prompt exemplars.

\

Listing 3: 3-shot chain-of-thought prompt for Critic-HumanEval.

F.3 PROMPT FOR CRITIC-TRUTHFULQA

Listing 4 presents the 5-shot chain-of-thought prompt for Critic-TruthfulQA. Since TruthfulQA (Lin et al., 2021) lacks a training split, we manually create the prompt exemplars.

\

Listing 4: 5-shot chain-of-thought prompt for Critic-TruthfulQA.

\

:::info Authors:

(1) Liangchen Luo, Google Research (luolc@google.com);

(2) Zi Lin, UC San Diego;

(3) Yinxiao Liu, Google Research;

(4) Yun Zhu, Google Research;

(5) Jingbo Shang, UC San Diego;

(6) Lei Meng, Google Research (leimeng@google.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.06779
$0.06779$0.06779
+6.22%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment?

Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment?

The post Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment? appeared on BitcoinEthereumNews.com. Crypto News 17 September 2025 | 17:39 Is dogecoin really fading? As traders hunt the best crypto to buy now and weigh 2025 picks, Dogecoin (DOGE) still owns the meme coin spotlight, yet upside looks capped, today’s Dogecoin price prediction says as much. Attention is shifting to projects that blend culture with real on-chain tools. Buyers searching “best crypto to buy now” want shipped products, audits, and transparent tokenomics. That frames the true matchup: dogecoin vs. Pepeto. Enter Pepeto (PEPETO), an Ethereum-based memecoin with working rails: PepetoSwap, a zero-fee DEX, plus Pepeto Bridge for smooth cross-chain moves. By fusing story with tools people can use now, and speaking directly to crypto presale 2025 demand, Pepeto puts utility, clarity, and distribution in front. In a market where legacy meme coin leaders risk drifting on sentiment, Pepeto’s execution gives it a real seat in the “best crypto to buy now” debate. First, a quick look at why dogecoin may be losing altitude. Dogecoin Price Prediction: Is Doge Really Fading? Remember when dogecoin made crypto feel simple? In 2013, DOGE turned a meme into money and a loose forum into a movement. A decade on, the nonstop momentum has cooled; the backdrop is different, and the market is far more selective. With DOGE circling ~$0.268, the tape reads bearish-to-neutral for the next few weeks: hold the $0.26 shelf on daily closes and expect choppy range-trading toward $0.29–$0.30 where rallies keep stalling; lose $0.26 decisively and momentum often bleeds into $0.245 with risk of a deeper probe toward $0.22–$0.21; reclaim $0.30 on a clean daily close and the downside bias is likely neutralized, opening room for a squeeze into the low-$0.30s. Source: CoinMarketcap / TradingView Beyond the dogecoin price prediction, DOGE still centers on payments and lacks native smart contracts; ZK-proof verification is proposed,…
Share
BitcoinEthereumNews2025/09/18 00:14
Pastor Involved in High-Stakes Crypto Fraud

Pastor Involved in High-Stakes Crypto Fraud

A gripping tale of deception has captured the media’s spotlight, especially in foreign outlets, centering on a cryptocurrency fraud case from Denver, Colorado. Eli Regalado, a pastor, alongside his wife Kaitlyn, was convicted, but what makes this case particularly intriguing is their unconventional defense.Continue Reading:Pastor Involved in High-Stakes Crypto Fraud
Share
Coinstats2025/09/18 00:38
Nexus Traps Tightening Nationwide

Nexus Traps Tightening Nationwide

Digital marketplaces and remote services have transformed how technology businesses operate across borders, but they’ve also intensified sales tax compliance challenges
Share
Techbullion2026/01/16 13:41