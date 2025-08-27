The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”

By: Hackernoon
2025/08/27 17:00
Prompt
PROMPT$0.2502-7.64%
ZeroLend
ZERO$0.00004849-1.82%
Sleepless AI
AI$0.124+4.81%
WELL3
WELL$0.000301+1.27%

Abstract and 1. Introduction

  1. Definition of Critique Ability

  2. Construction of CriticBench

    3.1 Data Generation

    3.2 Data Selection

  3. Properties of Critique Ability

    4.1 Scaling Law

    4.2 Self-Critique Ability

    4.3 Correlation to Certainty

  4. New Capacity with Critique: Self-Consistency with Self-Check

  5. Conclusion, References, and Acknowledgments

A. Notations

B. CriticBench: Sources of Queries

C. CriticBench: Data Generation Details

D. CriticBench: Data Selection Details

E. CriticBench: Statistics and Examples

F. Evaluation Settings

F EVALUATION SETTINGS

To evaluate large language models on CRITICBENCH, we employ few-shot chain-of-thought prompting, rather than zero-shot. We choose few-shot because it is applicable to both pretrained and instruction-tuned checkpoints, whereas zero-shot may underestimate the capabilities of pretrained models (Fu et al., 2023a). The prompt design draws inspiration from Constitutional AI (Bai

\ Figure 8: Examples from Critic-GSM8K.

\ Figure 9: Examples from Critic-HumanEval.

\ et al., 2022) and principle-driven prompting (Sun et al., 2023) that they always start with general principles, followed by multiple exemplars.

\ In the evaluation process, we use a temperature of 0.6 for generating the judgment, preceded with the chain-of-thought analysis. Each model is evaluated 8 times, and the average accuracy is reported. The few-shot exemplars always end with the pattern "Judgment: X.", where X is either correct or incorrect. We search for this pattern in the model output and extract X. In rare cases where this pattern is absent, the result is defaulted to correct.

\ Figure 10: Examples from Critic-TruthfulQA.

F.1 PROMPT FOR CRITIC-GSM8K

Listing 2 shows the 5-shot chain-of-thought prompt used to evaluate on Critic-GSM8K. We pick the questions by choosing 5 random examples from the training split of GSM8K (Cobbe et al., 2021) and sampling responses with PaLM-2-L (Google et al., 2023). We manually select the responses with appropriate quality. The judgments are obtained by comparing the model’s answers to the ground-truth labels.

\

Listing 2: 5-shot chain-of-thought prompt for Critic-GSM8K.

F.2 PROMPT FOR CRITIC-HUMANEVAL

Listing 3 presents the 3-shot chain-of-thought prompt for Critic-HumanEval. Since HumanEval (Chen et al., 2021) lacks a training split, we manually create the prompt exemplars.

\

Listing 3: 3-shot chain-of-thought prompt for Critic-HumanEval.

F.3 PROMPT FOR CRITIC-TRUTHFULQA

Listing 4 presents the 5-shot chain-of-thought prompt for Critic-TruthfulQA. Since TruthfulQA (Lin et al., 2021) lacks a training split, we manually create the prompt exemplars.

\

Listing 4: 5-shot chain-of-thought prompt for Critic-TruthfulQA.

\

:::info Authors:

(1) Liangchen Luo, Google Research ([email protected]);

(2) Zi Lin, UC San Diego;

(3) Yinxiao Liu, Google Research;

(4) Yun Zhu, Google Research;

(5) Jingbo Shang, UC San Diego;

(6) Lei Meng, Google Research ([email protected]).

:::

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

Taker Buy/Sell Ratio duikt onder 0,98: slecht nieuws voor Bitcoin?

Taker Buy/Sell Ratio duikt onder 0,98: slecht nieuws voor Bitcoin?

Bitcoin staat stevig boven de $110.500, maar onder de oppervlakte lijkt het momentum flink af te nemen. Een opvallende on chain indicator, de Taker Buy/Sell Ratio, laat namelijk zien dat de koopdruk op het laagste punt zit sinds mei 2018. En dat terwijl de prijs bijna op recordhoogte staat. Wat... Het bericht Taker Buy/Sell Ratio duikt onder 0,98: slecht nieuws voor Bitcoin? verscheen het eerst op Blockchain Stories.
Taker Protocol
TAKER$0.01224+0.16%
Mei Solutions
MEI$0.001381-6.94%
OP
OP$0.698--%
Share
Coinstats2025/08/27 18:30
Share
Ai&Meme Daily, a picture to understand the popular Ai&Memes in the past 24 hours (2025.5.15)

Ai&Meme Daily, a picture to understand the popular Ai&Memes in the past 24 hours (2025.5.15)

believe is in full swing, SOL is back with prosperity
Solana
SOL$204.37+7.90%
Sleepless AI
AI$0.1241+5.08%
Memecoin
MEME$0.00318-9.42%
Share
PANews2025/05/15 10:04
Share
AERO Surges 2.2% as Bulls Gear Up for $2.36 Breakout

AERO Surges 2.2% as Bulls Gear Up for $2.36 Breakout

Aerodrome Finance (AERO) is currently trading at $1.31, representing a 2.2% increase over the past 24 hours. Despite the increase in price, trading volume has decreased by 4.72% to $101.93 million. This pattern indicates strong prices despite the slowing activity levels. Source: CoinMarketCap Over the past week, the AERO token has experienced a 0.55% increase. […]
Gearbox
GEAR$0.003972-0.84%
TokenFi
TOKEN$0.0132+1.85%
Aerodrome Finance
AERO$1.3197-0.07%
Share
Tronweekly2025/08/27 19:30
Share

Trending News

More

Taker Buy/Sell Ratio duikt onder 0,98: slecht nieuws voor Bitcoin?

Ai&Meme Daily, a picture to understand the popular Ai&Memes in the past 24 hours (2025.5.15)

AERO Surges 2.2% as Bulls Gear Up for $2.36 Breakout

Google Cloud Building Blockchain for Digital Payments: Details

My Path From $0 to $5K a Month as a Self-Taught Programmer