CriticBench uses Google’s PaLM-2 model family to generate benchmark data for tasks like GSM8K, HumanEval, and TruthfulQA. By avoiding GPT and LLaMA due to licensing constraints, the project ensures a more open and compliant evaluation framework. Its methodology employs chain-of-thought prompting, code sandbox testing, and principle-driven prompting to create high-quality responses that capture both final answers and underlying reasoning, making it a valuable resource for critique-based AI evaluation.CriticBench uses Google’s PaLM-2 model family to generate benchmark data for tasks like GSM8K, HumanEval, and TruthfulQA. By avoiding GPT and LLaMA due to licensing constraints, the project ensures a more open and compliant evaluation framework. Its methodology employs chain-of-thought prompting, code sandbox testing, and principle-driven prompting to create high-quality responses that capture both final answers and underlying reasoning, making it a valuable resource for critique-based AI evaluation.

Why CriticBench Refuses GPT & LLaMA for Data Generation

Abstract and 1. Introduction

  1. Definition of Critique Ability

  2. Construction of CriticBench

    3.1 Data Generation

    3.2 Data Selection

  3. Properties of Critique Ability

    4.1 Scaling Law

    4.2 Self-Critique Ability

    4.3 Correlation to Certainty

  4. New Capacity with Critique: Self-Consistency with Self-Check

  5. Conclusion, References, and Acknowledgments

A. Notations

B. CriticBench: Sources of Queries

C. CriticBench: Data Generation Details

D. CriticBench: Data Selection Details

E. CriticBench: Statistics and Examples

F. Evaluation Settings

C CRITICBENCH: DATA GENERATION DETAILS

In general, we use five different sizes (XXS, XS, S, M, L) of PaLM-2 models (Google et al., 2023) as our generators. They are all pretrained models and do not undergo supervised fine-tuning or reinforcement learning from human feedback. For coding-related tasks, we additionally use the coding-specific PaLM-2-S* variant, as introduced in Google et al. (2023). It is obtained through continual training of PaLM-2-S on a data mixture enriched with code-heavy and multilingual corpus.

\ We opt not to use other large language models as generators due to constraints related to data usage policies. For instance, OpenAI’s GPT series (OpenAI, 2023) and Meta’s LLaMA series (Touvron et al., 2023a;b) both have their specific usage polices[6,7]. Our aim is to establish an open benchmark with minimal constraints. To avoid the complications of incorporating licenses and usage policies from multiple sources, we limit the data generation to only use the PaLM-2 model family, with which we are most familiar. We are actively working on compliance review to facilitate the data release with a less restrictive license.

C.1 GSM8K

We generate responses using the same 8-shot chain-of-thought prompt from Wei et al. (2022b). We use nucleus sampling (Holtzman et al., 2020) with temperature T = 0.6 and p = 0.95 to sample 64 responses for each query. Following Lewkowycz et al. (2022) and Google et al. (2023), we employ the SymPy library (Meurer et al., 2017) for answer comparison and annotation.

C.2 HUMANEVAL

Following Google et al. (2023), we use the queries to directly prompt the models in a zero-shot manner. We use nucleus sampling (Holtzman et al., 2020) with temperature T = 0.8 and p = 0.95 to sample 100 responses for each query. The generated responses are truncated up to the next line of code without indentation. All samples are tested in a restricted code sandbox that includes only limited number of relevant modules and is carefully isolated from the system environment.

C.3 TRUTHFULQA

In the original paper by Lin et al. (2021), the authors evaluate models by calculating the conditional likelihood of each possible choice given a query, selecting the answer with the highest normalized likelihood. While straightforward, this method has two primary limitations. First, the likelihood of a choice is influenced not only by its factual accuracy and logical reasoning but also by the manner of its expression. Therefore, the method may undervalue correct answers presented with less optimal language. Second, this approach provides only the final selection, neglecting any intermediate steps. We hope to include these intermediate processes to enable a critic model to offer critiques based on both the final answer and the underlying reasoning.

\ We follow OpenAI (2023) to adopt a 5-shot prompt for answer selection. Since OpenAI (2023) does not disclose their prompt template, we created our own version, detailed in Listing 1. Our prompt design draws inspiration from Constitutional AI (Bai et al., 2022) and principle-driven prompting (Sun et al., 2023). We use temperature T = 0.6 to sample 64 responses for each query.

\ We wish to clarify that although Lin et al. (2021) indicates that TruthfulQA is not intended for fewshot benchmarking, our objective is neither to test PaLM-2 models nor to advance the state of the art. Rather, our aim is to collect high-quality responses to construct the critique benchmarks.

\ Listing 1: 5-shot chain-of-thought prompt for TruthfulQA (mc1).

\ Listing 1: 5-shot chain-of-thought prompt for TruthfulQA (mc1).

\

:::info Authors:

(1) Liangchen Luo, Google Research (luolc@google.com);

(2) Zi Lin, UC San Diego;

(3) Yinxiao Liu, Google Research;

(4) Yun Zhu, Google Research;

(5) Jingbo Shang, UC San Diego;

(6) Lei Meng, Google Research (leimeng@google.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[6] OpenAI’s usage policies: https://openai.com/policies/usage-policies

\ [7] LLaMA-2’s usage policy: https://ai.meta.com/llama/use-policy/

Market Opportunity
Moonveil Logo
Moonveil Price(MORE)
$0.002489
$0.002489$0.002489
+0.76%
USD
Moonveil (MORE) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts?

Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts?

The post Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts? appeared on BitcoinEthereumNews.com. In recent crypto news, Stephen Miran swore in as the latest Federal Reserve governor on September 16, 2025, slipping into the board’s last open spot right before the Federal Open Market Committee kicks off its two-day rate discussion. Traders are betting heavily on a 25-basis-point trim, which would bring the federal funds rate down to 4.00%-4.25%, based on CME FedWatch Tool figures from September 15, 2025. Miran, who’s been Trump’s top economic advisor and a supporter of his trade ideas, joins a seven-member board where just three governors come from Democratic picks, according to the Fed’s records updated that same day. Crypto News: Miran’s Background and Quick Path to Confirmation The Senate greenlit Miran on September 15, 2025, with a tight 48-47 vote, following his nomination on September 2, 2025, as per a recent crypto news update. His stint runs only until January 31, 2026, stepping in for Adriana D. Kugler, who stepped down in August 2025 for reasons not made public. Miran earned his economics Ph.D. from Harvard and worked at the Treasury back in Trump’s first go-around. Afterward, he moved to Hudson Bay Capital Management as an economist, then looped back to the White House in December 2024 to head the Council of Economic Advisers. There, he helped craft Trump’s “reciprocal tariffs” approach, aimed at fixing trade gaps with China and the EU. He wouldn’t quit his White House gig, which irked Senator Elizabeth Warren at the September 7, 2025, confirmation hearings. That limited time frame means Miran gets to cast a vote straight away at the FOMC session starting September 16, 2025. The full board now features Chair Jerome H. Powell (Trump pick, term ends 2026), Vice Chair Philip N. Jefferson (Biden, to 2036), and folks like Lisa D. Cook (Biden, to 2028) and Michael S. Barr…
Share
BitcoinEthereumNews2025/09/18 03:14
What John Harbaugh And Mike Tomlin’s Departures Mean For NFL Coaching

What John Harbaugh And Mike Tomlin’s Departures Mean For NFL Coaching

The post What John Harbaugh And Mike Tomlin’s Departures Mean For NFL Coaching appeared on BitcoinEthereumNews.com. Baltimore Ravens head coach John Harbaugh (L
Share
BitcoinEthereumNews2026/01/15 10:56
Twitter founder's "weekend experiment": Bitchat encryption software becomes a "communication Noah's Ark"

Twitter founder's "weekend experiment": Bitchat encryption software becomes a "communication Noah's Ark"

Author: Nancy, PANews In the crypto world, both assets and technologies are gradually taking center stage with greater practical significance. In the past few months
Share
PANews2026/01/15 11:00