CriticBench uses Google’s PaLM-2 model family to generate benchmark data for tasks like GSM8K, HumanEval, and TruthfulQA. By avoiding GPT and LLaMA due to licensing constraints, the project ensures a more open and compliant evaluation framework. Its methodology employs chain-of-thought prompting, code sandbox testing, and principle-driven prompting to create high-quality responses that capture both final answers and underlying reasoning, making it a valuable resource for critique-based AI evaluation.CriticBench uses Google’s PaLM-2 model family to generate benchmark data for tasks like GSM8K, HumanEval, and TruthfulQA. By avoiding GPT and LLaMA due to licensing constraints, the project ensures a more open and compliant evaluation framework. Its methodology employs chain-of-thought prompting, code sandbox testing, and principle-driven prompting to create high-quality responses that capture both final answers and underlying reasoning, making it a valuable resource for critique-based AI evaluation.

Why CriticBench Refuses GPT & LLaMA for Data Generation

Abstract and 1. Introduction

  1. Definition of Critique Ability

  2. Construction of CriticBench

    3.1 Data Generation

    3.2 Data Selection

  3. Properties of Critique Ability

    4.1 Scaling Law

    4.2 Self-Critique Ability

    4.3 Correlation to Certainty

  4. New Capacity with Critique: Self-Consistency with Self-Check

  5. Conclusion, References, and Acknowledgments

A. Notations

B. CriticBench: Sources of Queries

C. CriticBench: Data Generation Details

D. CriticBench: Data Selection Details

E. CriticBench: Statistics and Examples

F. Evaluation Settings

C CRITICBENCH: DATA GENERATION DETAILS

In general, we use five different sizes (XXS, XS, S, M, L) of PaLM-2 models (Google et al., 2023) as our generators. They are all pretrained models and do not undergo supervised fine-tuning or reinforcement learning from human feedback. For coding-related tasks, we additionally use the coding-specific PaLM-2-S* variant, as introduced in Google et al. (2023). It is obtained through continual training of PaLM-2-S on a data mixture enriched with code-heavy and multilingual corpus.

\ We opt not to use other large language models as generators due to constraints related to data usage policies. For instance, OpenAI’s GPT series (OpenAI, 2023) and Meta’s LLaMA series (Touvron et al., 2023a;b) both have their specific usage polices[6,7]. Our aim is to establish an open benchmark with minimal constraints. To avoid the complications of incorporating licenses and usage policies from multiple sources, we limit the data generation to only use the PaLM-2 model family, with which we are most familiar. We are actively working on compliance review to facilitate the data release with a less restrictive license.

C.1 GSM8K

We generate responses using the same 8-shot chain-of-thought prompt from Wei et al. (2022b). We use nucleus sampling (Holtzman et al., 2020) with temperature T = 0.6 and p = 0.95 to sample 64 responses for each query. Following Lewkowycz et al. (2022) and Google et al. (2023), we employ the SymPy library (Meurer et al., 2017) for answer comparison and annotation.

C.2 HUMANEVAL

Following Google et al. (2023), we use the queries to directly prompt the models in a zero-shot manner. We use nucleus sampling (Holtzman et al., 2020) with temperature T = 0.8 and p = 0.95 to sample 100 responses for each query. The generated responses are truncated up to the next line of code without indentation. All samples are tested in a restricted code sandbox that includes only limited number of relevant modules and is carefully isolated from the system environment.

C.3 TRUTHFULQA

In the original paper by Lin et al. (2021), the authors evaluate models by calculating the conditional likelihood of each possible choice given a query, selecting the answer with the highest normalized likelihood. While straightforward, this method has two primary limitations. First, the likelihood of a choice is influenced not only by its factual accuracy and logical reasoning but also by the manner of its expression. Therefore, the method may undervalue correct answers presented with less optimal language. Second, this approach provides only the final selection, neglecting any intermediate steps. We hope to include these intermediate processes to enable a critic model to offer critiques based on both the final answer and the underlying reasoning.

\ We follow OpenAI (2023) to adopt a 5-shot prompt for answer selection. Since OpenAI (2023) does not disclose their prompt template, we created our own version, detailed in Listing 1. Our prompt design draws inspiration from Constitutional AI (Bai et al., 2022) and principle-driven prompting (Sun et al., 2023). We use temperature T = 0.6 to sample 64 responses for each query.

\ We wish to clarify that although Lin et al. (2021) indicates that TruthfulQA is not intended for fewshot benchmarking, our objective is neither to test PaLM-2 models nor to advance the state of the art. Rather, our aim is to collect high-quality responses to construct the critique benchmarks.

\ Listing 1: 5-shot chain-of-thought prompt for TruthfulQA (mc1).

\ Listing 1: 5-shot chain-of-thought prompt for TruthfulQA (mc1).

\

:::info Authors:

(1) Liangchen Luo, Google Research (luolc@google.com);

(2) Zi Lin, UC San Diego;

(3) Yinxiao Liu, Google Research;

(4) Yun Zhu, Google Research;

(5) Jingbo Shang, UC San Diego;

(6) Lei Meng, Google Research (leimeng@google.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[6] OpenAI’s usage policies: https://openai.com/policies/usage-policies

\ [7] LLaMA-2’s usage policy: https://ai.meta.com/llama/use-policy/

Market Opportunity
Moonveil Logo
Moonveil Price(MORE)
$0.002485
$0.002485$0.002485
+0.60%
USD
Moonveil (MORE) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight

American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight

The post American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight appeared on BitcoinEthereumNews.com. Key Takeaways: American Bitcoin (ABTC) surged nearly 85% on its Nasdaq debut, briefly reaching a $5B valuation. The Trump family, alongside Hut 8 Mining, controls 98% of the newly merged crypto-mining entity. Eric Trump called Bitcoin “modern-day gold,” predicting it could reach $1 million per coin. American Bitcoin, a fast-rising crypto mining firm with strong political and institutional backing, has officially entered Wall Street. After merging with Gryphon Digital Mining, the company made its Nasdaq debut under the ticker ABTC, instantly drawing global attention to both its stock performance and its bold vision for Bitcoin’s future. Read More: Trump-Backed Crypto Firm Eyes Asia for Bold Bitcoin Expansion Nasdaq Debut: An Explosive First Day ABTC’s first day of trading proved as dramatic as expected. Shares surged almost 85% at the open, touching a peak of $14 before settling at lower levels by the close. That initial spike valued the company around $5 billion, positioning it as one of 2025’s most-watched listings. At the last session, ABTC has been trading at $7.28 per share, which is a small positive 2.97% per day. Although the price has decelerated since opening highs, analysts note that the company has been off to a strong start and early investor activity is a hard-to-find feat in a newly-launched crypto mining business. According to market watchers, the listing comes at a time of new momentum in the digital asset markets. With Bitcoin trading above $110,000 this quarter, American Bitcoin’s entry comes at a time when both institutional investors and retail traders are showing heightened interest in exposure to Bitcoin-linked equities. Ownership Structure: Trump Family and Hut 8 at the Helm Its management and ownership set up has increased the visibility of the company. The Trump family and the Canadian mining giant Hut 8 Mining jointly own 98 percent…
Share
BitcoinEthereumNews2025/09/18 01:33
Whales Shift Focus to Zero Knowledge Proof’s 3000x ROI Potential as Zcash & Toncoin’s Rally Slows Down

Whales Shift Focus to Zero Knowledge Proof’s 3000x ROI Potential as Zcash & Toncoin’s Rally Slows Down

Explore how Zero Knowledge Proof (ZKP) is reshaping personal finance, challenging banks, and standing out as one of the top crypto gainers ahead of ZCash and Toncoin
Share
coinlineup2026/01/15 13:00
Visa Brings Stablecoins To $1.7T Platform In BVNK Deal

Visa Brings Stablecoins To $1.7T Platform In BVNK Deal

The post Visa Brings Stablecoins To $1.7T Platform In BVNK Deal appeared on BitcoinEthereumNews.com. Visa Brings Stablecoins To $1.7T Platform In BVNK
Share
BitcoinEthereumNews2026/01/15 13:03