CriticBench uses Google’s PaLM-2 model family to generate benchmark data for tasks like GSM8K, HumanEval, and TruthfulQA. By avoiding GPT and LLaMA due to licensing constraints, the project ensures a more open and compliant evaluation framework. Its methodology employs chain-of-thought prompting, code sandbox testing, and principle-driven prompting to create high-quality responses that capture both final answers and underlying reasoning, making it a valuable resource for critique-based AI evaluation.CriticBench uses Google’s PaLM-2 model family to generate benchmark data for tasks like GSM8K, HumanEval, and TruthfulQA. By avoiding GPT and LLaMA due to licensing constraints, the project ensures a more open and compliant evaluation framework. Its methodology employs chain-of-thought prompting, code sandbox testing, and principle-driven prompting to create high-quality responses that capture both final answers and underlying reasoning, making it a valuable resource for critique-based AI evaluation.

Why CriticBench Refuses GPT & LLaMA for Data Generation

2025/08/27 15:00
4분 읽기
이 콘텐츠에 대한 의견이나 우려 사항이 있으시면 crypto.news@mexc.com으로 연락주시기 바랍니다

Abstract and 1. Introduction

  1. Definition of Critique Ability

  2. Construction of CriticBench

    3.1 Data Generation

    3.2 Data Selection

  3. Properties of Critique Ability

    4.1 Scaling Law

    4.2 Self-Critique Ability

    4.3 Correlation to Certainty

  4. New Capacity with Critique: Self-Consistency with Self-Check

  5. Conclusion, References, and Acknowledgments

A. Notations

B. CriticBench: Sources of Queries

C. CriticBench: Data Generation Details

D. CriticBench: Data Selection Details

E. CriticBench: Statistics and Examples

F. Evaluation Settings

C CRITICBENCH: DATA GENERATION DETAILS

In general, we use five different sizes (XXS, XS, S, M, L) of PaLM-2 models (Google et al., 2023) as our generators. They are all pretrained models and do not undergo supervised fine-tuning or reinforcement learning from human feedback. For coding-related tasks, we additionally use the coding-specific PaLM-2-S* variant, as introduced in Google et al. (2023). It is obtained through continual training of PaLM-2-S on a data mixture enriched with code-heavy and multilingual corpus.

\ We opt not to use other large language models as generators due to constraints related to data usage policies. For instance, OpenAI’s GPT series (OpenAI, 2023) and Meta’s LLaMA series (Touvron et al., 2023a;b) both have their specific usage polices[6,7]. Our aim is to establish an open benchmark with minimal constraints. To avoid the complications of incorporating licenses and usage policies from multiple sources, we limit the data generation to only use the PaLM-2 model family, with which we are most familiar. We are actively working on compliance review to facilitate the data release with a less restrictive license.

C.1 GSM8K

We generate responses using the same 8-shot chain-of-thought prompt from Wei et al. (2022b). We use nucleus sampling (Holtzman et al., 2020) with temperature T = 0.6 and p = 0.95 to sample 64 responses for each query. Following Lewkowycz et al. (2022) and Google et al. (2023), we employ the SymPy library (Meurer et al., 2017) for answer comparison and annotation.

C.2 HUMANEVAL

Following Google et al. (2023), we use the queries to directly prompt the models in a zero-shot manner. We use nucleus sampling (Holtzman et al., 2020) with temperature T = 0.8 and p = 0.95 to sample 100 responses for each query. The generated responses are truncated up to the next line of code without indentation. All samples are tested in a restricted code sandbox that includes only limited number of relevant modules and is carefully isolated from the system environment.

C.3 TRUTHFULQA

In the original paper by Lin et al. (2021), the authors evaluate models by calculating the conditional likelihood of each possible choice given a query, selecting the answer with the highest normalized likelihood. While straightforward, this method has two primary limitations. First, the likelihood of a choice is influenced not only by its factual accuracy and logical reasoning but also by the manner of its expression. Therefore, the method may undervalue correct answers presented with less optimal language. Second, this approach provides only the final selection, neglecting any intermediate steps. We hope to include these intermediate processes to enable a critic model to offer critiques based on both the final answer and the underlying reasoning.

\ We follow OpenAI (2023) to adopt a 5-shot prompt for answer selection. Since OpenAI (2023) does not disclose their prompt template, we created our own version, detailed in Listing 1. Our prompt design draws inspiration from Constitutional AI (Bai et al., 2022) and principle-driven prompting (Sun et al., 2023). We use temperature T = 0.6 to sample 64 responses for each query.

\ We wish to clarify that although Lin et al. (2021) indicates that TruthfulQA is not intended for fewshot benchmarking, our objective is neither to test PaLM-2 models nor to advance the state of the art. Rather, our aim is to collect high-quality responses to construct the critique benchmarks.

\ Listing 1: 5-shot chain-of-thought prompt for TruthfulQA (mc1).

\ Listing 1: 5-shot chain-of-thought prompt for TruthfulQA (mc1).

\

:::info Authors:

(1) Liangchen Luo, Google Research (luolc@google.com);

(2) Zi Lin, UC San Diego;

(3) Yinxiao Liu, Google Research;

(4) Yun Zhu, Google Research;

(5) Jingbo Shang, UC San Diego;

(6) Lei Meng, Google Research (leimeng@google.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[6] OpenAI’s usage policies: https://openai.com/policies/usage-policies

\ [7] LLaMA-2’s usage policy: https://ai.meta.com/llama/use-policy/

시장 기회
Moonveil 로고
Moonveil 가격(MORE)
$0.0000415
$0.0000415$0.0000415
+2.62%
USD
Moonveil (MORE) 실시간 가격 차트
면책 조항: 본 사이트에 재게시된 글들은 공개 플랫폼에서 가져온 것으로 정보 제공 목적으로만 제공됩니다. 이는 반드시 MEXC의 견해를 반영하는 것은 아닙니다. 모든 권리는 원저자에게 있습니다. 제3자의 권리를 침해하는 콘텐츠가 있다고 판단될 경우, crypto.news@mexc.com으로 연락하여 삭제 요청을 해주시기 바랍니다. MEXC는 콘텐츠의 정확성, 완전성 또는 시의적절성에 대해 어떠한 보증도 하지 않으며, 제공된 정보에 기반하여 취해진 어떠한 조치에 대해서도 책임을 지지 않습니다. 본 콘텐츠는 금융, 법률 또는 기타 전문적인 조언을 구성하지 않으며, MEXC의 추천이나 보증으로 간주되어서는 안 됩니다.

$30,000 in PRL + 15,000 USDT

$30,000 in PRL + 15,000 USDT$30,000 in PRL + 15,000 USDT

Deposit & trade PRL to boost your rewards!