This article evaluates how fine-tuning affects AI reasoning on structured puzzle tasks. Using Open-LLaMA as a base, models were trained on datasets of varying sizes (1M, 10M, 100M). Results show clear scaling benefits: the 100M-sample model achieved the best pass@1 accuracy in both in-distribution and out-of-distribution tests. While smaller models struggled with limited reasoning steps or logical errors, larger fine-tuned models demonstrated deeper problem-solving ability, outperforming both base and prompt-engineered approaches.This article evaluates how fine-tuning affects AI reasoning on structured puzzle tasks. Using Open-LLaMA as a base, models were trained on datasets of varying sizes (1M, 10M, 100M). Results show clear scaling benefits: the 100M-sample model achieved the best pass@1 accuracy in both in-distribution and out-of-distribution tests. While smaller models struggled with limited reasoning steps or logical errors, larger fine-tuned models demonstrated deeper problem-solving ability, outperforming both base and prompt-engineered approaches.

Evaluating Fine-Tuned LLMs on Reasoning Puzzles

:::info Authors:

(1) Haolong Li, Tongji Universiy and work done during internship at ByteDance (furlongli322@gmail.com);

(2) Yu Ma, Seed Foundation, ByteDance (mayu.1231@bytedance.com);

(3) Yinqi Zhang, East China Normal University and work done during internship at ByteDance (zhang.inch@gmail.com);

(4) Chen Ye (Corresponding Author), ESSC Lab, Tongji Universiy (yechen@tongji.edu.cn);

(5) Jie Chen, Seed Foundation, ByteDance and a Project Leader (chenjiexjtu@gmail.com).

:::

Abstract and 1 Introduction

2 Problem Definition

2.1 Arithmetical Puzzle Problem

2.2 Data Synthesizing

2.3 Dataset

3 Model

4 Experiments

4.1 Evaluation

4.2 Results

4.3 Case Studies

5 Conclusion and Acknowledgements

6 Limitations

7 Ethics Statement and References

\ A Appendix

A.1 Hyperparameter Settings

A.2 Evaluation of the Base Model

A.3 Case Study

A.4 Visualization of the Proposed Puzzle

4.1 Evaluation

For the fine-tuned model, we use the greedy decoding strategy in a zero-shot setting to generate responses. To measure the model’s performance on the proposed puzzle, a corresponding verifier is designed to automatically evaluate the correctness of the responses. Specifically, a solution is deemed correct if it satisfies the following rules:

\ • No extra or illegal characters.

\ • There are only N − 1 equations and all the corresponding calculations are correct.

\ • F(X1, . . . , XN | ops) = T.

\ • All {Xi | i ∈ {1, 2, . . . , N}} and the intermediate calculation results are only used once.

\ Figure 1: Distributions of N and X for different training set sizes (1M / 10M / 100M samples). N denotes the total number of candidate integers of our puzzle, X = (X1, X2, . . . , XN ) denotes the candidate integers.

\ Figure 2: Distributions of the tokenized prompt and response lengths for different training set sizes (1M / 10M / 100M samples).

\ The detailed steps of evaluating the solution for this puzzle is described in Algorithm 2.

4.2 Results

As mentioned in Section 2.3, we have generated three training datasets with different sizes to explore the data scaling effects on the fine-tuned model. The pass@1 rate on different in-distribution and out-of-distribution test datasets are shown in Table 2. When the model is fine-tuned with 100M samples, it achieves the highest score with a zero-shot pass@1 of 0.44 in the in-distribution test dataset, and 0.33 and 0.35 in the two OOD datasets, respectively.

\ Furthermore, we have shown the training curves of the model fine-tuned on these three datasets in Figure 3. From Figure 3, a faster decaying rate is clearly observed in the training loss when increasing the training data size, which is consistent with the rapid increase of the pass@1 rate evaluated on the in-distribution dataset. The same enhancement of the performance also occurs in the two OOD test datasets as shown in Table 2.

\ Additionally, we have also conducted tests of this puzzle on the base model (open-llama-3B) and several other open-source and closed-source models with both few-shot and CoT prompting. The results and some of the generated cases are shown in Appendix A.2, demonstrating the necessity of fine-tuning with regard to solving such puzzle problems.

4.3 Case Studies

We further demonstrate the different solutions provided by models trained with 1M / 10M / 100M training data on the form OOD test dataset for several challenging queries. As shown in Figure 4 in Appendix A.3, the model trained on 1M samples is still limited to a fixed number of reasoning steps, whereas the models trained on 10M / 100M samples exhibit a higher-level understanding of the problem and perform an adequate number of reasoning steps. However, compared to the model trained on 100M samples, the model trained on 10M samples may still encounter computational or logical errors in the final step of reasoning.

\ \ Figure 3: The training loss and zero-shot pass@1 on ID dataset for different training set sizes (1M / 10M / 100M samples).

\ \ \ Table 2: Zero-shot pass@1 of the model fine-tuned with different training set sizes (1M / 10M / 100M samples) on ID, numerical OOD, and form OOD test datasets. The best results are highlighted.

\ \ \

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0,06884
$0,06884$0,06884
+%8,75
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Why Is Crypto Up Today? – January 14, 2026

Why Is Crypto Up Today? – January 14, 2026

The crypto market is up today, with the cryptocurrency market capitalisation rising by 3.6% to $3.33 trillion. At the time of writing, 95 of the top 100 coins have
Share
CryptoNews2026/01/14 20:04
Two Rock Anthems Put Ozzy Osbourne Back Inside The Top 10

Two Rock Anthems Put Ozzy Osbourne Back Inside The Top 10

The post Two Rock Anthems Put Ozzy Osbourne Back Inside The Top 10 appeared on BitcoinEthereumNews.com. Ozzy Osbourne returns to the Rock Digital Song Sales chart as “Mama, I’m Coming Home” rebounds to No. 8, joining “Crazy Train” in the top 10. NEW YORK, NY – DECEMBER 11: Ozzy Osbourne visits the SiriusXM Studios on December 11, 2014 in New York City. (Photo by Ilya S. Savenok/Getty Images) Getty Images For weeks following his death, Ozzy Osbourne’s music dominated charts all around the world. His name was especially visible – alongside Black Sabbath’s – on Billboard’s rock-focused rankings, as he is a legend in that field, and his performance on the rosters after his passing reflected that. After a few weeks, much of Osbourne’s work — both solo and with the band that made him a superstar — began to descend or vanish entirely. Since that decline, a handful of favorites have either managed to stay on Billboard’s lists or return from time to time. Osbourne doubles up inside the top 10 on one sales ranking in America as one of his most famous tracks — which seems to have taken on new meaning following his death — reappears. “Mama, I’m Coming Home” Returns “Mama, I’m Coming Home” returns to the Rock Digital Song Sales chart. The tune breaks back in at No. 8 on the list of the bestselling rock-only tunes in America. Ozzy Osbourne’s Recent No. 1 Coincidentally, as “Mama, I’m Coming Home” reappears on the Rock Digital Song Sales list, the cut earns its eighth stay on the tally after and comes in eighth place. The tune became the superstar’s third champion in July when it finally reached No. 1, debuted in first. It went on to lead for four frames. “Mama, I’m Coming Home” and “Crazy Train” As “Mama, I’m Coming Home” reenters the Rock Digital Song Sales chart, it joins “Crazy…
Share
BitcoinEthereumNews2025/09/18 21:35
China Urges Pause on Hong Kong Real-World Asset Tokenization

China Urges Pause on Hong Kong Real-World Asset Tokenization

The post China Urges Pause on Hong Kong Real-World Asset Tokenization appeared first on Coinpedia Fintech News China’s Securities Regulatory Commission (CSRC) has advised some domestic brokerages to temporarily halt their real-world asset (RWA) tokenization activities in Hong Kong. At least two major brokerages received informal guidance amid Beijing’s concerns over rapid growth in the offshore digital asset market. This move aims to strengthen risk management and ensure legitimacy in RWA projects. …
Share
CoinPedia2025/09/22 19:30