This article evaluates how fine-tuning affects AI reasoning on structured puzzle tasks. Using Open-LLaMA as a base, models were trained on datasets of varying sizes (1M, 10M, 100M). Results show clear scaling benefits: the 100M-sample model achieved the best pass@1 accuracy in both in-distribution and out-of-distribution tests. While smaller models struggled with limited reasoning steps or logical errors, larger fine-tuned models demonstrated deeper problem-solving ability, outperforming both base and prompt-engineered approaches.This article evaluates how fine-tuning affects AI reasoning on structured puzzle tasks. Using Open-LLaMA as a base, models were trained on datasets of varying sizes (1M, 10M, 100M). Results show clear scaling benefits: the 100M-sample model achieved the best pass@1 accuracy in both in-distribution and out-of-distribution tests. While smaller models struggled with limited reasoning steps or logical errors, larger fine-tuned models demonstrated deeper problem-solving ability, outperforming both base and prompt-engineered approaches.

Evaluating Fine-Tuned LLMs on Reasoning Puzzles

:::info Authors:

(1) Haolong Li, Tongji Universiy and work done during internship at ByteDance (furlongli322@gmail.com);

(2) Yu Ma, Seed Foundation, ByteDance (mayu.1231@bytedance.com);

(3) Yinqi Zhang, East China Normal University and work done during internship at ByteDance (zhang.inch@gmail.com);

(4) Chen Ye (Corresponding Author), ESSC Lab, Tongji Universiy (yechen@tongji.edu.cn);

(5) Jie Chen, Seed Foundation, ByteDance and a Project Leader (chenjiexjtu@gmail.com).

:::

Abstract and 1 Introduction

2 Problem Definition

2.1 Arithmetical Puzzle Problem

2.2 Data Synthesizing

2.3 Dataset

3 Model

4 Experiments

4.1 Evaluation

4.2 Results

4.3 Case Studies

5 Conclusion and Acknowledgements

6 Limitations

7 Ethics Statement and References

\ A Appendix

A.1 Hyperparameter Settings

A.2 Evaluation of the Base Model

A.3 Case Study

A.4 Visualization of the Proposed Puzzle

4.1 Evaluation

For the fine-tuned model, we use the greedy decoding strategy in a zero-shot setting to generate responses. To measure the model’s performance on the proposed puzzle, a corresponding verifier is designed to automatically evaluate the correctness of the responses. Specifically, a solution is deemed correct if it satisfies the following rules:

\ • No extra or illegal characters.

\ • There are only N − 1 equations and all the corresponding calculations are correct.

\ • F(X1, . . . , XN | ops) = T.

\ • All {Xi | i ∈ {1, 2, . . . , N}} and the intermediate calculation results are only used once.

\ Figure 1: Distributions of N and X for different training set sizes (1M / 10M / 100M samples). N denotes the total number of candidate integers of our puzzle, X = (X1, X2, . . . , XN ) denotes the candidate integers.

\ Figure 2: Distributions of the tokenized prompt and response lengths for different training set sizes (1M / 10M / 100M samples).

\ The detailed steps of evaluating the solution for this puzzle is described in Algorithm 2.

4.2 Results

As mentioned in Section 2.3, we have generated three training datasets with different sizes to explore the data scaling effects on the fine-tuned model. The pass@1 rate on different in-distribution and out-of-distribution test datasets are shown in Table 2. When the model is fine-tuned with 100M samples, it achieves the highest score with a zero-shot pass@1 of 0.44 in the in-distribution test dataset, and 0.33 and 0.35 in the two OOD datasets, respectively.

\ Furthermore, we have shown the training curves of the model fine-tuned on these three datasets in Figure 3. From Figure 3, a faster decaying rate is clearly observed in the training loss when increasing the training data size, which is consistent with the rapid increase of the pass@1 rate evaluated on the in-distribution dataset. The same enhancement of the performance also occurs in the two OOD test datasets as shown in Table 2.

\ Additionally, we have also conducted tests of this puzzle on the base model (open-llama-3B) and several other open-source and closed-source models with both few-shot and CoT prompting. The results and some of the generated cases are shown in Appendix A.2, demonstrating the necessity of fine-tuning with regard to solving such puzzle problems.

4.3 Case Studies

We further demonstrate the different solutions provided by models trained with 1M / 10M / 100M training data on the form OOD test dataset for several challenging queries. As shown in Figure 4 in Appendix A.3, the model trained on 1M samples is still limited to a fixed number of reasoning steps, whereas the models trained on 10M / 100M samples exhibit a higher-level understanding of the problem and perform an adequate number of reasoning steps. However, compared to the model trained on 100M samples, the model trained on 10M samples may still encounter computational or logical errors in the final step of reasoning.

\ \ Figure 3: The training loss and zero-shot pass@1 on ID dataset for different training set sizes (1M / 10M / 100M samples).

\ \ \ Table 2: Zero-shot pass@1 of the model fine-tuned with different training set sizes (1M / 10M / 100M samples) on ID, numerical OOD, and form OOD test datasets. The best results are highlighted.

\ \ \

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.06547
$0.06547$0.06547
+3.42%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

CME Group to Launch Solana and XRP Futures Options

CME Group to Launch Solana and XRP Futures Options

The post CME Group to Launch Solana and XRP Futures Options appeared on BitcoinEthereumNews.com. An announcement was made by CME Group, the largest derivatives exchanger worldwide, revealed that it would introduce options for Solana and XRP futures. It is the latest addition to CME crypto derivatives as institutions and retail investors increase their demand for Solana and XRP. CME Expands Crypto Offerings With Solana and XRP Options Launch According to a press release, the launch is scheduled for October 13, 2025, pending regulatory approval. The new products will allow traders to access options on Solana, Micro Solana, XRP, and Micro XRP futures. Expiries will be offered on business days on a monthly, and quarterly basis to provide more flexibility to market players. CME Group said the contracts are designed to meet demand from institutions, hedge funds, and active retail traders. According to Giovanni Vicioso, the launch reflects high liquidity in Solana and XRP futures. Vicioso is the Global Head of Cryptocurrency Products for the CME Group. He noted that the new contracts will provide additional tools for risk management and exposure strategies. Recently, CME XRP futures registered record open interest amid ETF approval optimism, reinforcing confidence in contract demand. Cumberland, one of the leading liquidity providers, welcomed the development and said it highlights the shift beyond Bitcoin and Ethereum. FalconX, another trading firm, added that rising digital asset treasuries are increasing the need for hedging tools on alternative tokens like Solana and XRP. High Record Trading Volumes Demand Solana and XRP Futures Solana futures and XRP continue to gain popularity since their launch earlier this year. According to CME official records, many have bought and sold more than 540,000 Solana futures contracts since March. A value that amounts to over $22 billion dollars. Solana contracts hit a record 9,000 contracts in August, worth $437 million. Open interest also set a record at 12,500 contracts.…
Share
BitcoinEthereumNews2025/09/18 01:39
XCN Rallies 116% — Can Price Hold as New Holders Gain?

XCN Rallies 116% — Can Price Hold as New Holders Gain?

The post XCN Rallies 116% — Can Price Hold as New Holders Gain? appeared on BitcoinEthereumNews.com. Onyxcoin has delivered one of the strongest performances among
Share
BitcoinEthereumNews2026/01/14 18:59
Worldcoin Price Near $0.65 Faces Pressure as Whales Sell Into the Rally

Worldcoin Price Near $0.65 Faces Pressure as Whales Sell Into the Rally

The post Worldcoin Price Near $0.65 Faces Pressure as Whales Sell Into the Rally appeared on BitcoinEthereumNews.com. Key Insights Retail buyers continue to support
Share
BitcoinEthereumNews2026/01/14 19:12