This article explores how GPT-4 and GPT-3.5 perform in simulated game environments under different conditions—human-written rules, LLM-generated rules, and no rules. The results reveal that GPT-4 significantly outpaces GPT-3.5, especially when rules are absent, underscoring its superior ability to apply common sense and accurately predict game states.This article explores how GPT-4 and GPT-3.5 perform in simulated game environments under different conditions—human-written rules, LLM-generated rules, and no rules. The results reveal that GPT-4 significantly outpaces GPT-3.5, especially when rules are absent, underscoring its superior ability to apply common sense and accurately predict game states.

GPT-4 vs GPT-3.5 Performance in Game Simulations

Abstract and 1. Introduction and Related Work

  1. Methodology

    2.1 LLM-Sim Task

    2.2 Data

    2.3 Evaluation

  2. Experiments

  3. Results

  4. Conclusion

  5. Limitations and Ethical Concerns, Acknowledgements, and References

A. Model details

B. Game transition examples

C. Game rules generation

D. Prompts

E. GPT-3.5 results

F. Histograms

D Prompts

The prompts introduced in this section includes game rules that can either be human written rules or LLM generated rules. For experiments without game rules, we simply remove the rules from the corresponding prompts.

D.1 Prompt Example: Fact

D.1 Prompt Example: Fact

\

\ D.1.2 State Difference Prediction

\

D.2 Prompt Example: Fenv

D.2.1 Full State Prediction

\

\ D.2.2 State Difference Prediction

\

D.3 Prompt Example: FR (Game Progress)

D.4 Prompt Example: F

D.4.1 Full State Prediction

\

\ D.4.2 State Difference Prediction

\

D.5 Other Examples

Below is an example of the rule of an action:

\

\ Below is an example of the rule of an object:

\

\ Below is an example of the score rule:

\

\ Below is an example of a game state:

\

\ Table 5: Average accuracy per game of GPT-3.5 predicting the whole state transitions (F) as well as action-driven transitions (Fact) and environment-driven transitions (Fenv). We report settings that use LLM generated rules, human written rules, or no rules. Dynamic and static denote whether the game object properties and game progress should be changed; Full and diff denote whether the prediction outcome is the full game state or state differences. Numbers shown in percentage.

\ Table 6: GPT-3.5 game progress prediction results

\ Below is an example of a JSON that describes the difference of two game states:

\

\

E GPT-3.5 results

Table 5 and Table 6 shows the performance of a GPT-3.5 simulator predicting objects properties and game progress respectively. There is a huge gap between the GPT-4 performance and GPT-3.5 performance, providing yet another example of how fast LLM develops in the two years. It is also worth notices that the performance difference is larger when no rules is provided, indicating that GPT-3.5 is especially weak at applying common sense knowledge to this few-shot world simulation task.

\

F Histograms

1. In Figure 3, we show detailed experimental results on the full state prediction task performed by GPT-4.

\ \ Table 7: Description of object properties mentioned in Figure 2

\ \ 2. In Figure 4, we show detailed experimental results on the state difference prediction task performed by GPT-4.

\ 3. In Figure 5, we show detailed experimental results on the full state prediction task performed by GPT-3.5.

\ 4. In Figure 6, we show detailed experimental results on the state difference prediction task performed by GPT-3.5.

\ \ (a) Human-generated rules.

\ \ \ (b) LLM-generated rules.

\ \ \ (c) No rules.

\ \ Figure 3: GPT-4 - Full State prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

\ \ (a) Human-generated rules.

\ \ \ (b) LLM-generated rules.

\ \ \ (c) No rules.

\ \ Figure 4: GPT-4 - Difference prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

\ \ (a) Human-generated rules.

\ \ \ (b) LLM-generated rules.

\ \ \ (c) No rules.

\ \ Figure 5: GPT-3.5 - Full State prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

\ \ (a) Human-generated rules.

\ \ \ (b) LLM-generated rules.

\ \ \ (c) No rules.

\ \ Figure 6: GPT-3.5 - Difference prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

\ \

:::info Authors:

(1) Ruoyao Wang, University of Arizona (ruoyaowang@arizona.edu);

(2) Graham Todd, New York University (gdrtodd@nyu.edu);

(3) Ziang Xiao, Johns Hopkins University (ziang.xiao@jhu.edu);

(4) Xingdi Yuan, Microsoft Research Montréal (eric.yuan@microsoft.com);

(5) Marc-Alexandre Côté, Microsoft Research Montréal (macote@microsoft.com);

(6) Peter Clark, Allen Institute for AI (PeterC@allenai.org).;

(7) Peter Jansen, University of Arizona and Allen Institute for AI (pajansen@arizona.edu).

:::


:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Market Opportunity
SQUID MEME Logo
SQUID MEME Price(GAME)
$40.1851
$40.1851$40.1851
+0.63%
USD
SQUID MEME (GAME) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

XRP Price Prediction: Ripple CEO at Davos Predicts Crypto ATHs This Year – $5 XRP Next?

XRP Price Prediction: Ripple CEO at Davos Predicts Crypto ATHs This Year – $5 XRP Next?

XRP has traded near $1.90 as Ripple CEO Brad Garlinghouse has predicted from Davos that the crypto market will reach new highs this year. Analysts have pointed
Share
Coinstats2026/01/22 04:49
What Is Jawboning? Jimmy Kimmel Suspension Sparks Legal Concerns About Trump Administration

What Is Jawboning? Jimmy Kimmel Suspension Sparks Legal Concerns About Trump Administration

The post What Is Jawboning? Jimmy Kimmel Suspension Sparks Legal Concerns About Trump Administration appeared on BitcoinEthereumNews.com. Topline Legal experts have raised concerns that ABC’s decision to pull “Jimmy Kimmel Live” from its airwaves following the host’s controversial comments about the death of Charlie Kirk, could be because the Trump administration violated free speech protections through a practice known as “jawboning.” Jimmy Kimmel speaks at Disney’s Advertising Upfront on May 13 in New York City. Disney via Getty Images Key Facts Disney-owned ABC announced Wednesday Kimmel’s show will be taken off the air “indefinitely,” which came after ABC affiliate owner Nexstar—which needs Federal Communications Commission approval to complete a planned acquisition of competitor Tegna Inc.—said it would not air the program due to Kimmel’s comments Monday regarding Kirk’s death and the reaction to it. The sudden move drew particular concern because it came only hours after FCC head Brendan Carr called for ABC to “take action” against Kimmel, and cryptically suggested his agency could take action saying, “We can do this the easy way or the hard way.” While ABC and Nexstar have not given any indication their decisions were influenced by Carr’s comments, the timing raised concerns among legal experts that the Trump administration’s threats may have unlawfully coerced ABC and Nexstar to punish Kimmel, which could constitute jawboning. Jawboning refers to “the use of official speech to inappropriately compel private action,” as defined by the Cato Institute, as governments or public officials—who cannot directly punish private actors for speech they don’t like—can use strongman tactics to try and indirectly silence critics or influence private companies’ actions. The practice is fairly loosely defined and there aren’t many legal safeguards dictating how violations of it are enforced, the Knight First Amendment Institute notes, but the Supreme Court has repeatedly ruled it can be unlawful and an impermissible First Amendment violation when it involves specific threats. The White…
Share
BitcoinEthereumNews2025/09/19 07:17
Wormhole Unleashes W 2.0 Tokenomics for a Connected Blockchain Future

Wormhole Unleashes W 2.0 Tokenomics for a Connected Blockchain Future

TLDR Wormhole reinvents W Tokenomics with Reserve, yield, and unlock upgrades. W Tokenomics: 4% yield, bi-weekly unlocks, and a sustainable Reserve Wormhole shifts to long-term value with treasury, yield, and smoother unlocks. Stakers earn 4% base yield as Wormhole optimizes unlocks for stability. Wormhole’s new Tokenomics align growth, yield, and stability for W holders. Wormhole [...] The post Wormhole Unleashes W 2.0 Tokenomics for a Connected Blockchain Future appeared first on CoinCentral.
Share
Coincentral2025/09/18 02:07