Large language models hold promise as simulators of virtual environments, but new benchmarking with BYTESIZED32 shows that even GPT-4 falls short. While LLMs can generate plausible outcomes, they often fail at capturing complex state transitions requiring arithmetic, common sense, or scientific reasoning. This research highlights both their potential and current limitations, offering a novel benchmark for tracking progress as models evolve.Large language models hold promise as simulators of virtual environments, but new benchmarking with BYTESIZED32 shows that even GPT-4 falls short. While LLMs can generate plausible outcomes, they often fail at capturing complex state transitions requiring arithmetic, common sense, or scientific reasoning. This research highlights both their potential and current limitations, offering a novel benchmark for tracking progress as models evolve.

Are Large Language Models the Future of Game State Simulation?

:::info Authors:

(1) Ruoyao Wang, University of Arizona (ruoyaowang@arizona.edu);

(2) Graham Todd, New York University (gdrtodd@nyu.edu);

(3) Ziang Xiao, Johns Hopkins University (ziang.xiao@jhu.edu);

(4) Xingdi Yuan, Microsoft Research Montréal (eric.yuan@microsoft.com);

(5) Marc-Alexandre Côté, Microsoft Research Montréal (macote@microsoft.com);

(6) Peter Clark, Allen Institute for AI (PeterC@allenai.org).;

(7) Peter Jansen, University of Arizona and Allen Institute for AI (pajansen@arizona.edu).

:::

Abstract and 1. Introduction and Related Work

  1. Methodology

    2.1 LLM-Sim Task

    2.2 Data

    2.3 Evaluation

  2. Experiments

  3. Results

  4. Conclusion

  5. Limitations and Ethical Concerns, Acknowledgements, and References

A. Model details

B. Game transition examples

C. Game rules generation

D. Prompts

E. GPT-3.5 results

F. Histograms

Abstract

Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called BYTESIZED32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM’s capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.

1 Introduction and Related Work

Simulating the world is crucial for studying and understanding it. In many cases, however, the breadth and depth of available simulations are limited by the fact that their implementation requires extensive work from a team of human experts over weeks or months. Recent advances in large language models (LLMs) have pointed towards an alternate approach by leveraging the huge amount of knowledge contained in their pre-training datasets. But are they ready to be used directly as simulators?

\ We examine this question in the domain of textbased games, which naturally express the environment and its dynamics in natural language and have long been used as part of advances in decision making processes (Côté et al., 2018; Fan et al., 2020; Urbanek et al., 2019; Shridhar et al., 2020; Hausknecht et al., 2020; Jansen, 2022; Wang et al.,2023), information extraction (Ammanabrolu and Hausknecht, 2020; Adhikari et al., 2020), and artificial reasoning (Wang et al., 2022).

\ Figure 1: An overview of our two approaches using an LLM as a text game simulator. The example shows the process that a cup in the sink is filled by water after turning on the sink. The full state prediction includes all objects in the game including the unrelated stove, while the state difference prediction excludes the unrelated stove. State changes caused by Fact and Fenv are highlighted in yellow and green , respectively.

\ Broadly speaking, there are two ways to leverage LLMs in the context of world modeling and simulation. The first is neurosymbolic: a number of efforts use language models to generate code in a symbolic representation that allows for formal planning or inference (Liu et al., 2023; Nottingham et al., 2023; Wong et al., 2023; Tang et al., 2024). REASONING VIA PLANNING (RAP) (Hao et al., 2023) is one such approach – it constructs a world model using LLM priors and then uses a dedicated planning algorithm to decide on agent policies (LLMs themselves continue to struggle to act directly as planners (Valmeekam et al., 2023)). Similarly, BYTESIZED32 (Wang et al., 2023) tasks LLMs with instantiating simulations of scientific reasoning concepts in the form of large PYTHON programs. These efforts are in contrast to the second, and comparatively less studied, approach of direct simulation. For instance, AI-DUNGEON represents a game world purely through the generated output of a language model, with inconsistent results (Walton, 2020). In this work, we provide the first quantitative analysis of the abilities of LLMs to directly simulate virtual environments. We make use of structured representations in the JSON schema as a scaffold that both improves simulation accuracy and allows for us to directly probe the LLM’s abilities across a variety of conditions.

\ In a systematic analysis of GPT-4 (Achiam et al., 2023), we find that LLMs broadly fail to capture state transitions not directly related to agent actions, as well as transitions that require arithmetic, common-sense, or scientific reasoning. Across a variety of conditions, model accuracy does not exceed 59.9% for transitions in which a non-trivial change in the world state occurs. These results suggest that, while promising and useful for downstream tasks, LLMs are not yet ready to act as reliable world simulators without further innovation.[1]

\

:::info This paper is available on arxiv under CC BY 4.0 license.

:::

[1] Code and data are available at https://github. com/cognitiveailab/GPT-simulator.

Market Opportunity
SQUID MEME Logo
SQUID MEME Price(GAME)
$40.1851
$40.1851$40.1851
+0.63%
USD
SQUID MEME (GAME) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

XRP Price Prediction: Ripple CEO at Davos Predicts Crypto ATHs This Year – $5 XRP Next?

XRP Price Prediction: Ripple CEO at Davos Predicts Crypto ATHs This Year – $5 XRP Next?

XRP has traded near $1.90 as Ripple CEO Brad Garlinghouse has predicted from Davos that the crypto market will reach new highs this year. Analysts have pointed
Share
Coinstats2026/01/22 04:49
What Is Jawboning? Jimmy Kimmel Suspension Sparks Legal Concerns About Trump Administration

What Is Jawboning? Jimmy Kimmel Suspension Sparks Legal Concerns About Trump Administration

The post What Is Jawboning? Jimmy Kimmel Suspension Sparks Legal Concerns About Trump Administration appeared on BitcoinEthereumNews.com. Topline Legal experts have raised concerns that ABC’s decision to pull “Jimmy Kimmel Live” from its airwaves following the host’s controversial comments about the death of Charlie Kirk, could be because the Trump administration violated free speech protections through a practice known as “jawboning.” Jimmy Kimmel speaks at Disney’s Advertising Upfront on May 13 in New York City. Disney via Getty Images Key Facts Disney-owned ABC announced Wednesday Kimmel’s show will be taken off the air “indefinitely,” which came after ABC affiliate owner Nexstar—which needs Federal Communications Commission approval to complete a planned acquisition of competitor Tegna Inc.—said it would not air the program due to Kimmel’s comments Monday regarding Kirk’s death and the reaction to it. The sudden move drew particular concern because it came only hours after FCC head Brendan Carr called for ABC to “take action” against Kimmel, and cryptically suggested his agency could take action saying, “We can do this the easy way or the hard way.” While ABC and Nexstar have not given any indication their decisions were influenced by Carr’s comments, the timing raised concerns among legal experts that the Trump administration’s threats may have unlawfully coerced ABC and Nexstar to punish Kimmel, which could constitute jawboning. Jawboning refers to “the use of official speech to inappropriately compel private action,” as defined by the Cato Institute, as governments or public officials—who cannot directly punish private actors for speech they don’t like—can use strongman tactics to try and indirectly silence critics or influence private companies’ actions. The practice is fairly loosely defined and there aren’t many legal safeguards dictating how violations of it are enforced, the Knight First Amendment Institute notes, but the Supreme Court has repeatedly ruled it can be unlawful and an impermissible First Amendment violation when it involves specific threats. The White…
Share
BitcoinEthereumNews2025/09/19 07:17
Wormhole Unleashes W 2.0 Tokenomics for a Connected Blockchain Future

Wormhole Unleashes W 2.0 Tokenomics for a Connected Blockchain Future

TLDR Wormhole reinvents W Tokenomics with Reserve, yield, and unlock upgrades. W Tokenomics: 4% yield, bi-weekly unlocks, and a sustainable Reserve Wormhole shifts to long-term value with treasury, yield, and smoother unlocks. Stakers earn 4% base yield as Wormhole optimizes unlocks for stability. Wormhole’s new Tokenomics align growth, yield, and stability for W holders. Wormhole [...] The post Wormhole Unleashes W 2.0 Tokenomics for a Connected Blockchain Future appeared first on CoinCentral.
Share
Coincentral2025/09/18 02:07