Large language models hold promise as simulators of virtual environments, but new benchmarking with BYTESIZED32 shows that even GPT-4 falls short. While LLMs can generate plausible outcomes, they often fail at capturing complex state transitions requiring arithmetic, common sense, or scientific reasoning. This research highlights both their potential and current limitations, offering a novel benchmark for tracking progress as models evolve.Large language models hold promise as simulators of virtual environments, but new benchmarking with BYTESIZED32 shows that even GPT-4 falls short. While LLMs can generate plausible outcomes, they often fail at capturing complex state transitions requiring arithmetic, common sense, or scientific reasoning. This research highlights both their potential and current limitations, offering a novel benchmark for tracking progress as models evolve.

Are Large Language Models the Future of Game State Simulation?

Par : Hackernoon
2025/09/24 17:00
SQUID MEME
GAME$33.1081+3.83%
Virtuals Protocol
VIRTUAL$1.1051-0.08%
FUTURECOIN
FUTURE$0.1213-10.67%

:::info Authors:

(1) Ruoyao Wang, University of Arizona ([email protected]);

(2) Graham Todd, New York University ([email protected]);

(3) Ziang Xiao, Johns Hopkins University ([email protected]);

(4) Xingdi Yuan, Microsoft Research Montréal ([email protected]);

(5) Marc-Alexandre Côté, Microsoft Research Montréal ([email protected]);

(6) Peter Clark, Allen Institute for AI ([email protected]).;

(7) Peter Jansen, University of Arizona and Allen Institute for AI ([email protected]).

:::

Abstract and 1. Introduction and Related Work

  1. Methodology

    2.1 LLM-Sim Task

    2.2 Data

    2.3 Evaluation

  2. Experiments

  3. Results

  4. Conclusion

  5. Limitations and Ethical Concerns, Acknowledgements, and References

A. Model details

B. Game transition examples

C. Game rules generation

D. Prompts

E. GPT-3.5 results

F. Histograms

Abstract

Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called BYTESIZED32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM’s capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.

1 Introduction and Related Work

Simulating the world is crucial for studying and understanding it. In many cases, however, the breadth and depth of available simulations are limited by the fact that their implementation requires extensive work from a team of human experts over weeks or months. Recent advances in large language models (LLMs) have pointed towards an alternate approach by leveraging the huge amount of knowledge contained in their pre-training datasets. But are they ready to be used directly as simulators?

\ We examine this question in the domain of textbased games, which naturally express the environment and its dynamics in natural language and have long been used as part of advances in decision making processes (Côté et al., 2018; Fan et al., 2020; Urbanek et al., 2019; Shridhar et al., 2020; Hausknecht et al., 2020; Jansen, 2022; Wang et al.,2023), information extraction (Ammanabrolu and Hausknecht, 2020; Adhikari et al., 2020), and artificial reasoning (Wang et al., 2022).

\ Figure 1: An overview of our two approaches using an LLM as a text game simulator. The example shows the process that a cup in the sink is filled by water after turning on the sink. The full state prediction includes all objects in the game including the unrelated stove, while the state difference prediction excludes the unrelated stove. State changes caused by Fact and Fenv are highlighted in yellow and green , respectively.

\ Broadly speaking, there are two ways to leverage LLMs in the context of world modeling and simulation. The first is neurosymbolic: a number of efforts use language models to generate code in a symbolic representation that allows for formal planning or inference (Liu et al., 2023; Nottingham et al., 2023; Wong et al., 2023; Tang et al., 2024). REASONING VIA PLANNING (RAP) (Hao et al., 2023) is one such approach – it constructs a world model using LLM priors and then uses a dedicated planning algorithm to decide on agent policies (LLMs themselves continue to struggle to act directly as planners (Valmeekam et al., 2023)). Similarly, BYTESIZED32 (Wang et al., 2023) tasks LLMs with instantiating simulations of scientific reasoning concepts in the form of large PYTHON programs. These efforts are in contrast to the second, and comparatively less studied, approach of direct simulation. For instance, AI-DUNGEON represents a game world purely through the generated output of a language model, with inconsistent results (Walton, 2020). In this work, we provide the first quantitative analysis of the abilities of LLMs to directly simulate virtual environments. We make use of structured representations in the JSON schema as a scaffold that both improves simulation accuracy and allows for us to directly probe the LLM’s abilities across a variety of conditions.

\ In a systematic analysis of GPT-4 (Achiam et al., 2023), we find that LLMs broadly fail to capture state transitions not directly related to agent actions, as well as transitions that require arithmetic, common-sense, or scientific reasoning. Across a variety of conditions, model accuracy does not exceed 59.9% for transitions in which a non-trivial change in the world state occurs. These results suggest that, while promising and useful for downstream tasks, LLMs are not yet ready to act as reliable world simulators without further innovation.[1]

\

:::info This paper is available on arxiv under CC BY 4.0 license.

:::

[1] Code and data are available at https://github. com/cognitiveailab/GPT-simulator.

Clause de non-responsabilité : les articles republiés sur ce site proviennent de plateformes publiques et sont fournis à titre informatif uniquement. Ils ne reflètent pas nécessairement les opinions de MEXC. Tous les droits restent la propriété des auteurs d'origine. Si vous estimez qu'un contenu porte atteinte aux droits d'un tiers, veuillez contacter [email protected] pour demander sa suppression. MEXC ne garantit ni l'exactitude, ni l'exhaustivité, ni l'actualité des contenus, et décline toute responsabilité quant aux actions entreprises sur la base des informations fournies. Ces contenus ne constituent pas des conseils financiers, juridiques ou professionnels, et ne doivent pas être interprétés comme une recommandation ou une approbation de la part de MEXC.
Partager des idées

Vous aimerez peut-être aussi

Using ChatGPT Like a Junior Dev: Productive, But Needs Checking

Using ChatGPT Like a Junior Dev: Productive, But Needs Checking

Treat ChatGPT like a junior dev on your team — helpful, but always needing review.
Shiba Inu Treat
TREAT$0.001244-2.12%
Wink
LIKE$0.008067-4.12%
Partager
Hackernoon2025/09/24 14:12
Partager
Jiuzi New Energy's board of directors approved an investment policy of deploying up to $1 billion to purchase crypto assets

Jiuzi New Energy's board of directors approved an investment policy of deploying up to $1 billion to purchase crypto assets

PANews reported on September 24th that according to PR Newswire, Chinese auto dealer Jiuzi New Energy (NASDAQ: JZXN) announced that its board of directors has formally approved and adopted a cryptoasset investment policy. This policy authorizes the company to allocate a portion of its cash reserves to specific cryptoassets within a prudent risk management framework. The policy's core framework includes: 1. Clear investment authorization and ceiling: The board has authorized the company to deploy up to $1 billion to purchase cryptoassets, ensuring controlled risk exposure. 2. Strict asset selection criteria: Initially, investments will be limited to BTC, ETH, and BNB. Any future expansion plans to include other assets will require reassessment and approval by the board's risk committee. 3. Highest level of custody standards: The company will not hold custody of acquired cryptoassets. 4. Professional oversight and governance structure: A "Cryptoasset Risk Committee" will be established to oversee the implementation of various policies and report regularly to the board.
1
1$0.013978-1.78%
Binance Coin
BNB$1,017.88+0.12%
Bitcoin
BTC$113,033.38+0.13%
Partager
PANews2025/09/24 19:29
Partager
The U.S. Department of Justice files civil forfeiture lawsuit for over $225 million in crypto fraud funds

The U.S. Department of Justice files civil forfeiture lawsuit for over $225 million in crypto fraud funds

PANews reported on June 18 that according to an official announcement, the U.S. Department of Justice filed a civil forfeiture lawsuit in the U.S. District Court for the District of
Union
U$0.008302-20.86%
AssangeDAO
JUSTICE$0.00005792-6.21%
Juneo Supernet
JUNE$0.1023+5.46%
Partager
PANews2025/06/18 23:59
Partager

Actualités tendance

Plus

Using ChatGPT Like a Junior Dev: Productive, But Needs Checking

Jiuzi New Energy's board of directors approved an investment policy of deploying up to $1 billion to purchase crypto assets

The U.S. Department of Justice files civil forfeiture lawsuit for over $225 million in crypto fraud funds

Why Small Models Matter in a Network of Experts Era

Changpeng Zhao: The era of Perp DEX is coming, and high-quality projects will win in the long run