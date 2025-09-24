Abstract and 1. Introduction and Related Work

This work considers two strong in-context learning LLMs, GPT-3.5 and GPT-4, in their ability to act as explicit formal simulators.We adopt these models because they are generally the most performant offthe-shelf models across a variety of benchmarks. While we observe that even GPT-3.5 and GPT-4 achieve a modest score at the proposed task, we acknowledge that we did not exhaustively evaluate a large selection of large language models, and other models may perform better. We provide this work as a benchmark to evaluate the performance of existing and future models on the task of accurately simulating state space transitions.

\ In this work, we propose two representational formalisms for representing state spaces, one that includes full state space, while the other focuses on state difference, both represented using JSON objects. We have chosen these representations based on their popularity and compatibility with the input and output formats of most LLM pretraining data (e.g. Fakhoury et al., 2023), as well as being able to directly compare against gold standard simulator output for evaluation, though it is possible that other representational formats may be more performant at the simulation task.

\ Finally, the state spaces produced in this work are focused around the domain of common-sense and early (elementary) scientific reasoning. These tasks, such as opening containers or activating devices, were chosen because the results of these actions are common knowledge, and models are likely to be most performant in simulating these actions. While this work does address a selection of less frequent actions and properties, it does not address using LLMs as simulators for highly domain-specific areas, such as physical or medical simulation. A long term goal of this work is to facilitate using language models as simulators for high-impact domains, and we view this work as a stepping-stone to developing progressively more capable language model simulators.

We do not foresee an immediate ethical or societal impact resulting from our work. However, we acknowledge that as an LLM application, the proposed LLM-Sim task could be affected in some way by misinformation and hallucinations introduced by the specific LLM selected by the user. Our work highlights the issue with using LLMs as text-based world simulators. In downstream tasks, such as game simulation, LLMs may generate misleading or non-factual information. For example, if the simulator suggests burning a house to boil water, our work does not prevent this, nor do we evaluate the ethical implications of such potentially dangerous suggestions. As a result, we believe such applications are neither suitable nor safe to be deployed to a setting where they directly interact with humans, especially children, e.g., in an educational setting. We urge researchers and practitioners to use our proposed task and dataset in a mindful manner.

We wish to thank the three anonymous reviewers for their helpful comments on an earlier draft of this paper.

(1) Ruoyao Wang, University of Arizona ([email protected]);

(2) Graham Todd, New York University ([email protected]);

(3) Ziang Xiao, Johns Hopkins University ([email protected]);

(4) Xingdi Yuan, Microsoft Research Montréal ([email protected]);

(5) Marc-Alexandre Côté, Microsoft Research Montréal ([email protected]);

(6) Peter Clark, Allen Institute for AI ([email protected]).;

(7) Peter Jansen, University of Arizona and Allen Institute for AI ([email protected]).

