This article benchmarks GPT-3.5 and GPT-4 as formal simulators, testing their ability to model state spaces in common-sense and early scientific reasoning tasks. While these models show promise, they achieve only modest accuracy and raise ethical concerns, particularly around misinformation and unsafe outputs. The study highlights both the potential and the risks of using LLMs for simulations, framing the work as an early step toward more capable and responsible AI simulators.This article benchmarks GPT-3.5 and GPT-4 as formal simulators, testing their ability to model state spaces in common-sense and early scientific reasoning tasks. While these models show promise, they achieve only modest accuracy and raise ethical concerns, particularly around misinformation and unsafe outputs. The study highlights both the potential and the risks of using LLMs for simulations, framing the work as an early step toward more capable and responsible AI simulators.

AI Models Can't Be Trusted in High-Stakes Simulations Just Yet

Abstract and 1. Introduction and Related Work

  1. Methodology

    2.1 LLM-Sim Task

    2.2 Data

    2.3 Evaluation

  2. Experiments

  3. Results

  4. Conclusion

  5. Limitations and Ethical Concerns, Acknowledgements, and References

A. Model details

B. Game transition examples

C. Game rules generation

D. Prompts

E. GPT-3.5 results

F. Histograms

\

5 Conclusion

\

\ \

6 Limitations and Ethical Concerns

6.1 Limitations

This work considers two strong in-context learning LLMs, GPT-3.5 and GPT-4, in their ability to act as explicit formal simulators.We adopt these models because they are generally the most performant offthe-shelf models across a variety of benchmarks. While we observe that even GPT-3.5 and GPT-4 achieve a modest score at the proposed task, we acknowledge that we did not exhaustively evaluate a large selection of large language models, and other models may perform better. We provide this work as a benchmark to evaluate the performance of existing and future models on the task of accurately simulating state space transitions.

\ In this work, we propose two representational formalisms for representing state spaces, one that includes full state space, while the other focuses on state difference, both represented using JSON objects. We have chosen these representations based on their popularity and compatibility with the input and output formats of most LLM pretraining data (e.g. Fakhoury et al., 2023), as well as being able to directly compare against gold standard simulator output for evaluation, though it is possible that other representational formats may be more performant at the simulation task.

\ Finally, the state spaces produced in this work are focused around the domain of common-sense and early (elementary) scientific reasoning. These tasks, such as opening containers or activating devices, were chosen because the results of these actions are common knowledge, and models are likely to be most performant in simulating these actions. While this work does address a selection of less frequent actions and properties, it does not address using LLMs as simulators for highly domain-specific areas, such as physical or medical simulation. A long term goal of this work is to facilitate using language models as simulators for high-impact domains, and we view this work as a stepping-stone to developing progressively more capable language model simulators.

6.2 Ethical Concerns

We do not foresee an immediate ethical or societal impact resulting from our work. However, we acknowledge that as an LLM application, the proposed LLM-Sim task could be affected in some way by misinformation and hallucinations introduced by the specific LLM selected by the user. Our work highlights the issue with using LLMs as text-based world simulators. In downstream tasks, such as game simulation, LLMs may generate misleading or non-factual information. For example, if the simulator suggests burning a house to boil water, our work does not prevent this, nor do we evaluate the ethical implications of such potentially dangerous suggestions. As a result, we believe such applications are neither suitable nor safe to be deployed to a setting where they directly interact with humans, especially children, e.g., in an educational setting. We urge researchers and practitioners to use our proposed task and dataset in a mindful manner.

Acknowledgements

We wish to thank the three anonymous reviewers for their helpful comments on an earlier draft of this paper.

References

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

\ Ashutosh Adhikari, Xingdi Yuan, Marc-Alexandre Côté, Mikuláš Zelinka, Marc-Antoine Rondeau, Romain Laroche, Pascal Poupart, Jian Tang, Adam Trischler, and Will Hamilton. 2020. Learning dynamic belief graphs to generalize on text-based games. Advances in Neural Information Processing Systems, 33:3045– 3057.

\ Prithviraj Ammanabrolu and Matthew Hausknecht. 2020. Graph constrained reinforcement learning for natural language action spaces. arXiv preprint arXiv:2001.08837.

\ Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Ruo Yu Tao, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. 2018. Textworld: A learning environment for textbased games. CoRR, abs/1806.11532.

\ Sarah Fakhoury, Saikat Chakraborty, Madan Musuvathi, and Shuvendu K Lahiri. 2023. Towards generating functionally correct code edits from natural language issue descriptions. arXiv preprint arXiv:2304.03816.

\ Angela Fan, Jack Urbanek, Pratik Ringshia, Emily Dinan, Emma Qian, Siddharth Karamcheti, Shrimai Prabhumoye, Douwe Kiela, Tim Rocktaschel, Arthur Szlam, and Jason Weston. 2020. Generating interactive worlds with text. Proceedings of the AAAI Conference on Artificial Intelligence, 34(02):1693– 1700.

\ Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173.

\ atthew Hausknecht, Prithviraj Ammanabrolu, MarcAlexandre Côté, and Xingdi Yuan. 2020. Interactive fiction games: A colossal adventure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7903–7910.

\ Peter Jansen. 2022. A systematic survey of text worlds as embodied natural language environments. In Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022), pages 1–15.

\ Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. 1998. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134.

\ Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. 2023. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477.

\ Kolby Nottingham, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hannaneh Hajishirzi, Sameer Singh, and Roy Fox. 2023. Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling. In International Conference on Machine Learning, pages 26311–26325. PMLR.

\ Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2020. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.

\ Hao Tang, Darren Key, and Kevin Ellis. 2024. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment. arXiv preprint arXiv:2402.12275.

\ Jack Urbanek, Angela Fan, Siddharth Karamcheti, Saachi Jain, Samuel Humeau, Emily Dinan, Tim Rocktäschel, Douwe Kiela, Arthur Szlam, and Jason Weston. 2019. Learning to speak and act in a fantasy text adventure game.

\ Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. 2023. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36:75993–76005.

\ Nick Walton. 2020. How we scaled AI Dungeon 2 to support over 1,000,000 users.

\ Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. 2022. Scienceworld: Is your agent smarter than a 5th grader? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298.

\ Ruoyao Wang, Graham Todd, Xingdi Yuan, Ziang Xiao, Marc-Alexandre Côté, and Peter Jansen. 2023. ByteSized32: A corpus and challenge task for generating task-specific world models expressed as text games. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13455–13471, Singapore. Association for Computational Linguistics.

\ Lionel Wong, Gabriel Grand, Alexander K Lew, Noah D Goodman, Vikash K Mansinghka, Jacob Andreas, and Joshua B Tenenbaum. 2023. From word models to world models: Translating from natural language to the probabilistic language of thought. arXiv preprint arXiv:2306.12672.

\

:::info Authors:

(1) Ruoyao Wang, University of Arizona (ruoyaowang@arizona.edu);

(2) Graham Todd, New York University (gdrtodd@nyu.edu);

(3) Ziang Xiao, Johns Hopkins University (ziang.xiao@jhu.edu);

(4) Xingdi Yuan, Microsoft Research Montréal (eric.yuan@microsoft.com);

(5) Marc-Alexandre Côté, Microsoft Research Montréal (macote@microsoft.com);

(6) Peter Clark, Allen Institute for AI (PeterC@allenai.org).;

(7) Peter Jansen, University of Arizona and Allen Institute for AI (pajansen@arizona.edu).

:::


:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Market Opportunity
Threshold Logo
Threshold Price(T)
$0.009434
$0.009434$0.009434
+1.28%
USD
Threshold (T) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

XRP Price Prediction: Ripple CEO at Davos Predicts Crypto ATHs This Year – $5 XRP Next?

XRP Price Prediction: Ripple CEO at Davos Predicts Crypto ATHs This Year – $5 XRP Next?

XRP has traded near $1.90 as Ripple CEO Brad Garlinghouse has predicted from Davos that the crypto market will reach new highs this year. Analysts have pointed
Share
Coinstats2026/01/22 04:49
What Is Jawboning? Jimmy Kimmel Suspension Sparks Legal Concerns About Trump Administration

What Is Jawboning? Jimmy Kimmel Suspension Sparks Legal Concerns About Trump Administration

The post What Is Jawboning? Jimmy Kimmel Suspension Sparks Legal Concerns About Trump Administration appeared on BitcoinEthereumNews.com. Topline Legal experts have raised concerns that ABC’s decision to pull “Jimmy Kimmel Live” from its airwaves following the host’s controversial comments about the death of Charlie Kirk, could be because the Trump administration violated free speech protections through a practice known as “jawboning.” Jimmy Kimmel speaks at Disney’s Advertising Upfront on May 13 in New York City. Disney via Getty Images Key Facts Disney-owned ABC announced Wednesday Kimmel’s show will be taken off the air “indefinitely,” which came after ABC affiliate owner Nexstar—which needs Federal Communications Commission approval to complete a planned acquisition of competitor Tegna Inc.—said it would not air the program due to Kimmel’s comments Monday regarding Kirk’s death and the reaction to it. The sudden move drew particular concern because it came only hours after FCC head Brendan Carr called for ABC to “take action” against Kimmel, and cryptically suggested his agency could take action saying, “We can do this the easy way or the hard way.” While ABC and Nexstar have not given any indication their decisions were influenced by Carr’s comments, the timing raised concerns among legal experts that the Trump administration’s threats may have unlawfully coerced ABC and Nexstar to punish Kimmel, which could constitute jawboning. Jawboning refers to “the use of official speech to inappropriately compel private action,” as defined by the Cato Institute, as governments or public officials—who cannot directly punish private actors for speech they don’t like—can use strongman tactics to try and indirectly silence critics or influence private companies’ actions. The practice is fairly loosely defined and there aren’t many legal safeguards dictating how violations of it are enforced, the Knight First Amendment Institute notes, but the Supreme Court has repeatedly ruled it can be unlawful and an impermissible First Amendment violation when it involves specific threats. The White…
Share
BitcoinEthereumNews2025/09/19 07:17
Wormhole Unleashes W 2.0 Tokenomics for a Connected Blockchain Future

Wormhole Unleashes W 2.0 Tokenomics for a Connected Blockchain Future

TLDR Wormhole reinvents W Tokenomics with Reserve, yield, and unlock upgrades. W Tokenomics: 4% yield, bi-weekly unlocks, and a sustainable Reserve Wormhole shifts to long-term value with treasury, yield, and smoother unlocks. Stakers earn 4% base yield as Wormhole optimizes unlocks for stability. Wormhole’s new Tokenomics align growth, yield, and stability for W holders. Wormhole [...] The post Wormhole Unleashes W 2.0 Tokenomics for a Connected Blockchain Future appeared first on CoinCentral.
Share
Coincentral2025/09/18 02:07