This article details the experimental setup for evaluating RECKONING, a novel bi-level learning algorithm, on three diverse multi-hop logical reasoning datasetsThis article details the experimental setup for evaluating RECKONING, a novel bi-level learning algorithm, on three diverse multi-hop logical reasoning datasets

Evaluating Dynamic Knowledge Encoding: Experimental Setup for Multi-Hop Logical Reasoning

2025/10/24 09:15
3 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

Abstract and 1. Introduction

  1. Background

  2. Method

  3. Experiments

    4.1 Multi-hop Reasoning Performance

    4.2 Reasoning with Distractors

    4.3 Generalization to Real-World knowledge

    4.4 Run-time Analysis

    4.5 Memorizing Knowledge

  4. Related Work

  5. Conclusion, Acknowledgements, and References

\ A. Dataset

B. In-context Reasoning with Distractors

C. Implementation Details

D. Adaptive Learning Rate

E. Experiments with Large Language Models

4 Experiments

Setup We conduct our experiments on three datasets focusing on multi-hop logical reasoning over natural language knowledge: ProofWriter [73], which measures the model’s ability to emulate reasoning over facts and rules expressed in natural language; CLUTRR-SG [28], which is generated from the CLUTRR [71] benchmark, a logical reasoning task that involves reasoning over family relationships between entities grounded in first-order logical proofs; and FOLIO [29], a reasoning benchmark with first-order logical reasoning problems written by expert annotators based on real-world knowledge. Each problem in these datasets requires multiple reasoning hops to answer.[1]

\ We compare our method against the following baselines: (1) a fine-tuned model that performs a forward pass on only the question without access to the knowledge (No-Facts), (2) a fine-tuned model that performs a forward pass on only the knowledge without access to the question (No-Question), (3) a model trained using RECKONING with random knowledge that is not relevant to the questions (Random-Facts), and (4) an ICR baseline that concatenates the knowledge K with the question x in a single context and is trained using supervised learning to predict the answer (FT-ICR). Our first three baselines sanity-check whether any surface-level patterns in the questions and facts can be exploited to make accurate predictions. The last baseline compares RECKONING to the conventional way of reasoning with language models. Unless stated otherwise, we use the GPT-2-small [59] model (∼124M parameters) as our initialization and refer by RECKONING to our method trained with the multi-task objective. We compute each score from the average across three different runs. For more details on the implementation, datasets, and examples, see Appendix A and Appendix C.

\

:::info Authors:

(1) Zeming Chen, EPFL (zeming.chen@epfl.ch);

(2) Gail Weiss, EPFL (antoine.bosselut@epfl.ch);

(3) Eric Mitchell, Stanford University (eric.mitchell@cs.stanford.edu)';

(4) Asli Celikyilmaz, Meta AI Research (aslic@meta.com);

(5) Antoine Bosselut, EPFL (antoine.bosselut@epfl.ch).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[1] In ProofWriter, the number of reasoning hops is called the proof depth. To unify the presentation of the results, we use the term “hop” to describe the number of reasoning steps for both datasets.

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

Stake.com Built The Crypto Casino Name. Bet365 Built The Sportsbook Name. ZunaBet Is Building What Comes After Both.

Stake.com Built The Crypto Casino Name. Bet365 Built The Sportsbook Name. ZunaBet Is Building What Comes After Both.

Name recognition in online gambling is built on specialisation. Stake.com built its name by specialising in the crypto gambling community — a platform that understood
Share
Blockonomi2026/05/09 22:45
Why The Green Bay Packers Must Take The Cleveland Browns Seriously — As Hard As That Might Be

Why The Green Bay Packers Must Take The Cleveland Browns Seriously — As Hard As That Might Be

The post Why The Green Bay Packers Must Take The Cleveland Browns Seriously — As Hard As That Might Be appeared on BitcoinEthereumNews.com. Jordan Love and the Green Bay Packers are off to a 2-0 start. Getty Images The Green Bay Packers are, once again, one of the NFL’s better teams. The Cleveland Browns are, once again, one of the league’s doormats. It’s why unbeaten Green Bay (2-0) is a 8-point favorite at winless Cleveland (0-2) Sunday according to betmgm.com. The money line is also Green Bay -500. Most expect this to be a Packers’ rout, and it very well could be. But Green Bay knows taking anyone in this league for granted can prove costly. “I think if you look at their roster, the paper, who they have on that team, what they can do, they got a lot of talent and things can turn around quickly for them,” Packers safety Xavier McKinney said. “We just got to kind of keep that in mind and know we not just walking into something and they just going to lay down. That’s not what they going to do.” The Browns certainly haven’t laid down on defense. Far from. Cleveland is allowing an NFL-best 191.5 yards per game. The Browns gave up 141 yards to Cincinnati in Week 1, including just seven in the second half, but still lost, 17-16. Cleveland has given up an NFL-best 45.5 rushing yards per game and just 2.1 rushing yards per attempt. “The biggest thing is our defensive line is much, much improved over last year and I think we’ve got back to our personality,” defensive coordinator Jim Schwartz said recently. “When we play our best, our D-line leads us there as our engine.” The Browns rank third in the league in passing defense, allowing just 146.0 yards per game. Cleveland has also gone 30 straight games without allowing a 300-yard passer, the longest active streak in the NFL.…
Share
BitcoinEthereumNews2025/09/18 00:41
Crypto Market Drops as Fear Grows and Major Assets Decline

Crypto Market Drops as Fear Grows and Major Assets Decline

Crypto market falls 2.53% as Bitcoin ($BTC) and Ethereum (ETH) drop, while investor fear rises and NFT sales surge sharply despite DeFi slowdown
Share
Blockchainreporter2026/04/02 18:20

KAIO Global Debut

KAIO Global DebutKAIO Global Debut

Enjoy 0-fee KAIO trading and tap into the RWA boom