arxiv:2602.09517

Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models

Published on Feb 10

Authors:

Abstract

Large language models struggle with integrating retrieved knowledge into extended reasoning chains, but a new inference-time method called Self-Anchored Knowledge Encoding helps maintain knowledge integrity and improve performance on complex tasks.

AI-generated summary

Modern Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks by employing search-augmented reasoning to incorporate external knowledge into long chains of thought. However, we identify a critical yet underexplored bottleneck in this paradigm, termed Knowledge Integration Decay (KID). Specifically, we observe that as the length of reasoning generated before search grows, models increasingly fail to integrate retrieved evidence into subsequent reasoning steps, limiting performance even when relevant information is available. To address this, we propose Self-Anchored Knowledge Encoding (SAKE), a training-free inference-time strategy designed to stabilize knowledge utilization. By anchoring retrieved knowledge at both the beginning and end of the reasoning process, SAKE prevents it from being overshadowed by prior context, thereby preserving its semantic integrity. Extensive experiments on multi-hop QA and complex reasoning benchmarks demonstrate that SAKE significantly mitigates KID and improves performance, offering a lightweight yet effective solution for knowledge integration in agentic LLMs.

View arXiv page View PDF Add to collection

Community

Epiphy77luM

1 day ago

Hi, this is a great paper. However, when I tried to reproduce the experiments, the results did not meet my expectations. Specifically, I used Qwen3-8B-Thinking with an E5 retriever on the Wiki18 dump, while keeping the other settings consistent with those reported in the paper.

Compared with Search-o1, the accuracy on HotpotQA decreased from 47.79% to 47.01%, and the accuracy on 2Wiki decreased from 58.98% to 53.95%. I would greatly appreciate any suggestions or explanations for this discrepancy. Also, if possible, could you kindly share the source code used in your experiments?

I look forward to your reply.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2602.09517

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.09517 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.09517 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.09517 in a Space README.md to link it from this page.