Sima Arasteh, Pegah Jandaghi, Nicolaas Weideman, Dennis Perepech, Mukund Raghothaman, Christophe Hauser, Luis Garcia
The software compilation process has a tendency to obscure the original
design of the system and makes it difficult both to identify individual
components and discern their purpose simply by examining the resulting binary
code. Although decompilation techniques attempt to recover higher-level source
code from the machine code in question, they are not fully able to restore the
semantics of the original functions. Furthermore, binaries are often stripped
of metadata, and this makes it challenging to reverse engineer complex binary
software.
In this paper we show how a combination of binary decomposition techniques,
decompilation passes, and LLM-powered function summarization can be used to
build an economical engine to identify modules in stripped binaries and
associate them with high-level natural language descriptions. We instantiated
this technique with three underlying open-source LLMs -- CodeQwen,
DeepSeek-Coder and CodeStral -- and measured its effectiveness in identifying
modules in robotics firmware. This experimental evaluation involved 467 modules
from four devices from the ArduPilot software suite, and showed that CodeStral,
the best-performing backend LLM, achieves an average F1-score of 0.68 with an
online running time of just a handful of seconds.
Authors' comments: 11 pages, 5 figures
Yulong Hui, Yihao Liu, Yao Lu, Huanchen Zhang
Large Language Models (LLMs) encounter challenges in efficiently processing long-text queries, as seen in applications like enterprise document analysis and financial report comprehension. While conventional solutions employ long-context processing or Retrieval-Augmented Generation (RAG), they suffer from prohibitive input expenses or incomplete information. Recent advancements adopt context compression and dynamic retrieval loops, but still sacrifice critical details or incur iterative costs. To address these limitations, we propose OkraLong, a novel framework that flexibly optimizes the entire processing workflow. Unlike prior static or coarse-grained adaptive strategies, OkraLong adopts fine-grained orchestration through three synergistic components: analyzer, organizer and executor. The analyzer characterizes the task states, which guide the organizer in dynamically scheduling the workflow. The executor carries out the execution and generates the final answer. Experimental results demonstrate that OkraLong not only enhances answer accuracy but also achieves cost-effectiveness across a variety of datasets.
Joyce Cahoon, Prerna Singh, Nick Litombe, Jonathan Larson, Ha Trinh, Yiwen Zhu, Andreas Mueller, Fotis Psallidas et al.
In this work, we benchmark various graph-based retrieval-augmented generation (RAG) systems across a broad spectrum of query types, including OLTP-style (fact-based) and OLAP-style (thematic) queries, to address the complex demands of open-domain question answering (QA). Traditional RAG methods often fall short in handling nuanced, multi-document synthesis tasks. By structuring knowledge as graphs, we can facilitate the retrieval of context that captures greater semantic depth and enhances language model operations. We explore graph-based RAG methodologies and introduce TREX, a novel, cost-effective alternative that combines graph-based and vector-based retrieval techniques. Our benchmarking across four diverse datasets highlights the strengths of different RAG methodologies, demonstrates TREX's ability to handle multiple open-domain QA types, and reveals the limitations of current evaluation methods. In a real-world technical support case study, we demonstrate how TREX solutions can surpass conventional vector-based RAG in efficiently synthesizing data from heterogeneous sources. Our findings underscore the potential of augmenting large language models with advanced retrieval and orchestration capabilities, advancing scalable, graph-based AI solutions.
Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Cross-modal retrieval is gaining increasing efficacy and interest from the
research community, thanks to large-scale training, novel architectural and
learning designs, and its application in LLMs and multimodal LLMs. In this
paper, we move a step forward and design an approach that allows for multimodal
queries, composed of both an image and a text, and can search within
collections of multimodal documents, where images and text are interleaved. Our
model, ReT, employs multi-level representations extracted from different layers
of both visual and textual backbones, both at the query and document side. To
allow for multi-level and cross-modal understanding and feature extraction, ReT
employs a novel Transformer-based recurrent cell that integrates both textual
and visual features at different layers, and leverages sigmoidal gates inspired
by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR
benchmarks show that ReT achieves state-of-the-art performance across diverse
settings. Our source code and trained models are publicly available at
https://github.com/aimagelab/ReT.
Authors' comments: CVPR 2025
Linhai Zhang, Ziyang Gao, Deyu Zhou, Yulan He
Depression is a widespread mental health disorder, and clinical interviews are the gold standard for assessment. However, their reliance on scarce professionals highlights the need for automated detection. Current systems mainly employ black-box neural networks, which lack interpretability, which is crucial in mental health contexts. Some attempts to improve interpretability use post-hoc LLM generation but suffer from hallucination. To address these limitations, we propose RED, a Retrieval-augmented generation framework for Explainable depression Detection. RED retrieves evidence from clinical interview transcripts, providing explanations for predictions. Traditional query-based retrieval systems use a one-size-fits-all approach, which may not be optimal for depression detection, as user backgrounds and situations vary. We introduce a personalized query generation module that combines standard queries with user-specific background inferred by LLMs, tailoring retrieval to individual contexts. Additionally, to enhance LLM performance in social intelligence, we augment LLMs by retrieving relevant knowledge from a social intelligence datastore using an event-centric retriever. Experimental results on the real-world benchmark demonstrate RED's effectiveness compared to neural networks and LLM-based baselines.
Jiaming Luo, Weiyi Luo, Guoqing Sun, Mengchen Zhu, Haifeng Tang, Kunyao Lan, Mengyue Wu, Kenny Q. Zhu
Designing effective debt collection systems is crucial for improving
operational efficiency and reducing costs in the financial industry. However,
the challenges of maintaining script diversity, contextual relevance, and
coherence make this task particularly difficult. This paper presents a debt
collection system based on real debtor-collector data from a major commercial
bank. We construct a script library from real-world debt collection
conversations, and propose a two-stage retrieval based response system for
contextual relevance. Experimental results show that our system improves script
diversity, enhances response relevance, and achieves practical deployment
efficiency through knowledge distillation. This work offers a scalable and
automated solution, providing valuable insights for advancing debt collection
practices in real-world applications.
Authors' comments: Accepted by NAACL 2025, Industry Track
Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, Hui Xiong
Large Language Models (LLMs) have demonstrated improved generation
performance by incorporating externally retrieved knowledge, a process known as
retrieval-augmented generation (RAG). Despite the potential of this approach,
existing studies evaluate RAG effectiveness by 1) assessing retrieval and
generation components jointly, which obscures retrieval's distinct
contribution, or 2) examining retrievers using traditional metrics such as
NDCG, which creates a gap in understanding retrieval's true utility in the
overall generation process. To address the above limitations, in this work, we
introduce an automatic evaluation method that measures retrieval quality
through the lens of information gain within the RAG framework. Specifically, we
propose Semantic Perplexity (SePer), a metric that captures the LLM's internal
belief about the correctness of the retrieved information. We quantify the
utility of retrieval by the extent to which it reduces semantic perplexity
post-retrieval. Extensive experiments demonstrate that SePer not only aligns
closely with human preferences but also offers a more precise and efficient
evaluation of retrieval utility across diverse RAG scenarios.
Authors' comments: ICLR 2025 Spotlight
Teng Lin, Yizhang Zhu, Yuyu Luo, Nan Tang
Multi-entity question answering (MEQA) poses significant challenges for large language models (LLMs), which often struggle to consolidate scattered information across multiple documents. An example question might be "What is the distribution of IEEE Fellows among various fields of study?", which requires retrieving information from diverse sources e.g., Wikipedia pages. The effectiveness of current retrieval-augmented generation (RAG) methods is limited by the LLMs' capacity to aggregate insights from numerous pages. To address this gap, this paper introduces a structured RAG (SRAG) framework that systematically organizes extracted entities into relational tables (e.g., tabulating entities with schema columns like "name" and "field of study") and then apply table-based reasoning techniques. Our approach decouples retrieval and reasoning, enabling LLMs to focus on structured data analysis rather than raw text aggregation. Extensive experiments on Wikipedia-based multi-entity QA tasks demonstrate that SRAG significantly outperforms state-of-the-art long-context LLMs and RAG solutions, achieving a 29.6% improvement in accuracy. The results underscore the efficacy of structuring unstructured data to enhance LLMs' reasoning capabilities.
Jie Ouyang, Tingyue Pan, Mingyue Cheng, Ruiran Yan, Yucong Luo, Jiaying Lin, Qi Liu
While Retrieval-Augmented Generation (RAG) has emerged as an effective approach for addressing the knowledge outdating problem in Large Language Models (LLMs), it faces a critical challenge: the prevalence of outdated information in knowledge bases. Current research primarily focuses on incorporating up-to-date information, yet the impact of outdated information coexisting in retrieval sources remains inadequately addressed. To bridge this gap, we introduce HoH, the first benchmark specifically designed to evaluate the impact of outdated information on RAG. Our benchmark leverages token-level diff algorithms combined with LLM pipelines to efficiently create a large-scale QA dataset that accurately captures temporal knowledge evolution in real-world facts. Through comprehensive experiments, we reveal that outdated information significantly degrades RAG performance in two critical ways: (1) it substantially reduces response accuracy by distracting models from correct information, and (2) it can mislead models into generating potentially harmful outputs, even when current information is available. Current RAG approaches struggle with both retrieval and generation aspects when handling outdated information. These findings highlight the urgent need for innovative solutions to address the temporal challenges in RAG.
Shi-Shun Chen
Academic codes associated with research papers are valuable resources for scholars. In specialized fields outside computer science, code availability is often limited, making effective code retrieval essential. Google Scholar is a crucial academic search tool. If a code published in the paper is not retrievable via Google Scholar, its accessibility and impact are significantly reduced. This study takes the term "accelerated degradation" combined with "reliability" as an example, and finds that, for papers published by Elsevier, only GitHub links included in abstracts are comprehensively retrieved by Google Scholar. When such links appear within the main body of a paper, even in the "Data Availability" section, they may be ignored and become unsearchable. These findings highlight the importance of strategically placing GitHub links in abstracts to enhance code discoverability on Google Scholar.
Hongchao Gu, Dexun Li, Kuicai Dong, Hao Zhang, Hang Lv, Hao Wang, Defu Lian, Yong Liu et al.
Generating knowledge-intensive and comprehensive long texts, such as encyclopedia articles, remains significant challenges for Large Language Models. It requires not only the precise integration of facts but also the maintenance of thematic coherence throughout the article. Existing methods, such as direct generation and multi-agent discussion, often struggle with issues like hallucinations, topic incoherence, and significant latency. To address these challenges, we propose RAPID, an efficient retrieval-augmented long text generation framework. RAPID consists of three main modules: (1) Retrieval-augmented preliminary outline generation to reduce hallucinations, (2) Attribute-constrained search for efficient information discovery, (3) Plan-guided article generation for enhanced coherence. Extensive experiments on our newly compiled benchmark dataset, FreshWiki-2024, demonstrate that RAPID significantly outperforms state-of-the-art methods across a wide range of evaluation metrics (e.g. long-text generation, outline quality, latency, etc). Our work provides a robust and efficient solution to the challenges of automated long-text generation.
Yin Wu, Zhengxuan Zhang, Fuling Wang, Yuyu Luo, Hui Xiong, Nan Tang
Misinformation continues to pose a significant challenge in today's
information ecosystem, profoundly shaping public perception and behavior. Among
its various manifestations, Out-of-Context (OOC) misinformation is particularly
obscure, as it distorts meaning by pairing authentic images with misleading
textual narratives. Existing methods for detecting OOC misinformation
predominantly rely on coarse-grained similarity metrics between image-text
pairs, which often fail to capture subtle inconsistencies or provide meaningful
explainability. While multi-modal large language models (MLLMs) demonstrate
remarkable capabilities in visual reasoning and explanation generation, they
have not yet demonstrated the capacity to address complex, fine-grained, and
cross-modal distinctions necessary for robust OOC detection. To overcome these
limitations, we introduce EXCLAIM, a retrieval-based framework designed to
leverage external knowledge through multi-granularity index of multi-modal
events and entities. Our approach integrates multi-granularity contextual
analysis with a multi-agent reasoning architecture to systematically evaluate
the consistency and integrity of multi-modal news content. Comprehensive
experiments validate the effectiveness and resilience of EXCLAIM, demonstrating
its ability to detect OOC misinformation with 4.3% higher accuracy compared to
state-of-the-art approaches, while offering explainable and actionable
insights.
Authors' comments: 15 pages, 2 figures
Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He et al.
We propose ReKV, a novel training-free approach that enables efficient
streaming video question-answering (StreamingVQA), by seamlessly integrating
with existing Video Large Language Models (Video-LLMs). Traditional VideoQA
systems struggle with long videos, as they must process entire videos before
responding to queries, and repeat this process for each new question. In
contrast, our approach analyzes long videos in a streaming manner, allowing for
prompt responses as soon as user queries are received. Building on a common
Video-LLM, we first incorporate a sliding-window attention mechanism, ensuring
that input frames attend to a limited number of preceding frames, thereby
reducing computational overhead. To prevent information loss, we store
processed video key-value caches (KV-Caches) in RAM and disk, reloading them
into GPU memory as needed. Additionally, we introduce a retrieval method that
leverages an external retriever or the parameters within Video-LLMs to retrieve
only query-relevant KV-Caches, ensuring both efficiency and accuracy in
question answering. ReKV enables the separation of video encoding and
question-answering across different processes and GPUs, significantly enhancing
the efficiency of StreamingVQA. Through comprehensive experimentation, we
validate the efficacy and practicality of our approach, which significantly
boosts efficiency and enhances applicability over existing VideoQA models.
Authors' comments: Accepted to ICLR 2025. Code: https://github.com/Becomebright/ReKV
Jia Chen, Qian Dong, Haitao Li, Xiaohui He, Yan Gao, Shaosheng Cao, Yi Wu, Ping Yang et al.
User-generated content (UGC) communities, especially those featuring
multimodal content, improve user experiences by integrating visual and textual
information into results (or items). The challenge of improving user
experiences in complex systems with search and recommendation (S\&R) services
has drawn significant attention from both academia and industry these years.
However, the lack of high-quality datasets has limited the research progress on
multimodal S\&R. To address the growing need for developing better S\&R
services, we present a novel multimodal information retrieval dataset in this
paper, namely Qilin. The dataset is collected from Xiaohongshu, a popular
social platform with over 300 million monthly active users and an average
search penetration rate of over 70\%. In contrast to existing datasets,
\textsf{Qilin} offers a comprehensive collection of user sessions with
heterogeneous results like image-text notes, video notes, commercial notes, and
direct answers, facilitating the development of advanced multimodal neural
retrieval models across diverse task settings. To better model user
satisfaction and support the analysis of heterogeneous user behaviors, we also
collect extensive APP-level contextual signals and genuine user feedback.
Notably, Qilin contains user-favored answers and their referred results for
search requests triggering the Deep Query Answering (DQA) module. This allows
not only the training \& evaluation of a Retrieval-augmented Generation (RAG)
pipeline, but also the exploration of how such a module would affect users'
search behavior. Through comprehensive analysis and experiments, we provide
interesting findings and insights for further improving S\&R systems. We hope
that \textsf{Qilin} will significantly contribute to the advancement of
multimodal content platforms with S\&R services in the future.
Authors' comments: 11 pages
Chandana Sree Mala, Gizem Gezici, Fosca Giannotti
Large Language Models (LLMs) excel in language comprehension and generation but are prone to hallucinations, producing factually incorrect or unsupported outputs. Retrieval Augmented Generation (RAG) systems address this issue by grounding LLM responses with external knowledge. This study evaluates the relationship between retriever effectiveness and hallucination reduction in LLMs using three retrieval approaches: sparse retrieval based on BM25 keyword search, dense retrieval using semantic search with Sentence Transformers, and a proposed hybrid retrieval module. The hybrid module incorporates query expansion and combines the results of sparse and dense retrievers through a dynamically weighted Reciprocal Rank Fusion score. Using the HaluBench dataset, a benchmark for hallucinations in question answering tasks, we assess retrieval performance with metrics such as mean average precision and normalised discounted cumulative gain, focusing on the relevance of the top three retrieved documents. Results show that the hybrid retriever achieves better relevance scores, outperforming both sparse and dense retrievers. Further evaluation of LLM-generated answers against ground truth using metrics such as accuracy, hallucination rate, and rejection rate reveals that the hybrid retriever achieves the highest accuracy on fails, the lowest hallucination rate, and the lowest rejection rate. These findings highlight the hybrid retriever's ability to enhance retrieval relevance, reduce hallucination rates, and improve LLM reliability, emphasising the importance of advanced retrieval techniques in mitigating hallucinations and improving response accuracy.
Angelo Ziletti, Leonardo D'Ambrosi
Clinical cohort definition is crucial for patient recruitment and
observational studies, yet translating inclusion/exclusion criteria into SQL
queries remains challenging and manual. We present an automated system
utilizing large language models that combines criteria parsing, two-level
retrieval augmented generation with specialized knowledge bases, medical
concept standardization, and SQL generation to retrieve patient cohorts with
patient funnels. The system achieves 0.75 F1-score in cohort identification on
EHR data, effectively capturing complex temporal and logical relationships.
These results demonstrate the feasibility of automated cohort generation for
epidemiological research.
Authors' comments: 7 pages, 1 figure
Zelong Sun, Dong Jing, Zhiwu Lu
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples. Existing methods primarily combine caption models and large language models (LLMs) to generate target captions based on composed queries but face various issues such as incompatibility, visual information loss, and insufficient reasoning. In this work, we propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning. Instead of relying on caption models for modality transformation, CoTMR employs the Large Vision-Language Model (LVLM) to achieve unified understanding and reasoning for composed queries. To enhance the reasoning reliability, we devise CIRCoT, which guides the LVLM through a step-by-step inference process using predefined subtasks. Considering that existing approaches focus solely on global-level reasoning, our CoTMR incorporates multi-scale reasoning to achieve more comprehensive inference via fine-grained predictions about the presence or absence of key elements at the object scale. Further, we design a Multi-Grained Scoring (MGS) mechanism, which integrates CLIP similarity scores of the above reasoning outputs with candidate images to realize precise retrieval. Extensive experiments demonstrate that our CoTMR not only drastically outperforms previous methods across four prominent benchmarks but also offers appealing interpretability.
Michael Dinzinger, Laura Caspari, Kanishka Ghosh Dastidar, Jelena Mitrović, Michael Granitzer
We present WebFAQ, a large-scale collection of open-domain question answering
datasets derived from FAQ-style schema.org annotations. In total, the data
collection consists of 96 million natural question-answer (QA) pairs across 75
languages, including 47 million (49%) non-English samples. WebFAQ further
serves as the foundation for 20 monolingual retrieval benchmarks with a total
size of 11.2 million QA pairs (5.9 million non-English). These datasets are
carefully curated through refined filtering and near-duplicate detection,
yielding high-quality resources for training and evaluating multilingual dense
retrieval models. To empirically confirm WebFAQ's efficacy, we use the
collected QAs to fine-tune an in-domain pretrained XLM-RoBERTa model. Through
this process of dataset-specific fine-tuning, the model achieves significant
retrieval performance gains, which generalize - beyond WebFAQ - to other
multilingual retrieval benchmarks evaluated in zero-shot setting. Last but not
least, we utilize WebFAQ to construct a set of QA-aligned bilingual corpora
spanning over 1000 language pairs using state-of-the-art bitext mining and
automated LLM-assessed translation evaluation. Due to our advanced, automated
method of bitext dataset generation, the resulting bilingual corpora
demonstrate higher translation quality compared to similar datasets. WebFAQ and
all associated resources are publicly available on GitHub and HuggingFace.
Authors' comments: 10 pages, 3 figures, 7 tables
Xujie Yuan, Yongxu Liu, Shimin Di, Shiwen Wu, Libin Zheng, Rui Meng, Lei Chen, Xiaofang Zhou et al.
The integration of Knowledge Graphs (KGs) into the Retrieval Augmented
Generation (RAG) framework has attracted significant interest, with early
studies showing promise in mitigating hallucinations and improving model
accuracy. However, a systematic understanding and comparative analysis of the
rapidly emerging KG-RAG methods are still lacking. This paper seeks to lay the
foundation for systematically answering the question of when and how to use
KG-RAG by analyzing their performance in various application scenarios
associated with different technical configurations. After outlining the mind
map using KG-RAG framework and summarizing its popular pipeline, we conduct a
pilot empirical study of KG-RAG works to reimplement and evaluate 6 KG-RAG
methods across 7 datasets in diverse scenarios, analyzing the impact of 9
KG-RAG configurations in combination with 17 LLMs. Our results underscore the
critical role of appropriate application conditions and optimal configurations
of KG-RAG components.
Authors' comments: 8 pages, 2 figures, 14 tables
Yuxin Yang, Haoyang Wu, Tao Wang, Jia Yang, Hao Ma, Guojie Luo
The advent of Large Language Models (LLMs) has revolutionized natural language processing. However, these models face challenges in retrieving precise information from vast datasets. Retrieval-Augmented Generation (RAG) was developed to combining LLMs with external information retrieval systems to enhance the accuracy and context of responses. Despite improvements, RAG still struggles with comprehensive retrieval in high-volume, low-information-density databases and lacks relational awareness, leading to fragmented answers. To address this, this paper introduces the Pseudo-Knowledge Graph (PKG) framework, designed to overcome these limitations by integrating Meta-path Retrieval, In-graph Text and Vector Retrieval into LLMs. By preserving natural language text and leveraging various retrieval techniques, the PKG offers a richer knowledge representation and improves accuracy in information retrieval. Extensive evaluations using Open Compass and MultiHop-RAG benchmarks demonstrate the framework's effectiveness in managing large volumes of data and complex relationships.