Hang Li, Xiao Wang, Bevan Koopman, Guido Zuccon
Scaling dense retrievers to larger large language model (LLM) backbones has been a dominant strategy for improving their retrieval effectiveness. However, this has substantial cost implications: larger backbones require more expensive hardware (e.g. GPUs with more memory) and lead to higher indexing and querying costs (latency, energy consumption). In this paper, we challenge this paradigm by introducing PromptPRF, a feature-based pseudo-relevance feedback (PRF) framework that enables small LLM-based dense retrievers to achieve effectiveness comparable to much larger models. PromptPRF uses LLMs to extract query-independent, structured and unstructured features (e.g., entities, summaries, chain-of-thought keywords, essay) from top-ranked documents. These features are generated offline and integrated into dense query representations via prompting, enabling efficient retrieval without additional training. Unlike prior methods such as GRF, which rely on online, query-specific generation and sparse retrieval, PromptPRF decouples feedback generation from query processing and supports dense retrievers in a fully zero-shot setting. Experiments on TREC DL and BEIR benchmarks demonstrate that PromptPRF consistently improves retrieval effectiveness and offers favourable cost-effectiveness trade-offs. We further present ablation studies to understand the role of positional feedback and analyse the interplay between feature extractor size, PRF depth, and model performance. Our findings demonstrate that with effective PRF design, scaling the retriever is not always necessary, narrowing the gap between small and large models while reducing inference cost.
Hikaru Shimadzu, Takehito Utsuro, Daisuke Kitayama
In the 2023 edition of the White Paper on Information and Communications, it is estimated that the population of social networking services in Japan will exceed 100 million by 2022, and the influence of social networking services in Japan is growing significantly. In addition, marketing using SNS and research on the propagation of emotions and information on SNS are being actively conducted, creating the need for a system for predicting trends in SNS interactions. We have already created a system that simulates the behavior of various communities on SNS by building a virtual SNS environment in which agents post and reply to each other in a chat community created by agents using a LLMs. In this paper, we evaluate the impact of the search extension generation mechanism used to create posts and replies in a virtual SNS environment using a simulation system on the ability to generate posts and replies. As a result of the evaluation, we confirmed that the proposed search extension generation mechanism, which mimics human search behavior, generates the most natural exchange.
Rikuto Tsuchida, Hibiki Yokoyama, Takehito Utsuro
The purpose of this paper is to examine whether large language models (LLMs) can understand what is good and evil with respect to judging good/evil reputation of celebrities. Specifically, we first apply a large language model (namely, ChatGPT) to the task of collecting sentences that mention the target celebrity from articles about celebrities on Web pages. Next, the collected sentences are categorized based on their contents by ChatGPT, where ChatGPT assigns a category name to each of those categories. Those assigned category names are referred to as "aspects" of each celebrity. Then, by applying the framework of retrieval augmented generation (RAG), we show that the large language model is quite effective in the task of judging good/evil reputation of aspects and descriptions of each celebrity. Finally, also in terms of proving the advantages of the proposed method over existing services incorporating RAG functions, we show that the proposed method of judging good/evil of aspects/descriptions of each celebrity significantly outperform an existing service incorporating RAG functions.
Jonas Oppenlaender
Planning a trip into a potentially unsafe area is a difficult task. We
conducted a formative study on travelers' information needs, finding that most
of them turn to search engines for trip planning. Search engines, however, fail
to provide easily interpretable results adapted to the context and personal
information needs of a traveler. Large language models (LLMs) create new
possibilities for providing personalized travel safety advice. To explore this
idea, we developed DangerMaps, a mapping system that assists its users in
researching the safety of an urban travel destination, whether it is pre-travel
or on-location. DangerMaps plots safety ratings onto a map and provides
explanations on demand. This late breaking work specifically emphasizes the
challenges of designing real-world applications with large language models. We
provide a detailed description of our approach to prompt design and highlight
future areas of research.
Authors' comments: 17 pages, 7 figures, 1 table
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim
Graph Retrieval-Augmented Generation (GraphRAG) has proven highly effective
in enhancing the performance of Large Language Models (LLMs) on tasks that
require external knowledge. By leveraging Knowledge Graphs (KGs), GraphRAG
improves information retrieval for complex reasoning tasks, providing more
precise and comprehensive retrieval and generating more accurate responses to
QAs. However, most RAG methods fall short in addressing multi-step reasoning,
particularly when both information extraction and inference are necessary. To
address this limitation, this paper presents Knowledge Graph-Based Iterative
Retrieval-Augmented Generation (KG-IRAG), a novel framework that integrates KGs
with iterative reasoning to improve LLMs' ability to handle queries involving
temporal and logical dependencies. Through iterative retrieval steps, KG-IRAG
incrementally gathers relevant data from external KGs, enabling step-by-step
reasoning. The proposed approach is particularly suited for scenarios where
reasoning is required alongside dynamic temporal data extraction, such as
determining optimal travel times based on weather conditions or traffic
patterns. Experimental results show that KG-IRAG improves accuracy in complex
reasoning tasks by effectively integrating external knowledge with iterative,
logic-based retrieval. Additionally, three new datasets: weatherQA-Irish,
weatherQA-Sydney, and trafficQA-TFNSW, are formed to evaluate KG-IRAG's
performance, demonstrating its potential beyond traditional RAG applications.
Authors' comments: 14 pages, 4 figures
Zhengsheng Guo, Linwei Zheng, Xinyang Chen, Xuefeng Bai, Kehai Chen, Min Zhang
While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture of knowledge paths enhanced retrieval mechanism through functional partitioning of a large language model (LLM) corpus into distinct sections, enabling retrieval from multiple specialized knowledge paths. Applied to the generation of 3D simulated environments, our proposed MoK-RAG3D enhances this paradigm by partitioning 3D assets into distinct sections and organizing them based on a hierarchical knowledge tree structure. Different from previous methods that only use manual evaluation, we pioneered the introduction of automated evaluation methods for 3D scenes. Both automatic and human evaluations in our experiments demonstrate that MoK-RAG3D can assist Embodied AI agents in generating diverse scenes.
Yujin Wang, Quanfeng Liu, Zhengxin Jiang, Tianyi Wang, Junfeng Jiao, Hongqing Chu, Bingzhao Gao, Hong Chen
Accurately understanding and deciding high-level meta-actions is essential for ensuring reliable and safe autonomous driving systems. While vision-language models (VLMs) have shown significant potential in various autonomous driving tasks, they often suffer from limitations such as inadequate spatial perception and hallucination, reducing their effectiveness in complex autonomous driving scenarios. To address these challenges, we propose a retrieval-augmented decision-making (RAD) framework, a novel architecture designed to enhance VLMs' capabilities to reliably generate meta-actions in autonomous driving scenes. RAD leverages a retrieval-augmented generation (RAG) pipeline to dynamically improve decision accuracy through a three-stage process consisting of the embedding flow, retrieving flow, and generating flow. Additionally, we fine-tune VLMs on a specifically curated dataset derived from the NuScenes dataset to enhance their spatial perception and bird's-eye view image comprehension capabilities. Extensive experimental evaluations on the curated NuScenes-based dataset demonstrate that RAD outperforms baseline methods across key evaluation metrics, including match accuracy, and F1 score, and self-defined overall score, highlighting its effectiveness in improving meta-action decision-making for autonomous driving tasks.
Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Hao Peng, Julia Hockenmaier, Tong Zhang
Retrieval-augmented generation (RAG) systems rely on retrieval models for identifying relevant contexts and answer generation models for utilizing those contexts. However, retrievers exhibit imperfect recall and precision, limiting downstream performance. We introduce RAG-RL, an answer generation model trained not only to produce answers but also to identify and cite relevant information from larger sets of retrieved contexts, shifting some of the burden of identifying relevant documents from the retriever to the answer generator. Our approach uses curriculum learning, where the model is first trained on easier examples that include only relevant contexts. Our experiments show that these training samples enable models to acquire citation and reasoning skills with greater sample efficiency and generalizability, demonstrating strong model performance even as the number of irrelevant passages increases. We benchmark our methods on three open-domain multi-hop question answering datasets and report significant gains in answer and citation accuracy. Our experiments provide empirical insights into how easier training samples can give models stronger signals for learning specific skills (e.g., citation generation) and how different components of post-training (e.g., training set construction, rule-based rewards, training sample ordering, etc.) impact final model performance.
Da Wu, Zhanliang Wang, Quan Nguyen, Kai Wang
Background: Several studies show that large language models (LLMs) struggle
with phenotype-driven gene prioritization for rare diseases. These studies
typically use Human Phenotype Ontology (HPO) terms to prompt foundation models
like GPT and LLaMA to predict candidate genes. However, in real-world settings,
foundation models are not optimized for domain-specific tasks like clinical
diagnosis, yet inputs are unstructured clinical notes rather than standardized
terms. How LLMs can be instructed to predict candidate genes or disease
diagnosis from unstructured clinical notes remains a major challenge. Methods:
We introduce RAG-driven CoT and CoT-driven RAG, two methods that combine
Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) to analyze
clinical notes. A five-question CoT protocol mimics expert reasoning, while RAG
retrieves data from sources like HPO and OMIM (Online Mendelian Inheritance in
Man). We evaluated these approaches on rare disease datasets, including 5,980
Phenopacket-derived notes, 255 literature-based narratives, and 220 in-house
clinical notes from Childrens Hospital of Philadelphia. Results: We found that
recent foundations models, including Llama 3.3-70B-Instruct and
DeepSeek-R1-Distill-Llama-70B, outperformed earlier versions such as Llama 2
and GPT-3.5. We also showed that RAG-driven CoT and CoT-driven RAG both
outperform foundation models in candidate gene prioritization from clinical
notes; in particular, both methods with DeepSeek backbone resulted in a top-10
gene accuracy of over 40% on Phenopacket-derived clinical notes. RAG-driven CoT
works better for high-quality notes, where early retrieval can anchor the
subsequent reasoning steps in domain-specific evidence, while CoT-driven RAG
has advantage when processing lengthy and noisy notes.
Authors' comments: 31 pages, 3 figures
Pengfei Luo, Jingbo Zhou, Tong Xu, Yuan Xia, Linli Xu, Enhong Chen
With the proliferation of images in online content, language-guided image
retrieval (LGIR) has emerged as a research hotspot over the past decade,
encompassing a variety of subtasks with diverse input forms. While the
development of large multimodal models (LMMs) has significantly facilitated
these tasks, existing approaches often address them in isolation, requiring the
construction of separate systems for each task. This not only increases system
complexity and maintenance costs, but also exacerbates challenges stemming from
language ambiguity and complex image content, making it difficult for retrieval
systems to provide accurate and reliable results. To this end, we propose
ImageScope, a training-free, three-stage framework that leverages collective
reasoning to unify LGIR tasks. The key insight behind the unification lies in
the compositional nature of language, which transforms diverse LGIR tasks into
a generalized text-to-image retrieval process, along with the reasoning of LMMs
serving as a universal verification to refine the results. To be specific, in
the first stage, we improve the robustness of the framework by synthesizing
search intents across varying levels of semantic granularity using
chain-of-thought (CoT) reasoning. In the second and third stages, we then
reflect on retrieval results by verifying predicate propositions locally, and
performing pairwise evaluations globally. Experiments conducted on six LGIR
datasets demonstrate that ImageScope outperforms competitive baselines.
Comprehensive evaluations and ablation studies further confirm the
effectiveness of our design.
Authors' comments: WWW 2025
Yuwei Zhang, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang
Large Language Models (LLMs) often exhibit substantially shorter effective
context lengths than their claimed capacities, especially when handling complex
reasoning tasks that require integrating information from multiple parts of a
long context and performing multi-step reasoning. Although Chain-of-Thought
(CoT) prompting has shown promise in reducing task complexity, our empirical
analysis reveals that it does not fully resolve this limitation. Through
controlled experiments, we identify poor recall of implicit facts as the
primary cause of failure, which significantly hampers reasoning performance.
Interestingly, we observe that the internal attention weights from the
generated CoT tokens can effectively ground implicit facts, even when these
facts are not explicitly recalled. Building on this insight, we propose a novel
training-free algorithm, Attrieval, which leverages attention weights to
retrieve relevant facts from the long context and incorporates them into the
reasoning process. Additionally, we find that selecting context tokens from CoT
tokens further improves performance. Our results demonstrate that Attrieval
enhances long-context reasoning capability notably on both synthetic and
real-world QA datasets with various models.
Authors' comments: Work in progress
Jihao Zhao, Zhiyuan Ji, Zhaoxin Fan, Hanyu Wang, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li
Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.
Ruihai Wu, Ziyu Zhu, Yuran Wang, Yue Chen, Jiarui Wang, Hao Dong
Cluttered garments manipulation poses significant challenges due to the complex, deformable nature of garments and intricate garment relations. Unlike single-garment manipulation, cluttered scenarios require managing complex garment entanglements and interactions, while maintaining garment cleanliness and manipulation stability. To address these demands, we propose to learn point-level affordance, the dense representation modeling the complex space and multi-modal manipulation candidates, while being aware of garment geometry, structure, and inter-object relations. Additionally, as it is difficult to directly retrieve a garment in some extremely entangled clutters, we introduce an adaptation module, guided by learned affordance, to reorganize highly-entangled garments into states plausible for manipulation. Our framework demonstrates effectiveness over environments featuring diverse garment types and pile configurations in both simulation and the real world. Project page: https://garmentpile.github.io/.
Amirmohammad Azadi, Sina Zamani, Mohammadmostafa Rostamkhani, Sauleh Eetemadi
This paper describes our system for SemEval 2025 Task 7: Previously Fact-Checked Claim Retrieval. The task requires retrieving relevant fact-checks for a given input claim from the extensive, multilingual MultiClaim dataset, which comprises social media posts and fact-checks in several languages. To address this challenge, we first evaluated zero-shot performance using state-of-the-art English and multilingual retrieval models and then fine-tuned the most promising systems, leveraging machine translation to enhance crosslingual retrieval. Our best model achieved an accuracy of 85% on crosslingual data and 92% on monolingual data.
Mayank Singh, Abhijeet Kumar, Sasidhar Donaparthi, Gayatri Karambelkar
Data catalogs serve as repositories for organizing and accessing diverse
collection of data assets, but their effectiveness hinges on the ease with
which business users can look-up relevant content. Unfortunately, many data
catalogs within organizations suffer from limited searchability due to
inadequate metadata like asset descriptions. Hence, there is a need of content
generation solution to enrich and curate metadata in a scalable way. This paper
explores the challenges associated with metadata creation and proposes a unique
prompt enrichment idea of leveraging existing metadata content using retrieval
based few-shot technique tied with generative large language models (LLM). The
literature also considers finetuning an LLM on existing content and studies the
behavior of few-shot pretrained LLM (Llama, GPT3.5) vis-\`a-vis few-shot
finetuned LLM (Llama2-7b) by evaluating their performance based on accuracy,
factual grounding, and toxicity. Our preliminary results exhibit more than 80%
Rouge-1 F1 for the generated content. This implied 87%- 88% of instances
accepted as is or curated with minor edits by data stewards. By automatically
generating descriptions for tables and columns in most accurate way, the
research attempts to provide an overall framework for enterprises to
effectively scale metadata curation and enrich its data catalog thereby vastly
improving the data catalog searchability and overall usability.
Authors' comments: Presented in 5th International Conference on NLP & Text Mining (NLTM
2025)
Jiawei Zhou, Lei Chen
In this paper, we analyze and empirically show that the learned relevance for conventional information retrieval (IR) scenarios may be inconsistent in retrieval-augmented generation (RAG) scenarios. To bridge this gap, we introduce OpenRAG, a RAG framework that is optimized end-to-end by tuning the retriever to capture in-context relevance, enabling adaptation to the diverse and evolving needs. Extensive experiments across a wide range of tasks demonstrate that OpenRAG, by tuning a retriever end-to-end, leads to a consistent improvement of 4.0% over the original retriever, consistently outperforming existing state-of-the-art retrievers by 2.1%. Additionally, our results indicate that for some tasks, an end-to-end tuned 0.2B retriever can achieve improvements that surpass those of RAG-oriented or instruction-tuned 8B large language models (LLMs), highlighting the cost-effectiveness of our approach in enhancing RAG systems.
Leandro Carísio Fernandes, Leandro dos Santos Ribeiro, Marcos Vinícius Borela de Castro, Leonardo Augusto da Silva Pacheco, Edans Flávius de Oliveira Sandes
This paper introduces JurisTCU, a Brazilian Portuguese dataset for legal
information retrieval (LIR). The dataset is freely available and consists of
16,045 jurisprudential documents from the Brazilian Federal Court of Accounts,
along with 150 queries annotated with relevance judgments. It addresses the
scarcity of Portuguese-language LIR datasets with query relevance annotations.
The queries are organized into three groups: real user keyword-based queries,
synthetic keyword-based queries, and synthetic question-based queries.
Relevance judgments were produced through a hybrid approach combining LLM-based
scoring with expert domain validation. We used JurisTCU in 14 experiments using
lexical search (document expansion methods) and semantic search (BERT-based and
OpenAI embeddings). We show that the document expansion methods significantly
improve the performance of standard BM25 search on this dataset, with
improvements exceeding 45% in P@10, R@10, and nDCG@10 metrics when evaluating
short keyword-based queries. Among the embedding models, the OpenAI models
produced the best results, with improvements of approximately 70% in P@10,
R@10, and nDCG@10 metrics for short keyword-based queries, suggesting that
these dense embeddings capture semantic relationships in this domain,
surpassing the reliance on lexical terms. Besides offering a dataset for the
Portuguese-language IR research community, suitable for evaluating search
systems, the results also contribute to enhancing a search system highly
relevant to Brazilian citizens.
Authors' comments: 21 pages
Nandakishor M
In this paper, I present our work on DeepRAG, a specialized embedding model we built specifically for Hindi language in RAG systems. While LLMs have gotten really good at generating text, their performance in retrieval tasks still depends heavily on having quality embeddings - something that's been lacking for Hindi despite being one of the world's most spoken languages. We tackled this by creating embeddings from the ground up rather than just fine-tuning existing models. Our process involved collecting diverse Hindi texts (over 2.7M samples), training a custom SentencePiece tokenizer that actually understands Hindi morphology, designing transformer architecture with Hindi-specific attention mechanisms, and optimizing with contrastive learning. Results were honestly better than I expected - we saw a 23% improvement in retrieval precision compared to the multilingual models everyone's been using. The paper details our methodology, which I think could help others working with low-resource languages where the one-size-fits-all multilingual models fall short. We've also integrated our embeddings with LangChain to build complete Hindi RAG systems, which might be useful for practitioners. While there's still tons more to explore, I believe this work addresses a critical gap for Hindi NLP and demonstrates why language-specific approaches matter.
Siyuan Wang, James R. Foulds, Md Osman Gani, Shimei Pan
In this paper, we introduce CIBER (Claim Investigation Based on Evidence Retrieval), an extension of the Retrieval-Augmented Generation (RAG) framework designed to identify corroborating and refuting documents as evidence for scientific claim verification. CIBER addresses the inherent uncertainty in Large Language Models (LLMs) by evaluating response consistency across diverse interrogation probes. By focusing on the behavioral analysis of LLMs without requiring access to their internal information, CIBER is applicable to both white-box and black-box models. Furthermore, CIBER operates in an unsupervised manner, enabling easy generalization across various scientific domains. Comprehensive evaluations conducted using LLMs with varying levels of linguistic proficiency reveal CIBER's superior performance compared to conventional RAG approaches. These findings not only highlight the effectiveness of CIBER but also provide valuable insights for future advancements in LLM-based scientific claim verification.
Hsin-Ling Hsu, Ping-Sheng Lin, Jing-Di Lin, Jengnan Tzeng
Hybrid Retrieval systems, combining Sparse and Dense Retrieval methods, struggle with Traditional Chinese non-narrative documents due to their complex formatting, rich vocabulary, and the insufficient understanding of Chinese synonyms by common embedding models. Previous approaches inadequately address the dual needs of these systems, focusing mainly on general text quality improvement rather than optimizing for retrieval. We propose Knowledge-Aware Preprocessing (KAP), a novel framework that transforms noisy OCR outputs into retrieval-optimized text. KAP adopts a two-stage approach: it first extracts text using OCR, then employs Multimodal Large Language Models to refine the output by integrating visual information from the original documents. This design reduces OCR noise, reconstructs structural elements, and formats the text to satisfy the distinct requirements of sparse and dense retrieval. Empirical results demonstrate that KAP consistently and significantly outperforms conventional preprocessing approaches. Our code is available at https://github.com/JustinHsu1019/KAP.