Chen Xu, Zhirui Deng, Clara Rus, Xiaopeng Ye, Yuanna Liu, Jun Xu, Zhicheng Dou, Ji-Rong Wen et al.
In modern information retrieval (IR). achieving more than just accuracy is essential to sustaining a healthy ecosystem, especially when addressing fairness and diversity considerations. To meet these needs, various datasets, algorithms, and evaluation frameworks have been introduced. However, these algorithms are often tested across diverse metrics, datasets, and experimental setups, leading to inconsistencies and difficulties in direct comparisons. This highlights the need for a comprehensive IR toolkit that enables standardized evaluation of fairness- and diversity-aware algorithms across different IR tasks. To address this challenge, we present FairDiverse, an open-source and standardized toolkit. FairDiverse offers a framework for integrating fair and diverse methods, including pre-processing, in-processing, and post-processing techniques, at different stages of the IR pipeline. The toolkit supports the evaluation of 28 fairness and diversity algorithms across 16 base models, covering two core IR tasks (search and recommendation) thereby establishing a comprehensive benchmark. Moreover, FairDiverse is highly extensible, providing multiple APIs that empower IR researchers to swiftly develop and evaluate their own fairness and diversity aware models, while ensuring fair comparisons with existing baselines. The project is open-sourced and available on https://github.com/XuChen0427/FairDiverse.
Qianchi Zhang, Hainan Zhang, Liang Pang, Ziwei Wang, Hongwei Zheng, Yongxin Tong, Zhiming Zheng
Retrieved documents containing noise will hinder Retrieval-Augmented
Generation (RAG) from detecting answer clues, necessitating noise filtering
mechanisms to enhance accuracy. Existing methods use reranking or summarization
to identify the most relevant sentences, but directly and accurately locating
answer clues from these large-scale and complex documents remains challenging.
Unlike these document-level operations, we treat noise filtering as a
sentence-level MinMax optimization problem: first identifying potential clues
from multiple documents, then ranking them by relevance, and finally retaining
the minimum number of clues through truncation. In this paper, we propose
FineFilter, a novel fine-grained noise filtering mechanism for RAG, consisting
of a clue extractor, a reranker, and a truncator. We optimize each module to
tackle complex reasoning challenges: (1) The clue extractor first uses
sentences containing the answer and similar ones as fine-tuning targets, aiming
to extract sufficient potential clues; (2) The reranker is trained to
prioritize effective clues based on the real feedback from the generation
module, with clues capable of generating correct answers as positive samples
and others as negative; (3) The truncator takes the minimum number of clues
needed to answer the question (truncation point) as fine-tuning targets, and
performs truncation on the reranked clues to achieve fine-grained noise
filtering. Experiments on three QA datasets demonstrate that FineFilter
significantly improves QA performance over baselines on both LLaMA3 and
Mistral. Further analysis confirms its effectiveness in complex reasoning,
robustness to unreliable retrieval, and generalization to different scenarios.
Authors' comments: 18 pages, 4 figures, 18 tables, under review
SeongKu Kang, Bowen Jin, Wonbin Kweon, Yu Zhang, Dongha Lee, Jiawei Han, Hwanjo Yu
In specialized fields like the scientific domain, constructing large-scale
human-annotated datasets poses a significant challenge due to the need for
domain expertise. Recent methods have employed large language models to
generate synthetic queries, which serve as proxies for actual user queries.
However, they lack control over the content generated, often resulting in
incomplete coverage of academic concepts in documents. We introduce Concept
Coverage-based Query set Generation (CCQGen) framework, designed to generate a
set of queries with comprehensive coverage of the document's concepts. A key
distinction of CCQGen is that it adaptively adjusts the generation process
based on the previously generated queries. We identify concepts not
sufficiently covered by previous queries, and leverage them as conditions for
subsequent query generation. This approach guides each new query to complement
the previous ones, aiding in a thorough understanding of the document.
Extensive experiments demonstrate that CCQGen significantly enhances query
quality and retrieval performance.
Authors' comments: WSDM 2025
Ting-Rui Chiang, Dani Yogatama
The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide range of rotation angles may prevent LLMs from utilizing those dimensions. To validate this hypothesis, we present a controlled experiment showing that applying RoPE causes low utility of certain dimensions. Our analyses on three LLMs also indicate that these dimensions do not help LLMs do long-context question answering.
Ze Liu, Zhengyang Liang, Junjie Zhou, Zheng Liu, Defu Lian
With the popularity of multimodal techniques, it receives growing interests to acquire useful information in visual forms. In this work, we formally define an emerging IR paradigm called \textit{Visualized Information Retrieval}, or \textbf{Vis-IR}, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called \textbf{Screenshots}, for various retrieval applications. We further make three key contributions for Vis-IR. First, we create \textbf{VIRA} (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and question-answer formats. Second, we develop \textbf{UniSE} (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct \textbf{MVRB} (Massive Visualized IR Benchmark), a comprehensive benchmark covering a variety of task forms and application scenarios. Through extensive evaluations on MVRB, we highlight the deficiency from existing multimodal retrievers and the substantial improvements made by UniSE. Our work will be shared with the community, laying a solid foundation for this emerging field.
Alexandru Lecu, Adrian Groza, Lezan Hawizy
Large language models (LLMs) have significantly advanced the field of natural language generation. However, they frequently generate unverified outputs, which compromises their reliability in critical applications. In this study, we propose an innovative framework that combines structured biomedical knowledge with LLMs through a retrieval-augmented generation technique. Our system develops a thorough knowledge graph by identifying and refining causal relationships and named entities from medical abstracts related to age-related macular degeneration (AMD). Using a vector-based retrieval process and a locally deployed language model, our framework produces responses that are both contextually relevant and verifiable, with direct references to clinical evidence. Experimental results show that this method notably decreases hallucinations, enhances factual precision, and improves the clarity of generated responses, providing a robust solution for advanced biomedical chatbot applications.
Zihan Wang, Yaohui Zhu, Gim Hee Lee, Yachun Fan
Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions. High-performance navigation models require a large amount of training data, the high cost of manually annotating data has seriously hindered this field. Therefore, some previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users' communication styles that briefly describe destinations or state specific needs. Moreover, local navigation trajectories overlook global context and high-level task planning. To address these issues, we propose NavRAG, a retrieval-augmented generation (RAG) framework that generates user demand instructions for VLN. NavRAG leverages LLM to build a hierarchical scene description tree for 3D scene understanding from global layout to local details, then simulates various user roles with specific demands to retrieve from the scene tree, generating diverse instructions with LLM. We annotate over 2 million navigation instructions across 861 scenes and evaluate the data quality and navigation performance of trained models.
Mohammad Reza Rezaei, Adji Bousso Dieng
Retrieval-augmented generation (RAG) enhances large language models (LLMs)
for domain-specific question-answering (QA) tasks by leveraging external
knowledge sources. However, traditional RAG systems primarily focus on
relevance-based retrieval and often struggle with redundancy, especially when
reasoning requires connecting information from multiple sources. This paper
introduces Vendi-RAG, a framework based on an iterative process that jointly
optimizes retrieval diversity and answer quality. This joint optimization leads
to significantly higher accuracy for multi-hop QA tasks. Vendi-RAG leverages
the Vendi Score (VS), a flexible similarity-based diversity metric, to promote
semantic diversity in document retrieval. It then uses an LLM judge that
evaluates candidate answers, generated after a reasoning step, and outputs a
score that the retriever uses to balance relevance and diversity among the
retrieved documents during each iteration. Experiments on three challenging
datasets -- HotpotQA, MuSiQue, and 2WikiMultiHopQA -- demonstrate Vendi-RAG's
effectiveness in multi-hop reasoning tasks. The framework achieves significant
accuracy improvements over traditional single-step and multi-step RAG
approaches, with accuracy increases reaching up to +4.2% on HotpotQA, +4.1% on
2WikiMultiHopQA, and +1.3% on MuSiQue compared to Adaptive-RAG, the current
best baseline. The benefits of Vendi-RAG are even more pronounced as the number
of retrieved documents increases. Finally, we evaluated Vendi-RAG across
different LLM backbones, including GPT-3.5, GPT-4, and GPT-4o-mini, and
observed consistent improvements, demonstrating that the framework's advantages
are model-agnostic.
Authors' comments: A RAG pipeline that accounts for both diversity and answer quality
and that can be used with any LLM backbone to solve complex multi-hop
question-answering tasks
Florian Neukart, Eike Marx, Valerii Vinokur
We present a series of quantum computing experiments designed to test a
central prediction of the Quantum Memory Matrix (QMM) hypothesis - that quantum
information can be locally stored in finite-dimensional cells of space-time and
later retrieved in a fully unitary and reversible manner. Our work encompasses
five distinct experiments: a basic three-qubit imprint-retrieval cycle, an
extended five-qubit model implementing two parallel cycles, and variations
incorporating dynamic evolution and controlled error injection. In each case, a
field qubit is prepared in an arbitrary superposition, and its state is
imprinted onto memory qubit(s) via controlled-R_y gates, with subsequent
controlled-SWAP operations retrieving the stored information into output
qubit(s). Execution on an IBM Quantum Processing Unit using the Qiskit Runtime
service yielded significant correlations between the initially prepared field
states and the retrieved outputs, with fidelities that, while subject to
hardware noise and decoherence, consistently demonstrate the reversible and
unitary nature of the process. These results not only confirm the basic
imprint-retrieval cycle as predicted by the QMM hypothesis but also establish a
scalable experimental methodology that may ultimately contribute to resolving
challenges such as the black hole information paradox and advancing our
understanding of quantum gravity.
Authors' comments: 9 pages, 5 figures
Tianci Liu, Haoxiang Jiang, Tianze Wang, Ran Xu, Yue Yu, Linjun Zhang, Tuo Zhao, Haoyu Wang
Large language models (LLMs) have achieved impressive performance but face high computational costs and latency, limiting their deployment in resource-constrained settings. In contrast, small-scale LLMs (SLMs) are more efficient yet struggle to capture evolving real-world knowledge. Retrieval-augmented generation (RAG) helps by integrating external knowledge, but imperfect retrieval can introduce distracting noise that misleads SLMs. We propose RoseRAG, a robust RAG framework for SLMs via Margin-aware Preference Optimization. RoseRAG employs multi-turn prompting for detailed reasoning, rejection sampling for high-quality explanations, and contrastive preference selection to refine responses by maximizing the likelihood gap between preferred and non-preferred outputs. By integrating these components into a margin-aware optimization process, RoseRAG robustly enhances the accuracy and reliability of SLMs for RAG applications. Extensive experiments on three open-domain question answering benchmarks indicate that our innovative RoseRAG surpasses state-of-the-art baselines significantly.
Xiangyu Zhang, Hexin Liu, Qiquan Zhang, Beena Ahmed, Julien Epps
Large Language Models (LLMs) have been increasingly adopted for health-related tasks, yet their performance in depression detection remains limited when relying solely on text input. While Retrieval-Augmented Generation (RAG) typically enhances LLM capabilities, our experiments indicate that traditional text-based RAG systems struggle to significantly improve depression detection accuracy. This challenge stems partly from the rich depression-relevant information encoded in acoustic speech patterns information that current text-only approaches fail to capture effectively. To address this limitation, we conduct a systematic analysis of temporal speech patterns, comparing healthy individuals with those experiencing depression. Based on our findings, we introduce Speech Timing-based Retrieval-Augmented Generation, SpeechT-RAG, a novel system that leverages speech timing features for both accurate depression detection and reliable confidence estimation. This integrated approach not only outperforms traditional text-based RAG systems in detection accuracy but also enhances uncertainty quantification through a confidence scoring mechanism that naturally extends from the same temporal features. Our unified framework achieves comparable results to fine-tuned LLMs without additional training while simultaneously addressing the fundamental requirements for both accuracy and trustworthiness in mental health assessment.
Shangda Wu, Zhancheng Guo, Ruibin Yuan, Junyan Jiang, Seungheon Doh, Gus Xia, Juhan Nam, Xiaobing Li et al.
CLaMP 3 is a unified framework developed to address challenges of cross-modal
and cross-lingual generalization in music information retrieval. Using
contrastive learning, it aligns all major music modalities--including sheet
music, performance signals, and audio recordings--with multilingual text in a
shared representation space, enabling retrieval across unaligned modalities
with text as a bridge. It features a multilingual text encoder adaptable to
unseen languages, exhibiting strong cross-lingual generalization. Leveraging
retrieval-augmented generation, we curated M4-RAG, a web-scale dataset
consisting of 2.31 million music-text pairs. This dataset is enriched with
detailed metadata that represents a wide array of global musical traditions. To
advance future research, we release WikiMT-X, a benchmark comprising 1,000
triplets of sheet music, audio, and richly varied text descriptions.
Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple
MIR tasks, significantly surpassing previous strong baselines and demonstrating
excellent generalization in multimodal and multilingual music contexts.
Authors' comments: 20 pages, 8 figures, 12 tables
Chang Liu, Ying Chang, Jianmin Li, Yiqian Qu, Yu Li, Lingyong Cao, Shuyuan Lin
Objectives: Large language models (LLMs) can harness medical knowledge for intelligent question answering (Q&A), promising support for auxiliary diagnosis and medical talent cultivation. However, there is a deficiency of highly efficient retrieval-augmented generation (RAG) frameworks within the domain of Traditional Chinese Medicine (TCM). Our purpose is to observe the effect of the Tree-Organized Self-Reflective Retrieval (TOSRR) framework on LLMs in TCM Q&A tasks. Materials and Methods: We introduce the novel approach of knowledge organization, constructing a tree structure knowledge base with hierarchy. At inference time, our self-reflection framework retrieves from this knowledge base, integrating information across chapters. Questions from the TCM Medical Licensing Examination (MLE) and the college Classics Course Exam (CCE) were randomly selected as benchmark datasets. Results: By coupling with GPT-4, the framework can improve the best performance on the TCM MLE benchmark by 19.85% in absolute accuracy, and improve recall accuracy from 27% to 38% on CCE datasets. In manual evaluation, the framework improves a total of 18.52 points across dimensions of safety, consistency, explainability, compliance, and coherence. Conclusion: The TOSRR framework can effectively improve LLM's capability in Q&A tasks of TCM.
Zhaodong Wang, Weizhi Du, Md Omar Faruk Rokon, Pooshpendu Adhikary, Yanbing Xue, Jiaxuan Xu, Jianghong Zhou, Kuang-chih Lee et al.
Sponsored search in e-commerce poses several unique and complex challenges. These challenges stem from factors such as the asymmetric language structure between search queries and product names, the inherent ambiguity in user search intent, and the vast volume of sparse and imbalanced search corpus data. The role of the retrieval component within a sponsored search system is pivotal, serving as the initial step that directly affects the subsequent ranking and bidding systems. In this paper, we present an end-to-end solution tailored to optimize the ads retrieval system on Walmart.com. Our approach is to pretrain the BERT-like classification model with product category information, enhancing the model's understanding of Walmart product semantics. Second, we design a two-tower Siamese Network structure for embedding structures to augment training efficiency. Third, we introduce a Human-in-the-loop Progressive Fusion Training method to ensure robust model performance. Our results demonstrate the effectiveness of this pipeline. It enhances the search relevance metric by up to 16% compared to a baseline DSSM-based model. Moreover, our large-scale online A/B testing demonstrates that our approach surpasses the ad revenue of the existing production model.
Ye Dong, Yan Lin Aung, Sudipta Chattopadhyay, Jianying Zhou
Internet of Things (IoT) has gained widespread popularity, revolutionizing
industries and daily life. However, it has also emerged as a prime target for
attacks. Numerous efforts have been made to improve IoT security, and
substantial IoT security and threat information, such as datasets and reports,
have been developed. However, existing research often falls short in leveraging
these insights to assist or guide users in harnessing IoT security practices in
a clear and actionable way. In this paper, we propose ChatIoT, a large language
model (LLM)-based IoT security assistant designed to disseminate IoT security
and threat intelligence. By leveraging the versatile property of
retrieval-augmented generation (RAG), ChatIoT successfully integrates the
advanced language understanding and reasoning capabilities of LLM with
fast-evolving IoT security information. Moreover, we develop an end-to-end data
processing toolkit to handle heterogeneous datasets. This toolkit converts
datasets of various formats into retrievable documents and optimizes chunking
strategies for efficient retrieval. Additionally, we define a set of common use
case specifications to guide the LLM in generating answers aligned with users'
specific needs and expertise levels. Finally, we implement a prototype of
ChatIoT and conduct extensive experiments with different LLMs, such as LLaMA3,
LLaMA3.1, and GPT-4o. Experimental evaluations demonstrate that ChatIoT can
generate more reliable, relevant, and technical in-depth answers for most use
cases. When evaluating the answers with LLaMA3:70B, ChatIoT improves the above
metrics by over 10% on average, particularly in relevance and technicality,
compared to using LLMs alone.
Authors' comments: preprint, under revision, 19 pages, 13 figures, 8 tables
Daniel Freeman, Daniel Haider
The injectivity of ReLU layers in neural networks, the recovery of vectors
from clipped or saturated measurements, and (real) phase retrieval in
$\mathbb{R}^n$ allow for a similar problem formulation and characterization
using frame theory. In this paper, we revisit all three problems with a unified
perspective and derive lower Lipschitz bounds for ReLU layers and clipping
which are analogous to the previously known result for phase retrieval and are
optimal up to a constant factor.
Authors' comments: 22 pages
Kevin Flanagan, Dima Damen, Michael Wray
Video Moment Retrieval is a common task to evaluate the performance of
visual-language models - it involves localising start and end times of moments
in videos from query sentences. The current task formulation assumes that the
queried moment is present in the video, resulting in false positive moment
predictions when irrelevant query sentences are provided.
In this paper we propose the task of Negative-Aware Video Moment Retrieval
(NA-VMR), which considers both moment retrieval accuracy and negative query
rejection accuracy. We make the distinction between In-Domain and Out-of-Domain
negative queries and provide new evaluation benchmarks for two popular video
moment retrieval datasets: QVHighlights and Charades-STA. We analyse the
ability of current SOTA video moment retrieval approaches to adapt to
Negative-Aware Video Moment Retrieval and propose UniVTG-NA, an adaptation of
UniVTG designed to tackle NA-VMR. UniVTG-NA achieves high negative rejection
accuracy (avg. $98.4\%$) scores while retaining moment retrieval scores to
within $3.87\%$ Recall@1. Dataset splits and code are available at
https://github.com/keflanagan/MomentofUntruth
Authors' comments: 16 pages, 9 figures
Wei Li, Wen Luo, Guangyue Peng, Houfeng Wang
Grammatical error correction (GEC) aims to correct grammatical, spelling, and
semantic errors in natural language text. With the growing of large language
models (LLMs), direct text generation has gradually become the focus of the GEC
methods, and few-shot in-context learning presents a cost-effective solution.
However, selecting effective in-context examples remains challenging, as the
similarity between input texts does not necessarily correspond to similar
grammatical error patterns. In this paper, we propose a novel retrieval method
based on natural language grammatical error explanations (GEE) to address this
issue. Our method retrieves suitable few-shot demonstrations by matching the
GEE of the test input with that of pre-constructed database samples, where
explanations for erroneous samples are generated by LLMs. We conducted
multilingual GEC few-shot experiments on both major open-source and
closed-source LLMs. Experiments across five languages show that our method
outperforms existing semantic and BM25-based retrieval techniques, without
requiring additional training or language adaptation. This also suggests that
matching error patterns is key to selecting examples.
Authors' comments: Accepted by NAACL 2025 main conference
Prajwal Gatti, Kshitij Parikh, Dhriti Prasanna Paul, Manish Gupta, Anand Mishra
Non-native speakers with limited vocabulary often struggle to name specific
objects despite being able to visualize them, e.g., people outside Australia
searching for numbats. Further, users may want to search for such elusive
objects with difficult-to-sketch interactions, e.g., numbat digging in the
ground. In such common but complex situations, users desire a search interface
that accepts composite multimodal queries comprising hand-drawn sketches of
difficult-to-name but easy-to-draw objects and text describing
difficult-to-sketch but easy-to-verbalize object attributes or interaction with
the scene. This novel problem statement distinctly differs from the previously
well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image
retrieval) problems. To study this under-explored task, we curate a dataset,
CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of approx. 2M
queries and 108K natural scene images. Further, as a solution to this problem,
we propose a pretrained multimodal transformer-based baseline, STNET
(Sketch+Text Network), that uses a hand-drawn sketch to localize relevant
objects in the natural scene image, and encodes the text and image to perform
image retrieval. In addition to contrastive learning, we propose multiple
training objectives that improve the performance of our model. Extensive
experiments show that our proposed method outperforms several state-of-the-art
retrieval methods for text-only, sketch-only, and composite query modalities.
We make the dataset and code available at our project website.
Authors' comments: Accepted at AAAI 2024, 9 pages. Project Website:
https://vl2g.github.io/projects/cstbir
Ruobing Yao, Yifei Zhang, Shuang Song, Yuhua Liu, Neng Gao, Chenyang Tu
While Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external knowledge, they still face persistent challenges in retrieval inefficiency and the inability of LLMs to filter out irrelevant information. We present ParetoRAG, an unsupervised framework that optimizes RAG systems through sentence-level refinement guided by the Pareto principle. By decomposing paragraphs into sentences and dynamically re-weighting core content while preserving contextual coherence, ParetoRAG achieves dual improvements in both retrieval precision and generation quality without requiring additional training or API resources. This framework has been empirically validated across various datasets, LLMs, and retrievers.