Shuai Lyu, Zijing Tian, Zhonghong Ou, Yifan Zhu, Xiao Zhang, Qiankun Ha, Haoran Luo, Meina Song
Cross-modal retrieval maps data under different modality via semantic
relevance. Existing approaches implicitly assume that data pairs are
well-aligned and ignore the widely existing annotation noise, i.e., noisy
correspondence (NC). Consequently, it inevitably causes performance
degradation. Despite attempts that employ the co-teaching paradigm with
identical architectures to provide distinct data perspectives, the differences
between these architectures are primarily stemmed from random initialization.
Thus, the model becomes increasingly homogeneous along with the training
process. Consequently, the additional information brought by this paradigm is
severely limited. In order to resolve this problem, we introduce a Tripartite
learning with Semantic Variation Consistency (TSVC) for robust image-text
retrieval. We design a tripartite cooperative learning mechanism comprising a
Coordinator, a Master, and an Assistant model. The Coordinator distributes
data, and the Assistant model supports the Master model's noisy label
prediction with diverse data. Moreover, we introduce a soft label estimation
method based on mutual information variation, which quantifies the noise in new
samples and assigns corresponding soft labels. We also present a new loss
function to enhance robustness and optimize training effectiveness. Extensive
experiments on three widely used datasets demonstrate that, even at increasing
noise ratios, TSVC exhibits significant advantages in retrieval accuracy and
maintains stable training performance.
Authors' comments: This paper has been accepted to the Main Track of AAAI 2025. It
contains 9 pages, 7 figures, and is relevant to the areas of cross-modal
retrieval and machine learning. The work presents a novel approach in robust
image-text retrieval using a tripartite learning framework
Aarush Sinha, Viraj Virk, Dipshikha Chakraborty, P. S. Sreeja
Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.
Vera Pavlova
This study examines the use of Natural Language Processing (NLP) technology within the Islamic domain, focusing on developing an Islamic neural retrieval model. By leveraging the robust XLM-R model, the research employs a language reduction technique to create a lightweight bilingual large language model (LLM). Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic while limited in other languages, including English. The work utilizes a multi-stage training process for retrieval models, incorporating large retrieval datasets, such as MS MARCO, and smaller, in-domain datasets to improve retrieval performance. Additionally, we have curated an in-domain retrieval dataset in English by employing data augmentation techniques and involving a reliable Islamic source. This approach enhances the domain-specific dataset for retrieval, leading to further performance gains. The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks.
Zengyi Gao, Yukun Cao, Hairu Wang, Ao Ke, Yuan Feng, Xike Xie, S Kevin Zhou
To mitigate the hallucination and knowledge deficiency in large language models (LLMs), Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) has shown promising potential by utilizing KGs as external resource to enhance LLMs reasoning.However, existing KG-RAG approaches struggle with a trade-off between flexibility and retrieval quality.Modular methods prioritize flexibility by avoiding the use of KG-fine-tuned models during retrieval, leading to fixed retrieval strategies and suboptimal retrieval quality.Conversely, coupled methods embed KG information within models to improve retrieval quality, but at the expense of flexibility.In this paper, we propose a novel flexible modular KG-RAG framework, termed FRAG, which synergizes the advantages of both approaches.FRAG estimates the hop range of reasoning paths based solely on the query and classify it as either simple or complex.To match the complexity of the query, tailored pipelines are applied to ensure efficient and accurate reasoning path retrieval, thus fostering the final reasoning process.By using the query text instead of the KG to infer the structural information of reasoning paths and employing adaptable retrieval strategies, FRAG improves retrieval quality while maintaining flexibility.Moreover, FRAG does not require extra LLMs fine-tuning or calls, significantly boosting efficiency and conserving resources.Extensive experiments show that FRAG achieves state-of-the-art performance with high efficiency and low resource consumption.
Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Jingyi Song, Hao Wang
Leveraging the autonomous decision-making capabilities of large language
models (LLMs) has demonstrated superior performance in reasoning tasks.
However, despite the success of iterative or recursive retrieval-augmented
generation (RAG) techniques, these methods are often constrained to a single
solution space when confronted with complex problems. In this paper, we propose
a novel thinking pattern in RAG that integrates system analysis with efficient
reasoning actions, significantly activating intrinsic reasoning capabilities
and expanding the solution space of specific tasks via Monte Carlo Tree Search
(MCTS), which we refer to as AirRAG. Specifically, our approach designs five
fundamental reasoning actions, which are expanded to a broad tree-based
reasoning space using MCTS. The approach also incorporates self-consistency
verification to explore potential reasoning paths and inference scaling law.
Additionally, computationally optimal strategies are employed to allocate more
inference resources to key actions, thereby enhancing overall performance.
Experimental results demonstrate the effectiveness of AirRAG, showing
significant performance gains on complex question-answering datasets.
Furthermore, AirRAG is flexible and lightweight, making it easy to integrate
with other advanced technologies.
Authors' comments: 17 pages, 14 figures
Reham Omar, Omij Mangukiya, Essam Mansour
Dialogue benchmarks are crucial in training and evaluating chatbots engaging
in domain-specific conversations. Knowledge graphs (KGs) represent semantically
rich and well-organized data spanning various domains, such as DBLP, DBpedia,
and YAGO. Traditionally, dialogue benchmarks have been manually created from
documents, neglecting the potential of KGs in automating this process. Some
question-answering benchmarks are automatically generated using extensive
preprocessing from KGs, but they do not support dialogue generation. This paper
introduces Chatty-Gen, a novel multi-stage retrieval-augmented generation
platform for automatically generating high-quality dialogue benchmarks tailored
to a specific domain using a KG. Chatty-Gen decomposes the generation process
into manageable stages and uses assertion rules for automatic validation
between stages. Our approach enables control over intermediate results to
prevent time-consuming restarts due to hallucinations. It also reduces reliance
on costly and more powerful commercial LLMs. Chatty-Gen eliminates upfront
processing of the entire KG using efficient query-based retrieval to find
representative subgraphs based on the dialogue context. Our experiments with
several real and large KGs demonstrate that Chatty-Gen significantly
outperforms state-of-the-art systems and ensures consistent model and system
performance across multiple LLMs of diverse capabilities, such as GPT-4o,
Gemini 1.5, Llama 3, and Mistral.
Authors' comments: The paper is publsihed in SIGMOD 2025
Soham Roy, Mitul Goswami, Nisharg Nargund, Suneeta Mohanty, Prasant Kumar Pattnaik
This study introduces a system leveraging Large Language Models (LLMs) to extract text and enhance user interaction with PDF documents via a conversational interface. Utilizing Retrieval-Augmented Generation (RAG), the system provides informative responses to user inquiries while highlighting relevant passages within the PDF. Upon user upload, the system processes the PDF, employing sentence embeddings to create a document-specific vector store. This vector store enables efficient retrieval of pertinent sections in response to user queries. The LLM then engages in a conversational exchange, using the retrieved information to extract text and generate comprehensive, contextually aware answers. While our approach demonstrates competitive ROUGE values compared to existing state-of-the-art techniques for text extraction and summarization, we acknowledge that further qualitative evaluation is necessary to fully assess its effectiveness in real-world applications. The proposed system gives competitive ROUGE values as compared to existing state-of-the-art techniques for text extraction and summarization, thus offering a valuable tool for researchers, students, and anyone seeking to efficiently extract knowledge and gain insights from documents through an intuitive question-answering interface.
Demetrio Deanda, Yuktha Priya Masupalli, Jeong Yang, Young Lee, Zechun Cao, Gongbo Liang
Medical images and reports offer invaluable insights into patient health. The
heterogeneity and complexity of these data hinder effective analysis. To bridge
this gap, we investigate contrastive learning models for cross-domain
retrieval, which associates medical images with their corresponding clinical
reports. This study benchmarks the robustness of four state-of-the-art
contrastive learning models: CLIP, CXR-RePaiR, MedCLIP, and CXR-CLIP. We
introduce an occlusion retrieval task to evaluate model performance under
varying levels of image corruption. Our findings reveal that all evaluated
models are highly sensitive to out-of-distribution data, as evidenced by the
proportional decrease in performance with increasing occlusion levels. While
MedCLIP exhibits slightly more robustness, its overall performance remains
significantly behind CXR-CLIP and CXR-RePaiR. CLIP, trained on a
general-purpose dataset, struggles with medical image-report retrieval,
highlighting the importance of domain-specific training data. The evaluation of
this work suggests that more effort needs to be spent on improving the
robustness of these models. By addressing these limitations, we can develop
more reliable cross-domain retrieval models for medical applications.
Authors' comments: This work is accepted to AAAI 2025 Workshop -- the 9th International
Workshop on Health Intelligence
Berent Ånund Strømnes Lunde
Ensemble-based Data Assimilation faces significant challenges in
high-dimensional systems due to spurious correlations and ensemble collapse.
These issues arise from estimating dense dependencies with limited ensemble
sizes. This paper introduces the Ensemble Information Filter, which encodes
Markov properties directly into the statistical model's precision matrix,
leveraging structure from SPDE dynamics to constrain information to propagate
locally. EnIF eliminates the need for ad-hoc localisation, improving
statistical consistency and scalability. Numerical experiments demonstrate its
advantages in filtering, smoothing, and parameter estimation, making EnIF a
robust and efficient solution for large-scale data assimilation problems.
Authors' comments: 25 pages, 10 figures
Xingyan Bin, Jianfei Cui, Wujie Yan, Zhichen Zhao, Xintian Han, Chongyang Yan, Feng Zhang, Xun Zhou et al.
Retrievers, which form one of the most important recommendation stages, are responsible for efficiently selecting possible positive samples to the later stages under strict latency limitations. Because of this, large-scale systems always rely on approximate calculations and indexes to roughly shrink candidate scale, with a simple ranking model. Considering simple models lack the ability to produce precise predictions, most of the existing methods mainly focus on incorporating complicated ranking models. However, another fundamental problem of index effectiveness remains unresolved, which also bottlenecks complication. In this paper, we propose a novel index structure: streaming Vector Quantization model, as a new generation of retrieval paradigm. Streaming VQ attaches items with indexes in real time, granting it immediacy. Moreover, through meticulous verification of possible variants, it achieves additional benefits like index balancing and reparability, enabling it to support complicated ranking models as existing approaches. As a lightweight and implementation-friendly architecture, streaming VQ has been deployed and replaced all major retrievers in Douyin and Douyin Lite, resulting in remarkable user engagement gain.
Laura Orphal-Kobin, Gregor Pieplow, Alok Gokhale, Kilian Unterguggenberger, Tim Schröder
In regimes of low signal strengths and therefore a small signal-to-noise
ratio, standard data analysis methods often fail to accurately estimate system
properties. We present a method based on Monte Carlo simulations to effectively
restore robust parameter estimates from large sets of undersampled data. This
approach is illustrated through the analysis of photoluminescence excitation
spectroscopy data for optical linewidth characterization of a nitrogen-vacancy
color center in diamond. We evaluate the quality of parameter prediction using
standard statistical data analysis methods, such as the median, and the Monte
Carlo method. Depending on the signal strength, we find that the median can be
precise (narrow confidence intervals) but very inaccurate. A detailed analysis
across a broad range of parameters allows to identify the experimental
conditions under which the median provides a reliable predictor of the quantum
emitter's linewidth. We also explore machine learning to perform the same task,
forming a promising addition to the parameter estimation toolkit. Finally, the
developed method offers a broadly applicable tool for accurate parameter
prediction from low signal data, opening new experimental regimes previously
deemed inaccessible.
Authors' comments: Main part: 7 pages incl. references, 5 figures
Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, Sidan Du
The target of video moment retrieval (VMR) is predicting temporal spans
within a video that semantically match a given linguistic query. Existing VMR
methods based on multimodal large language models (MLLMs) overly rely on
expensive high-quality datasets and time-consuming fine-tuning. Although some
recent studies introduce a zero-shot setting to avoid fine-tuning, they
overlook inherent language bias in the query, leading to erroneous
localization. To tackle the aforementioned challenges, this paper proposes
Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs.
Specifically, we first employ LLaMA-3 to correct and rephrase the query to
mitigate language bias. Subsequently, we design a span generator combined with
MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the
video comprehension capabilities of MLLMs, we apply VideoChatGPT and span
scorer to select the most appropriate spans. Our proposed method substantially
outperforms the state-ofthe-art MLLM-based and zero-shot models on several
public datasets, including QVHighlights, ActivityNet-Captions, and
Charades-STA.
Authors' comments: Accepted by AAAI 2025
Yifu Qiu, Varun Embar, Yizhe Zhang, Navdeep Jaitly, Shay B. Cohen, Benjamin Han
Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly -- a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts. To address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers. We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head. Our evaluation of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-tuning, respectively. It even outperforms GPT-4-Turbo on most tasks despite being a much smaller model.
Donghwi Jung, Keonwoo Kim, Seong-Woo Kim
We propose GOTPR, a robust place recognition method designed for outdoor environments where GPS signals are unavailable. Unlike existing approaches that use point cloud maps, which are large and difficult to store, GOTPR leverages scene graphs generated from text descriptions and maps for place recognition. This method improves scalability by replacing point clouds with compact data structures, allowing robots to efficiently store and utilize extensive map data. In addition, GOTPR eliminates the need for custom map creation by using publicly available OpenStreetMap data, which provides global spatial information. We evaluated its performance using the KITTI360Pose dataset with corresponding OpenStreetMap data, comparing it to existing point cloud-based place recognition methods. The results show that GOTPR achieves comparable accuracy while significantly reducing storage requirements. In city-scale tests, it completed processing within a few seconds, making it highly practical for real-world robotics applications. More information can be found at https://donghwijung.github.io/GOTPR_page/.
Xinyang Zhou, Fanyue Wei, Lixin Duan, Wen Li
Given a textual query along with a corresponding video, the objective of moment retrieval aims to localize the moments relevant to the query within the video. While commendable results have been demonstrated by existing transformer-based approaches, predicting the accurate temporal span of the target moment is currently still a major challenge. In this paper, we reveal that a crucial reason stems from the spurious correlation between the text queries and the moment context. Namely, the model may associate the textual query with the background frames rather than the target moment. To address this issue, we propose a temporal dynamic learning approach for moment retrieval, where two strategies are designed to mitigate the spurious correlation. First, we introduce a novel video synthesis approach to construct a dynamic context for the relevant moment. With separate yet similar videos mixed up, the synthesis approach empowers our model to attend to the target moment of the corresponding query under various dynamic contexts. Second, we enhance the representation by learning temporal dynamics. Besides the visual representation, text queries are aligned with temporal dynamic representations, which enables our model to establish a non-spurious correlation between the query-related moment and context. With the aforementioned proposed method, the spurious correlation issue in moment retrieval can be largely alleviated. Our method establishes a new state-of-the-art performance on two popular benchmarks of moment retrieval, \ie, QVHighlights and Charades-STA. In addition, the detailed ablation analyses demonstrate the effectiveness of the proposed strategies. Our code will be publicly available.
Yuxin Fan, Yuxiang Wang, Lipeng Liu, Xirui Tang, Na Sun, Zidong Yu
In the contemporary context of rapid advancements in information technology and the exponential growth of data volume, language models are confronted with significant challenges in effectively navigating the dynamic and ever-evolving information landscape to update and adapt to novel knowledge in real time. In this work, an online update method is proposed, which is based on the existing Retrieval Enhanced Generation (RAG) model with multiple innovation mechanisms. Firstly, the dynamic memory is used to capture the emerging data samples, and then gradually integrate them into the core model through a tunable knowledge distillation strategy. At the same time, hierarchical indexing and multi-layer gating mechanism are introduced into the retrieval module to ensure that the retrieved content is more targeted and accurate. Finally, a multi-stage network structure is established for different types of inputs in the generation stage, and cross-attention matching and screening are carried out on the intermediate representations of each stage to ensure the effective integration and iterative update of new and old knowledge. Experimental results show that the proposed method is better than the existing mainstream comparison models in terms of knowledge retention and inference accuracy.
Suchana Datta, Dwaipayan Roy, Derek Greene, Gerardine Meaney
In English literature, the 19th century witnessed a significant transition in
styles, themes, and genres. Consequently, the novels from this period display
remarkable diversity. This paper explores these variations by examining the
evolution of term usage in 19th century English novels through the lens of
information retrieval. By applying a query expansion-based approach to a
decade-segmented collection of fiction from the British Library, we examine how
related terms vary over time. Our analysis employs multiple standard metrics
including Kendall's tau, Jaccard similarity, and Jensen-Shannon divergence to
assess overlaps and shifts in expanded query term sets. Our results indicate a
significant degree of divergence in the related terms across decades as
selected by the query expansion technique, suggesting substantial linguistic
and conceptual changes throughout the 19th century novels.
Authors' comments: Accepted at JCDL 2024
Peizhuo Lv, Mengjie Sun, Hao Wang, Xiaofeng Wang, Shengzhi Zhang, Yuxuan Chen, Kai Chen, Limin Sun
In recent years, tremendous success has been witnessed in Retrieval-Augmented Generation (RAG), widely used to enhance Large Language Models (LLMs) in domain-specific, knowledge-intensive, and privacy-sensitive tasks. However, attackers may steal those valuable RAGs and deploy or commercialize them, making it essential to detect Intellectual Property (IP) infringement. Most existing ownership protection solutions, such as watermarks, are designed for relational databases and texts. They cannot be directly applied to RAGs because relational database watermarks require white-box access to detect IP infringement, which is unrealistic for the knowledge base in RAGs. Meanwhile, post-processing by the adversary's deployed LLMs typically destructs text watermark information. To address those problems, we propose a novel black-box "knowledge watermark" approach, named RAG-WM, to detect IP infringement of RAGs. RAG-WM uses a multi-LLM interaction framework, comprising a Watermark Generator, Shadow LLM & RAG, and Watermark Discriminator, to create watermark texts based on watermark entity-relationship tuples and inject them into the target RAG. We evaluate RAG-WM across three domain-specific and two privacy-sensitive tasks on four benchmark LLMs. Experimental results show that RAG-WM effectively detects the stolen RAGs in various deployed LLMs. Furthermore, RAG-WM is robust against paraphrasing, unrelated content removal, knowledge insertion, and knowledge expansion attacks. Lastly, RAG-WM can also evade watermark detection approaches, highlighting its promising application in detecting IP infringement of RAG systems.
Kevin Bönisch, Alexander Mehler
We introduce a retrieval approach leveraging Support Vector Regression (SVR) ensembles, bootstrap aggregation (bagging), and embedding spaces on the German Dataset for Legal Information Retrieval (GerDaLIR). By conceptualizing the retrieval task in terms of multiple binary needle-in-a-haystack subtasks, we show improved recall over the baselines (0.849 > 0.803 | 0.829) using our voting ensemble, suggesting promising initial results, without training or fine-tuning any deep learning models. Our approach holds potential for further enhancement, particularly through refining the encoding models and optimizing hyperparameters.
Ofir Marom
Case-based reasoning (CBR) is an experience-based approach to problem
solving, where a repository of solved cases is adapted to solve new cases.
Recent research shows that Large Language Models (LLMs) with
Retrieval-Augmented Generation (RAG) can support the Retrieve and Reuse stages
of the CBR pipeline by retrieving similar cases and using them as additional
context to an LLM query. Most studies have focused on text-only applications,
however, in many real-world problems the components of a case are multimodal.
In this paper we present MCBR-RAG, a general RAG framework for multimodal CBR
applications. The MCBR-RAG framework converts non-text case components into
text-based representations, allowing it to: 1) learn application-specific
latent representations that can be indexed for retrieval, and 2) enrich the
query provided to the LLM by incorporating all case components for better
context. We demonstrate MCBR-RAG's effectiveness through experiments conducted
on a simplified Math-24 application and a more complex Backgammon application.
Our empirical results show that MCBR-RAG improves generation quality compared
to a baseline LLM with no contextual information provided.
Authors' comments: 15 pages, 7 figures