Soham Roy, Mitul Goswami, Nisharg Nargund, Suneeta Mohanty, Prasant Kumar Pattnaik
This study introduces a system leveraging Large Language Models (LLMs) to extract text and enhance user interaction with PDF documents via a conversational interface. Utilizing Retrieval-Augmented Generation (RAG), the system provides informative responses to user inquiries while highlighting relevant passages within the PDF. Upon user upload, the system processes the PDF, employing sentence embeddings to create a document-specific vector store. This vector store enables efficient retrieval of pertinent sections in response to user queries. The LLM then engages in a conversational exchange, using the retrieved information to extract text and generate comprehensive, contextually aware answers. While our approach demonstrates competitive ROUGE values compared to existing state-of-the-art techniques for text extraction and summarization, we acknowledge that further qualitative evaluation is necessary to fully assess its effectiveness in real-world applications. The proposed system gives competitive ROUGE values as compared to existing state-of-the-art techniques for text extraction and summarization, thus offering a valuable tool for researchers, students, and anyone seeking to efficiently extract knowledge and gain insights from documents through an intuitive question-answering interface.
Demetrio Deanda, Yuktha Priya Masupalli, Jeong Yang, Young Lee, Zechun Cao, Gongbo Liang
Medical images and reports offer invaluable insights into patient health. The
heterogeneity and complexity of these data hinder effective analysis. To bridge
this gap, we investigate contrastive learning models for cross-domain
retrieval, which associates medical images with their corresponding clinical
reports. This study benchmarks the robustness of four state-of-the-art
contrastive learning models: CLIP, CXR-RePaiR, MedCLIP, and CXR-CLIP. We
introduce an occlusion retrieval task to evaluate model performance under
varying levels of image corruption. Our findings reveal that all evaluated
models are highly sensitive to out-of-distribution data, as evidenced by the
proportional decrease in performance with increasing occlusion levels. While
MedCLIP exhibits slightly more robustness, its overall performance remains
significantly behind CXR-CLIP and CXR-RePaiR. CLIP, trained on a
general-purpose dataset, struggles with medical image-report retrieval,
highlighting the importance of domain-specific training data. The evaluation of
this work suggests that more effort needs to be spent on improving the
robustness of these models. By addressing these limitations, we can develop
more reliable cross-domain retrieval models for medical applications.
Authors' comments: This work is accepted to AAAI 2025 Workshop -- the 9th International
Workshop on Health Intelligence
Berent Ånund Strømnes Lunde
Ensemble-based Data Assimilation faces significant challenges in
high-dimensional systems due to spurious correlations and ensemble collapse.
These issues arise from estimating dense dependencies with limited ensemble
sizes. This paper introduces the Ensemble Information Filter, which encodes
Markov properties directly into the statistical model's precision matrix,
leveraging structure from SPDE dynamics to constrain information to propagate
locally. EnIF eliminates the need for ad-hoc localisation, improving
statistical consistency and scalability. Numerical experiments demonstrate its
advantages in filtering, smoothing, and parameter estimation, making EnIF a
robust and efficient solution for large-scale data assimilation problems.
Authors' comments: 25 pages, 10 figures
Xingyan Bin, Jianfei Cui, Wujie Yan, Zhichen Zhao, Xintian Han, Chongyang Yan, Feng Zhang, Xun Zhou et al.
Retrievers, which form one of the most important recommendation stages, are responsible for efficiently selecting possible positive samples to the later stages under strict latency limitations. Because of this, large-scale systems always rely on approximate calculations and indexes to roughly shrink candidate scale, with a simple ranking model. Considering simple models lack the ability to produce precise predictions, most of the existing methods mainly focus on incorporating complicated ranking models. However, another fundamental problem of index effectiveness remains unresolved, which also bottlenecks complication. In this paper, we propose a novel index structure: streaming Vector Quantization model, as a new generation of retrieval paradigm. Streaming VQ attaches items with indexes in real time, granting it immediacy. Moreover, through meticulous verification of possible variants, it achieves additional benefits like index balancing and reparability, enabling it to support complicated ranking models as existing approaches. As a lightweight and implementation-friendly architecture, streaming VQ has been deployed and replaced all major retrievers in Douyin and Douyin Lite, resulting in remarkable user engagement gain.
Laura Orphal-Kobin, Gregor Pieplow, Alok Gokhale, Kilian Unterguggenberger, Tim Schröder
In regimes of low signal strengths and therefore a small signal-to-noise
ratio, standard data analysis methods often fail to accurately estimate system
properties. We present a method based on Monte Carlo simulations to effectively
restore robust parameter estimates from large sets of undersampled data. This
approach is illustrated through the analysis of photoluminescence excitation
spectroscopy data for optical linewidth characterization of a nitrogen-vacancy
color center in diamond. We evaluate the quality of parameter prediction using
standard statistical data analysis methods, such as the median, and the Monte
Carlo method. Depending on the signal strength, we find that the median can be
precise (narrow confidence intervals) but very inaccurate. A detailed analysis
across a broad range of parameters allows to identify the experimental
conditions under which the median provides a reliable predictor of the quantum
emitter's linewidth. We also explore machine learning to perform the same task,
forming a promising addition to the parameter estimation toolkit. Finally, the
developed method offers a broadly applicable tool for accurate parameter
prediction from low signal data, opening new experimental regimes previously
deemed inaccessible.
Authors' comments: Main part: 7 pages incl. references, 5 figures
Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, Sidan Du
The target of video moment retrieval (VMR) is predicting temporal spans
within a video that semantically match a given linguistic query. Existing VMR
methods based on multimodal large language models (MLLMs) overly rely on
expensive high-quality datasets and time-consuming fine-tuning. Although some
recent studies introduce a zero-shot setting to avoid fine-tuning, they
overlook inherent language bias in the query, leading to erroneous
localization. To tackle the aforementioned challenges, this paper proposes
Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs.
Specifically, we first employ LLaMA-3 to correct and rephrase the query to
mitigate language bias. Subsequently, we design a span generator combined with
MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the
video comprehension capabilities of MLLMs, we apply VideoChatGPT and span
scorer to select the most appropriate spans. Our proposed method substantially
outperforms the state-ofthe-art MLLM-based and zero-shot models on several
public datasets, including QVHighlights, ActivityNet-Captions, and
Charades-STA.
Authors' comments: Accepted by AAAI 2025
Yifu Qiu, Varun Embar, Yizhe Zhang, Navdeep Jaitly, Shay B. Cohen, Benjamin Han
Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly -- a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts. To address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers. We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head. Our evaluation of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-tuning, respectively. It even outperforms GPT-4-Turbo on most tasks despite being a much smaller model.
Donghwi Jung, Keonwoo Kim, Seong-Woo Kim
We propose GOTPR, a robust place recognition method designed for outdoor environments where GPS signals are unavailable. Unlike existing approaches that use point cloud maps, which are large and difficult to store, GOTPR leverages scene graphs generated from text descriptions and maps for place recognition. This method improves scalability by replacing point clouds with compact data structures, allowing robots to efficiently store and utilize extensive map data. In addition, GOTPR eliminates the need for custom map creation by using publicly available OpenStreetMap data, which provides global spatial information. We evaluated its performance using the KITTI360Pose dataset with corresponding OpenStreetMap data, comparing it to existing point cloud-based place recognition methods. The results show that GOTPR achieves comparable accuracy while significantly reducing storage requirements. In city-scale tests, it completed processing within a few seconds, making it highly practical for real-world robotics applications. More information can be found at https://donghwijung.github.io/GOTPR_page/.
Xinyang Zhou, Fanyue Wei, Lixin Duan, Wen Li
Given a textual query along with a corresponding video, the objective of moment retrieval aims to localize the moments relevant to the query within the video. While commendable results have been demonstrated by existing transformer-based approaches, predicting the accurate temporal span of the target moment is currently still a major challenge. In this paper, we reveal that a crucial reason stems from the spurious correlation between the text queries and the moment context. Namely, the model may associate the textual query with the background frames rather than the target moment. To address this issue, we propose a temporal dynamic learning approach for moment retrieval, where two strategies are designed to mitigate the spurious correlation. First, we introduce a novel video synthesis approach to construct a dynamic context for the relevant moment. With separate yet similar videos mixed up, the synthesis approach empowers our model to attend to the target moment of the corresponding query under various dynamic contexts. Second, we enhance the representation by learning temporal dynamics. Besides the visual representation, text queries are aligned with temporal dynamic representations, which enables our model to establish a non-spurious correlation between the query-related moment and context. With the aforementioned proposed method, the spurious correlation issue in moment retrieval can be largely alleviated. Our method establishes a new state-of-the-art performance on two popular benchmarks of moment retrieval, \ie, QVHighlights and Charades-STA. In addition, the detailed ablation analyses demonstrate the effectiveness of the proposed strategies. Our code will be publicly available.
Yuxin Fan, Yuxiang Wang, Lipeng Liu, Xirui Tang, Na Sun, Zidong Yu
In the contemporary context of rapid advancements in information technology and the exponential growth of data volume, language models are confronted with significant challenges in effectively navigating the dynamic and ever-evolving information landscape to update and adapt to novel knowledge in real time. In this work, an online update method is proposed, which is based on the existing Retrieval Enhanced Generation (RAG) model with multiple innovation mechanisms. Firstly, the dynamic memory is used to capture the emerging data samples, and then gradually integrate them into the core model through a tunable knowledge distillation strategy. At the same time, hierarchical indexing and multi-layer gating mechanism are introduced into the retrieval module to ensure that the retrieved content is more targeted and accurate. Finally, a multi-stage network structure is established for different types of inputs in the generation stage, and cross-attention matching and screening are carried out on the intermediate representations of each stage to ensure the effective integration and iterative update of new and old knowledge. Experimental results show that the proposed method is better than the existing mainstream comparison models in terms of knowledge retention and inference accuracy.
Suchana Datta, Dwaipayan Roy, Derek Greene, Gerardine Meaney
In English literature, the 19th century witnessed a significant transition in
styles, themes, and genres. Consequently, the novels from this period display
remarkable diversity. This paper explores these variations by examining the
evolution of term usage in 19th century English novels through the lens of
information retrieval. By applying a query expansion-based approach to a
decade-segmented collection of fiction from the British Library, we examine how
related terms vary over time. Our analysis employs multiple standard metrics
including Kendall's tau, Jaccard similarity, and Jensen-Shannon divergence to
assess overlaps and shifts in expanded query term sets. Our results indicate a
significant degree of divergence in the related terms across decades as
selected by the query expansion technique, suggesting substantial linguistic
and conceptual changes throughout the 19th century novels.
Authors' comments: Accepted at JCDL 2024
Peizhuo Lv, Mengjie Sun, Hao Wang, Xiaofeng Wang, Shengzhi Zhang, Yuxuan Chen, Kai Chen, Limin Sun
In recent years, tremendous success has been witnessed in Retrieval-Augmented Generation (RAG), widely used to enhance Large Language Models (LLMs) in domain-specific, knowledge-intensive, and privacy-sensitive tasks. However, attackers may steal those valuable RAGs and deploy or commercialize them, making it essential to detect Intellectual Property (IP) infringement. Most existing ownership protection solutions, such as watermarks, are designed for relational databases and texts. They cannot be directly applied to RAGs because relational database watermarks require white-box access to detect IP infringement, which is unrealistic for the knowledge base in RAGs. Meanwhile, post-processing by the adversary's deployed LLMs typically destructs text watermark information. To address those problems, we propose a novel black-box "knowledge watermark" approach, named RAG-WM, to detect IP infringement of RAGs. RAG-WM uses a multi-LLM interaction framework, comprising a Watermark Generator, Shadow LLM & RAG, and Watermark Discriminator, to create watermark texts based on watermark entity-relationship tuples and inject them into the target RAG. We evaluate RAG-WM across three domain-specific and two privacy-sensitive tasks on four benchmark LLMs. Experimental results show that RAG-WM effectively detects the stolen RAGs in various deployed LLMs. Furthermore, RAG-WM is robust against paraphrasing, unrelated content removal, knowledge insertion, and knowledge expansion attacks. Lastly, RAG-WM can also evade watermark detection approaches, highlighting its promising application in detecting IP infringement of RAG systems.
Kevin Bönisch, Alexander Mehler
We introduce a retrieval approach leveraging Support Vector Regression (SVR) ensembles, bootstrap aggregation (bagging), and embedding spaces on the German Dataset for Legal Information Retrieval (GerDaLIR). By conceptualizing the retrieval task in terms of multiple binary needle-in-a-haystack subtasks, we show improved recall over the baselines (0.849 > 0.803 | 0.829) using our voting ensemble, suggesting promising initial results, without training or fine-tuning any deep learning models. Our approach holds potential for further enhancement, particularly through refining the encoding models and optimizing hyperparameters.
Ofir Marom
Case-based reasoning (CBR) is an experience-based approach to problem
solving, where a repository of solved cases is adapted to solve new cases.
Recent research shows that Large Language Models (LLMs) with
Retrieval-Augmented Generation (RAG) can support the Retrieve and Reuse stages
of the CBR pipeline by retrieving similar cases and using them as additional
context to an LLM query. Most studies have focused on text-only applications,
however, in many real-world problems the components of a case are multimodal.
In this paper we present MCBR-RAG, a general RAG framework for multimodal CBR
applications. The MCBR-RAG framework converts non-text case components into
text-based representations, allowing it to: 1) learn application-specific
latent representations that can be indexed for retrieval, and 2) enrich the
query provided to the LLM by incorporating all case components for better
context. We demonstrate MCBR-RAG's effectiveness through experiments conducted
on a simplified Math-24 application and a more complex Backgammon application.
Our empirical results show that MCBR-RAG improves generation quality compared
to a baseline LLM with no contextual information provided.
Authors' comments: 15 pages, 7 figures
Patrice Béchard, Orlando Marquez Ayala
Retrieval-Augmented Generation (RAG) has become ubiquitous when deploying
Large Language Models (LLMs), as it can address typical limitations such as
generating hallucinated or outdated information. However, when building
real-world RAG applications, practical issues arise. First, the retrieved
information is generally domain-specific. Since it is computationally expensive
to fine-tune LLMs, it is more feasible to fine-tune the retriever to improve
the quality of the data included in the LLM input. Second, as more applications
are deployed in the same real-world system, one cannot afford to deploy
separate retrievers. Moreover, these RAG applications normally retrieve
different kinds of data. Our solution is to instruction fine-tune a small
retriever encoder on a variety of domain-specific tasks to allow us to deploy
one encoder that can serve many use cases, thereby achieving low-cost,
scalability, and speed. We show how this encoder generalizes to out-of-domain
settings as well as to an unseen retrieval task on real-world enterprise use
cases.
Authors' comments: 9 pages, 2 figures. Submitted to NAACL 2025 Industry Track
Charles Corbière, Simon Roburin, Syrielle Montariol, Antoine Bosselut, Alexandre Alahi
While chain-of-thought (CoT) prompting improves reasoning in large language
models, its effectiveness in vision-language models (VLMs) remains limited due
to over-reliance on textual cues and memorized knowledge. To investigate the
visual reasoning capabilities of VLMs in complex real-world scenarios, we
introduce DrivingVQA, a visual question answering dataset derived from driving
theory exams, which contains 3,931 multiple-choice problems with expert-written
explanations and grounded entities relevant to the reasoning process.
Leveraging this dataset, we propose RIV-CoT, a Retrieval-Based Interleaved
Visual Chain-of-Thought method that enables VLMs to reason using visual crops
corresponding to these relevant entities. Our experiments demonstrate that
RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over
vanilla CoT prompting. Furthermore, we demonstrate that our method effectively
scales to the larger A-OKVQA reasoning dataset by leveraging automatically
generated pseudo-labels, outperforming CoT prompting.
Authors' comments: Project page: https://vita-epfl.github.io/DrivingVQA
Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young-Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu et al.
Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at https://github.com/ibm/mt-rag-benchmark.
Wen-Dong Jiang, Chih-Yung Chang, Diptendu Sinha Roy
Recently, violence detection systems developed using unified multimodal
models have achieved significant success and attracted widespread attention.
However, most of these systems face two critical challenges: the lack of
interpretability as black-box models and limited functionality, offering only
classification or retrieval capabilities. To address these challenges, this
paper proposes a novel interpretable violence detection system, termed the
Three-in-One (TIO) System. The TIO system integrates knowledge graphs (KG) and
graph attention networks (GAT) to provide three core functionalities:
detection, retrieval, and explanation. Specifically, the system processes each
video frame along with text descriptions generated by a large language model
(LLM) for videos containing potential violent behavior. It employs ImageBind to
generate high-dimensional embeddings for constructing a knowledge graph, uses
GAT for reasoning, and applies lightweight time series modules to extract video
embedding features. The final step connects a classifier and retriever for
multi-functional outputs. The interpretability of KG enables the system to
verify the reasoning process behind each output. Additionally, the paper
introduces several lightweight methods to reduce the resource consumption of
the TIO system and enhance its efficiency. Extensive experiments conducted on
the XD-Violence and UCF-Crime datasets validate the effectiveness of the
proposed system. A case study further reveals an intriguing phenomenon: as the
number of bystanders increases, the occurrence of violent behavior tends to
decrease.
Authors' comments: This work has been submitted to the IEEE for possible publication
Yindu Su, Huike Zou, Lin Sun, Ting Zhang, Haiyang Yang, Liyu Chen, David Lo, Qingheng Zhang et al.
Product Attribute Value Identification (PAVI) involves identifying attribute values from product profiles, a key task for improving product search, recommendations, and business analytics on e-commerce platforms. However, existing PAVI methods face critical challenges, such as inferring implicit values, handling out-of-distribution (OOD) values, and producing normalized outputs. To address these limitations, we introduce Taxonomy-Aware Contrastive Learning Retrieval (TACLR), the first retrieval-based method for PAVI. TACLR formulates PAVI as an information retrieval task by encoding product profiles and candidate values into embeddings and retrieving values based on their similarity to the item embedding. It leverages contrastive training with taxonomy-aware hard negative sampling and employs adaptive inference with dynamic thresholds. TACLR offers three key advantages: (1) it effectively handles implicit and OOD values while producing normalized outputs; (2) it scales to thousands of categories, tens of thousands of attributes, and millions of values; and (3) it supports efficient inference for high-load industrial scenarios. Extensive experiments on proprietary and public datasets validate the effectiveness and efficiency of TACLR. Moreover, it has been successfully deployed in a real-world e-commerce platform, processing millions of product listings daily while supporting dynamic, large-scale attribute taxonomies.
Binita Saha, Utsha Saha, Muhammad Zubair Malik
This work presents a novel architecture for building Retrieval-Augmented Generation (RAG) systems to improve Question Answering (QA) tasks from a target corpus. Large Language Models (LLMs) have revolutionized the analyzing and generation of human-like text. These models rely on pre-trained data and lack real-time updates unless integrated with live data tools. RAG enhances LLMs by integrating online resources and databases to generate contextually appropriate responses. However, traditional RAG still encounters challenges like information dilution and hallucinations when handling vast amounts of data. Our approach addresses these challenges by converting corpora into a domain-specific dataset and RAG architecture is constructed to generate responses from the target document. We introduce QuIM-RAG (Question-to-question Inverted Index Matching), a novel approach for the retrieval mechanism in our system. This strategy generates potential questions from document chunks and matches these with user queries to identify the most relevant text chunks for generating accurate answers. We have implemented our RAG system on top of the open-source Meta-LLaMA3-8B-instruct model by Meta Inc. that is available on Hugging Face. We constructed a custom corpus of 500+ pages from a high-traffic website accessed thousands of times daily for answering complex questions, along with manually prepared ground truth QA for evaluation. We compared our approach with traditional RAG models using BERT-Score and RAGAS, state-of-the-art metrics for evaluating LLM applications. Our evaluation demonstrates that our approach outperforms traditional RAG architectures on both metrics.