Deepak Gupta, Dina Demner-Fushman, William Hersh, Steven Bedrick, Kirk Roberts
With the advancement of large language models (LLMs), the biomedical domain has seen significant progress and improvement in multiple tasks such as biomedical question answering, lay language summarization of the biomedical literature, clinical note summarization, etc. However, hallucinations or confabulations remain one of the key challenges when using LLMs in the biomedical and other domains. Inaccuracies may be particularly harmful in high-risk situations, such as making clinical decisions or appraising biomedical research. Studies on the evaluation of the LLMs' abilities to ground generated statements in verifiable sources have shown that models perform significantly worse on lay-user generated questions, and often fail to reference relevant sources. This can be problematic when those seeking information want evidence from studies to back up the claims from LLMs[3]. Unsupported statements are a major barrier to using LLMs in any applications that may affect health. Methods for grounding generated statements in reliable sources along with practical evaluation approaches are needed to overcome this barrier. Towards this, in our pilot task organized at TREC 2024, we introduced the task of reference attribution as a means to mitigate the generation of false statements by LLMs answering biomedical questions.
Abdelkader Belhenniche, Roman Chertovskih
This article provides a new approach on how to enhance data storage and retrieval in the Query By Image Content Systems (QBIC) by introducing the ${\rm NEM}_{\sigma}$ distance measure, satisfying the relaxed triangle inequality. By leveraging the concept of extended $b$-metric spaces, we address complex distance relationships, thereby improving the accuracy and efficiency of image database management. The use of ${\rm NEM}_{\sigma}$ facilitates better scalability and accuracy in large-scale image retrieval systems, optimizing both the storage and retrieval processes. The proposed method represents a significant advancement over traditional distance measures, offering enhanced flexibility and precision in the context of image content-based querying. Additionally, we take inspiration from ice flow models using ${\rm NEM}_{\sigma}$ and ${\rm NEM}_r$, adding dynamic and location-based factors to better capture details in images.
Liu Yang, Fabian Paischer, Kaveh Hassani, Jiacheng Li, Shuai Shao, Zhang Gabriel Li, Yun He, Xue Feng et al.
Sequential dense retrieval models utilize advanced sequence learning techniques to compute item and user representations, which are then used to rank relevant items for a user through inner product computation between the user and all item representations. However, this approach requires storing a unique representation for each item, resulting in significant memory requirements as the number of items grow. In contrast, the recently proposed generative retrieval paradigm offers a promising alternative by directly predicting item indices using a generative model trained on semantic IDs that encapsulate items' semantic information. Despite its potential for large-scale applications, a comprehensive comparison between generative retrieval and sequential dense retrieval under fair conditions is still lacking, leaving open questions regarding performance, and computation trade-offs. To address this, we compare these two approaches under controlled conditions on academic benchmarks and propose LIGER (LeveragIng dense retrieval for GEnerative Retrieval), a hybrid model that combines the strengths of these two widely used methods. LIGER integrates sequential dense retrieval into generative retrieval, mitigating performance differences and enhancing cold-start item recommendation in the datasets evaluated. This hybrid approach provides insights into the trade-offs between these approaches and demonstrates improvements in efficiency and effectiveness for recommendation systems in small-scale benchmarks.
Ronglai Zuo, Rolandos Alexandros Potamias, Evangelos Ververas, Jiankang Deng, Stefanos Zafeiriou
Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. Although many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), the reverse task-sign language generation (text-to-sign)-remains largely unexplored. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we leverage a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. During decoding, unlike existing approaches that flatten all part-wise tokens into a single sequence and predict one token at a time, we propose a multi-head decoding method capable of predicting multiple tokens simultaneously. This approach improves inference efficiency while maintaining effective information fusion across different body parts. To further ease the generation process, we propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs as auxiliary conditions, significantly improving the precision of generated signs. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. Code, models, and data will be made publicly available.
Albert Kornilov, Tatiana Shavrina
Recent advances in language modeling have demonstrated significant
improvements in zero-shot capabilities, including in-context learning,
instruction following, and machine translation for extremely under-resourced
languages (Tanzer et al., 2024). However, many languages with limited written
resources rely primarily on formal descriptions of grammar and vocabulary.
In this paper, we introduce a set of benchmarks to evaluate how well models
can extract and classify information from the complex descriptions found in
linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based
approach that leverages these descriptions for downstream tasks such as machine
translation. Our benchmarks encompass linguistic descriptions for 248 languages
across 142 language families, focusing on typological features from WALS and
Grambank.
This set of benchmarks offers the first comprehensive evaluation of language
models' in-context ability to accurately interpret and extract linguistic
features, providing a critical resource for scaling NLP to low-resource
languages. The code and data are publicly available at
\url{https://github.com/al-the-eigenvalue/RAG-on-grammars}.
Authors' comments: submitted to COLING 2025
Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini et al.
Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.
Shenglai Zeng, Jiankun Zhang, Bingheng Li, Yuping Lin, Tianqi Zheng, Dante Everaert, Hanqing Lu, Hui Liu et al.
Retrieval-Augmented Generation (RAG) systems have shown promise in enhancing the performance of Large Language Models (LLMs). However, these systems face challenges in effectively integrating external knowledge with the LLM's internal knowledge, often leading to issues with misleading or unhelpful information. This work aims to provide a systematic study on knowledge checking in RAG systems. We conduct a comprehensive analysis of LLM representation behaviors and demonstrate the significance of using representations in knowledge checking. Motivated by the findings, we further develop representation-based classifiers for knowledge filtering. We show substantial improvements in RAG performance, even when dealing with noisy knowledge databases. Our study provides new insights into leveraging LLM representations for enhancing the reliability and effectiveness of RAG systems.
Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao et al.
Existing large video-language models (LVLMs) struggle to comprehend long
videos correctly due to limited context. To address this problem, fine-tuning
long-context LVLMs and employing GPT-based agents have emerged as promising
solutions. However, fine-tuning LVLMs would require extensive high-quality data
and substantial GPU resources, while GPT-based agents would rely on proprietary
models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented
Generation (Video-RAG), a training-free and cost-effective pipeline that
employs visually-aligned auxiliary texts to help facilitate cross-modality
alignment while providing additional information beyond the visual content.
Specifically, we leverage open-source external tools to extract
visually-aligned information from pure video data (e.g., audio, optical
character, and object detection), and incorporate the extracted information
into an existing LVLM as auxiliary texts, alongside video frames and queries,
in a plug-and-play manner. Our Video-RAG offers several key advantages: (i)
lightweight with low computing overhead due to single-turn retrieval; (ii) easy
implementation and compatibility with any LVLM; and (iii) significant,
consistent performance gains across long video understanding benchmarks,
including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates
superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o
when utilized with a 72B model.
Authors' comments: 10 pages, 6 figures
Amar Abane, Anis Bekri, Abdella Battou, Saddek Bensalem
Efficiently processing and interpreting network data is critical for the operation of increasingly complex networks. Recent advances in Large Language Models (LLM) and Retrieval-Augmented Generation (RAG) techniques have improved data processing in network management. However, existing RAG methods like VectorRAG and GraphRAG struggle with the complexity and implicit nature of semi-structured technical data, leading to inefficiencies in time, cost, and retrieval. This paper introduces FastRAG, a novel RAG approach designed for semi-structured data. FastRAG employs schema learning and script learning to extract and structure data without needing to submit entire data sources to an LLM. It integrates text search with knowledge graph (KG) querying to improve accuracy in retrieving context-rich information. Evaluation results demonstrate that FastRAG provides accurate question answering, while improving up to 90% in time and 85% in cost compared to GraphRAG.
Jakub Stetina, Martin Fajcik, Michal Stefanik, Michal Hradis
This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.
Matthias Becht, Hans-Peter Lehmann, Peter Sanders
A retrieval data structure stores a static function f : S -> {0,1}^r . For all x in S, it returns the r-bit value f(x), while for other inputs it may return an arbitrary result. The structure cannot answer membership queries, so it does not have to encode S. The information theoretic space lower bound for arbitrary inputs is r|S| bits. Retrieval data structures have widespread applications. They can be used as an approximate membership filter for S by storing fingerprints of the keys in S, where they are faster and more space efficient than Bloom filters. They can also be used as a basic building block of succinct data structures like perfect hash functions. Bumped Ribbon Retrieval (BuRR) [Dillinger et al., SEA'22] is a recently developed retrieval data structure that is fast to construct with a space overhead of less than 1%. The idea is to solve a nearly diagonal system of linear equations to determine a matrix that, multiplied with the hash of each key, gives the desired output values. During solving, BuRR might bump lines of the equation system to another layer of the same data structure. While the paper describes a simple parallel construction based on bumping the keys on thread boundaries, it does not give an implementation. In this brief announcement, we now fill this gap. Our parallel implementation is transparent to the queries. It achieves a speedup of 14 on 32 cores for 8-bit filters. The additional space overhead is 105 bytes per thread, or 105 slots. This matches 0.0007% of the total space consumption when constructing with 1 billion input keys. A large portion of the construction time is spent on parallel sorting.
Zongmeng Zhang, Jinhua Zhu, Wengang Zhou, Xiang Qi, Peng Zhang, Houqiang Li
Dense retrieval, which aims to encode the semantic information of arbitrary
text into dense vector representations or embeddings, has emerged as an
effective and efficient paradigm for text retrieval, consequently becoming an
essential component in various natural language processing systems. These
systems typically focus on optimizing the embedding space by attending to the
relevance of text pairs, while overlooking the Boolean logic inherent in
language, which may not be captured by current training objectives. In this
work, we first investigate whether current retrieval systems can comprehend the
Boolean logic implied in language. To answer this question, we formulate the
task of Boolean Dense Retrieval and collect a benchmark dataset, BoolQuestions,
which covers complex queries containing basic Boolean logic and corresponding
annotated passages. Through extensive experimental results on the proposed task
and benchmark dataset, we draw the conclusion that current dense retrieval
systems do not fully understand Boolean logic in language, and there is a long
way to go to improve our dense retrieval systems. Furthermore, to promote
further research on enhancing the understanding of Boolean logic for language
models, we explore Boolean operation on decomposed query and propose a
contrastive continual training method that serves as a strong baseline for the
research community.
Authors' comments: Findings of the Association for Computational Linguistics: EMNLP 2024
Quan Ze Chen, K. J. Kevin Feng, Chan Young Park, Amy X. Zhang
Alignment of large language models (LLMs) to societal values should account for pluralistic values from diverse groups. One technique uses in-context learning for inference-time alignment, but only considers similarity when drawing few-shot examples, not accounting for cross-group differences in value prioritization. We propose SPICA, a framework for pluralistic alignment that accounts for group-level differences during in-context example retrieval. SPICA introduces three designs to facilitate pluralistic alignment: scenario banks, group-informed metrics, and in-context alignment prompts. From an evaluation of SPICA on an alignment task collecting inputs from four demographic groups ($n = 544$), our metrics retrieve in-context examples that more closely match observed preferences, with the best prompt configuration using multiple contrastive responses to demonstrate examples. In an end-to-end evaluation ($n = 80$), we observe that SPICA-aligned models are higher rated than a baseline similarity-only retrieval approach, with groups seeing up to a +0.16 point improvement on a 5 point scale. Additionally, gains from SPICA were more uniform, with all groups benefiting from alignment rather than only some. Finally, we find that while a group-agnostic approach can effectively align to aggregated values, it is not most suited for aligning to divergent groups.
Arne Wulff, Swapan Madabhushi Venkata, Boyang Chen, Sebastian Feld, Matthias Möller, Yinglu Tang
We, the QAIMS lab lab at the Aerospace Faculty of TU Delft, participated as
finalists in the Airbus/BMW Quantum Computing Challenge 2024. Stacking sequence
retrieval, a complex combinatorial task within a bi-level optimization
framework, is crucial for designing laminated composites that meet aerospace
requirements for weight, strength, and stiffness. This document presents the
scientifically relevant sections of our submission, which builds on our prior
research on applying quantum computation to this challenging design problem.
For the competition, we expanded our previous work in several significant ways.
First, we incorporated a full set of manufacturing constraints into our
algorithmic framework, including those previously established theoretically but
not yet demonstrated, thereby aligning our approach more closely with
real-world manufacturing demands. We implemented the F-VQE algorithm, which
enhances the probability shaping of optimal solutions, improving on simpler
variational quantum algorithms. Our approach also demonstrates flexibility by
accommodating diverse objectives as well as finer ply-angle increments
alongside the previously demonstrated conventional ply angles. Scalability was
tested using the DMRG algorithm, which, despite limitations in entanglement
representation, enabled simulations with up to 200 plies. Results were directly
compared to conventional stacking sequence retrieval algorithms with DMRG
showing high competitiveness. Given DMRG's limited entanglement capabilities,
it serves as a conservative baseline, suggesting potential for even greater
performance on fully realized quantum systems. This document serves to make our
competition results publicly available as we prepare a formal publication on
these findings and their implications for aerospace materials design
optimization.
Authors' comments: Scientifically relevant sections of a submission to 2024 Airbus/BMW
Quantum Computing Challenge. 26 pages, 10 figures, 3 tables
Ashkan Nejad, Mohammad Reza Faraji, Xiaojun Qi
With the widespread adoption of digital devices equipped with cameras and the rapid development of Internet technology, numerous content-based image retrieval systems and novel image feature extraction techniques have emerged in recent years. This paper introduces a saliency map-based image retrieval approach using invariant Krawtchouk moments (SM-IKM) to enhance retrieval speed and accuracy. The proposed method applies a global contrast-based salient region detection algorithm to create a saliency map that effectively isolates the foreground from the background. It then combines multiple orders of invariant Krawtchouk moments (IKM) with local binary patterns (LBPs) and color histograms to comprehensively represent the foreground and background. Additionally, it incorporates LBPs derived from the saliency map to improve discriminative power, facilitating more precise image differentiation. A bag-of-visual-words (BoVW) model is employed to generate a codebook for classification and discrimination. By using compact IKMs in the BoVW framework and integrating a range of region-based feature-including color histograms, LBPs, and saliency map-enhanced LBPs, our proposed SM-IKM achieves efficient and accurate image retrieval. Extensive experiments on publicly available datasets, such as Caltech 101 and Wang, demonstrate that SM-IKM outperforms recent state-of-the-art retrieval methods. The source code for SM-IKM is available at github.com/arnejad/SMIKM.
João Alberto de Oliveira Lima
This work addresses the challenge of capturing the complexities of legal
knowledge by proposing a multi-layered embedding-based retrieval method for
legal and legislative texts. Creating embeddings not only for individual
articles but also for their components (paragraphs, clauses) and structural
groupings (books, titles, chapters, etc), we seek to capture the subtleties of
legal information through the use of dense vectors of embeddings, representing
it at varying levels of granularity. Our method meets various information needs
by allowing the Retrieval Augmented Generation system to provide accurate
responses, whether for specific segments or entire sections, tailored to the
user's query. We explore the concepts of aboutness, semantic chunking, and
inherent hierarchy within legal texts, arguing that this method enhances the
legal information retrieval. Despite the focus being on Brazil's legislative
methods and the Brazilian Constitution, which follow a civil law tradition, our
findings should in principle be applicable across different legal systems,
including those adhering to common law traditions. Furthermore, the principles
of the proposed method extend beyond the legal domain, offering valuable
insights for organizing and retrieving information in any field characterized
by information encoded in hierarchical text.
Authors' comments: 27 pages, 10 figures
Rose E. Wang, Pawan Wirawarn, Kenny Lam, Omar Khattab, Dorottya Demszky
Many open-ended conversations (e.g., tutoring lessons or business meetings)
revolve around pre-defined reference materials, like worksheets or meeting
bullets. To provide a framework for studying such conversation structure, we
introduce Problem-Oriented Segmentation & Retrieval (POSR), the task of jointly
breaking down conversations into segments and linking each segment to the
relevant reference item. As a case study, we apply POSR to education where
effectively structuring lessons around problems is critical yet difficult. We
present LessonLink, the first dataset of real-world tutoring lessons, featuring
3,500 segments, spanning 24,300 minutes of instruction and linked to 116 SAT
math problems. We define and evaluate several joint and independent approaches
for POSR, including segmentation (e.g., TextTiling), retrieval (e.g., ColBERT),
and large language models (LLMs) methods. Our results highlight that modeling
POSR as one joint task is essential: POSR methods outperform independent
segmentation and retrieval pipelines by up to +76% on joint metrics and surpass
traditional segmentation methods by up to +78% on segmentation metrics. We
demonstrate POSR's practical impact on downstream education applications,
deriving new insights on the language and time use in real-world lesson
structures.
Authors' comments: EMNLP 2024 Findings. Our code and dataset are open-sourced at
https://github.com/rosewang2008/posr
Kristijan Armeni, Marko Pranjić, Senja Pollak
To predict upcoming text, language models must in some cases retrieve
in-context information verbatim. In this report, we investigated how the
ability of language models to retrieve arbitrary in-context nouns developed
during training (across time) and as language models trained on the same
dataset increase in size (across scale). We then asked whether learning of
in-context retrieval correlates with learning of more challenging zero-shot
benchmarks. Furthermore, inspired by semantic effects in human short-term
memory, we evaluated the retrieval with respect to a major semantic component
of target nouns, namely whether they denote a concrete or abstract entity, as
rated by humans. We show that verbatim in-context retrieval developed in a
sudden transition early in the training process, after about 1% of the training
tokens. This was observed across model sizes (from 14M and up to 12B
parameters), and the transition occurred slightly later for the two smallest
models. We further found that the development of verbatim in-context retrieval
is positively correlated with the learning of zero-shot benchmarks. Around the
transition point, all models showed the advantage of retrieving concrete nouns
as opposed to abstract nouns. In all but two smallest models, the advantage
dissipated away toward the end of training.
Authors' comments: accepted to Conference on Natural Language Learning 2024
(https://www.conll.org/)
Ziwei Liu, Liang Zhang, Qian Li, Jianghua Wu, Guangxu Zhu
Retrieval-augmented generation (RAG) has shown impressive capability in providing reliable answer predictions and addressing hallucination problems. A typical RAG implementation uses powerful retrieval models to extract external information and large language models (LLMs) to generate answers. In contrast, recent LLM-based retrieval has gained attention for its substantial improvements in information retrieval (IR) due to the LLMs' semantic understanding capability. However, directly applying LLM to RAG systems presents challenges. This may cause feature locality problems as massive parametric knowledge can hinder effective usage of global information across the corpus; for example, an LLM-based retriever often inputs document summaries instead of full documents. Moreover, various pre-trained tasks in LLMs introduce variance, further weakening performance as a retriever. To address these issues, we propose a novel two-stage fine-tuning architecture called Invar-RAG. In the retrieval stage, an LLM-based retriever is constructed by integrating LoRA-based representation learning to tackle feature locality issues. To enhance retrieval performance, we develop two patterns (invariant and variant patterns) and an invariance loss to reduce LLM variance. In the generation stage, a refined fine-tuning method is employed to improve LLM accuracy in generating answers based on retrieved information. Experimental results show that Invar-RAG significantly outperforms existing baselines across three open-domain question answering (ODQA) datasets. Code is available in the Supplementary Material for reproducibility.
Andrés Muñoz, Nancy Thomas, Annita Vapsi, Daniel Borrajo
Many industrial and service sectors require tools to extract vehicle characteristics from images. This is a complex task not only by the variety of noise, and large number of classes, but also by the constant introduction of new vehicle models to the market. In this paper, we present Veri-Car, an information retrieval integrated approach designed to help on this task. It leverages supervised learning techniques to accurately identify the make, type, model, year, color, and license plate of cars. The approach also addresses the challenge of handling open-world problems, where new car models and variations frequently emerge, by employing a sophisticated combination of pre-trained models, and a hierarchical multi-similarity loss. Veri-Car demonstrates robust performance, achieving high precision and accuracy in classifying both seen and unseen data. Additionally, it integrates an ensemble license plate detection, and an OCR model to extract license plate numbers with impressive accuracy.