Kyle Buettner, Adriana Kovashka
There is a scarcity of multilingual vision-language models that properly
account for the perceptual differences that are reflected in image captions
across languages and cultures. In this work, through a multimodal, multilingual
retrieval case study, we quantify the existing lack of model flexibility. We
empirically show performance gaps between training on captions that come from
native German perception and captions that have been either machine-translated
or human-translated from English into German. To address these gaps, we further
propose and evaluate caption augmentation strategies. While we achieve mean
recall improvements (+1.3), gaps still remain, indicating an open area of
future work for the community.
Authors' comments: Short paper accepted to EMNLP24 (Main)
Shayekh Bin Islam, Md Asib Rahman, K S M Tozammel Hossain, Enamul Hoque, Shafiq Joty, Md Rizwan Parvez
Retrieval-Augmented Generation (RAG) has been shown to enhance the factual
accuracy of Large Language Models (LLMs), but existing methods often suffer
from limited reasoning capabilities in effectively using the retrieved
evidence, particularly when using open-source LLMs. To mitigate this gap, we
introduce a novel framework, Open-RAG, designed to enhance reasoning
capabilities in RAG with open-source LLMs. Our framework transforms an
arbitrary dense LLM into a parameter-efficient sparse mixture of experts (MoE)
model capable of handling complex reasoning tasks, including both single- and
multi-hop queries. Open-RAG uniquely trains the model to navigate challenging
distractors that appear relevant but are misleading. As a result, Open-RAG
leverages latent learning, dynamically selecting relevant experts and
integrating external knowledge effectively for more accurate and contextually
relevant responses. In addition, we propose a hybrid adaptive retrieval method
to determine retrieval necessity and balance the trade-off between performance
gain and inference speed. Experimental results show that the Llama2-7B-based
Open-RAG outperforms state-of-the-art LLMs and RAG models such as ChatGPT,
Self-RAG, and Command R+ in various knowledge-intensive tasks. We open-source
our code and models at https://openragmoe.github.io/
Authors' comments: Accepted to EMNLP 2024 Findings. Website:
https://openragmoe.github.io/. 14 pages, 7 figures, 5 tables
Jiaqi Lei, Liang Hu, Yi Bu, Jiqun Liu
Previous researches on the Information retrieval (IR) field have focused on
summarizing progress and synthesizing knowledge and techniques from individual
studies and data-driven experiments, the extent of contributions and
collaborations between researchers from different communities (e.g., academia
and industry) in advancing IR knowledge remains unclear. To address this gap,
this study explores several characteristics of information retrieval research
in four areas: productivity patterns and preferred venues, the relationship
between citations and downloads, changes in research topics, and changes in
patterns of scientific collaboration, by analyzing 53,471 papers published
between 2000 and 2018 from the Association for Computing Machinery (ACM)
Digital Library dataset. Through the analysis and interpretation on empirical
datasets, we find that academic research, industry research, and collaborative
research between academia and industry focused on different topics. Among the
collaboration models, Academia-Industry Collaboration is more oriented towards
large teamwork. Collaborative networks between researchers in academia and
industry suggest that the field of information retrieval has become richer over
time in terms of themes, foci, and sub-themes, becoming a more diverse field of
study.
Authors' comments: 43 pages, 11 figures
Melody Yu
In this paper, we present SeeSay, an assistive device designed for individuals with visual impairments. This system leverages large language models (LLMs) for speech recognition and visual querying. It effectively identifies, records, and responds to the user's environment by providing audio guidance using retrieval-augmented generation (RAG). Our experiments demonstrate the system's capability to recognize its surroundings and respond to queries with audio feedback in diverse settings. We hope that the SeeSay system will facilitate users' comprehension and recollection of their surroundings, thereby enhancing their environmental perception, improving navigational capabilities, and boosting overall independence.
Xiang Hu, Zhihao Teng, Jun Zhao, Wei Wu, Kewei Tu
Despite the success of Transformers, handling long contexts remains
challenging due to the limited length generalization and quadratic complexity
of self-attention. Thus Transformers often require post-training with a larger
attention window, significantly increasing computational and memory costs. In
this paper, we propose a novel attention mechanism based on dynamic context,
Grouped Cross Attention (GCA), which can generalize to 1000 times the
pre-training context length while maintaining the ability to access distant
information with a constant attention window size. For a given input sequence,
we split it into chunks and use each chunk to retrieve top-k relevant past
chunks for subsequent text generation. Specifically, unlike most previous works
that use an off-the-shelf retriever, our key innovation allows the retriever to
learn how to retrieve past chunks that better minimize the auto-regressive loss
of subsequent tokens in an end-to-end manner. Such a mechanism accommodates
retrieved chunks with a fixed-size attention window to achieve long-range
information access, significantly reducing computational and memory costs
during training and inference. Experiments show that GCA-based models achieve
near-perfect accuracy in passkey retrieval for 16M context lengths, which is
1000 times the training length.
Authors' comments: accepted to ICML 2025
Reshmi Ghosh, Rahul Seetharaman, Hitesh Wadhwa, Somyaa Aggarwal, Samyadeep Basu, Soundararajan Srinivasan, Wenlong Zhao, Shreyas Chaudhari et al.
Retrieval Augmented Generation (RAG) is a widely used approach for leveraging
external context in several natural language applications such as question
answering and information retrieval. Yet, the exact nature in which a Language
Model (LM) leverages this non-parametric memory or retrieved context isn't
clearly understood. This paper mechanistically examines the RAG pipeline to
highlight that LMs demonstrate a "shortcut'' effect and have a strong bias
towards utilizing the retrieved context to answer questions, while relying
minimally on model priors. We propose (a) Causal Mediation Analysis; for
proving that parametric memory is minimally utilized when answering a question
and (b) Attention Contributions and Knockouts for showing the last token
residual stream do not get enriched from the subject token in the question, but
gets enriched from tokens of RAG-context. We find this pronounced "shortcut''
behaviour to be true across both LLMs (e.g.,LlaMa) and SLMs (e.g., Phi)
Authors' comments: Accepted to Blackbox NLP @ EMNLP 2024
Sarah Packowski, Inge Halilovic, Jenifer Schlotfeldt, Trish Smith
Retrieval-augmented generation (RAG) is a popular technique for using large
language models (LLMs) to build customer-support, question-answering solutions.
In this paper, we share our team's practical experience building and
maintaining enterprise-scale RAG solutions that answer users' questions about
our software based on product documentation. Our experience has not always
matched the most common patterns in the RAG literature. This paper focuses on
solution strategies that are modular and model-agnostic. For example, our
experience over the past few years - using different search methods and LLMs,
and many knowledge base collections - has been that simple changes to the way
we create knowledge base content can have a huge impact on our RAG solutions'
success. In this paper, we also discuss how we monitor and evaluate results.
Common RAG benchmark evaluation techniques have not been useful for evaluating
responses to novel user questions, so we have found a flexible, "human in the
lead" approach is required.
Authors' comments: 6 pages, 4 figures, to be published in ICAAI 2024 conference
proceedings
Bryan Li, Fiona Luo, Samar Haider, Adwait Agashe, Tammy Li, Runqi Liu, Muqing Miao, Shriya Ramakrishnan et al.
The paradigm of retrieval-augmented generated (RAG) helps mitigate hallucinations of large language models (LLMs). However, RAG also introduces biases contained within the retrieved documents. These biases can be amplified in scenarios which are multilingual and culturally-sensitive, such as territorial disputes. In this paper, we introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages. To evaluate LLMs' cross-lingual robustness for this task, we formalize several modes for multilingual retrieval. Our experiments on several LLMs reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents, showing how incorporating diverse perspectives improves robustness. Also, querying in low-resource languages displays a much wider variance in the linguistic distribution of response citations. Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents. We release our benchmark and code to support further research towards ensuring equitable information access across languages at https://huggingface.co/datasets/borderlines/bordirlines.
Tao Tan, Yining Qian, Ang Lv, Hongzhan Lin, Songhao Wu, Yongbo Wang, Feng Wang, Jingtong Wu et al.
Large language models (LLMs) enhanced with retrieval-augmented generation
(RAG) have introduced a new paradigm for web search. However, the limited
context awareness of LLMs degrades their performance on RAG tasks. Existing
methods to enhance context awareness are often inefficient, incurring time or
memory overhead during inference, and many are tailored to specific position
embeddings. In this paper, we propose Position-Embedding-Agnostic attention
Re-weighting (PEAR), which enhances the context awareness of LLMs with zero
inference overhead. Specifically, on a proxy task focused on context copying,
we first detect heads which suppress the models' context awareness thereby
diminishing RAG performance. To weaken the impact of these heads, we re-weight
their outputs with learnable coefficients. The LLM (with frozen parameters) is
optimized by adjusting these coefficients to minimize loss on the proxy task.
As a result, the coefficients are optimized to values less than one, thereby
reducing their tendency to suppress RAG performance. During inference, the
optimized coefficients are fixed to re-weight these heads, regardless of the
specific task at hand. Our proposed PEAR offers two major advantages over
previous approaches: (1) It introduces zero additional inference overhead in
terms of memory usage or inference time, while outperforming competitive
baselines in accuracy and efficiency across various RAG tasks. (2) It is
independent of position embedding algorithms, ensuring broader applicability.
Authors' comments: preprint
Xuyang Wu, Shuowei Li, Hsin-Tai Wu, Zhiqiang Tao, Yi Fang
Retrieval-Augmented Generation (RAG) has recently gained significant
attention for its enhanced ability to integrate external knowledge sources into
open-domain question answering (QA) tasks. However, it remains unclear how
these models address fairness concerns, particularly with respect to sensitive
attributes such as gender, geographic location, and other demographic factors.
First, as language models evolve to prioritize utility, like improving exact
match accuracy, fairness considerations may have been largely overlooked.
Second, the complex, multi-component architecture of RAG methods poses
challenges in identifying and mitigating biases, as each component is optimized
for distinct objectives. In this paper, we aim to empirically evaluate fairness
in several RAG methods. We propose a fairness evaluation framework tailored to
RAG, using scenario-based questions and analyzing disparities across
demographic attributes. Our experimental results indicate that, despite recent
advances in utility-driven optimization, fairness issues persist in both the
retrieval and generation stages. These findings underscore the need for
targeted interventions to address fairness concerns throughout the RAG
pipeline. The dataset and code used in this study are publicly available at
this GitHub Repository https://github.com/elviswxy/RAG_fairness .
Authors' comments: Published at COLING 2025
Yining Juan, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen
In diverse professional environments, ranging from academic conferences to corporate earnings calls, the ability to anticipate audience questions stands paramount. Traditional methods, which rely on manual assessment of an audience's background, interests, and subject knowledge, often fall short - particularly when facing large or heterogeneous groups, leading to imprecision and inefficiency. While NLP has made strides in text-based question generation, its primary focus remains on academic settings, leaving the intricate challenges of professional domains, especially earnings call conferences, underserved. Addressing this gap, our paper pioneers the multi-question generation (MQG) task specifically designed for earnings call contexts. Our methodology involves an exhaustive collection of earnings call transcripts and a novel annotation technique to classify potential questions. Furthermore, we introduce a retriever-enhanced strategy to extract relevant information. With a core aim of generating a spectrum of potential questions that analysts might pose, we derive these directly from earnings call content. Empirical evaluations underscore our approach's edge, revealing notable excellence in the accuracy, consistency, and perplexity of the questions generated.
Steven Ndungu, Trienko Grobler, Stefan J. Wijnholds, George Azzopardi
The morphologies of astronomical sources are highly complex, making it
essential not only to classify the identified sources into their predefined
categories but also to determine the sources that are most similar to a given
query source. Image-based retrieval is essential, as it allows an astronomer
with a source under study to ask a computer to sift through the large archived
database of sources to find the most similar ones. This is of particular
interest if the source under study does not fall into a "known" category
(anomalous). Our work uses the trainable COSFIRE (Combination of Shifted Filter
Responses) approach for image retrieval. COSFIRE filters are automatically
configured to extract the hyperlocal geometric arrangements that uniquely
describe the morphological characteristics of patterns of interest in a given
image; in this case astronomical sources. This is achieved by automatically
examining the shape properties of a given prototype source in an image, which
ultimately determines the selectivity of a COSFIRE filter. We further utilize
hashing techniques, which are efficient in terms of required computation and
storage, enabling scalability in handling large data sets in the image
retrieval process. We evaluated the effectiveness of our approach by conducting
experiments on a benchmark data set of radio galaxies, containing 1,180
training images and 404 test images. Notably, our approach achieved a mean
average precision of 91% for image retrieval, surpassing the performance of the
competing DenseNet-based method. Moreover, the COSFIRE filters are
significantly more computationally efficient, requiring $\sim\!14\times$ fewer
operations than the DenseNet-based method.
Authors' comments: 11 pages, 7 figures
Helong Huang, Chris W. Ormel, Michiel Min
Context. Clouds are ubiquitous in exoplanets' atmospheres and play an
important role in setting the opacity and chemical inventory of the atmosphere.
Understanding clouds is a critical step in interpreting exoplanets'
spectroscopic data. Aims. The aim is to model the multi-species nature of
clouds in atmospheric retrieval studies. To this end, we develop ExoLyn - a 1D
cloud model that balances physical consistency with computational efficiency.
Methods. ExoLyn solves the transport equation of cloud particles and vapor
under cloud condensation rates that are self-consistently calculated from
thermodynamics. ExoLyn is a standalone, open source package capable to be
combined with \texttt{optool} to calculate solid opacities and with
\texttt{petitRADTRANS} to generate transmission or emission spectra. Results.
With ExoLyn we find that the compositional structure of clouds in hot Jupiter
planets' atmospheres is layered with a cloud dominated by magnesium-silicates
on top of an iron cloud. This finding is consistent with more complex cloud
formation models but can be obtained with ExoLyn in only a few seconds. The
composition of the cloud particles can be constrained from the spectrum, for
example, MgSiO3 and Mg2SiO4 components give rise to an absorption feature at 8
- 10 um. We investigate the dependence of the cloud structure on the bulk
elemental composition of the planet and find that SiO2-dominated clouds forms
on metal-rich planet and Fe clouds with strong extinction effect forms on
C-rich planet. Conclusions. Designed towards maximum flexibility, ExoLyn can
also be used in retrieval analysis of sub-Neptunes and self-luminous planets.
The efficiency of ExoLyn opens the possibility of joint retrieval of
exoplanets' gas and cloud components.
Authors' comments: 17 pages, 12 figures, accepted by A&A
Soeun Lee, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim
Recent advancements in image captioning have explored text-only training
methods to overcome the limitations of paired image-text data. However,
existing text-only training methods often overlook the modality gap between
using text data during training and employing images during inference. To
address this issue, we propose a novel approach called Image-like Retrieval,
which aligns text features with visually relevant features to mitigate the
modality gap. Our method further enhances the accuracy of generated captions by
designing a Fusion Module that integrates retrieved captions with input
features. Additionally, we introduce a Frequency-based Entity Filtering
technique that significantly improves caption quality. We integrate these
methods into a unified framework, which we refer to as IFCap
($\textbf{I}$mage-like Retrieval and $\textbf{F}$requency-based Entity
Filtering for Zero-shot $\textbf{Cap}$tioning). Through extensive
experimentation, our straightforward yet powerful approach has demonstrated its
efficacy, outperforming the state-of-the-art methods by a significant margin in
both image captioning and video captioning compared to zero-shot captioning
based on text-only training.
Authors' comments: Accepted to EMNLP 2024
Ashmi Banerjee, Adithi Satish, Wolfgang Wörndl
Tourism Recommender Systems (TRS) have traditionally focused on providing
personalized travel suggestions, often prioritizing user preferences without
considering broader sustainability goals. Integrating sustainability into TRS
has become essential with the increasing need to balance environmental impact,
local community interests, and visitor satisfaction. This paper proposes a
novel approach to enhancing TRS for sustainable city trips using Large Language
Models (LLMs) and a modified Retrieval-Augmented Generation (RAG) pipeline. We
enhance the traditional RAG system by incorporating a sustainability metric
based on a city's popularity and seasonal demand during the prompt augmentation
phase. This modification, called Sustainability Augmented Reranking (SAR),
ensures the system's recommendations align with sustainability goals.
Evaluations using popular open-source LLMs, such as Llama-3.1-Instruct-8B and
Mistral-Instruct-7B, demonstrate that the SAR-enhanced approach consistently
matches or outperforms the baseline (without SAR) across most metrics,
highlighting the benefits of incorporating sustainability into TRS.
Authors' comments: Accepted at the RecSoGood 2024 Workshop co-located with the 18th ACM
Conference on Recommender Systems (RecSys 2024)
Nilanjan Sinhababu, Andrew Parry, Debasis Ganguly, Debasis Samanta, Pabitra Mitra
A supervised ranking model, despite its advantage of being effective, usually
involves complex processing - typically multiple stages of task-specific
pre-training and fine-tuning. This has motivated researchers to explore simpler
pipelines leveraging large language models (LLMs) that are capable of working
in a zero-shot manner. However, since zero-shot inference does not make use of
a training set of pairs of queries and their relevant documents, its
performance is mostly worse than that of supervised models, which are trained
on such example pairs. Motivated by the existing findings that training
examples generally improve zero-shot performance, in our work, we explore if
this also applies to ranking models. More specifically, given a query and a
pair of documents, the preference prediction task is improved by augmenting
examples of preferences for similar queries from a training set. Our proposed
pairwise few-shot ranker demonstrates consistent improvements over the
zero-shot baseline on both in-domain (TREC DL) and out-domain (BEIR subset)
retrieval benchmarks. Our method also achieves a close performance to that of a
supervised model without requiring any complex training pipeline.
Authors' comments: Accepted to EMNLP 2024
Quanting Xie, So Yeon Min, Tianyi Zhang, Aarav Bajaj, Ruslan Salakhutdinov, Matthew Johnson-Roberson, Yonatan Bisk
There is no limit to how much a robot might explore and learn, but all of
that knowledge needs to be searchable and actionable. Within language research,
retrieval augmented generation (RAG) has become the workhouse of large-scale
non-parametric knowledge, however existing techniques do not directly transfer
to the embodied domain, which is multimodal, data is highly correlated, and
perception requires abstraction.
To address these challenges, we introduce Embodied-RAG, a framework that
enhances the foundational model of an embodied agent with a non-parametric
memory system capable of autonomously constructing hierarchical knowledge for
both navigation and language generation. Embodied-RAG handles a full range of
spatial and semantic resolutions across diverse environments and query types,
whether for a specific object or a holistic description of ambiance. At its
core, Embodied-RAG's memory is structured as a semantic forest, storing
language descriptions at varying levels of detail. This hierarchical
organization allows the system to efficiently generate context-sensitive
outputs across different robotic platforms. We demonstrate that Embodied-RAG
effectively bridges RAG to the robotics domain, successfully handling over 200
explanation and navigation queries across 19 environments, highlighting its
promise for general-purpose non-parametric system for embodied agents.
Authors' comments: Web: https://quanting-xie.github.io/Embodied-RAG-web/
Qiuhai Zeng, Zimeng Qiu, Dae Yon Hwang, Xin He, William M. Campbell
Dense retrieval systems are commonly used for information retrieval (IR).
They rely on learning text representations through an encoder and usually
require supervised modeling via labelled data which can be costly to obtain or
simply unavailable. In this study, we introduce a novel unsupervised text
representation learning technique via instruction-tuning the pre-trained
encoder-decoder large language models (LLM) under the dual-encoder retrieval
framework. We demonstrate the corpus representation can be augmented by the
representations of relevant synthetic queries generated by the instruct-tuned
LLM founded on the Rao-Blackwell theorem. Furthermore, we effectively align the
query and corpus text representation with self-instructed-tuning. Specifically,
we first prompt an open-box pre-trained LLM to follow defined instructions
(i.e. question generation and keyword summarization) to generate synthetic
queries. Next, we fine-tune the pre-trained LLM with defined instructions and
the generated queries that passed quality check. Finally, we generate synthetic
queries with the instruction-tuned LLM for each corpora and represent each
corpora by weighted averaging the synthetic queries and original corpora
embeddings. We evaluate our proposed method under low-resource settings on
three English and one German retrieval datasets measuring NDCG@10, MRR@100,
Recall@100. We significantly improve the average zero-shot retrieval
performance on all metrics, increasing open-box FLAN-T5 model variations by
[3.34%, 3.50%] in absolute and exceeding three competitive dense retrievers
(i.e. mDPR, T-Systems, mBART-Large), with model of size at least 38% smaller,
by 1.96%, 4.62%, 9.52% absolute on NDCG@10.
Authors' comments: Accepted at DCAI24 workshop@CIKM2024
Omar Mussa, Omer Rana, Benoît Goossens, Pablo Orozco-Terwengel, Charith Perera
Despite the recent broad adoption of Large Language Models (LLMs) across
various domains, their potential for enriching information systems in
extracting and exploring Linked Data (LD) and Resource Description Framework
(RDF) triplestores has not been extensively explored. This paper examines the
integration of LLMs within existing systems, emphasising the enhancement of
conversational user interfaces (UIs) and their capabilities for data extraction
by producing more accurate SPARQL queries without the requirement for model
retraining. Typically, conversational UI models necessitate retraining with the
introduction of new datasets or updates, limiting their functionality as
general-purpose extraction tools. Our approach addresses this limitation by
incorporating LLMs into the conversational UI workflow, significantly enhancing
their ability to comprehend and process user queries effectively. By leveraging
the advanced natural language understanding capabilities of LLMs, our method
improves RDF entity extraction within web systems employing conventional
chatbots. This integration facilitates a more nuanced and context-aware
interaction model, critical for handling the complex query patterns often
encountered in RDF datasets and Linked Open Data (LOD) endpoints. The
evaluation of this methodology shows a marked enhancement in system
expressivity and the accuracy of responses to user queries, indicating a
promising direction for future research in this area. This investigation not
only underscores the versatility of LLMs in enhancing existing information
systems but also sets the stage for further explorations into their potential
applications within more specialised domains of web information systems.
Authors' comments: This paper has been accepted at the 25th International Web
Information Systems Engineering Conference (WISE 2024)
Wenlong Dong, Dehao Huang, Jiangshan Liu, Chao Tang, Hong Zhang
Task-oriented grasping (TOG) is crucial for robots to accomplish manipulation tasks, requiring the determination of TOG positions and directions. Existing methods either rely on costly manual TOG annotations or only extract coarse grasping positions or regions from human demonstrations, limiting their practicality in real-world applications. To address these limitations, we introduce RTAGrasp, a Retrieval, Transfer, and Alignment framework inspired by human grasping strategies. Specifically, our approach first effortlessly constructs a robot memory from human grasping demonstration videos, extracting both TOG position and direction constraints. Then, given a task instruction and a visual observation of the target object, RTAGrasp retrieves the most similar human grasping experience from its memory and leverages semantic matching capabilities of vision foundation models to transfer the TOG constraints to the target object in a training-free manner. Finally, RTAGrasp aligns the transferred TOG constraints with the robot's action for execution. Evaluations on the public TOG benchmark, TaskGrasp dataset, show the competitive performance of RTAGrasp on both seen and unseen object categories compared to existing baseline methods. Real-world experiments further validate its effectiveness on a robotic arm. Our code, appendix, and video are available at \url{https://sites.google.com/view/rtagrasp/home}.