Chunying Zhou, Xiaoyuan Xie, Gong Chen, Peng He, Bing Li
Most studies focused on information retrieval-based techniques for fault localization, which built representations for bug reports and source code files and matched their semantic vectors through similarity measurement. However, such approaches often ignore some useful information that might help improve localization performance, such as 1) the interaction relationship between bug reports and source code files; 2) the similarity relationship between bug reports; and 3) the co-citation relationship between source code files. In this paper, we propose a novel approach named Multi-View Adaptive Contrastive Learning for Information Retrieval Fault Localization (MACL-IRFL) to learn the above-mentioned relationships for software fault localization. Specifically, we first generate data augmentations from report-code interaction view, report-report similarity view and code-code co-citation view separately, and adopt graph neural network to aggregate the information of bug reports or source code files from the three views in the embedding process. Moreover, we perform contrastive learning across these views. Our design of contrastive learning task will force the bug report representations to encode information shared by report-report and report-code views,and the source code file representations shared by code-code and report-code views, thereby alleviating the noise from auxiliary information. Finally, to evaluate the performance of our approach, we conduct extensive experiments on five open-source Java projects. The results show that our model can improve over the best baseline up to 28.93%, 25.57% and 20.35% on Accuracy@1, MAP and MRR, respectively.
Warren Jouanneau, Marc Palyart, Emma Jouffroy
Finding the perfect match between a job proposal and a set of freelancers is not an easy task to perform at scale, especially in multiple languages. In this paper, we propose a novel neural retriever architecture that tackles this problem in a multilingual setting. Our method encodes project descriptions and freelancer profiles by leveraging pre-trained multilingual language models. The latter are used as backbone for a custom transformer architecture that aims to keep the structure of the profiles and project. This model is trained with a contrastive loss on historical data. Thanks to several experiments, we show that this approach effectively captures skill matching similarity and facilitates efficient matching, outperforming traditional methods.
Jiaming Zhou, Shiwan Zhao, Jiabei He, Hui Wang, Wenjia Zeng, Yong Chen, Haoqin Sun, Aobo Kong et al.
State-of-the-art models like OpenAI's Whisper exhibit strong performance in multilingual automatic speech recognition (ASR), but they still face challenges in accurately recognizing diverse subdialects. In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. Building on the principles of in-context learning (ICL) and retrieval-augmented techniques, our method employs sentence-level ICL in the pre-processing stage to harness contextual information, while integrating token-level k-Nearest Neighbors (kNN) retrieval as a post-processing step to further refine the final output distribution. By synergistically combining sentence-level and token-level retrieval strategies, M2R-whisper effectively mitigates various types of recognition errors. Experiments conducted on Mandarin and subdialect datasets, including AISHELL-1 and KeSpeech, demonstrate substantial improvements in ASR accuracy, all achieved without any parameter updates.
Weiye Xu, Min Wang, Wengang Zhou, Houqiang Li
Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations.
Wonduk Seo, Haojie Zhang, Yueyang Zhang, Changhao Zhang, Songyao Duan, Lixin Su, Daiting Shi, Jiashu Zhao et al.
Query reformulation is a well-known problem in Information Retrieval (IR) aimed at enhancing single search successful completion rate by automatically modifying user's input query. Recent methods leverage Large Language Models (LLMs) to improve query reformulation, but often generate limited and redundant expansions, potentially constraining their effectiveness in capturing diverse intents. In this paper, we propose GenCRF: a Generative Clustering and Reformulation Framework to capture diverse intentions adaptively based on multiple differentiated, well-generated queries in the retrieval phase for the first time. GenCRF leverages LLMs to generate variable queries from the initial query using customized prompts, then clusters them into groups to distinctly represent diverse intents. Furthermore, the framework explores to combine diverse intents query with innovative weighted aggregation strategies to optimize retrieval performance and crucially integrates a novel Query Evaluation Rewarding Model (QERM) to refine the process through feedback loops. Empirical experiments on the BEIR benchmark demonstrate that GenCRF achieves state-of-the-art performance, surpassing previous query reformulation SOTAs by up to 12% on nDCG@10. These techniques can be adapted to various LLMs, significantly boosting retriever performance and advancing the field of Information Retrieval.
To Eun Kim, Fernando Diaz
Modern language models frequently include retrieval components to improve
their outputs, giving rise to a growing number of retrieval-augmented
generation (RAG) systems. Yet, most existing work in RAG has underemphasized
fair ranking techniques and neglected the diverse interests of all
stakeholders. In this paper, we present the first comprehensive study of RAG
systems that incorporate fairness-aware rankings, focusing on both ranking
fairness and attribution fairness - ensuring equitable exposure of sources
cited in the final text. We specifically examine item-side fairness, i.e.,
whether retrieved documents receive balanced exposure, and assess how this
affects both the system's overall performance and the eventual distribution of
cited sources. Across twelve RAG models and seven tasks, we find that
fairness-aware retrieval frequently retains or even improves ranking
effectiveness and generation quality, countering the widespread belief that
fairness compromises system performance. Moreover, we show that fair retrieval
leads to more balanced attribution in the final responses, ensuring that the
cited sources are credited more equitably. Our results underscore the
importance of item-side fairness throughout both retrieval and generation
phases, offering key insights for building more responsible and equitable RAG
systems and illustrating promising avenues for future exploration in fair
ranking and source attribution.
Authors' comments: Top 5 Spotlight at AFME Workshop at NeurIPS 2024
Hyukmo Kang, Kyle Van Gorkom, Meghdoot Biswas, Daewook Kim, Ewan S. Douglas
Continuous wavefront sensing benefits space observatories in on-orbit optical performance maintenance. To measure the phase of a wavefront, phase retrieval is an attractive technique as it uses multiple point spread function (PSF) images that are acquired by the telescope itself without extra metrology systems nor complicated calibration. The focus diverse phase retrieval utilizes PSFs from predetermined defocused positions to enhance the dynamic range of the algorithm. We describe an updated visible light active optics testbed with the addition of a linear motorized focus stage. The performance of the phase retrieval algorithm in broadband is tested under various cases. While broadband pass filters have advantages in higher signal-to-noise ratio (SNR), the performance of phase retrieval can be restricted due to blurred image caused by diffraction and increased computing cost. We used multiple bandpass filters (10 nm, 88 nm, and 150 nm) and investigated effects of bandwidth on the accuracy and required image acquisition conditions such as SNR, reaching accuracies below 20 nm RMS wavefront error at the widest bandwidth. We also investigated the dynamic range of the phase retrieval algorithm depending on the bandwidth and required amount of defocus to expand dynamic range. Finally, we simulated the continuous wavefront sensing and correction loop with a range of statistically generated representative telescope disturbance time series to test for edge cases.
Frédéric Grosshans, Michał Horodecki, Mio Murao, Tomasz Młynik, Marco Túlio Quintino, Michał Studziński, Satoshi Yoshida
This work considers a teleportation task for Alice and Bob in a scenario
where Bob cannot perform corrections. In particular, we analyse the task of
\textit{multicopy state teleportation}, where Alice has $k$ identical copies of
an arbitrary unknown $d$-dimensional qudit state $\vert\psi\rangle$ to teleport
a single copy of $\vert\psi\rangle$ to Bob using a maximally entangled
two-qudit state shared between Alice and Bob without Bob's correction. Alice
may perform a joint measurement on her half of the entangled state and the $k$
copies of $\vert\psi\rangle$. We prove that the maximal probability of success
for teleporting the exact state $\vert\psi\rangle$ to Bob is
$p(d,k)=\frac{k}{d(k-1+d)}$ and present an explicit protocol to attain this
performance. Then, by utilising $k$ copies of an arbitrary target state
$\vert\psi\rangle$, we show how the multicopy state teleportation protocol can
be employed to enhance the success probability of storage and retrieval of
quantum programs, which aims to universally retrieve the action of an arbitrary
quantum channel that is stored in a state. Our proofs make use of group
representation theory methods, which may find applications beyond the problems
addressed in this work.
Authors' comments: 25 pages,3 figures. Comments are welcome
K. R. Schuurman, A. Meyer
Accurate estimates of surface solar irradiance (SSI) are essential for solar
resource assessments and solar energy forecasts in grid integration and
building control applications. SSI estimates for spatially extended regions can
be retrieved from geostationary satellites such as Meteosat. Traditional SSI
satellite retrievals like Heliosat rely on physical radiative transfer
modelling. We introduce the first machine-learning-based satellite retrieval
for instantaneous SSI and demonstrate its capability to provide accurate and
generalizable SSI estimates across Europe. Our deep learning retrieval provides
near real-time SSI estimates based on data-driven emulation of Heliosat and
fine-tuning on pyranometer networks. By including SSI from ground stations, our
SSI retrieval model can outperform Heliosat accuracy and generalize well to
regions with other climates and surface albedos in cloudy conditions (clear-sky
index < 0.8). We also show that the SSI retrieved from Heliosat exhibits large
biases in mountain regions, and that training and fine-tuning our retrieval
models on SSI data from ground stations strongly reduces these biases,
outperforming Heliosat. Furthermore, we quantify the relative importance of the
Meteosat channels and other predictor variables like solar zenith angle for the
accuracy of our deep learning SSI retrieval model in different cloud
conditions. We find that in cloudy conditions multiple near-infrared and
infrared channels enhance the performance. Our results can facilitate the
development of more accurate satellite retrieval models of surface solar
irradiance.
Authors' comments: 19 pages, 11 figures
Mohamed Sobhi Jabal, Pranav Warman, Jikai Zhang, Kartikeye Gupta, Ayush Jain, Maciej Mazurowski, Walter Wiggins, Kirti Magudia et al.
Purpose: To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weights large language models (LMs) and retrieval augmented generation (RAG), and to assess the effects of model configuration variables on extraction performance. Methods and Materials: The study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports annotated for isocitrate dehydrogenase (IDH) mutation status. An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations. The impact of model size, quantization, prompting strategies, output formatting, and inference parameters was systematically evaluated. Results: The best performing models achieved over 98% accuracy in extracting BT-RADS scores from radiology reports and over 90% for IDH mutation status extraction from pathology reports. The top model being medical fine-tuned llama3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models. Model quantization had minimal impact on performance. Few-shot prompting significantly improved accuracy. RAG improved performance for complex pathology reports but not for shorter radiology reports. Conclusions: Open LMs demonstrate significant potential for automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semi-automated optimization using annotated data are critical for optimal performance. These approaches could be reliable enough for practical use in research workflows, highlighting the potential for human-machine collaboration in healthcare data extraction.
Alireza Salemi, Hamed Zamani
Privacy-preserving methods for personalizing large language models (LLMs) are relatively under-explored. There are two schools of thought on this topic: (1) generating personalized outputs by personalizing the input prompt through retrieval augmentation from the user's personal information (RAG-based methods), and (2) parameter-efficient fine-tuning of LLMs per user that considers efficiency and space limitations (PEFT-based methods). This paper presents the first systematic comparison between two approaches on a wide range of personalization tasks using seven diverse datasets. Our results indicate that RAG-based and PEFT-based personalization methods on average yield 14.92% and 1.07% improvements over the non-personalized LLM, respectively. We find that combining RAG with PEFT elevates these improvements to 15.98%. Additionally, we identify a positive correlation between the amount of user data and PEFT's effectiveness, indicating that RAG is a better choice for cold-start users (i.e., user's with limited personal data).
Simon Schleich, Sudeshna Boro Saikia, Quentin Changeat, Manuel Güdel, Aiko Voigt, Ingo Waldmann
We investigate the impact of using multipoint p-T profiles of varying
complexity on the retrieval of synthetically generated hot Jupiter transmission
spectra modelled after state-of-the-art observations of the hot Jupiter
WASP-39~b with JWST. We perform homogenised atmospheric retrievals with the
TauREx retrieval framework on a sample of synthetically generated transmission
spectra, accounting for varying cases of underlying p-T profiles, cloud-top
pressures, and expected noise levels. These retrievals are performed using a
fixed-pressure multipoint p-T prescription with increasing complexity, ranging
from isothermal to an eleven-point profile. We evaluate the performance of the
retrievals based on the Bayesian model evidence, and the accuracy of the
retrievals compared to the known input parameters. We find that performing
atmospheric retrievals using an isothermal prescription for the
pressure-temperature profile consistently results in wrongly retrieved
atmospheric parameters when compared to the known input parameters. For an
underlying p-T profile with a fully positive lapse rate, we find that a
two-point profile is sufficient to retrieve the known atmospheric parameters,
while under the presence of an atmospheric temperature inversion, we find that
a more complex profile is necessary. Our investigation shows that, for a data
quality scenario mirroring state-of-the-art observations of a hot Jupiter with
JWST, an isothermal p-T prescription is insufficient to correctly retrieve the
known atmospheric parameters. We find a model complexity preference dependent
on the underlying pressure-temperature structure, but argue that a p-T
prescription on the complexity level of a four-point profile should be
preferred. This represents the overlap between the lowest number of free
parameters and highest model preference in the cases investigated in this work.
Authors' comments: 16 pages, 14 figures, accepted for publication in A&A
Siqi Li, Danni Liu, Jan Niehues
Direct speech translation (ST) models often struggle with rare words. Incorrect translation of these words can have severe consequences, impacting translation quality and user trust. While rare word translation is inherently challenging for neural models due to sparse learning signals, real-world scenarios often allow access to translations of past recordings on similar topics. To leverage these valuable resources, we propose a retrieval-and-demonstration approach to enhance rare word translation accuracy in direct ST models. First, we adapt existing ST models to incorporate retrieved examples for rare word translation, which allows the model to benefit from prepended examples, similar to in-context learning. We then develop a cross-modal (speech-to-speech, speech-to-text, text-to-text) retriever to locate suitable examples. We demonstrate that standard ST models can be effectively adapted to leverage examples for rare word translation, improving rare word translation accuracy over the baseline by 17.6% with gold examples and 8.5% with retrieved examples. Moreover, our speech-to-speech retrieval approach outperforms other modalities and exhibits higher robustness to unseen speakers. Our code is publicly available (https://github.com/SiqiLii/Retrieve-and-Demonstration-ST).
Jialu Tang, Tong Xia, Yuan Lu, Cecilia Mascolo, Aaqib Saeed
Interpreting electrocardiograms (ECGs) and generating comprehensive reports remain challenging tasks in cardiology, often requiring specialized expertise and significant time investment. To address these critical issues, we propose ECG-ReGen, a retrieval-based approach for ECG-to-text report generation and question answering. Our method leverages a self-supervised learning for the ECG encoder, enabling efficient similarity searches and report retrieval. By combining pre-training with dynamic retrieval and Large Language Model (LLM)-based refinement, ECG-ReGen effectively analyzes ECG data and answers related queries, with the potential of improving patient care. Experiments conducted on the PTB-XL and MIMIC-IV-ECG datasets demonstrate superior performance in both in-domain and cross-domain scenarios for report generation. Furthermore, our approach exhibits competitive performance on ECG-QA dataset compared to fully supervised methods when utilizing off-the-shelf LLMs for zero-shot question answering. This approach, effectively combining self-supervised encoder and LLMs, offers a scalable and efficient solution for accurate ECG interpretation, holding significant potential to enhance clinical decision-making.
Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou
Most existing audio-text retrieval (ATR) approaches typically rely on a
single-level interaction to associate audio and text, limiting their ability to
align different modalities and leading to suboptimal matches. In this work, we
present a novel ATR framework that leverages two-stream Transformers in
conjunction with a Hierarchical Alignment (THA) module to identify multi-level
correspondences of different Transformer blocks between audio and text.
Moreover, current ATR methods mainly focus on learning a global-level
representation, missing out on intricate details to capture audio occurrences
that correspond to textual semantics. To bridge this gap, we introduce a
Disentangled Cross-modal Representation (DCR) approach that disentangles
high-dimensional features into compact latent factors to grasp fine-grained
audio-text semantic correlations. Additionally, we develop a confidence-aware
(CA) module to estimate the confidence of each latent factor pair and
adaptively aggregate cross-modal latent factors to achieve local semantic
alignment. Experiments show that our THA effectively boosts ATR performance,
with the DCR approach further contributing to consistent performance gains.
Authors' comments: Accepted by Interspeech2024
Hanane Djeddal, Pierre Erbacher, Raouf Toukal, Laure Soulier, Karen Pinel-Sauvagnat, Sophia Katrenko, Lynda Tamine
With the growing success of Large Language models (LLMs) in information-seeking scenarios, search engines are now adopting generative approaches to provide answers along with in-line citations as attribution. While existing work focuses mainly on attributed question answering, in this paper, we target information-seeking scenarios which are often more challenging due to the open-ended nature of the queries and the size of the label space in terms of the diversity of candidate-attributed answers per query. We propose a reproducible framework to evaluate and benchmark attributed information seeking, using any backbone LLM, and different architectural designs: (1) Generate (2) Retrieve then Generate, and (3) Generate then Retrieve. Experiments using HAGRID, an attributed information-seeking dataset, show the impact of different scenarios on both the correctness and attributability of answers.
Sidong Feng, Haochuan Lu, Jianqin Jiang, Ting Xiong, Likun Huang, Yinglin Liang, Xiaoqin Li, Yuetang Deng et al.
UI automation tests play a crucial role in ensuring the quality of mobile applications. Despite the growing popularity of machine learning techniques to generate these tests, they still face several challenges, such as the mismatch of UI elements. The recent advances in Large Language Models (LLMs) have addressed these issues by leveraging their semantic understanding capabilities. However, a significant gap remains in applying these models to industrial-level app testing, particularly in terms of cost optimization and knowledge limitation. To address this, we introduce CAT to create cost-effective UI automation tests for industry apps by combining machine learning and LLMs with best practices. Given the task description, CAT employs Retrieval Augmented Generation (RAG) to source examples of industrial app usage as the few-shot learning context, assisting LLMs in generating the specific sequence of actions. CAT then employs machine learning techniques, with LLMs serving as a complementary optimizer, to map the target element on the UI screen. Our evaluations on the WeChat testing dataset demonstrate the CAT's performance and cost-effectiveness, achieving 90% UI automation with $0.34 cost, outperforming the state-of-the-art. We have also integrated our approach into the real-world WeChat testing platform, demonstrating its usefulness in detecting 141 bugs and enhancing the developers' testing process.
Esmaeil Narimissa, David Raithel
The performance of Retrieval-Augmented Generation (RAG) systems in
information retrieval is significantly influenced by the characteristics of the
documents being processed. In this study, the structured nature of textbooks,
the conciseness of articles, and the narrative complexity of novels are shown
to require distinct retrieval strategies. A comparative evaluation of multiple
document-splitting methods reveals that the Recursive Character Splitter
outperforms the Token-based Splitter in preserving contextual integrity. A
novel evaluation technique is introduced, utilizing an open-source model to
generate a comprehensive dataset of question-and-answer pairs, simulating
realistic retrieval scenarios to enhance testing efficiency and metric
reliability. The evaluation employs weighted scoring metrics, including
SequenceMatcher, BLEU, METEOR, and BERT Score, to assess the system's accuracy
and relevance. This approach establishes a refined standard for evaluating the
precision of RAG systems, with future research focusing on optimizing chunk and
overlap sizes to improve retrieval accuracy and efficiency.
Authors' comments: This article is 16 pages long and includes detailed comparisons of
RAG systems and document splitting techniques
Gentiana Rashiti, Geethan Karunaratne, Mrinmaya Sachan, Abu Sebastian, Abbas Rahimi
The retrieval augmented generation (RAG) system such as Retro has been shown to improve language modeling capabilities and reduce toxicity and hallucinations by retrieving from a database of non-parametric memory containing trillions of entries. We introduce Retro-li that shows retrieval can also help using a small-scale database, but it demands more accurate and better neighbors when searching in a smaller hence sparser non-parametric memory. This can be met by using a proper semantic similarity search. We further propose adding a regularization to the non-parametric memory for the first time: it significantly reduces perplexity when the neighbor search operations are noisy during inference, and it improves generalization when a domain shift occurs. We also show that Retro-li's non-parametric memory can potentially be implemented on analog in-memory computing hardware, exhibiting O(1) search time while causing noise in retrieving neighbors, with minimal (<1%) performance loss. Our code is available at: https://github.com/IBM/Retrieval-Enhanced-Transformer-Little.
Xun Xian, Ganghua Wang, Xuan Bi, Jayanth Srinivasa, Ashish Kundu, Charles Fleming, Mingyi Hong, Jie Ding
Retrieval-Augmented Generation (RAG) has been empirically shown to enhance
the performance of large language models (LLMs) in knowledge-intensive domains
such as healthcare, finance, and legal contexts. Given a query, RAG retrieves
relevant documents from a corpus and integrates them into the LLMs' generation
process. In this study, we investigate the adversarial robustness of RAG,
focusing specifically on examining the retrieval system. First, across 225
different setup combinations of corpus, retriever, query, and targeted
information, we show that retrieval systems are vulnerable to universal
poisoning attacks in medical Q\&A. In such attacks, adversaries generate
poisoned documents containing a broad spectrum of targeted information, such as
personally identifiable information. When these poisoned documents are inserted
into a corpus, they can be accurately retrieved by any users, as long as
attacker-specified queries are used. To understand this vulnerability, we
discovered that the deviation from the query's embedding to that of the
poisoned document tends to follow a pattern in which the high similarity
between the poisoned document and the query is retained, thereby enabling
precise retrieval. Based on these findings, we develop a new detection-based
defense to ensure the safe use of RAG. Through extensive experiments spanning
various Q\&A domains, we observed that our proposed method consistently
achieves excellent detection rates in nearly all cases.
Authors' comments: Accepted by ICML 2025