Minsang Kim, Seungjun Baek
LLMs have boosted progress in many AI applications. Recently, there were
attempts to distill the vast knowledge of LLMs into information retrieval
systems. Those distillation methods mostly use output probabilities of LLMs
which are unavailable in the latest black-box LLMs. We propose Syntriever, a
training framework for retrievers using synthetic data from black-box LLMs.
Syntriever consists of two stages. Firstly in the distillation stage, we
synthesize relevant and plausibly irrelevant passages and augmented queries
using chain-of-thoughts for the given queries. LLM is asked to self-verify the
synthetic data for possible hallucinations, after which retrievers are trained
with a loss designed to cluster the embeddings of relevant passages. Secondly
in the alignment stage, we align the retriever with the preferences of LLMs. We
propose a preference modeling called partial Plackett-Luce ranking to learn LLM
preferences with regularization which prevents the model from deviating
excessively from that trained in the distillation stage. Experiments show that
Syntriever achieves state-of-the-art performances on benchmark datasets from
various domains in nDCG@$K$. The code is available at
\href{https://github.com/kmswin1/Syntriever}{https://github.com/kmswin1/Syntriever}.
Authors' comments: the Nations of the Americas Chapter of the Association for
Computational Linguistics (NAACL), Findings, Accepted
Xinyu Mao, Teerapong Leelanupab, Harrisen Scells, Guido Zuccon
Screening is a time-consuming and labour-intensive yet required task for
medical systematic reviews, as tens of thousands of studies often need to be
screened. Prioritising relevant studies to be screened allows downstream
systematic review creation tasks to start earlier and save time. In previous
work, we developed a dense retrieval method to prioritise relevant studies with
reviewer feedback during the title and abstract screening stage. Our method
outperforms previous active learning methods in both effectiveness and
efficiency. In this demo, we extend this prior work by creating (1) a web-based
screening tool that enables end-users to screen studies exploiting
state-of-the-art methods and (2) a Python library that integrates models and
feedback mechanisms and allows researchers to develop and demonstrate new
active learning methods. We describe the tool's design and showcase how it can
aid screening. The tool is available at https://densereviewer.ielab.io. The
source code is also open sourced at https://github.com/ielab/densereviewer.
Authors' comments: Accepted at ECIR 2025
Bo Lin, Shangwen Wang, Liqian Chen, Xiaoguang Mao
The integration of Large Language Models (LLMs) into software development has revolutionized the field, particularly through the use of Retrieval-Augmented Code Generation (RACG) systems that enhance code generation with information from external knowledge bases. However, the security implications of RACG systems, particularly the risks posed by vulnerable code examples in the knowledge base, remain largely unexplored. This risk is particularly concerning given that public code repositories, which often serve as the sources for knowledge base collection in RACG systems, are usually accessible to anyone in the community. Malicious attackers can exploit this accessibility to inject vulnerable code into the knowledge base, making it toxic. Once these poisoned samples are retrieved and incorporated into the generated code, they can propagate security vulnerabilities into the final product. This paper presents the first comprehensive study on the security risks associated with RACG systems, focusing on how vulnerable code in the knowledge base compromises the security of generated code. We investigate the LLM-generated code security across different settings through extensive experiments using four major LLMs, two retrievers, and two poisoning scenarios. Our findings highlight the significant threat of knowledge base poisoning, where even a single poisoned code example can compromise up to 48% of generated code. Our findings provide crucial insights into vulnerability introduction in RACG systems and offer practical mitigation recommendations, thereby helping improve the security of LLM-generated code in future works.
Napat Laosaengpha, Thanit Tativannarat, Attapol Rutherford, Ekapol Chuangsuwanich
Understanding the textual components of resumes and job postings is critical
for improving job-matching accuracy and optimizing job search systems in online
recruitment platforms. However, existing works primarily focus on analyzing
individual components within this information, requiring multiple specialized
tools to analyze each aspect. Such disjointed methods could potentially hinder
overall generalizability in recruitment-related text processing. Therefore, we
propose a unified sentence encoder that utilized multi-task dual-encoder
framework for jointly learning multiple component into the unified sentence
encoder. The results show that our method outperforms other state-of-the-art
models, despite its smaller model size. Moreover, we propose a novel metric,
Language Bias Kullback-Leibler Divergence (LBKL), to evaluate language bias in
the encoder, demonstrating significant bias reduction and superior
cross-lingual performance.
Authors' comments: To be published in CompJobs Workshop at AAAI 2025
Xumeng Wen, Shun Zheng, Zhen Xu, Yiming Sun, Jiang Bian
Recent studies have shown that large language models (LLMs), when customized
with post-training on tabular data, can acquire general tabular in-context
learning (TabICL) capabilities. These models are able to transfer effectively
across diverse data schemas and different task domains. However, existing
LLM-based TabICL approaches are constrained to few-shot scenarios due to the
sequence length limitations of LLMs, as tabular instances represented in plain
text consume substantial tokens. To address this limitation and enable scalable
TabICL for any data size, we propose retrieval-augmented LLMs tailored to
tabular data. Our approach incorporates a customized retrieval module, combined
with retrieval-guided instruction-tuning for LLMs. This enables LLMs to
effectively leverage larger datasets, achieving significantly improved
performance across 69 widely recognized datasets and demonstrating promising
scaling behavior. Extensive comparisons with state-of-the-art tabular models
reveal that, while LLM-based TabICL still lags behind well-tuned numeric models
in overall performance, it uncovers powerful algorithms under limited contexts,
enhances ensemble diversity, and excels on specific datasets. These unique
properties underscore the potential of language as a universal and accessible
interface for scalable tabular data learning.
Authors' comments: Preprint
Seonok Kim
Large Language Models (LLMs) have demonstrated impressive capabilities across natural language processing tasks. However, their application to specialized domains such as medicine and biology requires further optimization to ensure factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a domain-adapted biomedical question-answering model designed to enhance both short-form and long-form queries. By integrating fine-tuning and retrieval-augmented generation (RAG), MedBioLM dynamically incorporates domain-specific knowledge, improving reasoning abilities and factual accuracy. To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA datasets, covering structured multiple-choice assessments and complex clinical reasoning tasks. Fine-tuning significantly improves accuracy on benchmark datasets, while RAG enhances factual consistency. These results highlight the potential of domain-optimized LLMs in advancing biomedical research, medical education, and clinical decision support.
Atharva Mangeshkumar Agrawal, Rutika Pandurang Shinde, Vasanth Kumar Bhukya, Ashmita Chakraborty, Sagar Bharat Shah, Tanmay Shukla, Sree Pradeep Kumar Relangi, Nilesh Mutyam
Large language models (LLMs) have shown impressive capabilities in natural
language processing tasks, including dialogue generation. This research aims to
conduct a novel comparative analysis of two prominent techniques, fine-tuning
with LoRA (Low-Rank Adaptation) and the Retrieval-Augmented Generation (RAG)
framework, in the context of doctor-patient chat conversations with multiple
datasets of mixed medical domains. The analysis involves three state-of-the-art
models: Llama-2, GPT, and the LSTM model. Employing real-world doctor-patient
dialogues, we comprehensively evaluate the performance of models, assessing key
metrics such as language quality (perplexity, BLEU score), factual accuracy
(fact-checking against medical knowledge bases), adherence to medical
guidelines, and overall human judgments (coherence, empathy, safety). The
findings provide insights into the strengths and limitations of each approach,
shedding light on their suitability for healthcare applications. Furthermore,
the research investigates the robustness of the models in handling diverse
patient queries, ranging from general health inquiries to specific medical
conditions. The impact of domain-specific knowledge integration is also
explored, highlighting the potential for enhancing LLM performance through
targeted data augmentation and retrieval strategies.
Authors' comments: 12 pages
Xiao Hu, Eric Liu, Weizhou Wang, Xiangyu Guo, David Lie
Retrieval-Augmented Generation (RAG) offers a solution to mitigate hallucinations in Large Language Models (LLMs) by grounding their outputs to knowledge retrieved from external sources. The use of private resources and data in constructing these external data stores can expose them to risks of extraction attacks, in which attackers attempt to steal data from these private databases. Existing RAG extraction attacks often rely on manually crafted prompts, which limit their effectiveness. In this paper, we introduce a framework called MARAGE for optimizing an adversarial string that, when appended to user queries submitted to a target RAG system, causes outputs containing the retrieved RAG data verbatim. MARAGE leverages a continuous optimization scheme that integrates gradients from multiple models with different architectures simultaneously to enhance the transferability of the optimized string to unseen models. Additionally, we propose a strategy that emphasizes the initial tokens in the target RAG data, further improving the attack's generalizability. Evaluations show that MARAGE consistently outperforms both manual and optimization-based baselines across multiple LLMs and RAG datasets, while maintaining robust transferability to previously unseen models. Moreover, we conduct probing tasks to shed light on the reasons why MARAGE is more effective compared to the baselines and to analyze the impact of our approach on the model's internal state.
Natasha Maniar, Samantha W. T. Chan, Wazeer Zulfikar, Scott Ren, Christine Xu, Pattie Maes
Older adults have increasing difficulty with retrospective memory, hindering
their abilities to perform daily activities and posing stress on caregivers to
ensure their wellbeing. Recent developments in Artificial Intelligence (AI) and
large context-aware multimodal models offer an opportunity to create memory
support systems that assist older adults with common issues like object
finding. This paper discusses the development of an AI-based, wearable memory
assistant, MemPal, that helps older adults with a common problem, finding lost
objects at home, and presents results from tests of the system in older adults'
own homes. Using visual context from a wearable camera, the multimodal LLM
system creates a real-time automated text diary of the person's activities for
memory support purposes, offering object retrieval assistance using a
voice-based interface. The system is designed to support additional use cases
like context-based proactive safety reminders and recall of past actions. We
report on a quantitative and qualitative study with N=15 older adults within
their own homes that showed improved performance of object finding with
audio-based assistance compared to no aid and positive overall user perceptions
on the designed system. We discuss further applications of MemPal's design as a
multi-purpose memory aid and future design guidelines to adapt memory
assistants to older adults' unique needs.
Authors' comments: 15 pages
Selin Aslan, Tristan van Leeuwen, Allard Mosk, Palina Salanevich
In phase retrieval and similar inverse problems, the stability of solutions across different noise levels is crucial for applications. One approach to promote it is using signal priors in a form of a generative model as a regularization, at the expense of introducing a bias in the reconstruction. In this paper, we explore and compare the reconstruction properties of classical and generative inverse problem formulations. We propose a new unified reconstruction approach that mitigates overfitting to the generative model for varying noise levels.
Yuyang Gong, Zhuo Chen, Miaokun Chen, Fengchang Yu, Wei Lu, Xiaofeng Wang, Xiaozhong Liu, Jiawei Liu
Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become essential for tasks such as question answering and content generation. However, their increasing impact on public opinion and information dissemination has made them a critical focus for security research due to inherent vulnerabilities. Previous studies have predominantly addressed attacks targeting factual or single-query manipulations. In this paper, we address a more practical scenario: topic-oriented adversarial opinion manipulation attacks on RAG models, where LLMs are required to reason and synthesize multiple perspectives, rendering them particularly susceptible to systematic knowledge poisoning. Specifically, we propose Topic-FlipRAG, a two-stage manipulation attack pipeline that strategically crafts adversarial perturbations to influence opinions across related queries. This approach combines traditional adversarial ranking attack techniques and leverages the extensive internal relevant knowledge and reasoning capabilities of LLMs to execute semantic-level perturbations. Experiments show that the proposed attacks effectively shift the opinion of the model's outputs on specific topics, significantly impacting user information perception. Current mitigation methods cannot effectively defend against such attacks, highlighting the necessity for enhanced safeguards for RAG systems, and offering crucial insights for LLM security research.
Dazhou Yu, Riyang Bao, Ruiyu Ning, Jinghong Peng, Gengchen Mai, Liang Zhao
Spatial reasoning remains a challenge for Large Language Models (LLMs), which struggle with spatial data retrieval and reasoning. We propose Spatial Retrieval-Augmented Generation (Spatial-RAG), a framework that extends RAG to spatial tasks by integrating sparse spatial retrieval (spatial databases) and dense semantic retrieval (LLM-based similarity). A multi-objective ranking strategy balances spatial constraints and semantic relevance, while an LLM-guided generator ensures coherent responses. Experiments on a real-world tourism dataset show that Spatial-RAG significantly improves spatial question answering, bridging the gap between LLMs and spatial intelligence.
Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun et al.
Large Language Models (LLMs) have shown remarkable reasoning capabilities, while their practical applications are limited by severe factual hallucinations due to limitations in the timeliness, accuracy, and comprehensiveness of their parametric knowledge. Meanwhile, enhancing retrieval-augmented generation (RAG) with reasoning remains challenging due to ineffective task decomposition and redundant retrieval, which can introduce noise and degrade response quality. In this paper, we propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP), enabling reasonable and adaptive retrieval. By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency and boosts answer accuracy by 26.4%, demonstrating its effectiveness in enhancing retrieval-augmented reasoning.
Yuanhuiyi Lyu, Xu Zheng, Lutao Jiang, Yibo Yan, Xin Zou, Huiyu Zhou, Linfeng Zhang, Xuming Hu
Recent text-to-image generative models, e.g., Stable Diffusion V3 and Flux,
have achieved notable progress. However, these models are strongly restricted
to their limited knowledge, a.k.a., their own fixed parameters, that are
trained with closed datasets. This leads to significant hallucinations or
distortions when facing fine-grained and unseen novel real-world objects, e.g.,
the appearance of the Tesla Cybertruck. To this end, we present the first
real-object-based retrieval-augmented generation framework (RealRAG), which
augments fine-grained and unseen novel object generation by learning and
retrieving real-world images to overcome the knowledge gaps of generative
models. Specifically, to integrate missing memory for unseen novel object
generation, we train a reflective retriever by self-reflective contrastive
learning, which injects the generator's knowledge into the sef-reflective
negatives, ensuring that the retrieved augmented images compensate for the
model's missing knowledge. Furthermore, the real-object-based framework
integrates fine-grained visual knowledge for the generative models, tackling
the distortion problem and improving the realism for fine-grained object
generation. Our Real-RAG is superior in its modular application to all types of
state-of-the-art text-to-image generative models and also delivers remarkable
performance boosts with all of them, such as a gain of 16.18% FID score with
the auto-regressive model on the Stanford Car benchmark.
Authors' comments: Accepted to ICML2025
Rajat Keshri, Arun George Zachariah, Michael Boone
Ensuring that code accurately reflects the algorithms and methods described in research papers is critical for maintaining credibility and fostering trust in AI research. This paper presents a novel system designed to verify code implementations against the algorithms and methodologies outlined in corresponding research papers. Our system employs Retrieval-Augmented Generation to extract relevant details from both the research papers and code bases, followed by a structured comparison using Large Language Models. This approach improves the accuracy and comprehensiveness of code implementation verification while contributing to the transparency, explainability, and reproducibility of AI research. By automating the verification process, our system reduces manual effort, enhances research credibility, and ultimately advances the state of the art in code verification.
Genc Hoxha, Olivér Angyal, Begüm Demir
The development of image time series retrieval (ITSR) methods is a growing research interest in remote sensing (RS). Given a user-defined image time series (i.e., the query time series), the ITSR methods search and retrieve from large archives the image time series that have similar content to the query time series. The existing ITSR methods in RS are designed for unimodal retrieval problems, limiting their usability and versatility. To overcome this issue, as a first time in RS we introduce the task of cross-modal text-ITSR. In particular, we present a self-supervised cross-modal text-image time series retrieval (text-ITSR) method that enables the retrieval of image time series using text sentences as queries, and vice versa. In detail, we focus our attention on text-ITSR in pairs of images (i.e., bitemporal images). The proposed text-ITSR method consists of two key components: 1) modality-specific encoders to model the semantic content of bitemporal images and text sentences with discriminative features; and 2) modality-specific projection heads to align textual and image representations in a shared embedding space. To effectively model the temporal information within the bitemporal images, we introduce two fusion strategies: i) global feature fusion (GFF) strategy that combines global image features through simple yet effective operators; and ii) transformer-based feature fusion (TFF) strategy that leverages transformers for fine-grained temporal integration. Extensive experiments conducted on two benchmark RS archives demonstrate the effectiveness of the proposed method in accurately retrieving semantically relevant bitemporal images (or text sentences) to a query text sentence (or bitemporal image). The code of this work is publicly available at https://git.tu-berlin.de/rsim/cross-modal-text-tsir.
Manveer Singh Tamber, Jimmy Lin
Consider a scenario in which a user searches for information, only to encounter texts flooded with misleading or non-relevant content. This scenario exemplifies a simple yet potent vulnerability in neural Information Retrieval (IR) pipelines: content injection attacks. We find that embedding models for retrieval, rerankers, and large language model (LLM) relevance judges are vulnerable to these attacks, in which adversaries insert misleading text into passages to manipulate model judgements. We identify two primary threats: (1) inserting unrelated or harmful content within passages that still appear deceptively "relevant", and (2) inserting entire queries or key query terms into passages to boost their perceived relevance. While the second tactic has been explored in prior research, we present, to our knowledge, the first empirical analysis of the first threat, demonstrating how state-of-the-art models can be easily misled. Our study systematically examines the factors that influence an attack's success, such as the placement of injected content and the balance between relevant and non-relevant material. Additionally, we explore various defense strategies, including adversarial passage classifiers, retriever fine-tuning to discount manipulated content, and prompting LLM judges to adopt a more cautious approach. However, we find that these countermeasures often involve trade-offs, sacrificing effectiveness for attack robustness and sometimes penalizing legitimate documents in the process. Our findings highlight the need for stronger defenses against these evolving adversarial strategies to maintain the trustworthiness of IR systems. We release our code and scripts to facilitate further research.
Youssef Maklad, Fares Wael, Wael Elsersy, Ali Hamdi
This paper presents a novel approach to evaluate the efficiency of a RAG-based agentic Large Language Model (LLM) architecture in network packet seed generation for network protocol fuzzing. Enhanced by chain-of-thought (COT) prompting techniques, the proposed approach focuses on the improvement of the seeds structural quality in order to guide protocol fuzzing frameworks through a wide exploration of the protocol state space. Our method leverages RAG and text embeddings in a two-stages. In the first stage, the agent dynamically refers to the Request For Comments (RFC) documents knowledge base for answering queries regarding the protocol Finite State Machine (FSM), then it iteratively reasons through the retrieved knowledge, for output refinement and proper seed placement. In the second stage, we evaluate the response structure quality of the agent's output, based on metrics as BLEU, ROUGE, and Word Error Rate (WER) by comparing the generated packets against the ground truth packets. Our experiments demonstrate significant improvements of up to 18.19%, 14.81%, and 23.45% in BLEU, ROUGE, and WER, respectively, over baseline models. These results confirm the potential of such approach, improving LLM-based protocol fuzzing frameworks for the identification of hidden vulnerabilities.
Subhankar Maity, Aniket Deroy, Sudeshna Sarkar
Question generation in education is a time-consuming and cognitively
demanding task, as it requires creating questions that are both contextually
relevant and pedagogically sound. Current automated question generation methods
often generate questions that are out of context. In this work, we explore
advanced techniques for automated question generation in educational contexts,
focusing on In-Context Learning (ICL), Retrieval-Augmented Generation (RAG),
and a novel Hybrid Model that merges both methods. We implement GPT-4 for ICL
using few-shot examples and BART with a retrieval module for RAG. The Hybrid
Model combines RAG and ICL to address these issues and improve question
quality. Evaluation is conducted using automated metrics, followed by human
evaluation metrics. Our results show that both the ICL approach and the Hybrid
Model consistently outperform other methods, including baseline models, by
generating more contextually accurate and relevant questions.
Authors' comments: Accepted at the 16th Meeting of the Forum for Information Retrieval
Evaluation as a Regular Paper
Mohamed Nomeir, Alptug Aytekin, Sennur Ulukus
We consider the problem of finding the asymptotic capacity of symmetric private information retrieval (SPIR) with $B$ Byzantine servers. Prior to finding the capacity, a definition for the Byzantine servers is needed since in the literature there are two different definitions. In \cite{byzantine_tpir}, where it was first defined, the Byzantine servers can send any symbol from the storage, their received queries and some independent random symbols. In \cite{unresponsive_byzantine_1}, Byzantine servers send any random symbol independently of their storage and queries. It is clear that these definitions are not identical, especially when \emph{symmetric} privacy is required. To that end, we define Byzantine servers, inspired by \cite{byzantine_tpir}, as the servers that can share everything, before and after the scheme initiation. In this setting, we find an upper bound, for an infinite number of messages case, that should be satisfied for all schemes that protect against this setting and develop a scheme that achieves this upper bound. Hence, we identify the capacity of the problem.