Anand Selvadurai, Jasheen Shaik, Girish Chandrasekar, ShriRadhaKrishnan Balamurugan, Eswara Reddy
Embedding models have become essential for retrieval-augmented generation
(RAG) tasks, semantic clustering, and text re-ranking. But despite their
growing use, many of these come with notable limitations. For example, Jina
fails to capture the semantic content of medical documents, while models such
as MiniLM often perform poorly on long-form documents. Domain-adapted models,
while specialized, often underperform in general-purpose tasks, reducing their
overall applicability. General-domain tokenizers often misinterpret medical
vocabulary. The limitations of current embedding models, whether in
tokenization accuracy, domain comprehension, or handling long sequences,
highlight the need for more versatile solutions. In this work, we present
MedEIR, a novel embedding model and tokenizer jointly optimized for both
medical and general NLP tasks, incorporating ALiBi-based long-context
processing to support sequences of up to 8,192 tokens. MedEIR was pre-trained
on only 6 billion tokens, significantly fewer than Jina's, followed by
fine-tuning on 3 million sentence pairs. MedEIR consistently outperforms Jina
V2 and MiniLM across MTEB benchmarks, achieving top scores on ArguAna (55.24),
NFCorpus (38.44), MedicalQARetrieval (74.25), SciFact (72.04), and TRECCOVID
(79.56). These results highlight the potential of MedEIR as a highly effective
embedding model, demonstrating strong performance across both general-purpose
and domain-specific tasks and outperforming existing models on multiple
benchmarks.
Authors' comments: 9 pages, 1 figure. This manuscript is a substantial revision of a
previously submitted paper. We have explicitly clarified novelty,
strengthened scholarly depth, and expanded experimental validation
Wei Yang, Jingjing Fu, Rui Wang, Jinyu Wang, Lei Song, Jiang Bian
Vision-language retrieval-augmented generation (RAG) has become an effective
approach for tackling Knowledge-Based Visual Question Answering (KB-VQA), which
requires external knowledge beyond the visual content presented in images. The
effectiveness of Vision-language RAG systems hinges on multimodal retrieval,
which is inherently challenging due to the diverse modalities and knowledge
granularities in both queries and knowledge bases. Existing methods have not
fully tapped into the potential interplay between these elements. We propose a
multimodal RAG system featuring a coarse-to-fine, multi-step retrieval that
harmonizes multiple granularities and modalities to enhance efficacy. Our
system begins with a broad initial search aligning knowledge granularity for
cross-modal retrieval, followed by a multimodal fusion reranking to capture the
nuanced multimodal information for top entity selection. A text reranker then
filters out the most relevant fine-grained section for augmented generation.
Extensive experiments on the InfoSeek and Encyclopedic-VQA benchmarks show our
method achieves state-of-the-art retrieval performance and highly competitive
answering results, underscoring its effectiveness in advancing KB-VQA systems.
Authors' comments: 19 pages, 6 figures, 17 tables
Ozan Gokdemir, Carlo Siebenschuh, Alexander Brace, Azton Wells, Brian Hsu, Kyle Hippe, Priyanka V. Setty, Aswathy Ajith et al.
The volume of scientific literature is growing exponentially, leading to
underutilized discoveries, duplicated efforts, and limited cross-disciplinary
collaboration. Retrieval Augmented Generation (RAG) offers a way to assist
scientists by improving the factuality of Large Language Models (LLMs) in
processing this influx of information. However, scaling RAG to handle millions
of articles introduces significant challenges, including the high computational
costs associated with parsing documents and embedding scientific knowledge, as
well as the algorithmic complexity of aligning these representations with the
nuanced semantics of scientific content. To address these issues, we introduce
HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index
and retrieve knowledge from more than 3.6 million scientific articles. At its
core are Oreo, a high-throughput model for multimodal document parsing, and
ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval
accuracy by using contrastive learning and late-interaction techniques.
HiPerRAG delivers robust performance on existing scientific question answering
benchmarks and two new benchmarks introduced in this work, achieving 90%
accuracy on SciQ and 76% on PubMedQA-outperforming both domain-specific models
like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs
on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million
document-scale RAG workflows for unifying scientific knowledge and fostering
interdisciplinary innovation.
Authors' comments: This paper has been accepted at the Platform for Advanced Scientific
Computing Conference (PASC 25), June 16-18, 2025, Brugg-Windisch, Switzerland
Richard Zhipeng Wang, Guangyao Li, Silvia Gentilini, Marcello Calvanese Strinati, Claudio Conti, Natalia G. Berloff
Phase-retrieval from coded diffraction patterns (CDP) is important to X-ray
crystallography, diffraction tomography and astronomical imaging, yet remains a
hard, non-convex inverse problem. We show that CDP recovery can be reformulated
exactly as the minimisation of a continuous-variable XY Hamiltonian and solved
by gain-based photonic networks. The coupled-mode equations we exploit are the
natural mean-field dynamics of exciton-polariton condensate lattices,
coupled-laser arrays and driven photon Bose-Einstein condensates, while other
hardware such as the spatial photonic Ising machine can implement the same
update rule through high-speed digital feedback, preserving full optical
parallelism. Numerical experiments on images, two- and three-dimensional
vortices and unstructured complex data demonstrate that the gain-based solver
consistently outperforms the state-of-the-art Relaxed-Reflect-Reflect (RRR)
algorithm in the medium-noise regime (signal-to-noise ratios 10--40 dB) and
retains this advantage as problem size scales. Because the physical platform
performs the continuous optimisation, our approach promises fast,
energy-efficient phase retrieval on readily available photonic hardware. uch as
two- and three-dimensional vortices, and unstructured random data. Moreover,
the solver's accuracy remains high as problem sizes increase, underscoring its
scalability.
Authors' comments: 11 pages, 7 figures
Tal Amir, Tamir Bendory, Nadav Dym, Dan Edidin
The generalized phase retrieval problem over compact groups aims to recover a set of matrices, representing an unknown signal, from their associated Gram matrices, leveraging prior structural knowledge about the signal. This framework generalizes the classical phase retrieval problem, which reconstructs a signal from the magnitudes of its Fourier transform, to a richer setting involving non-abelian compact groups. In this broader context, the unknown phases in Fourier space are replaced by unknown orthogonal matrices that arise from the action of a compact group on a finite-dimensional vector space. This problem is primarily motivated by advances in electron microscopy to determining the 3D structure of biological macromolecules from highly noisy observations. To capture realistic assumptions from machine learning and signal processing, we model the signal as belonging to one of several broad structural families: a generic linear subspace, a sparse representation in a generic basis, the output of a generic ReLU neural network, or a generic low-dimensional manifold. Our main result shows that, under mild conditions, the generalized phase retrieval problem not only admits a unique solution (up to inherent group symmetries), but also satisfies a bi-Lipschitz property. This implies robustness to both noise and model mismatch, an essential requirement for practical use, especially when measurements are severely corrupted by noise. These findings provide theoretical support for a wide class of scientific problems under modern structural assumptions, and they offer strong foundations for developing robust algorithms in high-noise regimes.
Yuzhou Zhu, Zheng Zhang, Ruyi Zhang, Liang Zhou
Attosecond streaking phase retrieval is essential for resolving electron dynamics on sub-femtosecond time scales yet traditional algorithms rely on iterative minimization and central momentum approximations that degrade accuracy for broadband pulses. In this work phase retrieval is reformulated as a supervised computer-vision problem and four neural architectures are systematically compared. A convolutional network demonstrates strong sensitivity to local streak edges but lacks global context; a vision transformer captures long-range delay-energy correlations at the expense of local inductive bias; a hybrid CNN-ViT model unites local feature extraction and full-graph attention; and a capsule network further enforces spatial pose agreement through dynamic routing. A theoretical analysis introduces local, global and positional sensitivity measures and derives surrogate error bounds that predict the strict ordering $CNN<ViT<Hybrid<Capsule$. Controlled experiments on synthetic streaking spectrograms confirm this hierarchy, with the capsule network achieving the highest retrieval fidelity. Looking forward, embedding the strong-field integral into physics-informed neural networks and exploring photonic hardware implementations promise pathways toward real-time attosecond pulse characterization under demanding experimental conditions.
Camilla Hollanti, Neehar Verma
A private information retrieval (PIR) scheme is a protocol that allows a user
to retrieve a file from a database without revealing the identity of the
desired file to a curious database. Given a distributed data storage system,
efficient PIR can be achieved by making assumptions about the colluding
capabilities of the storage servers holding the database. If these assumptions
turn out to be incorrect, privacy is lost. In this work, we focus on the
worst-case assumption: full collusion or, equivalently, viewing the storage
system virtually as a single honest-but-curious server. We present CB-cPIR, a
single-server code-based computational private information retrieval (cPIR)
scheme that derives security from code-based cryptography. Specifically, the
queries are protected by the hardness of decoding a random linear code. The
scheme is heavily inspired by the pioneering code-based cPIR scheme proposed by
Holzbaur, Hollanti, and Wachter-Zeh in [Holzbaur et al., "Computational
Code-Based Single-Server Private Information Retrieval", 2020 IEEE ISIT] and
fixes the vulnerabilities of the original scheme arising from highly probable
rank differences in submatrices of the user's query. For further validation, we
draw comparisons to the state-of-the-art lattice-based cPIR schemes.
Authors' comments: This paper builds on the work done in arXiv: 2402.02871v1 (IEEE
ISIT24) and arXiv: 2001.07049 (IEEE ISIT20)
Sameer Malik, Moyuru Yamada, Ayush Singh, Dishank Aggarwal
Comprehending long videos remains a significant challenge for Large Multi-modal Models (LMMs). Current LMMs struggle to process even minutes to hours videos due to their lack of explicit memory and retrieval mechanisms. To address this limitation, we propose RAVU (Retrieval Augmented Video Understanding), a novel framework for video understanding enhanced by retrieval with compositional reasoning over a spatio-temporal graph. We construct a graph representation of the video, capturing both spatial and temporal relationships between entities. This graph serves as a long-term memory, allowing us to track objects and their actions across time. To answer complex queries, we decompose the queries into a sequence of reasoning steps and execute these steps on the graph, retrieving relevant key information. Our approach enables more accurate understanding of long videos, particularly for queries that require multi-hop reasoning and tracking objects across frames. Our approach demonstrate superior performances with limited retrieved frames (5-10) compared with other SOTA methods and baselines on two major video QA datasets, NExT-QA and EgoSchema.
Zhengliang Shi, Lingyong Yan, Weiwei Sun, Yue Feng, Pengjie Ren, Xinyu Ma, Shuaiqiang Wang, Dawei Yin et al.
Retrieval-augmented generation (RAG) integrates large language models ( LLM s) with retrievers to access external knowledge, improving the factuality of LLM generation in knowledge-grounded tasks. To optimize the RAG performance, most previous work independently fine-tunes the retriever to adapt to frozen LLM s or trains the LLMs to use documents retrieved by off-the-shelf retrievers, lacking end-to-end training supervision. Recent work addresses this limitation by jointly training these two components but relies on overly simplifying assumptions of document independence, which has been criticized for being far from real-world scenarios. Thus, effectively optimizing the overall RAG performance remains a critical challenge. We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components: (i) a generative knowledge selection model and (ii) an LLM generator. DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted maximization, progressively improving RAG components through a variational approach. In the estimation step, we treat document permutation as a latent variable and directly estimate its distribution from the selection model by applying an importance sampling strategy. In the maximization step, we calibrate the optimization expectation using importance weights and jointly train the selection model and LLM generator. Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning. Extensive experiments conducted on five datasets illustrate that DRO outperforms the best baseline with 5%-15% improvements in EM and F1. We also provide in-depth experiments to qualitatively analyze the stability, convergence, and variance of DRO.
Madhukar Reddy Vongala, Saurabh Srivastava, Jana Košecká
Vision-language pretraining on large datasets of images-text pairs is one of
the main building blocks of current Vision-Language Models. While with
additional training, these models excel in various downstream tasks, including
visual question answering, image captioning, and visual commonsense reasoning.
However, a notable weakness of pretrained models like CLIP, is their inability
to perform entity grounding and compositional image and text
matching~\cite{Jiang2024ComCLIP, yang2023amc, Rajabi2023GroundedVSR,
learninglocalizeCVPR24}. In this work we propose a novel learning-free
zero-shot augmentation of CLIP embeddings that has favorable compositional
properties. We compute separate embeddings of sub-images of object entities and
relations that are localized by the state of the art open vocabulary detectors
and dynamically adjust the baseline global image embedding. % The final
embedding is obtained by computing a weighted combination of the sub-image
embeddings. The resulting embedding is then utilized for similarity computation
with text embedding, resulting in a average 1.5\% improvement in image-text
matching accuracy on the Visual Genome and SVO Probes
datasets~\cite{krishna2017visualgenome, svo}. Notably, the enhanced embeddings
demonstrate superior retrieval performance, thus achieving significant gains on
the Flickr30K and MS-COCO retrieval benchmarks~\cite{flickr30ke, mscoco},
improving the state-of-the-art Recall@1 by 12\% and 0.4\%, respectively. Our
code is available at https://github.com/madhukarreddyvongala/GroundingCLIP.
Authors' comments: Accepted at CVPR-W
David Nazareno Campo, Javier Conde, Álvaro Alonso, Gabriel Huecas, Joaquín Salvachúa, Pedro Reviriego
The proliferation of Generative Artificial Ingelligence (AI), especially Large Language Models, presents transformative opportunities for urban applications through Urban Foundation Models. However, base models face limitations, as they only contain the knowledge available at the time of training, and updating them is both time-consuming and costly. Retrieval Augmented Generation (RAG) has emerged in the literature as the preferred approach for injecting contextual information into Foundation Models. It prevails over techniques such as fine-tuning, which are less effective in dynamic, real-time scenarios like those found in urban environments. However, traditional RAG architectures, based on semantic databases, knowledge graphs, structured data, or AI-powered web searches, do not fully meet the demands of urban contexts. Urban environments are complex systems characterized by large volumes of interconnected data, frequent updates, real-time processing requirements, security needs, and strong links to the physical world. This work proposes a real-time spatial RAG architecture that defines the necessary components for the effective integration of generative AI into cities, leveraging temporal and spatial filtering capabilities through linked data. The proposed architecture is implemented using FIWARE, an ecosystem of software components to develop smart city solutions and digital twins. The design and implementation are demonstrated through the use case of a tourism assistant in the city of Madrid. The use case serves to validate the correct integration of Foundation Models through the proposed RAG architecture.
Malte Mosbach, Sven Behnke
Building models responsive to input prompts represents a transformative shift in machine learning. This paradigm holds significant potential for robotics problems, such as targeted manipulation amidst clutter. In this work, we present a novel approach to combine promptable foundation models with reinforcement learning (RL), enabling robots to perform dexterous manipulation tasks in a prompt-responsive manner. Existing methods struggle to link high-level commands with fine-grained dexterous control. We address this gap with a memory-augmented student-teacher learning framework. We use the Segment-Anything 2 (SAM 2) model as a perception backbone to infer an object of interest from user prompts. While detections are imperfect, their temporal sequence provides rich information for implicit state estimation by memory-augmented models. Our approach successfully learns prompt-responsive policies, demonstrated in picking objects from cluttered scenes. Videos and code are available at https://memory-student-teacher.github.io
Manak Raj, Nidhi Mishra
This review paper explores recent advancements and emerging approaches in
Information Retrieval (IR) applied to Natural Language Processing (NLP). We
examine traditional IR models such as Boolean, vector space, probabilistic, and
inference network models, and highlight modern techniques including deep
learning, reinforcement learning, and pretrained transformer models like BERT.
We discuss key tools and libraries - Lucene, Anserini, and Pyserini - for
efficient text indexing and search. A comparative analysis of sparse, dense,
and hybrid retrieval methods is presented, along with applications in web
search engines, cross-language IR, argument mining, private information
retrieval, and hate speech detection. Finally, we identify open challenges and
future research directions to enhance retrieval accuracy, scalability, and
ethical considerations.
Authors' comments: 12 pages, 4 figures, comprehensive literature review covering six key
IR-NLP papers, plus keywords and full reference list
Akshay Kekuda, Yuyang Zhang, Arun Udayashankar
In this abstract we present a series of optimizations we performed on the
two-tower model architecture [14], and training and evaluation datasets to
implement semantic product search at Best Buy. Search queries on bestbuy.com
follow the pareto distribution whereby a minority of them account for most
searches. This leaves us with a long tail of search queries that have low
frequency of issuance. The queries in the long tail suffer from very spare
interaction signals. Our current work focuses on building a model to serve the
long tail queries. We present a series of optimizations we have done to this
model to maximize conversion for the purpose of retrieval from the catalog. The
first optimization we present is using a large language model to improve the
sparsity of conversion signals. The second optimization is pretraining an
off-the-shelf transformer-based model on the Best Buy catalog data. The third
optimization we present is on the finetuning front. We use query-to-query pairs
in addition to query-to-product pairs and combining the above strategies for
finetuning the model. We also demonstrate how merging the weights of these
finetuned models improves the evaluation metrics. Finally, we provide a recipe
for curating an evaluation dataset for continuous monitoring of model
performance with human-in-the-loop evaluation. We found that adding this recall
mechanism to our current term match-based recall improved conversion by 3% in
an online A/B test.
Authors' comments: Published at RecSys '24: Proceedings of the 18th ACM Conference on
Recommender Systems
Run Ling, Wenji Wang, Yuting Liu, Guibing Guo, Linying Jiang, Xingwei Wang
Personalized image generation is crucial for improving the user experience, as it renders reference images into preferred ones according to user visual preferences. Although effective, existing methods face two main issues. First, existing methods treat all items in the user historical sequence equally when extracting user preferences, overlooking the varying semantic similarities between historical items and the reference item. Disproportionately high weights for low-similarity items distort users' visual preferences for the reference item. Second, existing methods heavily rely on consistency between generated and reference images to optimize the generation, which leads to underfitting user preferences and hinders personalization. To address these issues, we propose Retrieval Augment Personalized Image GenerAtion guided by Recommendation (RAGAR). Our approach uses a retrieval mechanism to assign different weights to historical items according to their similarities to the reference item, thereby extracting more refined users' visual preferences for the reference item. Then we introduce a novel rank task based on the multi-modal ranking model to optimize the personalization of the generated images instead of forcing depend on consistency. Extensive experiments and human evaluations on three real-world datasets demonstrate that RAGAR achieves significant improvements in both personalization and semantic metrics compared to five baselines.
Marco Braga, Pranav Kasela, Alessandro Raganato, Gabriella Pasi
Large Language Models (LLMs) have shown impressive zero-shot performance
across a variety of Natural Language Processing tasks, including document
re-ranking. However, their effectiveness degrades on unseen tasks and domains,
largely due to shifts in vocabulary and word distributions. In this paper, we
investigate Task Arithmetic, a technique that combines the weights of LLMs
pre-trained on different tasks or domains via simple mathematical operations,
such as addition or subtraction, to adapt retrieval models without requiring
additional fine-tuning. Our method is able to synthesize diverse tasks and
domain knowledge into a single model, enabling effective zero-shot adaptation
in different retrieval contexts. Extensive experiments on publicly available
scientific, biomedical, and multilingual datasets show that our method improves
state-of-the-art re-ranking performance by up to 18% in NDCG@10 and 15% in
P@10. In addition to these empirical gains, our analysis provides insights into
the strengths and limitations of Task Arithmetic as a practical strategy for
zero-shot learning and model adaptation. We make our code publicly available at
https://github.com/DetectiveMB/Task-Arithmetic-for-ZS-IR.
Authors' comments: Accepted in SIGIR '25
Maxime Bouthors, Josep Crego, François Yvon
Conventional retrieval-augmented neural machine translation (RANMT) systems
leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many
settings, in-domain monolingual target-side corpora are often available. This
work explores ways to take advantage of such resources by retrieving relevant
segments directly in the target language, based on a source-side query. For
this, we design improved cross-lingual retrieval systems, trained with both
sentence level and word-level matching objectives. In our experiments with two
RANMT architectures, we first demonstrate the benefits of such cross-lingual
objectives in a controlled setting, obtaining translation performances that
surpass standard TM-based models. We then showcase our method on a real-world
set-up, where the target monolingual resources far exceed the amount of
parallel data and observe large improvements of our new techniques, which
outperform both the baseline setting, and general-purpose cross-lingual
retrievers.
Authors' comments: 13 pages
Baolei Zhang, Haoran Xin, Minghong Fang, Zhuqing Liu, Biao Yi, Tong Li, Zheli Liu
Large language models (LLMs) integrated with retrieval-augmented generation
(RAG) systems improve accuracy by leveraging external knowledge sources.
However, recent research has revealed RAG's susceptibility to poisoning
attacks, where the attacker injects poisoned texts into the knowledge database,
leading to attacker-desired responses. Existing defenses, which predominantly
focus on inference-time mitigation, have proven insufficient against
sophisticated attacks. In this paper, we introduce RAGForensics, the first
traceback system for RAG, designed to identify poisoned texts within the
knowledge database that are responsible for the attacks. RAGForensics operates
iteratively, first retrieving a subset of texts from the database and then
utilizing a specially crafted prompt to guide an LLM in detecting potential
poisoning texts. Empirical evaluations across multiple datasets demonstrate the
effectiveness of RAGForensics against state-of-the-art poisoning attacks. This
work pioneers the traceback of poisoned texts in RAG systems, providing a
practical and promising defense mechanism to enhance their security.
Authors' comments: Accepted by The Web Conference 2025
Cristina Ioana Muntean, Franco Maria Nardini, Raffaele Perego, Guido Rocchietti, Cosimo Rulli
Pre-trained language models have been widely exploited to learn dense
representations of documents and queries for information retrieval. While
previous efforts have primarily focused on improving effectiveness and user
satisfaction, response time remains a critical bottleneck of conversational
search systems. To address this, we exploit the topical locality inherent in
conversational queries, i.e., the tendency of queries within a conversation to
focus on related topics. By leveraging query embedding similarities, we
dynamically restrict the search space to semantically relevant document
clusters, reducing computational complexity without compromising retrieval
quality. We evaluate our approach on the TREC CAsT 2019 and 2020 datasets using
multiple embedding models and vector indexes, achieving improvements in
processing speed of up to 10.4X with little loss in performance (4.4X without
any loss). Our results show that the proposed system effectively handles
complex, multiturn queries with high precision and efficiency, offering a
practical solution for real-time conversational search.
Authors' comments: 5 pages, 2 figures, SIGIR 2025
Máté Gedeon
Speech Event Extraction (SpeechEE) is a challenging task that lies at the intersection of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP), requiring the identification of structured event information from spoken language. In this work, we present a modular, pipeline-based SpeechEE framework that integrates high-performance ASR with semantic search-enhanced prompting of Large Language Models (LLMs). Our system first classifies speech segments likely to contain events using a hybrid filtering mechanism including rule-based, BERT-based, and LLM-based models. It then employs few-shot LLM prompting, dynamically enriched via semantic similarity retrieval, to identify event triggers and extract corresponding arguments. We evaluate the pipeline using multiple LLMs (Llama3-8B, GPT-4o-mini, and o1-mini) highlighting significant performance gains with o1-mini, which achieves 63.3% F1 on trigger classification and 27.8% F1 on argument classification, outperforming prior benchmarks. Our results demonstrate that pipeline approaches, when empowered by retrieval-augmented LLMs, can rival or exceed end-to-end systems while maintaining interpretability and modularity. This work provides practical insights into LLM-driven event extraction and opens pathways for future hybrid models combining textual and acoustic features.