Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, Even Oldridge
Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.
Hengran Zhang, Keping Bi, Jiafeng Guo
While large language models (LLMs) are increasingly deployed as dense
retrievers, the impact of their domain-specific specialization on retrieval
effectiveness remains underexplored. This investigation systematically examines
how task-specific adaptations in LLMs influence their retrieval capabilities,
an essential step toward developing unified retrievers capable of handling
text, code, images, and multimodal content. We conduct extensive experiments
with eight Qwen2.5 7B LLMs, including base, instruction-tuned,
code/math-specialized, long reasoning, and vision-language models across
zero-shot retrieval settings and the supervised setting. For the zero-shot
retrieval settings, we consider text retrieval from the BEIR benchmark and code
retrieval from the CoIR benchmark. Further, to evaluate supervised performance,
all LLMs are fine-tuned on the MS MARCO dataset. We find that mathematical
specialization and the long reasoning capability cause consistent degradation
in three settings, indicating conflicts between mathematical reasoning and
semantic matching. The vision-language model and code-specialized LLMs
demonstrate superior zero-shot performance compared to other LLMs, even
surpassing BM25 on the code retrieval task, and maintain comparable performance
to base LLMs in supervised settings. These findings suggest promising
directions for the unified retrieval task leveraging cross-domain and
cross-modal fusion.
Authors' comments: Accepted by CCIR25 and published by Springer LNCS or LNAI
Ikuya Yamada, Ryokan Ri, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Dense retrievers often struggle with queries involving less-frequent entities due to their limited entity knowledge. We propose the Knowledgeable Passage Retriever (KPR), a BERT-based retriever enhanced with a context-entity attention layer and dynamically updatable entity embeddings. This design enables KPR to incorporate external entity knowledge without retraining. Experiments on three datasets show that KPR consistently improves retrieval accuracy, achieving a substantial 12.6% gain on the EntityQuestions dataset over the model without KPR extensions. When built on the off-the-shelf bge-base retriever, KPR achieves state-of-the-art performance among similarly sized models on two datasets. Code and models will be released soon.
Lea Fischbach, Akbar Karimi, Caroline Kleen, Alfred Lameli, Lucie Flek
Deep learning models for dialect identification are often limited by the
scarcity of dialectal data. To address this challenge, we propose to use
Retrieval-based Voice Conversion (RVC) as an effective data augmentation method
for a low-resource German dialect classification task. By converting audio
samples to a uniform target speaker, RVC minimizes speaker-related variability,
enabling models to focus on dialect-specific linguistic and phonetic features.
Our experiments demonstrate that RVC enhances classification performance when
utilized as a standalone augmentation method. Furthermore, combining RVC with
other augmentation methods such as frequency masking and segment removal leads
to additional performance gains, highlighting its potential for improving
dialect classification in low-resource scenarios.
Authors' comments: Accepted to Interspeech 2025
Jingyu Pan, Isaac Jacobson, Zheng Zhao, Tung-Chieh Chen, Guanglei Zhou, Chen-Chia Chang, Vineet Rashingkar, Yiran Chen
Modern very large-scale integration (VLSI) design requires the implementation
of integrated circuits using electronic design automation (EDA) tools. Due to
the complexity of EDA algorithms, the vast parameter space poses a huge
challenge to chip design optimization, as the combination of even moderate
numbers of parameters creates an enormous solution space to explore. Manual
parameter selection remains industrial practice despite being excessively
laborious and limited by expert experience. To address this issue, we present
CROP, the first large language model (LLM)-powered automatic VLSI design flow
tuning framework. Our approach includes: (1) a scalable methodology for
transforming RTL source code into dense vector representations, (2) an
embedding-based retrieval system for matching designs with semantically similar
circuits, and (3) a retrieval-augmented generation (RAG)-enhanced LLM-guided
parameter search system that constrains the search process with prior knowledge
from similar designs. Experiment results demonstrate CROP's ability to achieve
superior quality-of-results (QoR) with fewer iterations than existing
approaches on industrial designs, including a 9.9% reduction in power
consumption.
Authors' comments: Accepted by ICCAD 2025
Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, Sewon Min
Retrieval-augmented Generation (RAG) has primarily been studied in limited
settings, such as factoid question answering; more challenging,
reasoning-intensive benchmarks have seen limited success from minimal RAG. In
this work, we challenge this prevailing view on established,
reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We
identify a key missing component in prior work: a usable, web-scale datastore
aligned with the breadth of pretraining data. To this end, we introduce
CompactDS: a diverse, high-quality, web-scale datastore that achieves high
retrieval accuracy and subsecond latency on a single-node. The key insights are
(1) most web content can be filtered out without sacrificing coverage, and a
compact, high-quality subset is sufficient; and (2) combining in-memory
approximate nearest neighbor (ANN) retrieval and on-disk exact search balances
speed and recall. Using CompactDS, we show that a minimal RAG pipeline achieves
consistent accuracy improvements across all benchmarks and model sizes
(8B--70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA,
and 19% on MATH. No single data source suffices alone, highlighting the
importance of diversity of sources (web crawls, curated math, academic papers,
textbooks). Finally, we show that our carefully designed in-house datastore
matches or outperforms web search engines such as Google Search, as well as
recently proposed, complex agent-based RAG systems--all while maintaining
simplicity, reproducibility, and self-containment. We release CompactDS and our
retrieval pipeline, supporting future research exploring retrieval-based AI
systems.
Authors' comments: 33 pages, 2 figures, 27 tables
Thanh-Tung Phan-Nguyen, Khoi-Nguyen Nguyen-Ngoc, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
The global fashion e-commerce industry has become integral to people's daily lives, leveraging technological advancements to offer personalized shopping experiences, primarily through recommendation systems that enhance customer engagement through personalized suggestions. To improve customers' experience in online shopping, we propose a novel comprehensive KiseKloset system for outfit retrieval, recommendation, and try-on. We explore two approaches for outfit retrieval: similar item retrieval and text feedback-guided item retrieval. Notably, we introduce a novel transformer architecture designed to recommend complementary items from diverse categories. Furthermore, we enhance the overall performance of the search pipeline by integrating approximate algorithms to optimize the search process. Additionally, addressing the crucial needs of online shoppers, we employ a lightweight yet efficient virtual try-on framework capable of real-time operation, memory efficiency, and maintaining realistic outputs compared to its predecessors. This virtual try-on module empowers users to visualize specific garments on themselves, enhancing the customers' experience and reducing costs associated with damaged items for retailers. We deployed our end-to-end system for online users to test and provide feedback, enabling us to measure their satisfaction levels. The results of our user study revealed that 84% of participants found our comprehensive system highly useful, significantly improving their online shopping experience.
Sophie Zhou, Shu Kong
Art plagiarism detection plays a crucial role in protecting artists'
copyrights and intellectual property, yet it remains a challenging problem in
forensic analysis. In this paper, we address the task of recognizing
plagiarized paintings and explaining the detected plagarisms by retrieving
visually similar authentic artworks. To support this study, we construct a
dataset by collecting painting photos and synthesizing plagiarized versions
using generative AI, tailored to specific artists' styles. We first establish a
baseline approach using off-the-shelf features from the visual foundation model
DINOv2 to retrieve the most similar images in the database and classify
plagiarism based on a similarity threshold. Surprisingly, this non-learned
method achieves a high recognition accuracy of 97.2\% but suffers from low
retrieval precision 29.0\% average precision (AP). To improve retrieval
quality, we finetune DINOv2 with a metric learning loss using positive and
negative sample pairs sampled in the database. The finetuned model greatly
improves retrieval performance by 12\% AP over the baseline, though it
unexpectedly results in a lower recognition accuracy (92.7\%). We conclude with
insightful discussions and outline directions for future research.
Authors' comments: to appear at AVSS'25
Saraga S., Anagha M. S., Dincy R. Arikkat, Rafidha Rehiman K. A., Serena Nicolazzo, Antonino Nocera, Vinod P
The widespread use of Android applications has made them a prime target for cyberattacks, significantly increasing the risk of malware that threatens user privacy, security, and device functionality. Effective malware detection is thus critical, with static analysis, dynamic analysis, and Machine Learning being widely used approaches. In this work, we focus on a Machine Learning-based method utilizing static features. We first compiled a dataset of benign and malicious APKs and performed static analysis to extract features such as code structure, permissions, and manifest file content, without executing the apps. Instead of relying solely on raw static features, our system uses an LLM to generate high-level functional descriptions of APKs. To mitigate hallucinations, which are a known vulnerability of LLM, we integrated Retrieval-Augmented Generation (RAG), enabling the LLM to ground its output in relevant context. Using carefully designed prompts, we guide the LLM to produce coherent function summaries, which are then analyzed using a transformer-based model, improving detection accuracy over conventional feature-based methods for malware detection.
Wali Mohammad Abdullah, Md. Morshedul Islam, Devraj Parmar, Happy Hasmukhbhai Patel, Sindhuja Prabhakaran, Baidya Saha
Large Language Models (LLMs) like GPT-3.5-Turbo are increasingly used to assist software development, yet they often produce incomplete code or incorrect imports, especially when lacking access to external or project-specific documentation. We introduce RAILS (Retrieval-Augmented Intelligence for Learning Software Development), a framework that augments LLM prompts with semantically retrieved context from curated Java resources using FAISS and OpenAI embeddings. RAILS incorporates an iterative validation loop guided by compiler feedback to refine suggestions. We evaluated RAILS on 78 real-world Java import error cases spanning standard libraries, GUI APIs, external tools, and custom utilities. Despite using the same LLM, RAILS outperforms baseline prompting by preserving intent, avoiding hallucinations, and surfacing correct imports even when libraries are unavailable locally. Future work will integrate symbolic filtering via PostgreSQL and extend support to other languages and IDEs.
Wali Mohammad Abdullah, Azmain Kabir
We present P4OMP, a retrieval-augmented framework for transforming serial C/C++ code into OpenMP-annotated parallel code using large language models (LLMs). To our knowledge, this is the first system to apply retrieval-based prompting for OpenMP pragma correctness without model fine-tuning or compiler instrumentation. P4OMP leverages Retrieval-Augmented Generation (RAG) with structured instructional knowledge from OpenMP tutorials to improve the reliability of prompt-driven code generation. By grounding generation in the retrieved context, P4OMP improves syntactic correctness compared to baseline prompting with GPT-3.5-Turbo. We evaluate P4OMP against a baseline, GPT-3.5-Turbo without retrieval, on a comprehensive benchmark of 108 real-world C++ programs drawn from Stack Overflow, PolyBench, and NAS benchmark suites. P4OMP achieves 100% compilation success on all parallelizable cases, while the baseline fails to compile in 20 out of 108 cases. Six cases that rely on non-random-access iterators or thread-unsafe constructs are excluded due to fundamental OpenMP limitations. A detailed analysis demonstrates how P4OMP consistently avoids scoping errors, syntactic misuse, and invalid directive combinations that commonly affect baseline-generated code. We further demonstrate strong runtime scaling across seven compute-intensive benchmarks on an HPC cluster. P4OMP offers a robust, modular pipeline that significantly improves the reliability and applicability of LLM-generated OpenMP code.
Kevin Duh, Eugene Yang, Orion Weller, Andrew Yates, Dawn Lawrie
The HLTCOE LiveRAG submission utilized the GPT-researcher framework for
researching the context of the question, filtering the returned results, and
generating the final answer. The retrieval system was a ColBERT bi-encoder
architecture, which represents a passage with many dense tokens. Retrieval used
a local, compressed index of the FineWeb10-BT collection created with PLAID-X,
using a model fine-tuned for multilingual retrieval. Query generation from
context was done with Qwen2.5-7B-Instruct, while filtering was accomplished
with m2-bert-80M-8k-retrieval. Up to nine passages were used as context to
generate an answer using Falcon3-10B. This system placed 5th in the LiveRAG
automatic evaluation for correctness with a score of 1.07.
Authors' comments: 5 pages, 1 figure
Xinghe Cheng, Zihan Zhang, Jiapu Wang, Liangda Fang, Chaobo He, Quanlong Guan, Shirui Pan, Weiqi Luo
Learning path recommendation seeks to provide learners with a structured sequence of learning items (e.g., knowledge concepts or exercises) to optimize their learning efficiency. Despite significant efforts in this area, most existing methods primarily rely on prerequisite relationships, which present two major limitations: 1) Many educational datasets do not explicitly provide prerequisite relationships between knowledge concepts, hindering the application of current learning path recommendation methods. 2) Relying solely on prerequisite relationships as the sole knowledge structure can impede learning progress and negatively impact student outcomes. To address these challenges, we propose a novel approach, Discrimination Learning Enhances Learning Path Recommendation (DLELP), which enhances learning path recommendations by incorporating both prerequisite and similarity relationships between knowledge concepts. Specifically, we introduce a knowledge concept structure graph generation module that adaptively constructs knowledge concept structure graphs for different educational datasets, significantly improving the generalizability of learning path recommendation methods. We then propose a Discrimination Learning-driven Reinforcement Learning (DLRL) framework, which mitigates the issue of blocked learning paths, further enhancing the efficacy of learning path recommendations. Finally, we conduct extensive experiments on three benchmark datasets, demonstrating that our method not only achieves state-of-the-art performance but also provides interpretable reasoning for the recommended learning paths.
Haiyang Peng, Deren Han, Meng Huang
This paper investigates the stability of phase retrieval by analyzing the condition number of the nonlinear map $\Psi_{\boldsymbol{A}}(\boldsymbol{x}) = \bigl(\lvert \langle {\boldsymbol{a}}_j, \boldsymbol{x} \rangle \rvert^2 \bigr)_{1 \le j \le m}$, where $\boldsymbol{a}_j \in \mathbb{H}^n$ are known sensing vectors with $\mathbb{H} \in \{\mathbb{R}, \mathbb{C}\}$. For each $p \ge 1$, we define the condition number $\beta_{\Psi_{\boldsymbol{A}}}^{\ell_p}$ as the ratio of optimal upper and lower Lipschitz constants of $\Psi_{\boldsymbol{A}}$ measured in the $\ell_p$ norm, with respect to the metric $\mathrm {dist}_\mathbb{H}\left(\boldsymbol{x}, \boldsymbol{y}\right) = \|\boldsymbol{x} \boldsymbol{x}^\ast - \boldsymbol{y} \boldsymbol{y}^\ast\|_*$. We establish universal lower bounds on $\beta_{\Psi_{\boldsymbol{A}}}^{\ell_p}$ for any sensing matrix $\boldsymbol{A} \in \mathbb{H}^{m \times d}$, proving that $\beta_{\Psi_{\boldsymbol{A}}}^{\ell_1} \ge \pi/2$ and $\beta_{\Psi_{\boldsymbol{A}}}^{\ell_2} \ge \sqrt{3}$ in the real case $(\mathbb{H} = \mathbb{R})$, and $\beta_{\Psi_{\boldsymbol{A}}}^{\ell_p} \ge 2$ for $p=1,2$ in the complex case $(\mathbb{H} = \mathbb{C})$. These bounds are shown to be asymptotically tight: both a deterministic harmonic frame $\boldsymbol{E}_m \in \mathbb{R}^{m \times 2}$ and Gaussian random matrices $\boldsymbol{A} \in \mathbb{H}^{m \times d}$ asymptotically attain them. Notably, the harmonic frame $\boldsymbol{E}_m \in \mathbb{R}^{m \times 2}$ achieves the optimal lower bound $\sqrt{3}$ for all $m \ge 3$ when $p=2$, thus serving as an optimal sensing matrix within $\boldsymbol{A} \in \mathbb{R}^{m \times 2}$. Our results provide the first explicit uniform lower bounds on $\beta_{\Psi_{\boldsymbol{A}}}^{\ell_p}$ and offer insights into the fundamental stability limits of phase retrieval.
Reza Yousefi Maragheh, Pratheek Vadla, Priyank Gupta, Kai Zhao, Aysenur Inan, Kehui Yao, Jianpeng Xu, Praveen Kanumala et al.
Retrieval-Augmented Generation (RAG) has shown promise in enhancing recommendation systems by incorporating external context into large language model prompts. However, existing RAG-based approaches often rely on static retrieval heuristics and fail to capture nuanced user preferences in dynamic recommendation scenarios. In this work, we introduce ARAG, an Agentic Retrieval-Augmented Generation framework for Personalized Recommendation, which integrates a multi-agent collaboration mechanism into the RAG pipeline. To better understand the long-term and session behavior of the user, ARAG leverages four specialized LLM-based agents: a User Understanding Agent that summarizes user preferences from long-term and session contexts, a Natural Language Inference (NLI) Agent that evaluates semantic alignment between candidate items retrieved by RAG and inferred intent, a context summary agent that summarizes the findings of NLI agent, and an Item Ranker Agent that generates a ranked list of recommendations based on contextual fit. We evaluate ARAG accross three datasets. Experimental results demonstrate that ARAG significantly outperforms standard RAG and recency-based baselines, achieving up to 42.1% improvement in NDCG@5 and 35.5% in Hit@5. We also, conduct an ablation study to analyse the effect by different components of ARAG. Our findings highlight the effectiveness of integrating agentic reasoning into retrieval-augmented recommendation and provide new directions for LLM-based personalization.
Guanting Dong, Xiaoxi Li, Yuyao Zhang, Mengjie Deng
Real-world live retrieval-augmented generation (RAG) systems face significant
challenges when processing user queries that are often noisy, ambiguous, and
contain multiple intents. While RAG enhances large language models (LLMs) with
external knowledge, current systems typically struggle with such complex
inputs, as they are often trained or evaluated on cleaner data. This paper
introduces Omni-RAG, a novel framework designed to improve the robustness and
effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs
LLM-assisted query understanding to preprocess user inputs through three key
modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs
with tailored prompts to denoise queries (e.g., correcting spelling errors) and
decompose multi-intent queries into structured sub-queries; (2) Intent-Aware
Knowledge Retrieval, which performs retrieval for each sub-query from a corpus
(i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking
and Generation, where a reranker (i.e., BGE) refines document selection before
a final response is generated by an LLM (i.e., Falcon-10B) using a
chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG
capabilities and the demands of real-world applications, such as those
highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex
and noisy queries.
Authors' comments: Accepted at SIGIR 2025 LiveRAG Workshop (Oral Presentation)
Dinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, Trung-Nghia Le
Retrieving 3D objects in complex indoor environments using only a masked 2D image and a natural language description presents significant challenges. The ROOMELSA challenge limits access to full 3D scene context, complicating reasoning about object appearance, geometry, and semantics. These challenges are intensified by distorted viewpoints, textureless masked regions, ambiguous language prompts, and noisy segmentation masks. To address this, we propose SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification. SAMURAI integrates CLIP-based semantic matching with shape-guided re-ranking derived from binary silhouettes of masked regions, alongside a robust majority voting strategy. A dedicated preprocessing pipeline enhances mask quality by extracting the largest connected component and removing background noise. Our hybrid retrieval framework leverages both language and shape cues, achieving competitive performance on the ROOMELSA private test set. These results highlight the importance of combining shape priors with language understanding for robust open-world 3D object retrieval.
Fangyuan Zhang, Zhengjun Huang, Yingli Zhou, Qintian Guo, Zhixun Li, Wensheng Luo, Di Jiang, Yixiang Fang et al.
Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large
language models (LLMs) by structuring retrieval over an external corpus.
However, existing approaches typically assume a static corpus, requiring
expensive full-graph reconstruction whenever new documents arrive, limiting
their scalability in dynamic, evolving environments. To address these
limitations, we introduce EraRAG, a novel multi-layered Graph-RAG framework
that supports efficient and scalable dynamic updates. Our method leverages
hyperplane-based Locality-Sensitive Hashing (LSH) to partition and organize the
original corpus into hierarchical graph structures, enabling efficient and
localized insertions of new data without disrupting the existing topology. The
design eliminates the need for retraining or costly recomputation while
preserving high retrieval accuracy and low latency. Experiments on large-scale
benchmarks demonstrate that EraRag achieves up to an order of magnitude
reduction in update time and token consumption compared to existing Graph-RAG
systems, while providing superior accuracy performance. This work offers a
practical path forward for RAG systems that must operate over continually
growing corpora, bridging the gap between retrieval efficiency and
adaptability. Our code and data are available at
https://github.com/EverM0re/EraRAG-Official.
Authors' comments: Under review
Zhigong Zhou, Ning Ding, Xiaochuan Fan, Yue Shang, Yiming Qiu, Jingwei Zhuo, Zhiwei Ge, Songlin Wang et al.
Semantic retrieval, which retrieves semantically matched items given a
textual query, has been an essential component to enhance system effectiveness
in e-commerce search. In this paper, we study the multimodal retrieval problem,
where the visual information (e.g, image) of item is leveraged as supplementary
of textual information to enrich item representation and further improve
retrieval performance. Though learning from cross-modality data has been
studied extensively in tasks such as visual question answering or media
summarization, multimodal retrieval remains a non-trivial and unsolved problem
especially in the asymmetric scenario where the query is unimodal while the
item is multimodal. In this paper, we propose a novel model named SMAR, which
stands for Semantic-enhanced Modality-Asymmetric Retrieval, to tackle the
problem of modality fusion and alignment in this kind of asymmetric scenario.
Extensive experimental results on an industrial dataset show that the proposed
model outperforms baseline models significantly in retrieval accuracy. We have
open sourced our industrial dataset for the sake of reproducibility and future
research works.
Authors' comments: published in sigir2023
Jia-Huei Ju, Suzan Verberne, Maarten de Rijke, Andrew Yates
Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources. While the effectiveness of the retrieval module is typically evaluated with relevance-based ranking metrics, such metrics may be insufficient to reflect the retrieval's impact on the final RAG result, especially in long-form generation scenarios. We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation and propose metrics for assessing the context independent of generation. We introduce CRUX, a \textbf{C}ontrolled \textbf{R}etrieval-a\textbf{U}gmented conte\textbf{X}t evaluation framework designed to directly assess retrieval-augmented contexts. This framework uses human-written summaries to control the information scope of knowledge, enabling us to measure how well the context covers information essential for long-form generation. CRUX uses question-based evaluation to assess RAG's retrieval in a fine-grained manner. Empirical results show that CRUX offers more reflective and diagnostic evaluation. Our findings also reveal substantial room for improvement in current retrieval methods, pointing to promising directions for advancing RAG's retrieval. Our data and code are publicly available to support and advance future research on retrieval.