Yoorhim Cho, Hongyeob Kim, Semin Kim, Youjia Zhang, Yunseok Choi, Sungeun Hong
Visuo-tactile perception aims to understand an object's tactile properties, such as texture, softness, and rigidity. However, the field remains underexplored because collecting tactile data is costly and labor-intensive. We observe that visually distinct objects can exhibit similar surface textures or material properties. For example, a leather sofa and a leather jacket have different appearances but share similar tactile properties. This implies that tactile understanding can be guided by material cues in visual data, even without direct tactile supervision. In this paper, we introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets. A key challenge remains in effectively utilizing these tactile-aware external descriptions. RA-Touch addresses this by retrieving visual-textual representations aligned with tactile inputs and integrating them to focus on relevant textural and material properties. By outperforming prior methods on the TVL benchmark, our method demonstrates the potential of retrieval-based visual reuse for tactile understanding. Code is available at https://aim-skku.github.io/RA-Touch
Ziyang Zeng, Dun Zhang, Jiacheng Li, Panxiang Zou, Yuqing Yang
This study investigates a specific form of positional bias, termed the Myopic
Trap, where retrieval models disproportionately attend to the early parts of
documents while overlooking relevant information that appears later. To
systematically quantify this phenomenon, we propose a semantics-preserving
evaluation framework that repurposes the existing NLP datasets into
position-aware retrieval benchmarks. By evaluating the SOTA models of full
retrieval pipeline, including BM25, embedding models, ColBERT-style
late-interaction models, and reranker models, we offer a broader empirical
perspective on positional bias than prior work. Experimental results show that
embedding models and ColBERT-style models exhibit significant performance
degradation when query-related content is shifted toward later positions,
indicating a pronounced head bias. Notably, under the same training
configuration, ColBERT-style approach show greater potential for mitigating
positional bias compared to the traditional single-vector approach. In
contrast, BM25 and reranker models remain largely unaffected by such
perturbations, underscoring their robustness to positional bias. Code and data
are publicly available at: www.github.com/NovaSearch-Team/RAG-Retrieval.
Authors' comments: 10 pages, 3 figures, 4 tables. Under review
Qifeng Cai, Hao Liang, Hejun Dong, Meiyi Qiang, Ruichuan An, Zhaoyang Han, Zhengzhou Zhu, Bin Cui et al.
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark
Junyu Luo, Yusheng Zhao, Xiao Luo, Zhiping Xiao, Wei Ju, Li Shen, Dacheng Tao, Ming Zhang
Unsupervised efficient domain adaptive retrieval aims to transfer knowledge
from a labeled source domain to an unlabeled target domain, while maintaining
low storage cost and high retrieval efficiency. However, existing methods
typically fail to address potential noise in the target domain, and directly
align high-level features across domains, thus resulting in suboptimal
retrieval performance. To address these challenges, we propose a novel
Cross-Domain Diffusion with Progressive Alignment method (COUPLE). This
approach revisits unsupervised efficient domain adaptive retrieval from a graph
diffusion perspective, simulating cross-domain adaptation dynamics to achieve a
stable target domain adaptation process. First, we construct a cross-domain
relationship graph and leverage noise-robust graph flow diffusion to simulate
the transfer dynamics from the source domain to the target domain, identifying
lower noise clusters. We then leverage the graph diffusion results for
discriminative hash code learning, effectively learning from the target domain
while reducing the negative impact of noise. Furthermore, we employ a
hierarchical Mixup operation for progressive domain alignment, which is
performed along the cross-domain random walk paths. Utilizing target domain
discriminative hash learning and progressive domain alignment, COUPLE enables
effective domain adaptive hash learning. Extensive experiments demonstrate
COUPLE's effectiveness on competitive benchmarks.
Authors' comments: IEEE TIP
Yuning Jiang, Feiyang Shang, Freedy Tan Wei You, Huilin Wang, Chia Ren Cong, Qiaoran Meng, Nay Oo, Hoon Wei Lim et al.
The dynamic landscape of cybersecurity demands precise and scalable solutions for vulnerability management in heterogeneous systems, where configuration-specific vulnerabilities are often misidentified due to inconsistent data in databases like the National Vulnerability Database (NVD). Inaccurate Common Platform Enumeration (CPE) data in NVD further leads to false positives and incomplete vulnerability retrieval. Informed by our systematic analysis of CPE and CVEdeails data, revealing more than 50% vendor name inconsistencies, we propose VulCPE, a framework that standardizes data and models configuration dependencies using a unified CPE schema (uCPE), entity recognition, relation extraction, and graph-based modeling. VulCPE achieves superior retrieval precision (0.766) and coverage (0.926) over existing tools. VulCPE ensures precise, context-aware vulnerability management, enhancing cyber resilience.
Runchu Tian, Xueqiang Xu, Bowen Jin, SeongKu Kang, Jiawei Han
Scientific retrieval is essential for advancing academic discovery. Within
this process, document reranking plays a critical role by refining first-stage
retrieval results. However, large language model (LLM) listwise reranking faces
unique challenges in the scientific domain. First-stage retrieval is often
suboptimal in the scientific domain, so relevant documents are ranked lower.
Moreover, conventional listwise reranking uses the full text of candidate
documents in the context window, limiting the number of candidates that can be
considered. As a result, many relevant documents are excluded before reranking,
which constrains overall retrieval performance. To address these challenges, we
explore compact document representations based on semantic features such as
categories, sections, and keywords, and propose a training-free, model-agnostic
reranking framework for scientific retrieval called CoRank. The framework
involves three stages: (i) offline extraction of document-level features, (ii)
coarse reranking using these compact representations, and (iii) fine-grained
reranking on full texts of the top candidates from stage (ii). This hybrid
design provides a high-level abstraction of document semantics, expands
candidate coverage, and retains critical details required for precise ranking.
Experiments on LitSearch and CSFCube show that CoRank significantly improves
reranking performance across different LLM backbones, increasing nDCG@10 from
32.0 to 39.7. Overall, these results highlight the value of information
extraction for reranking in scientific retrieval.
Authors' comments: 17 pages, 4 figures
Tommaso Mario Buonocore, Enea Parimbelli
Content moderation for large language models (LLMs) remains a significant
challenge, requiring flexible and adaptable solutions that can quickly respond
to emerging threats. This paper introduces Retrieval Augmented Rejection (RAR),
a novel approach that leverages a retrieval-augmented generation (RAG)
architecture to dynamically reject unsafe user queries without model
retraining. By strategically inserting and marking malicious documents into the
vector database, the system can identify and reject harmful requests when these
documents are retrieved. Our preliminary results show that RAR achieves
comparable performance to embedded moderation in LLMs like Claude 3.5 Sonnet,
while offering superior flexibility and real-time customization capabilities, a
fundamental feature to timely address critical vulnerabilities. This approach
introduces no architectural changes to existing RAG systems, requiring only the
addition of specially crafted documents and a simple rejection mechanism based
on retrieval results.
Authors' comments: 7 pages, 4 figures, 2 tables
Kevin Chenhao Li, Vahid Zolfaghari, Nenad Petrovic, Fengjunjie Pan, Alois Knoll
The Object Constraint Language (OCL) is essential for defining precise constraints within Model-Based Systems Engineering (MBSE). However, manually writing OCL rules is complex and time-consuming. This study explores the optimization of Retrieval-Augmented Generation (RAG) for automating OCL rule generation, focusing on the impact of different retrieval strategies. We evaluate three retrieval approaches $\unicode{x2013}$ BM25 (lexical-based), BERT-based (semantic retrieval), and SPLADE (sparse-vector retrieval) $\unicode{x2013}$ analyzing their effectiveness in providing relevant context for a large language model. To further assess our approach, we compare and benchmark our retrieval-optimized generation results against PathOCL, a state-of-the-art graph-based method. We directly compare BM25, BERT, and SPLADE retrieval methods with PathOCL to understand how different retrieval methods perform for a unified evaluation framework. Our experimental results, focusing on retrieval-augmented generation, indicate that while retrieval can enhance generation accuracy, its effectiveness depends on the retrieval method and the number of retrieved chunks (k). BM25 underperforms the baseline, whereas semantic approaches (BERT and SPLADE) achieve better results, with SPLADE performing best at lower k values. However, excessive retrieval with high k parameter can lead to retrieving irrelevant chunks which degrades model performance. Our findings highlight the importance of optimizing retrieval configurations to balance context relevance and output consistency. This research provides insights into improving OCL rule generation using RAG and underscores the need for tailoring retrieval.
Liuji Chen, Xiaofang Yang, Yuanzhuo Lu, Jinghao Zhang, Xin Sun, Qiang Liu, Shu Wu, Jing Dong et al.
Retrieval-Augmented Generation (RAG) systems, widely used to improve the
factual grounding of large language models (LLMs), are increasingly vulnerable
to poisoning attacks, where adversaries inject manipulated content into the
retriever's corpus. While prior research has predominantly focused on
single-attacker settings, real-world scenarios often involve multiple,
competing attackers with conflicting objectives. In this work, we introduce
PoisonArena, the first benchmark to systematically study and evaluate competing
poisoning attacks in RAG. We formalize the multi-attacker threat model, where
attackers vie to control the answer to the same query using mutually exclusive
misinformation. PoisonArena leverages the Bradley-Terry model to quantify each
method's competitive effectiveness in such adversarial environments. Through
extensive experiments on the Natural Questions and MS MARCO datasets, we
demonstrate that many attack strategies successful in isolation fail under
competitive pressure. Our findings highlight the limitations of conventional
evaluation metrics like Attack Success Rate (ASR) and F1 score and underscore
the need for competitive evaluation to assess real-world attack robustness.
PoisonArena provides a standardized framework to benchmark and develop future
attack and defense strategies under more realistic, multi-adversary conditions.
Project page: https://github.com/yxf203/PoisonArena.
Authors' comments: 29 pages
Yongyuan Chu, Lu Yang, Wenle Weng, Junqiu Liu, Hairun Guo
The nature of optical metrology is to perform efficient transfer and precise
retrieval for lasers and optical signals, which is beneficial for a variety of
applications ranging from optical clocking, spectroscopy, to telecommunications
and quantum optics. While efforts have been made to promote the detection
accuracy of optical frequencies, retrieval on optical waveforms remains on the
autocorrelation scheme with limited performances. Here, we demonstrate a novel
scheme for optical metrology, particularly on direct retrieval of optical
waveform in terms of the field amplitude profile. The scheme is based on
massive four-wave-mixings underlying a nanophotonic supercontinuum process,
which enables arbitrary transfer of an additive laser to modulational sidebands
of the broadened continuum. Detection of the transferred signals is then
flexible to be within the whole span of the supercontinuum from visible to the
mid-infrared range. We demonstrate such a transfer scheme for both CW lasers
and pulsed lasers. For the latter, the temporal amplitude profile of the
optical wave can be retrieved, which reveals high-order dynamics of solitary
pulses including the self-steepening, self-compression, and the soliton
splitting, and shows a remarkable square-fold increase of signal-to-noise ratio
in the power spectrum. Our results may contribute to advance optical metrology
particularly towards chip scale optical waveform detection, and more
fundamentally, they reveal insights of massive ultrafast nonlinear interactions
underlying the soliton-based supercontinuum process.
Authors' comments: 17 pages, 9 figures
Yuhao Wang, Ruiyang Ren, Yucheng Wang, Wayne Xin Zhao, Jing Liu, Hua Wu, Haifeng Wang
Considering the inherent limitations of parametric knowledge in large
language models (LLMs), retrieval-augmented generation (RAG) is widely employed
to expand their knowledge scope. Since RAG has shown promise in
knowledge-intensive tasks like open-domain question answering, its broader
application to complex tasks and intelligent assistants has further advanced
its utility. Despite this progress, the underlying knowledge utilization
mechanisms of LLM-based RAG remain underexplored. In this paper, we present a
systematic investigation of the intrinsic mechanisms by which LLMs integrate
internal (parametric) and external (retrieved) knowledge in RAG scenarios.
Specially, we employ knowledge stream analysis at the macroscopic level, and
investigate the function of individual modules at the microscopic level.
Drawing on knowledge streaming analyses, we decompose the knowledge utilization
process into four distinct stages within LLM layers: knowledge refinement,
knowledge elicitation, knowledge expression, and knowledge contestation. We
further demonstrate that the relevance of passages guides the streaming of
knowledge through these stages. At the module level, we introduce a new method,
knowledge activation probability entropy (KAPE) for neuron identification
associated with either internal or external knowledge. By selectively
deactivating these neurons, we achieve targeted shifts in the LLM's reliance on
one knowledge source over the other. Moreover, we discern complementary roles
for multi-head attention and multi-layer perceptron layers during knowledge
formation. These insights offer a foundation for improving interpretability and
reliability in retrieval-augmented LLMs, paving the way for more robust and
transparent generative solutions in knowledge-intensive domains.
Authors' comments: SIGIR 2025
Zhangyu Wang, Siyuan Gao, Rong Zhou, Hao Wang, Li Ning
Large Language Models (LLMs) have achieved impressive progress in natural language processing, but their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks. Retrieval-Augmented Generation (RAG) mitigates this by retrieving relevant information from an external corpus. However, existing RAG systems often rely on embedding-based retrieval trained on corpus-level semantic similarity, which can lead to retrieving content that is semantically similar in form but misaligned with the question's true intent. Furthermore, recent RAG variants construct graph- or hierarchy-based structures to improve retrieval accuracy, resulting in significant computation and storage overhead. In this paper, we propose an embedding-free retrieval framework. Our method leverages the logical inferencing ability of LLMs in retrieval using iterative search space refinement guided by our novel importance measure and extend our retrieval results with logically related information without explicit graph construction. Experiments on long-context QA benchmarks, including NovelQA and Marathon, show that our approach outperforms strong baselines while reducing storage and runtime by over an order of magnitude.
Radek Osmulsk, Gabriel de Souza P. Moreira, Ronay Ak, Mengyao Xu, Benedikt Schifferer, Even Oldridge
Document retrieval is an important task for search and Retrieval-Augmented Generation (RAG) applications. Large Language Models (LLMs) have contributed to improving the accuracy of text-based document retrieval. However, documents with complex layout and visual elements like tables, charts and infographics are not perfectly represented in textual format. Recently, image-based document retrieval pipelines have become popular, which use visual large language models (VLMs) to retrieve relevant page images given a query. Current evaluation benchmarks on visual document retrieval are limited, as they primarily focus only English language, rely on synthetically generated questions and offer a small corpus size. Therefore, we introduce MIRACL-VISION, a multilingual visual document retrieval evaluation benchmark. MIRACL-VISION covers 18 languages, and is an extension of the MIRACL dataset, a popular benchmark to evaluate text-based multilingual retrieval pipelines. MIRACL was built using a human-intensive annotation process to generate high-quality questions. In order to reduce MIRACL-VISION corpus size to make evaluation more compute friendly while keeping the datasets challenging, we have designed a method for eliminating the "easy" negatives from the corpus. We conducted extensive experiments comparing MIRACL-VISION with other benchmarks, using popular public text and image models. We observe a gap in state-of-the-art VLM-based embedding models on multilingual capabilities, with up to 59.7% lower retrieval accuracy than a text-based retrieval models. Even for the English language, the visual models retrieval accuracy is 12.1% lower compared to text-based models. MIRACL-VISION is a challenging, representative, multilingual evaluation benchmark for visual retrieval pipelines and will help the community build robust models for document retrieval.
Ruobing Yao, Yifei Zhang, Shuang Song, Neng Gao, Chenyang Tu
Retrieval-Augmented Generation (RAG) compensates for the static knowledge limitations of Large Language Models (LLMs) by integrating external knowledge, producing responses with enhanced factual correctness and query-specific contextualization. However, it also introduces new attack surfaces such as corpus poisoning at the same time. Most of the existing defense methods rely on the internal knowledge of the model, which conflicts with the design concept of RAG. To bridge the gap, EcoSafeRAG uses sentence-level processing and bait-guided context diversity detection to identify malicious content by analyzing the context diversity of candidate documents without relying on LLM internal knowledge. Experiments show EcoSafeRAG delivers state-of-the-art security with plug-and-play deployment, simultaneously improving clean-scenario RAG performance while maintaining practical operational costs (relatively 1.2$\times$ latency, 48\%-80\% token reduction versus Vanilla RAG).
Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yongkang Wu, Zhonghua Li, Qi Ye et al.
Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical document structuring, and adaptive refinement through multi-task learning on a single foundation model. Experiments on seven QA datasets demonstrate that LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline. Further analysis validates that LongRefiner is scalable, efficient, and effective, providing practical insights for real-world long-text RAG applications. Our code is available at https://github.com/ignorejjj/LongRefiner.
Ruizhe Wang, Ryan J. MacDonald, Neale P. Gibson, Nikole K. Lewis
High-resolution spectroscopy (R > 25,000) has opened new opportunities to
characterize exoplanet atmospheres from the ground. By resolving individual
lines in planetary emission and transmission spectra, one can sensitively probe
the chemical inventory and temperature structure of exoplanets. However, a
significant challenge to reliable and reproducible atmospheric inferences from
high-resolution datasets has been the lack of open source codes for
high-resolution retrievals. Here, we present a unified high-resolution
retrieval framework, for both emission and transmission spectroscopy, made
publicly available within the open source POSEIDON retrieval code. Our
high-resolution retrieval framework is fast (typically < 12 hours), accessible
(no GPUs required), and well-documented via Python notebooks. We validate our
framework by reproducing previous emission retrievals of the hot Jupiter
WASP-77Ab and transmission retrievals of the ultra-hot Jupiter WASP-121b. Our
results are broadly consistent with those of published works when making the
same data detrending assumptions, but we demonstrate that user choices can
subtly propagate into retrieved chemical abundances.
Authors' comments: Documentation available at
https://poseidon-retrievals.readthedocs.io/en/latest/content/retrieval_tutorials.html#high-resolution-retrievals-new
Sheng Liang, Hang Lv, Zhihao Wen, Yaxiong Wu, Yongyue Zhang, Hao Wang, Yong Liu
Event extraction (EE) is a fundamental task in natural language processing
(NLP) that involves identifying and extracting event information from
unstructured text. Effective EE in real-world scenarios requires two key steps:
selecting appropriate schemas from hundreds of candidates and executing the
extraction process. Existing research exhibits two critical gaps: (1) the rigid
schema fixation in existing pipeline systems, and (2) the absence of benchmarks
for evaluating joint schema matching and extraction. Although large language
models (LLMs) offer potential solutions, their schema hallucination tendencies
and context window limitations pose challenges for practical deployment. In
response, we propose Adaptive Schema-aware Event Extraction (ASEE), a novel
paradigm combining schema paraphrasing with schema retrieval-augmented
generation. ASEE adeptly retrieves paraphrased schemas and accurately generates
targeted structures. To facilitate rigorous evaluation, we construct the
Multi-Dimensional Schema-aware Event Extraction (MD-SEE) benchmark, which
systematically consolidates 12 datasets across diverse domains, complexity
levels, and language settings. Extensive evaluations on MD-SEE show that our
proposed ASEE demonstrates strong adaptability across various scenarios,
significantly improving the accuracy of event extraction.
Authors' comments: 15 pages, 3 figures
Dvir Cohen, Lin Burg, Sviatoslav Pykhnivskyi, Hagit Gur, Stanislav Kovynov, Olga Atzmon, Gilad Barkan
Retrieval-Augmented Generation (RAG) is a cornerstone of modern question answering (QA) systems, enabling grounded answers based on external knowledge. Although recent progress has been driven by open-domain datasets, enterprise QA systems need datasets that mirror the concrete, domain-specific issues users raise in day-to-day support scenarios. Critically, evaluating end-to-end RAG systems requires benchmarks comprising not only question--answer pairs but also the specific knowledge base (KB) snapshot from which answers were derived. To address this need, we introduce WixQA, a benchmark suite featuring QA datasets precisely grounded in the released KB corpus, enabling holistic evaluation of retrieval and generation components. WixQA includes three distinct QA datasets derived from Wix.com customer support interactions and grounded in a snapshot of the public Wix Help Center KB: (i) WixQA-ExpertWritten, 200 real user queries with expert-authored, multi-step answers; (ii) WixQA-Simulated, 200 expert-validated QA pairs distilled from user dialogues; and (iii) WixQA-Synthetic, 6,222 LLM-generated QA pairs, with one pair systematically derived from each article in the knowledge base. We release the KB snapshot alongside the datasets under MIT license and provide comprehensive baseline results, forming a unique benchmark for evaluating enterprise RAG systems in realistic enterprise environments.
Mingxu Tao, Bowen Tang, Mingxuan Ma, Yining Zhang, Hourun Li, Feifan Wen, Hao Ma, Jia Yang
The rise of Large Language Models~(LLMs) revolutionizes information
retrieval, allowing users to obtain required answers through complex
instructions within conversations. However, publicly available services remain
inadequate in addressing the needs of faculty and students to search
campus-specific information. It is primarily due to the LLM's lack of
domain-specific knowledge and the limitation of search engines in supporting
multilingual and timely scenarios. To tackle these challenges, we introduce
ALOHA, a multilingual agent enhanced by hierarchical retrieval for university
orientation. We also integrate external APIs into the front-end interface to
provide interactive service. The human evaluation and case study show our
proposed system has strong capabilities to yield correct, timely, and
user-friendly responses to the queries in multiple languages, surpassing
commercial chatbots and search engines. The system has been deployed and has
provided service for more than 12,000 people.
Authors' comments: To appear in NAACL 2025 Demo Track
Linus Stuhlmann, Michael Alexander Saxer, Jonathan Fürst
Biomedical question-answering (QA) systems require effective retrieval and
generation components to ensure accuracy, efficiency, and scalability. This
study systematically examines a Retrieval-Augmented Generation (RAG) system for
biomedical QA, evaluating retrieval strategies and response time trade-offs. We
first assess state-of-the-art retrieval methods, including BM25, BioBERT,
MedCPT, and a hybrid approach, alongside common data stores such as
Elasticsearch, MongoDB, and FAISS, on a ~10% subset of PubMed (2.4M documents)
to measure indexing efficiency, retrieval latency, and retriever performance in
the end-to-end RAG system. Based on these insights, we deploy the final RAG
system on the full 24M PubMed corpus, comparing different retrievers' impact on
overall performance. Evaluations of the retrieval depth show that retrieving 50
documents with BM25 before reranking with MedCPT optimally balances accuracy
(0.90), recall (0.90), and response time (1.91s). BM25 retrieval time remains
stable (82ms), while MedCPT incurs the main computational cost. These results
highlight previously not well-known trade-offs in retrieval depth, efficiency,
and scalability for biomedical QA. With open-source code, the system is fully
reproducible and extensible.
Authors' comments: Accepted at SDS25