Ruofan Hu, Yan Xia, Minjie Hong, Jieming Zhu, Bo Chen, Xiaoda Yang, Minghui Fang, Tao Jin
Multimodal large language models (MLLMs) have seen substantial progress in
recent years. However, their ability to represent multimodal information in the
acoustic domain remains underexplored. In this work, we introduce Vela, a novel
framework designed to adapt MLLMs for the generation of universal multimodal
embeddings. By leveraging MLLMs with specially crafted prompts and selected
in-context learning examples, Vela effectively bridges the modality gap across
various modalities. We then propose a single-modality training approach, where
the model is trained exclusively on text pairs. Our experiments show that Vela
outperforms traditional CLAP models in standard text-audio retrieval tasks.
Furthermore, we introduce new benchmarks that expose CLAP models' limitations
in handling long texts and complex retrieval tasks. In contrast, Vela, by
harnessing the capabilities of MLLMs, demonstrates robust performance in these
scenarios. Our code will soon be available.
Authors' comments: Accepted by Interspeech 2025
Chelsi Jain, Yiran Wu, Yifan Zeng, Jiale Liu, S hengyu Dai, Zhenwen Shao, Qingyun Wu, Huazheng Wang
Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g, images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to embed and retrieve relevant pages as images, and generate answers with VLMs that can accept an image as input. In this paper, we introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA. It boosts evidence page gathering by first retrieving candidates through embedding similarity and then filtering and re-ranking these candidates based on page summaries. A single VLM-based reasoner agent repeatedly invokes this dual-cue retriever, iteratively pulling fresh pages into a working memory until the question is confidently answered. SimpleDoc outperforms previous baselines by 3.2% on average on 4 DocVQA datasets with much fewer pages retrieved. Our code is available at https://github.com/ag2ai/SimpleDoc.
Chia-Heng Yu, Yen-Lung Tsai
Traditional Retrieval-Augmented Generation (RAG) systems employ brute-force
inner product search to retrieve the top-k most similar documents, then
combined with the user query and passed to a language model. This allows the
model to access external knowledge and reduce hallucinations. However,
selecting an appropriate k value remains a significant challenge in practical
applications: a small k may fail to retrieve sufficient information, while a
large k can introduce excessive and irrelevant content. To address this, we
propose a hierarchical clustering-based retrieval method that eliminates the
need to predefine k. Our approach maintains the accuracy and relevance of
system responses while adaptively selecting semantically relevant content. In
the experiment stage, we applied our method to a Taiwanese legal dataset with
expert-graded queries. The results show that our approach achieves superior
performance in expert evaluations and maintains high precision while
eliminating the need to predefine k, demonstrating improved accuracy and
interpretability in legal text retrieval tasks. Our framework is simple to
implement and easily integrates with existing RAG pipelines, making it a
practical solution for real-world applications under limited resources.
Authors' comments: 19 pages, 5 figures, Code available at
https://github.com/arthur422tp/hierachical
Imdad Ullah, Najm Hassan, Tariq Ahamed Ahangar, Zawar Hussain Shah, Mehregan Mahdavi Andrew Levula
User profiling is crucial in providing personalised services, as it relies on analysing user behaviour and preferences to deliver targeted services. This approach enhances user experience and promotes heightened engagement. Nevertheless, user profiling also gives rise to noteworthy privacy considerations due to the extensive tracking and monitoring of personal data, potentially leading to surveillance or identity theft. We propose a dual-ring protection mechanism to protect user privacy by examining various threats to user privacy, such as behavioural attacks, profiling fingerprinting and monitoring, profile perturbation, etc., both on the user and service provider sides. We develop user profiles that contain sensitive private attributes and an equivalent profile based on differential privacy for evaluating personalised services. We determine the entropy of the resultant profiles during each update to protect profiling attributes and invoke various processes, such as data evaporation, to artificially increase entropy or destroy private profiling attributes. Furthermore, we use different variants of private information retrieval (PIR) to retrieve personalised services against differentially private profiles. We implement critical components of the proposed model via a proof-of-concept mobile app to demonstrate its applicability over a specific case study of advertising services, which can be generalised to other services. Our experimental results show that the observed processing delays with different PIR schemes are similar to the current advertising systems.
Larissa Mori, Carlos Sousa de Oliveira, Yuehwern Yih, Mario Ventresca
Legal passage retrieval is an important task that assists legal practitioners in the time-intensive process of finding relevant precedents to support legal arguments. This study investigates the task of retrieving legal passages or paragraphs from decisions of the Court of Justice of the European Union (CJEU), whose language is highly structured and formulaic, leading to repetitive patterns. Understanding when lexical or semantic models are more effective at handling the repetitive nature of legal language is key to developing retrieval systems that are more accurate, efficient, and transparent for specific legal domains. To this end, we explore when this routinized legal language is better suited for retrieval using methods that rely on lexical and statistical features, such as BM25, or dense retrieval models trained to capture semantic and contextual information. A qualitative and quantitative analysis with three complementary metrics shows that both lexical and dense models perform well in scenarios with more repetitive usage of language, whereas BM25 performs better than the dense models in more nuanced scenarios where repetition and verbatim~quotes are less prevalent and in longer queries. Our experiments also show that BM25 is a strong baseline, surpassing off-the-shelf dense models in 4 out of 7 performance metrics. However, fine-tuning a dense model on domain-specific data led to improved performance, surpassing BM25 in most metrics, and we analyze the effect of the amount of data used in fine-tuning on the model's performance and temporal robustness. The code, dataset and appendix related to this work are available on: https://github.com/larimo/lexsem-legal-ir.
Deborah Bardet, Quentin Changeat, Olivia Venot, Emilie Panek
Constraining the chemical structure of exoplanetary atmospheres is pivotal
for interpreting spectroscopic data and understanding planetary evolution.
Traditional retrieval methods often assume thermochemical equilibrium or free
profiles, which may fail to capture disequilibrium processes like
photodissociation and vertical mixing. This study leverages the TauREx 3.1
retrieval framework coupled with FRECKLL, a disequilibrium chemistry model, to
address these challenges. The study aims to (1) assess the impact of
disequilibrium chemistry on constraining metallicity and C/O ratios; (2)
evaluate the role of refractory species (TiO and VO) in spectral retrievals;
(3) explore consistency between transit and eclipse observations for
temperature and chemical profiles; and (4) determine the effects of retrieval
priors and data reduction methods. Ten hot-Jupiter atmospheres were reanalyzed
using Hubble Space Telescope (HST) WFC3 data in eclipse and transit. The
TauREx-FRECKLL model incorporated disequilibrium chemistry calculations with a
Bayesian framework to infer atmospheric properties. The disequilibrium approach
significantly altered retrieved metallicity and C/O ratios compared to
equilibrium models, impacting planet formation insights. Retrievals reconciled
transit and eclipse temperature profiles in deeper atmospheric layers but not
in upper layers. Results were highly dependent on spectral resolution and
retrieval priors, emphasizing limitations of HST data and the need for broader
spectral coverage from instruments like JWST. This study demonstrates the
feasibility and importance of incorporating disequilibrium chemistry in
atmospheric retrievals, highlighting its potential for advancing our
understanding of exoplanetary atmospheres with next-generation telescopes.
Authors' comments: 24 pages, 22 figures, accepted for publication in Astronomy and
Astrophysics
Joon Soo Yoo, Taeho Kim, Ji Won Yoon
Location-based services often require users to share sensitive locational data, raising privacy concerns due to potential misuse or exploitation by untrusted servers. In response, we present VeLoPIR, a versatile location-based private information retrieval (PIR) system designed to preserve user privacy while enabling efficient and scalable query processing. VeLoPIR introduces three operational modes-interval validation, coordinate validation, and identifier matching-that support a broad range of real-world applications, including information and emergency alerts. To enhance performance, VeLoPIR incorporates multi-level algorithmic optimizations with parallel structures, achieving significant scalability across both CPU and GPU platforms. We also provide formal security and privacy proofs, confirming the system's robustness under standard cryptographic assumptions. Extensive experiments on real-world datasets demonstrate that VeLoPIR achieves up to 11.55 times speed-up over a prior baseline. The implementation of VeLoPIR is publicly available at https://github.com/PrivStatBool/VeLoPIR.
Mingjun Xu, Jinhan Dong, Jue Hou, Zehui Wang, Sihang Li, Zhifeng Gao, Renxin Zhong, Hengxing Cai
Multimodal document retrieval systems enable information access across text, images, and layouts, benefiting various domains like document-based question answering, report analysis, and interactive content summarization. Rerankers improve retrieval precision by reordering retrieved candidates. However, current multimodal reranking methods remain underexplored, with significant room for improvement in both training strategies and overall effectiveness. Moreover, the lack of explicit reasoning makes it difficult to analyze and optimize these methods further. In this paper, We propose MM-R5, a MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval, aiming to provide a more effective and reliable solution for multimodal reranking tasks. MM-R5 is trained in two stages: supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we focus on improving instruction-following and guiding the model to generate complete and high-quality reasoning chains. To support this, we introduce a novel data construction strategy that produces rich, high-quality reasoning data. In the RL stage, we design a task-specific reward framework, including a reranking reward tailored for multimodal candidates and a composite template-based reward to further refine reasoning quality. We conduct extensive experiments on MMDocIR, a challenging public benchmark spanning multiple domains. MM-R5 achieves state-of-the-art performance on most metrics and delivers comparable results to much larger models on the remaining ones. Moreover, compared to the best retrieval-only method, MM-R5 improves recall@1 by over 4%. These results validate the effectiveness of our reasoning-enhanced training pipeline.
Simin Sun, Yuchuan Jin, Miroslaw Staron
Developing autonomous driving systems (ADSs) involves generating and storing extensive log data from test drives, which is essential for verification, research, and simulation. However, these high-frequency logs, recorded over varying durations, pose challenges for developers attempting to locate specific driving scenarios. This difficulty arises due to the wide range of signals representing various vehicle components and driving conditions, as well as unfamiliarity of some developers' with the detailed meaning of these signals. Traditional SQL-based querying exacerbates this challenge by demanding both domain expertise and database knowledge, often yielding results that are difficult to verify for accuracy. This paper introduces a Large Language Model (LLM)-supported approach that combines signal log data with video recordings from test drives, enabling natural language based scenario searches while reducing the need for specialized knowledge. By leveraging scenario distance graphs and relative gap indicators, it provides quantifiable metrics to evaluate the reliability of query results. The method is implemented as an API for efficient database querying and retrieval of relevant records, paired with video frames for intuitive visualization. Evaluation on an open industrial dataset demonstrates improved efficiency and reliability in scenario retrieval, eliminating dependency on a single data source and conventional SQL.
Seongbo Jang, Seonghyeon Lee, Dongha Lee, Hwanjo Yu
Multimodal chatbots have become one of the major topics for dialogue systems
in both research community and industry. Recently, researchers have shed light
on the multimodality of responses as well as dialogue contexts. This work
explores how a dialogue system can output responses in various modalities such
as text and image. To this end, we first formulate a multimodal dialogue
response retrieval task for retrieval-based systems as the combination of three
subtasks. We then propose three integration methods based on a two-step
approach and an end-to-end approach, and compare the merits and demerits of
each method. Experimental results on two datasets demonstrate that the
end-to-end approach achieves comparable performance without an intermediate
step in the two-step approach. In addition, a parameter sharing strategy not
only reduces the number of parameters but also boosts performance by
transferring knowledge across the subtasks and the modalities.
Authors' comments: 9 pages, 1 figure
Alireza Salemi, Mukta Maddipatla, Hamed Zamani
This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework's strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.
Priyanka Kargupta, Runchu Tian, Jiawei Han
Claims made by individuals or entities are oftentimes nuanced and cannot be
clearly labeled as entirely "true" or "false" -- as is frequently the case with
scientific and political claims. However, a claim (e.g., "vaccine A is better
than vaccine B") can be dissected into its integral aspects and sub-aspects
(e.g., efficacy, safety, distribution), which are individually easier to
validate. This enables a more comprehensive, structured response that provides
a well-rounded perspective on a given problem while also allowing the reader to
prioritize specific angles of interest within the claim (e.g., safety towards
children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based
framework for automatically constructing a hierarchy of aspects typically
considered when addressing a claim and enriching them with corpus-specific
perspectives. This structure hierarchically partitions an input corpus to
retrieve relevant segments, which assist in discovering new sub-aspects.
Moreover, these segments enable the discovery of varying perspectives towards
an aspect of the claim (e.g., support, neutral, or oppose) and their respective
prevalence (e.g., "how many biomedical papers believe vaccine A is more
transportable than B?"). We apply ClaimSpect to a wide variety of real-world
scientific and political claims featured in our constructed dataset, showcasing
its robustness and accuracy in deconstructing a nuanced claim and representing
perspectives within a corpus. Through real-world case studies and human
evaluation, we validate its effectiveness over multiple baselines.
Authors' comments: Accepted to ACL 2025 Main Conference. Code available at:
https://github.com/pkargupta/claimspect
Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal
This paper presents our system for Track 1: Mistake Identification in the BEA
2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The
task involves evaluating whether a tutor's response correctly identifies a
mistake in a student's mathematical reasoning. We explore four approaches: (1)
an ensemble of machine learning models over pooled token embeddings from
multiple pretrained language models (LMs); (2) a frozen sentence-transformer
using [CLS] embeddings with an MLP classifier; (3) a history-aware model with
multi-head attention between token-level history and response embeddings; and
(4) a retrieval-augmented few-shot prompting system with a large language model
(LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples,
constructs structured prompts, and uses schema-guided output parsing to produce
interpretable predictions. It outperforms all baselines, demonstrating the
effectiveness of combining example-driven prompting with LLM reasoning for
pedagogical feedback assessment. Our code is available at
https://github.com/NaumanNaeem/BEA_2025.
Authors' comments: 6 pages, 2 figures, 1 table
Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu
Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.
Jing He, Yiqing Wang, Lingling Li, Kexin Zhang, Puhua Chen
This report presents ContextRefine-CLIP (CR-CLIP), an efficient model for visual-textual multi-instance retrieval tasks. The approach is based on the dual-encoder AVION, on which we introduce a cross-modal attention flow module to achieve bidirectional dynamic interaction and refinement between visual and textual features to generate more context-aware joint representations. For soft-label relevance matrices provided in tasks such as EPIC-KITCHENS-100, CR-CLIP can work with Symmetric Multi-Similarity Loss to achieve more accurate semantic alignment and optimization using the refined features. Without using ensemble learning, the CR-CLIP model achieves 66.78mAP and 82.08nDCG on the EPIC-KITCHENS-100 public leaderboard, which significantly outperforms the baseline model and fully validates its effectiveness in cross-modal retrieval. The code will be released open-source on https://github.com/delCayr/ContextRefine-Clip
Shubhashis Roy Dipta, Francis Ferraro
Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.
Yael Frischholz, Devis Tuia, Michael Lehning
Accurate retrieval of surface solar radiation (SSR) from satellite imagery
critically depends on estimating the background reflectance that a spaceborne
sensor would observe under clear-sky conditions. Deviations from this baseline
can then be used to detect cloud presence and guide radiative transfer models
in inferring atmospheric attenuation. Operational retrieval algorithms
typically approximate background reflectance using monthly statistics, assuming
surface properties vary slowly relative to atmospheric conditions. However,
this approach fails in mountainous regions where intermittent snow cover and
changing snow surfaces are frequent. We propose an attention-based emulator for
SSR retrieval that implicitly learns to infer clear-sky surface reflectance
from raw satellite image sequences. Built on the Temporo-Spatial Vision
Transformer, our approach eliminates the need for hand-crafted features such as
explicit albedo maps or cloud masks. The emulator is trained on instantaneous
SSR estimates from the HelioMont algorithm over Switzerland, a region
characterized by complex terrain and dynamic snow cover. Inputs include
multi-spectral SEVIRI imagery from the Meteosat Second Generation platform,
augmented with static topographic features and solar geometry. The target
variable is HelioMont's SSR, computed as the sum of its direct and diffuse
horizontal irradiance components, given at a spatial resolution of 1.7 km. We
show that, when provided a sufficiently long temporal context, the model
matches the performances of albedo-informed models, highlighting the model's
ability to internally learn and exploit latent surface reflectance dynamics.
Our geospatial analysis shows this effect is most powerful in mountainous
regions and improves generalization in both simple and complex topographic
settings. Code and datasets are publicly available at
https://github.com/frischwood/HeMu-dev.git
Authors' comments: 14 pages, 7 figures
Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen, Xi Ye
Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
Tianjun Yao, Haoxuan Li, Zhiqiang Shen, Pan Li, Tongliang Liu, Kun Zhang
Large Language Models (LLMs) have shown strong inductive reasoning ability
across various domains, but their reliability is hindered by the outdated
knowledge and hallucinations. Retrieval-Augmented Generation mitigates these
issues by grounding LLMs with external knowledge; however, most existing RAG
pipelines rely on unstructured text, limiting interpretability and structured
reasoning. Knowledge graphs, which represent facts as relational triples, offer
a more structured and compact alternative. Recent studies have explored
integrating knowledge graphs with LLMs for knowledge graph question answering
(KGQA), with a significant proportion adopting the retrieve-then-reasoning
paradigm. In this framework, graph-based retrievers have demonstrated strong
empirical performance, yet they still face challenges in generalization
ability. In this work, we propose RAPL, a novel framework for efficient and
effective graph retrieval in KGQA. RAPL addresses these limitations through
three aspects: (1) a two-stage labeling strategy that combines heuristic
signals with parametric models to provide causally grounded supervision; (2) a
model-agnostic graph transformation approach to capture both intra- and
inter-triple interactions, thereby enhancing representational capacity; and (3)
a path-based reasoning strategy that facilitates learning from the injected
rational knowledge, and supports downstream reasoner through structured inputs.
Empirically, RAPL outperforms state-of-the-art methods by $2.66\%-20.34\%$, and
significantly reduces the performance gap between smaller and more powerful
LLM-based reasoners, as well as the gap under cross-dataset settings,
highlighting its superior retrieval capability and generalizability. Codes are
available at: https://github.com/tianyao-aka/RAPL.
Authors' comments: 32 pages, 28 figures
Hyeongcheol Park, MinHyuk Jang, Ha Dam Baek, Gyusam Chang, Jiyoung Seo, Jiwan Park, Hogun Park, Sangpil Kim
Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations reduce their extensibility and applicability to a broad range of multimodal tasks, particularly as the field shifts toward richer modalities such as video and audio in recent MLLMs. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information, where each triplet is linked to multimodal data and enriched with detailed descriptions of concepts. Specifically, our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics through a series of stringent filtering and alignment steps, enabling the automatic generation of MMKGs from any multimodal dataset. We further introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities. Experiments on question answering tasks across various modalities demonstrate the effectiveness of VAT-KG in supporting MLLMs, highlighting its practical value in unifying and leveraging multimodal knowledge.
Authors' comments: Project Page: https://vatkg.github.io/