Xuan Xu, Beilin Chu, Qinhong Lin, Yixiao Zhong, Fufang Wen, Jiaqi Liu, Binjie Fei, Yu Li et al.
In recent years, large language models (LLMs) have demonstrated significant potential in constructing passage retrieval datasets. However, existing methods still face limitations in expressing cross-doc query needs and controlling annotation quality. To address these issues, this paper proposes a bidirectional generation pipeline, which aims to generate 3-level hierarchical queries for both intra-doc and cross-doc scenarios and mine additional relevance labels on top of direct mapping annotation. The pipeline introduces two query generation methods: bottom-up from single-doc text and top-down from multi-doc titles. The bottom-up method uses LLMs to disassemble and generate structured queries at both sentence-level and passage-level simultaneously from intra-doc passages. The top-down approach incorporates three key financial elements--industry, topic, and time--to divide report titles into clusters and prompts LLMs to generate topic-level queries from each cluster. For relevance annotation, our pipeline not only relies on direct mapping annotation from the generation relationship but also implements an indirect positives mining method to enrich the relevant query-passage pairs. Using this pipeline, we constructed a Financial Passage Retrieval Generated dataset (FinCPRG) from almost 1.3k Chinese financial research reports, which includes hierarchical queries and rich relevance labels. Through evaluations of mined relevance labels, benchmarking and training experiments, we assessed the quality of FinCPRG and validated its effectiveness as a passage retrieval dataset for both training and benchmarking.
Jeiyoon Park, Yongshin Han, Minseop Kim, Kisu Yang
We propose AMADEUS, which is composed of Adaptive Context-aware Text Splitter
(ACTS), Guided Selection (GS), and Attribute Extractor (AE). ACTS finds an
optimal chunk length and hierarchical contexts for each character. AE
identifies a character's general attributes from the chunks retrieved by GS and
uses these attributes as a final context to maintain robust persona consistency
even when answering out of knowledge questions. To facilitate the development
and evaluation of RAG-based RPAs, we construct CharacterRAG, a role-playing
dataset that consists of persona documents for 15 distinct fictional characters
totaling 976K written characters, and 450 question and answer pairs. We find
that our framework effectively models not only the knowledge possessed by
characters, but also various attributes such as personality.
Authors' comments: preprint
Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou et al.
Retrieval-augmented generation (RAG) over long documents typically involves
splitting the text into smaller chunks, which serve as the basic units for
retrieval. However, due to dependencies across the original document,
contextual information is often essential for accurately interpreting each
chunk. To address this, prior work has explored encoding longer context windows
to produce embeddings for longer chunks. Despite these efforts, gains in
retrieval and downstream tasks remain limited. This is because (1) longer
chunks strain the capacity of embedding models due to the increased amount of
information they must encode, and (2) many real-world applications still
require returning localized evidence due to constraints on model or human
bandwidth.
We propose an alternative approach to this challenge by representing short
chunks in a way that is conditioned on a broader context window to enhance
retrieval performance -- i.e., situating a chunk's meaning within its context.
We further show that existing embedding models are not well-equipped to encode
such situated context effectively, and thus introduce a new training paradigm
and develop the situated embedding models (SitEmb). To evaluate our method, we
curate a book-plot retrieval dataset specifically designed to assess situated
retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3
substantially outperforms state-of-the-art embedding models, including several
with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model
further improves performance by over 10% and shows strong results across
different languages and several downstream applications.
Authors' comments: Our trained models can be downloaded from:
https://huggingface.co/SituatedEmbedding
Jaskaranjeet Singh, Rakesh Thakur
Despite the rapid advancement of large language models (LLMs), low-resource languages remain largely excluded from the NLP landscape. We present PunGPT2, the first fully open-source suite of Punjabi large language models, trained from scratch on a 35GB domain-diverse corpus encompassing literature, religious texts, news, and social discourse. Unlike prior multilingual approaches, PunGPT2 captures rich syntactic and morphological features unique to Punjabi through a tokenizer optimised with byte pair encoding and linguistically aligned pretraining objectives. To improve factual grounding and domain recall, we introduce Pun-RAG, a retrieval-augmented generation framework combining PunGPT2 with a dense FAISS retriever over a curated Punjabi knowledge base. We further develop Pun-Instruct, a parameter-efficient, instruction-tuned variant using QLoRA, enabling robust zero-shot and instruction-following performance with significantly reduced compute needs. As a key innovation, we propose Quantum-RAG, a novel hybrid retrieval system that fuses sparse (BM25) and dense methods with quantum-inspired semantic matching. By encoding queries using amplitude-based embeddings and retrieving via quantum kernel similarity, Quantum-RAG achieves improved contextual relevance with minimal memory overhead marking the first practical integration of quantum representations in low-resource language generation. Our models significantly outperform strong multilingual baselines (mBERT, mT5, MuRIL) in perplexity, factuality, and fluency. This work provides a scalable, reproducible blueprint for extending LLM capabilities to underrepresented languages and pioneers quantum-aware retrieval in low-resource NLP
Dong Li, Yichen Niu, Ying Ai, Xiang Zou, Biqing Qi, Jianxing Liu
Large language models (LLMs) have demonstrated strong performance in natural language generation but remain limited in knowle- dge-intensive tasks due to outdated or incomplete internal knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external retrieval, with GraphRAG further enhancing performance through structured knowledge graphs and multi-hop reasoning. However, existing GraphRAG methods largely ignore the temporal dynamics of knowledge, leading to issues such as temporal ambiguity, time-insensitive retrieval, and semantic redundancy. To overcome these limitations, we propose Temporal GraphRAG (T-GRAG), a dynamic, temporally-aware RAG framework that models the evolution of knowledge over time. T-GRAG consists of five key components: (1) a Temporal Knowledge Graph Generator that creates time-stamped, evolving graph structures; (2) a Temporal Query Decomposition mechanism that breaks complex temporal queries into manageable sub-queries; (3) a Three-layer Interactive Retriever that progressively filters and refines retrieval across temporal subgraphs; (4) a Source Text Extractor to mitigate noise; and (5) a LLM-based Generator that synthesizes contextually and temporally accurate responses. We also introduce Time-LongQA, a novel benchmark dataset based on real-world corporate annual reports, designed to test temporal reasoning across evolving knowledge. Extensive experiments show that T-GRAG significantly outperforms prior RAG and GraphRAG baselines in both retrieval accuracy and response relevance under temporal constraints, highlighting the necessity of modeling knowledge evolution for robust long-text question answering. Our code is publicly available on the T-GRAG
Zeyu Xu, Junkang Zhang, Qiang Wang, Yi Liu
Vision-Language Models (VLMs) have enabled substantial progress in video understanding by leveraging cross-modal reasoning capabilities. However, their effectiveness is limited by the restricted context window and the high computational cost required to process long videos with thousands of frames. Retrieval-augmented generation (RAG) addresses this challenge by selecting only the most relevant frames as input, thereby reducing the computational burden. Nevertheless, existing video RAG methods struggle to balance retrieval efficiency and accuracy, particularly when handling diverse and complex video content. To address these limitations, we propose E-VRAG, a novel and efficient video RAG framework for video understanding. We first apply a frame pre-filtering method based on hierarchical query decomposition to eliminate irrelevant frames, reducing computational costs at the data level. We then employ a lightweight VLM for frame scoring, further reducing computational costs at the model level. Additionally, we propose a frame retrieval strategy that leverages the global statistical distribution of inter-frame scores to mitigate the potential performance degradation from using a lightweight VLM. Finally, we introduce a multi-view question answering scheme for the retrieved frames, enhancing the VLM's capability to extract and comprehend information from long video contexts. Experiments on four public benchmarks show that E-VRAG achieves about 70% reduction in computational cost and higher accuracy compared to baseline methods, all without additional training. These results demonstrate the effectiveness of E-VRAG in improving both efficiency and accuracy for video RAG tasks.
Minjeong Park, Hongbeen Park, Sangwon Lee, Yoonha Jang, Jinkyu Kim
Pedestrian Attribute Recognition (PAR) plays a crucial role in various vision tasks such as person retrieval and identification. Most existing attribute-based retrieval methods operate under the closed-set assumption that all attribute classes are consistently available during both training and inference. However, this assumption limits their applicability in real-world scenarios where novel attributes may emerge. Moreover, predefined attributes in benchmark datasets are often generic and shared across individuals, making them less discriminative for retrieving the target person. To address these challenges, we propose the Open-Attribute Recognition for Person Retrieval (OAPR) task, which aims to retrieve individuals based on attribute cues, regardless of whether those attributes were seen during training. To support this task, we introduce a novel framework designed to learn generalizable body part representations that cover a broad range of attribute categories. Furthermore, we reconstruct four widely used datasets for open-attribute recognition. Comprehensive experiments on these datasets demonstrate the necessity of the OAPR task and the effectiveness of our framework. The source code and pre-trained models will be publicly available upon publication.
Bingshen Mu, Hexin Liu, Hongfei Xue, Kun Wei, Lei Xie
Automatic Speech Recognition (ASR) aims to convert human speech content into corresponding text. In conversational scenarios, effectively utilizing context can enhance its accuracy. Large Language Models' (LLMs) exceptional long-context understanding and reasoning abilities enable LLM-based ASR (LLM-ASR) to leverage historical context for recognizing conversational speech, which has a high degree of contextual relevance. However, existing conversational LLM-ASR methods use a fixed number of preceding utterances or the entire conversation history as context, resulting in significant ASR confusion and computational costs due to massive irrelevant and redundant information. This paper proposes a multi-modal retrieval-and-selection method named MARS that augments conversational LLM-ASR by enabling it to retrieve and select the most relevant acoustic and textual historical context for the current utterance. Specifically, multi-modal retrieval obtains a set of candidate historical contexts, each exhibiting high acoustic or textual similarity to the current utterance. Multi-modal selection calculates the acoustic and textual similarities for each retrieved candidate historical context and, by employing our proposed near-ideal ranking method to consider both similarities, selects the best historical context. Evaluations on the Interspeech 2025 Multilingual Conversational Speech Language Model Challenge dataset show that the LLM-ASR, when trained on only 1.5K hours of data and equipped with the MARS, outperforms the state-of-the-art top-ranking system trained on 179K hours of data.
Shubham Kumar Nigam, Tanmay Dubey, Noel Shallum, Arnab Bhattacharya
Legal precedent retrieval is a cornerstone of the common law system, governed by the principle of stare decisis, which demands consistency in judicial decisions. However, the growing complexity and volume of legal documents challenge traditional retrieval methods. TraceRetriever mirrors real-world legal search by operating with limited case information, extracting only rhetorically significant segments instead of requiring complete documents. Our pipeline integrates BM25, Vector Database, and Cross-Encoder models, combining initial results through Reciprocal Rank Fusion before final re-ranking. Rhetorical annotations are generated using a Hierarchical BiLSTM CRF classifier trained on Indian judgments. Evaluated on IL-PCR and COLIEE 2025 datasets, TraceRetriever addresses growing document volume challenges while aligning with practical search constraints, reliable and scalable foundation for precedent retrieval enhancing legal research when only partial case knowledge is available.
Roie Kazoom, Ofir Cohen, Rami Puzis, Asaf Shabtai, Ofer Hadar
We introduce VAULT, a fully automated adversarial RAG pipeline that systematically uncovers and remedies weaknesses in NLI models through three stages: retrieval, adversarial generation, and iterative retraining. First, we perform balanced few-shot retrieval by embedding premises with both semantic (BGE) and lexical (BM25) similarity. Next, we assemble these contexts into LLM prompts to generate adversarial hypotheses, which are then validated by an LLM ensemble for label fidelity. Finally, the validated adversarial examples are injected back into the training set at increasing mixing ratios, progressively fortifying a zero-shot RoBERTa-base model.On standard benchmarks, VAULT elevates RoBERTa-base accuracy from 88.48% to 92.60% on SNLI +4.12%, from 75.04% to 80.95% on ANLI +5.91%, and from 54.67% to 71.99% on MultiNLI +17.32%. It also consistently outperforms prior in-context adversarial methods by up to 2.0% across datasets. By automating high-quality adversarial data curation at scale, VAULT enables rapid, human-independent robustness improvements in NLI inference tasks.
Stefan Englmeier, Max A. Büttner, Katharina Winter, Fabian B. Flohr
Autonomous driving systems must operate reliably in safety-critical
scenarios, particularly those involving unusual or complex behavior by
Vulnerable Road Users (VRUs). Identifying these edge cases in driving datasets
is essential for robust evaluation and generalization, but retrieving such rare
human behavior scenarios within the long tail of large-scale datasets is
challenging. To support targeted evaluation of autonomous driving systems in
diverse, human-centered scenarios, we propose a novel context-aware motion
retrieval framework. Our method combines Skinned Multi-Person Linear
(SMPL)-based motion sequences and corresponding video frames before encoding
them into a shared multimodal embedding space aligned with natural language.
Our approach enables the scalable retrieval of human behavior and their context
through text queries. This work also introduces our dataset WayMoCo, an
extension of the Waymo Open Dataset. It contains automatically labeled motion
and scene context descriptions derived from generated pseudo-ground-truth SMPL
sequences and corresponding image data. Our approach outperforms
state-of-the-art models by up to 27.5% accuracy in motion-context retrieval,
when evaluated on the WayMoCo dataset.
Authors' comments: 9 pages, 10 figure, project page
https://iv.ee.hm.edu/contextmotionclip/, submitted to IEEE Transactions on
Intelligent Vehicles (T-IV), This work has been submitted to the IEEE for
possible publication
Thomas Konings, Linus Heinke, Robin Baeyens, Kaustubh Hakim, Valentin Christiaens, Leen Decin
Observations of WASP-107b suggest a metal-rich and carbon-deprived atmosphere
with an extremely hot interior based on detections of SO$_2$, H$_2$O, CO$_2$,
CO, NH$_3$, and CH$_4$. In this paper, we aim to determine the reliability of a
1D radiative-convective photochemical-equilibrium (1D-RCPE) retrieval method in
inferring atmospheric properties of WASP-107b. Our grid of radiative-convective
balanced pressure-temperature profiles and 1D photochemical equilibrated models
covers a range of metallicities (Z), carbon-to-oxygen ratios (C/O), intrinsic
temperatures (T$_{int}$), and eddy diffusion coefficients (K$_{zz}$). We obtain
good fits with our 1D-RCPE retrievals based on a few molecular features of
H$_2$O, CO$_2$, SO$_2$, and CH$_4$, but find no substantial contribution of
NH$_3$. We find that the degeneracy between metallicity, cloud pressure, and a
model offset is broken by the presence of strong SO$_2$ features, confirming
that SO$_2$ is a robust metallicity indicator. We systematically retrieve
sub-solar C/O based on the relative amplitude of a strong CO$_2$ feature with
respect to the broad band of H$_2$O, which is sensitive to a
wavelength-dependent scattering slope. We find that high-altitude clouds
obscure the CH$_4$-rich layers, preventing the retrievals from constraining
T$_{int}$, but that higher values of K$_{zz}$ can transport material above the
cloud deck, allowing a fit of the CH$_4$ feature. However, T$_{int}$ and
K$_{zz}$ can vary substantially between retrievals depending on the adopted
cloud parametrization. We conclude that the 1D-RCPE retrieval method can
provide useful insights if the underlying grid of forward models is well
understood. We find that WASP-107b's atmosphere is enriched in metals (3 to 5
times solar) and carbon-deprived (C/O <= 0.20). However, we lack robust
constraints on the intrinsic temperature and vertical mixing strength.
Authors' comments: 30 pages, 18 figures, 5 tables, accepted by A&A
Sarah G. A. Barbosa, Raissa Estrela, Paulo C. F. da Silva Filho, Daniel B. de Freitas
Upcoming direct-imaging missions like the Habitable Worlds Observatory (HWO)
aim to characterize dozens of Earth-like exoplanets by capturing their
reflected-light spectra. However, traditional atmospheric retrieval frameworks
are too computationally intensive to explore the high-dimensional parameter
spaces such missions will generate. Here, we present a one-dimensional
convolutional neural network (1D CNN), trained on over one million synthetic,
noise-injected spectra simulating Archean, Proterozoic, and Modern Earth
analogs, as observed by LUVOIR-B (0.2-2.0 $\mu$m) and HabEx/SS (0.2-1.8
$\mu$m). Our model simultaneously infers six molecular abundances (including
biosignatures O$_2$ and O$_3$) along with radius, gravity, surface pressure,
and temperature. Inference on unseen test data is performed via Monte Carlo
Dropout, enabling uncertainty estimation across thousands of realizations
within seconds. The network performs best where spectral features are
prominent, accurately recovering CH$_4$ and CO$_2$ in Archean atmospheres and
O$_2$ and O$_3$ in Modern cases, while avoiding false positives and outputting
near-zero abundances in scenarios of true absence such as Archean O$_2$ and
O$_3$. Interpretation via Integrated Gradients confirms that the model bases
its predictions on physically meaningful features, including the Fraunhofer A
band for O$_2$, and the Hartley-Huggins band for O$_3$. Credibility curve
analysis indicates that O$_3$ remains retrievable across a wide range of
stellar types and distances, while O$_2$ is detectable out to 12 pc around FG
stars. These results elevate the CNN from proof of concept to a mission-ready
retrieval engine, capable of processing direct-imaging spectra with HWO on an
operational cadence.
Authors' comments: 19 pages, 9 figures, submitted to MNRAS
Chuanyue Yu, Kuo Zhao, Yuhan Li, Heng Chang, Mingjian Feng, Xiangzhe Jiang, Yufei Sun, Jia Li et al.
Graph Retrieval-Augmented Generation (GraphRAG) has shown great effectiveness in enhancing the reasoning abilities of LLMs by leveraging graph structures for knowledge representation and modeling complex real-world relationships. However, existing GraphRAG methods still face significant bottlenecks when handling complex problems that require multi-hop reasoning, as their query and retrieval phases are largely based on pre-defined heuristics and do not fully utilize the reasoning potentials of LLMs. To address this problem, we propose GraphRAG-R1, an adaptive GraphRAG framework by training LLMs with process-constrained outcome-based reinforcement learning (RL) to enhance the multi-hop reasoning ability. Our method can decompose complex problems, autonomously invoke retrieval tools to acquire necessary information, and perform effective reasoning. Specifically, we utilize a modified version of Group Relative Policy Optimization (GRPO) that supports rollout-with-thinking capability. Next, we design two process-constrained reward functions. To handle the shallow retrieval problem, we design a Progressive Retrieval Attenuation (PRA) reward to encourage essential retrievals. Then, to handle the over-thinking problem, we design Cost-Aware F1 (CAF) reward to balance the model performance with computational costs. We further design a phase-dependent training strategy, containing three training stages corresponding to cold start and these two rewards. Lastly, our method adopts a hybrid graph-textual retrieval to improve the reasoning capacity. Extensive experimental results demonstrate that GraphRAG-R1 boosts LLM capabilities in solving complex reasoning problems compared to state-of-the-art GraphRAG methods on both in-domain and out-of-domain datasets. Furthermore, our framework can be flexibly integrated with various existing retrieval methods, consistently delivering performance improvements.
Keer Lu, Zheng Liang, Youquan Li, Jiejun Tan, Da Pan, Shusen Zhang, Guosheng Dong, Huang Leng
In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce **Med-R$^3$**, a **Med**ical **R**etrieval-augmented **R**easoning framework driven by progressive **R**einforcement learning. In this framework, we first develop the model's ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model's retrieval and reasoning coordination. Extensive experiments indicate that **Med-R$^3$** could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93\% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53\%.
Dohwan Ko, Ji Soo Lee, Minhyuk Choi, Zihang Meng, Hyunwoo J. Kim
Text-Video Retrieval aims to find the most relevant text (or video) candidate
given a video (or text) query from large-scale online databases. Recent work
leverages multi-modal large language models (MLLMs) to improve retrieval,
especially for long or complex query-candidate pairs. However, we observe that
the naive application of MLLMs, i.e., retrieval based on candidate likelihood,
introduces candidate prior bias, favoring candidates with inherently higher
priors over those more relevant to the query. To this end, we propose a novel
retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM),
which leverages both query and candidate likelihoods by training the model to
generate text from a given video as well as video features from a given text.
Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet
effective training-free score calibration module designed to mitigate candidate
prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks,
our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4
R@1 on average, effectively alleviating candidate prior bias and emphasizing
query-candidate relevance. Our in-depth analysis across various multi-modal
tasks beyond retrieval highlights the broad applicability of CPN which enhances
visual understanding by reducing reliance on textual priors. Code is available
at https://github.com/mlvlab/BLiM.
Authors' comments: ICCV 2025 Highlight
Sungguk Cha, DongWook Kim, Taeseung Hahn, Mintae Kim, Youngsub Han, Byoung-Ki Jeon
Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbf{RL-QR}, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with $\text{RL-QR}_{\text{multi-modal}}$ achieving an 11\% relative gain in NDCG@3 for multi-modal RAG and $\text{RL-QR}_{\text{lexical}}$ yielding a 9\% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR's potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.
Yufei Chen, Yao Wang, Haibin Zhang, Tao Gu
Retrieval-augmented generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge bases, but this advancement introduces significant privacy risks. Existing privacy attacks on RAG systems can trigger data leakage but often fail to accurately isolate knowledge-base-derived sentences within mixed responses. They also lack robustness when applied across multiple domains. This paper addresses these challenges by presenting a novel black-box attack framework that exploits knowledge asymmetry between RAG and standard LLMs to achieve fine-grained privacy extraction across heterogeneous knowledge landscapes. We propose a chain-of-thought reasoning strategy that creates adaptive prompts to steer RAG systems away from sensitive content. Specifically, we first decompose adversarial queries to maximize information disparity and then apply a semantic relationship scoring to resolve lexical and syntactic ambiguities. We finally train a neural network on these feature scores to precisely identify sentences containing private information. Unlike prior work, our framework generalizes to unseen domains through iterative refinement without pre-defined knowledge. Experimental results show that we achieve over 91% privacy extraction rate in single-domain and 83% in multi-domain scenarios, reducing sensitive sentence exposure by over 65% in case studies. This work bridges the gap between attack and defense in RAG systems, enabling precise extraction of private information while providing a foundation for adaptive mitigation.
Hyeon Seong Jeong, Sangwoo Jo, Byeong Hyun Yoon, Yoonseok Heo, Haedong Jeong, Taehoon Kim
Understanding complex multimodal documents remains challenging due to their structural inconsistencies and limited training data availability. We introduce \textit{DocsRay}, a training-free document understanding system that integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG). Our approach leverages multimodal Large Language Models' (LLMs) native capabilities to seamlessly process documents containing diverse elements such as text, images, charts, and tables without requiring specialized models or additional training. DocsRay's framework synergistically combines three key techniques: (1) a semantic structuring module using prompt-based LLM interactions to generate a hierarchical pseudo-TOC, (2) zero-shot multimodal analysis that converts diverse document elements into unified, text-centric representations using the inherent capabilities of multimodal LLMs, and (3) an efficient two-stage hierarchical retrieval system that reduces retrieval complexity from $O(N)$ to $O(S + k_1 \cdot N_s)$. Evaluated on documents averaging 49.4 pages and 20,971 textual tokens, DocsRay reduced query latency from 3.89 to 2.12 seconds, achieving a 45% efficiency improvement. On the MMLongBench-Doc benchmark, DocsRay-Pro attains an accuracy of 64.7%, substantially surpassing previous state-of-the-art results.
Shiyao Yu, Zi-An Wang, Kangning Yin, Zheng Tian, Mingyuan Zhang, Weixin Si, Shihao Zou
Motion retrieval is crucial for motion acquisition, offering superior
precision, realism, controllability, and editability compared to motion
generation. Existing approaches leverage contrastive learning to construct a
unified embedding space for motion retrieval from text or visual modality.
However, these methods lack a more intuitive and user-friendly interaction mode
and often overlook the sequential representation of most modalities for
improved retrieval performance. To address these limitations, we propose a
framework that aligns four modalities -- text, audio, video, and motion --
within a fine-grained joint embedding space, incorporating audio for the first
time in motion retrieval to enhance user immersion and convenience. This
fine-grained space is achieved through a sequence-level contrastive learning
approach, which captures critical details across modalities for better
alignment. To evaluate our framework, we augment existing text-motion datasets
with synthetic but diverse audio recordings, creating two multi-modal motion
retrieval datasets. Experimental results demonstrate superior performance over
state-of-the-art methods across multiple sub-tasks, including an 10.16%
improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in
R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our
results show that our 4-modal framework significantly outperforms its 3-modal
counterpart, underscoring the potential of multi-modal motion retrieval for
advancing motion acquisition.
Authors' comments: Accepted by IEEE TMM 2025