Van-Thinh Vo, Minh-Khoi Nguyen, Minh-Huy Tran, Anh-Quan Nguyen-Tran, Duy-Tan Nguyen, Khanh-Loi Nguyen, Anh-Minh Phan
Multimedia information retrieval from videos remains a challenging problem. While recent systems have advanced multimodal search through semantic, object, and OCR queries - and can retrieve temporally consecutive scenes - they often rely on a single query modality for an entire sequence, limiting robustness in complex temporal contexts. To overcome this, we propose a cross-modal temporal event retrieval framework that enables different query modalities to describe distinct scenes within a sequence. To determine decision thresholds for scene transition and slide change adaptively, we build Kernel Density Gaussian Mixture Thresholding (KDE-GMM) algorithm, ensuring optimal keyframe selection. These extracted keyframes act as compact, high-quality visual exemplars that retain each segment's semantic essence, improving retrieval precision and efficiency. Additionally, the system incorporates a large language model (LLM) to refine and expand user queries, enhancing overall retrieval performance. The proposed system's effectiveness and robustness were demonstrated through its strong results in the Ho Chi Minh AI Challenge 2025.
Authors' comments: 11 pages, 6 figures, SOICT 2025
Alamgir Munir Qazi, John P. McCrae, Jamal Abdul Nasir
The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.
Qi Luo, Xiaonan Li, Yuxin Wang, Tingshuo Fan, Yuan Li, Xinchi Chen, Xipeng Qiu
Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data, resulting in factual inaccuracies and weak adaptability to new information. Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge; However, the effectiveness of RAG critically depends on whether the model can adequately access relevant information. Existing RAG systems rely on a single retriever with fixed top-k selection, restricting access to a narrow and static subset of the corpus. As a result, this single-retriever paradigm has become the primary bottleneck for comprehensive external information acquisition, especially in tasks requiring corpus-level reasoning. To overcome this limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access. MARAG-R1 equips the model with four retrieval tools -- semantic search, keyword search, filtering, and aggregation -- and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning. This design allows the model to interleave reasoning and retrieval, progressively gathering sufficient evidence for corpus-level synthesis. Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.
Junchi Yu, Yujie Liu, Jindong Gu, Philip Torr, Dongzhan Zhou
Retrieval-Augmented Generation (RAG) based on knowledge graphs (KGs) enhances
large language models (LLMs) by providing structured and interpretable external
knowledge. However, existing KG-based RAG methods struggle to retrieve accurate
and diverse information from text-rich KGs for complex real-world queries.
Process Reward Models (PRMs) offer a way to align the retrieval process of
KG-based RAG with query-specific knowledge requirements, but they heavily rely
on process-level supervision signals that are expensive and hard to obtain on
KGs. To address this challenge, we propose GraphFlow, a framework that
efficiently retrieves accurate and diverse knowledge required for real-world
queries from text-rich KGs. GraphFlow employs a transition-based flow matching
objective to jointly optimize a retrieval policy and a flow estimator. The flow
estimator factorizes the reward of the retrieval outcome into the intermediate
retrieval states. Such reward factorization guides the retrieval policy to
retrieve candidates from KGs in proportion to their reward. This allows
GraphFlow to explore high-quality regions of KGs that yield diverse and
relevant results. We evaluate GraphFlow on the STaRK benchmark, which includes
real-world queries from multiple domains over text-rich KGs. GraphFlow
outperforms strong KG-RAG baselines, including GPT-4o, by 10% on average in hit
rate and recall. It also shows strong generalization to unseen KGs,
demonstrating its effectiveness and robustness.
Authors' comments: NeurIPS 2025 (Spotlight)
Yingchen zhang, Ruqing zhang, Jiafeng Guo, Wenjun Peng, Sen Li, Fuyu Lv
Generative retrieval (GR) is an emerging paradigm that leverages large language models (LLMs) to autoregressively generate document identifiers (docids) relevant to a given query. Prior works have focused on leveraging the generative capabilities of LLMs to improve GR, while overlooking that their reasoning capabilities could likewise help. This raises a key question: Can explicit reasoning benefit GR? To investigate, we first conduct a preliminary study where an LLM is prompted to generate free-form chain-of-thought (CoT) reasoning before performing constrained docid decoding. Although this method outperforms standard GR, the generated reasoning tends to be verbose and poorly aligned with the docid space. These limitations motivate the development of a reasoning mechanism better tailored to GR. Therefore, we propose Reason-for-Retrieval (R4R), a reasoning-augmented framework for GR that converts free-form CoT reasoning into a compact, structured format, and iteratively refines the reasoning during the retrieval process. R4R augments an existing GR method by leveraging a reasoning-capable LLM that has been instruction-tuned for GR. At inference time, R4R first uses the LLM to generate an initial structured reasoning; then the same LLM alternates between (i) constrained decoding with the chosen GR method to produce candidate docids and (ii) updating the reasoning based on retrieval results to improve the next round. R4R does not require additional models or training, and instead a single LLM serves as both the reasoning generator and the retriever. Extensive experiments on Natural Questions, MS MARCO, and a real-world item-search benchmark validate the effectiveness of R4R.
Yongjie Wang, Yue Yu, Kaisong Song, Jun Lin, Zhiqi Shen
Large Language Models (LLMs) have enabled a wide range of applications
through their powerful capabilities in language understanding and generation.
However, as LLMs are trained on static corpora, they face difficulties in
addressing rapidly evolving information or domain-specific queries.
Retrieval-Augmented Generation (RAG) was developed to overcome this limitation
by integrating LLMs with external retrieval mechanisms, allowing them to access
up-to-date and contextually relevant knowledge. However, as LLMs themselves
continue to advance in scale and capability, the relative advantages of
traditional RAG frameworks have become less pronounced and necessary. Here, we
present a comprehensive review of RAG, beginning with its overarching
objectives and core components. We then analyze the key challenges within RAG,
highlighting critical weakness that may limit its effectiveness. Finally, we
showcase applications where LLMs alone perform inadequately, but where RAG,
when combined with LLMs, can substantially enhance their effectiveness. We hope
this work will encourage researchers to reconsider the role of RAG and inspire
the development of next-generation RAG systems.
Authors' comments: Under Review
Yingyi Zhang, Pengyue Jia, Derong Xu, Yi Wen, Xianneng Li, Yichao Wang, Wenlin Zhang, Xiaopeng Li et al.
Retrieval-Augmented Generation (RAG) critically depends on effective query expansion to retrieve relevant information. However, existing expansion methods adopt uniform strategies that overlook user-specific semantics, ignoring individual expression styles, preferences, and historical context. In practice, identical queries in text can express vastly different intentions across users. This representational rigidity limits the ability of current RAG systems to generalize effectively in personalized settings. Specifically, we identify two core challenges for personalization: 1) user expression styles are inherently diverse, making it difficult for standard expansions to preserve personalized intent. 2) user corpora induce heterogeneous semantic structures-varying in topical focus and lexical organization-which hinders the effective anchoring of expanded queries within the user's corpora space. To address these challenges, we propose Personalize Before Retrieve (PBR), a framework that incorporates user-specific signals into query expansion prior to retrieval. PBR consists of two components: P-PRF, which generates stylistically aligned pseudo feedback using user history for simulating user expression style, and P-Anchor, which performs graph-based structure alignment over user corpora to capture its structure. Together, they produce personalized query representations tailored for retrieval. Experiments on two personalized benchmarks show that PBR consistently outperforms strong baselines, with up to 10% gains on PersonaBench across retrievers. Our findings demonstrate the value of modeling personalization before retrieval to close the semantic gap in user-adaptive RAG systems. Our code is available at https://github.com/Zhang-Yingyi/PBR-code.
Kai Guo, Xinnan Dai, Shenglai Zeng, Harry Shomer, Haoyu Han, Yu Wang, Jiliang Tang
Retrieval-augmented generation (RAG) is a powerful paradigm for improving large language models (LLMs) on knowledge-intensive question answering. Graph-based RAG (GraphRAG) leverages entity-relation graphs to support multi-hop reasoning, but most systems still rely on static retrieval. When crucial evidence, especially bridge documents that connect disjoint entities, is absent, reasoning collapses and hallucinations persist. Iterative retrieval, which performs multiple rounds of evidence selection, has emerged as a promising alternative, yet its role within GraphRAG remains poorly understood. We present the first systematic study of iterative retrieval in GraphRAG, analyzing how different strategies interact with graph-based backbones and under what conditions they succeed or fail. Our findings reveal clear opportunities: iteration improves complex multi-hop questions, helps promote bridge documents into leading ranks, and different strategies offer complementary strengths. At the same time, pitfalls remain: naive expansion often introduces noise that reduces precision, gains are limited on single-hop or simple comparison questions, and several bridge evidences still be buried too deep to be effectively used. Together, these results highlight a central bottleneck, namely that GraphRAG's effectiveness depends not only on recall but also on whether bridge evidence is consistently promoted into leading positions where it can support reasoning chains. To address this challenge, we propose Bridge-Guided Dual-Thought-based Retrieval (BDTR), a simple yet effective framework that generates complementary thoughts and leverages reasoning chains to recalibrate rankings and bring bridge evidence into leading positions. BDTR achieves consistent improvements across diverse GraphRAG settings and provides guidance for the design of future GraphRAG systems.
Haoxiang Jin, Ronghan Li, Qiguang Miao, Zixiang Lu
Large Language Models (LLMs) often generate inaccurate responses (hallucinations) when faced with questions beyond their knowledge scope. Retrieval-Augmented Generation (RAG) addresses this by leveraging external knowledge, but a critical challenge remains: determining whether retrieved contexts effectively enhance the model`s ability to answer specific queries. This challenge underscores the importance of knowledge boundary awareness, which current methods-relying on discrete labels or limited signals-fail to address adequately, as they overlook the rich information in LLMs` continuous internal hidden states. To tackle this, we propose a novel post-retrieval knowledge filtering approach. First, we construct a confidence detection model based on LLMs` internal hidden states to quantify how retrieved contexts enhance the model`s confidence. Using this model, we build a preference dataset (NQ_Rerank) to fine-tune a reranker, enabling it to prioritize contexts preferred by the downstream LLM during reranking. Additionally, we introduce Confidence-Based Dynamic Retrieval (CBDR), which adaptively triggers retrieval based on the LLM`s initial confidence in the original question, reducing knowledge conflicts and improving efficiency. Experimental results demonstrate significant improvements in accuracy for context screening and end-to-end RAG performance, along with a notable reduction in retrieval costs while maintaining competitive accuracy.
Huifeng Lin, Gang Su, Jintao Liang, You Wu, Rui Zhao, Ziyue Li
Retrieval-Augmented Generation (RAG) based on Large Language Models (LLMs) is
a powerful solution to understand and query the industry's closed-source
documents. However, basic RAG often struggles with complex QA tasks in legal
and regulatory domains, particularly when dealing with numerous government
documents. The top-$k$ strategy frequently misses golden chunks, leading to
incomplete or inaccurate answers. To address these retrieval bottlenecks, we
explore two strategies to improve evidence coverage and answer quality. The
first is a One-SHOT retrieval method that adaptively selects chunks based on a
token budget, allowing as much relevant content as possible to be included
within the model's context window. Additionally, we design modules to further
filter and refine the chunks. The second is an iterative retrieval strategy
built on a Reasoning Agentic RAG framework, where a reasoning LLM dynamically
issues search queries, evaluates retrieved results, and progressively refines
the context over multiple turns. We identify query drift and retrieval laziness
issues and further design two modules to tackle them. Through extensive
experiments on a dataset of government documents, we aim to offer practical
insights and guidance for real-world applications in legal and regulatory
domains.
Authors' comments: under Review of EMNLP 2025
Or Shachar, Uri Katz, Yoav Goldberg, Oren Glickman
We present NER Retriever, a zero-shot retrieval framework for ad-hoc Named
Entity Retrieval, a variant of Named Entity Recognition (NER), where the types
of interest are not provided in advance, and a user-defined type description is
used to retrieve documents mentioning entities of that type. Instead of relying
on fixed schemas or fine-tuned models, our method builds on internal
representations of large language models (LLMs) to embed both entity mentions
and user-provided open-ended type descriptions into a shared semantic space. We
show that internal representations, specifically the value vectors from
mid-layer transformer blocks, encode fine-grained type information more
effectively than commonly used top-layer embeddings. To refine these
representations, we train a lightweight contrastive projection network that
aligns type-compatible entities while separating unrelated types. The resulting
entity embeddings are compact, type-aware, and well-suited for nearest-neighbor
search. Evaluated on three benchmarks, NER Retriever significantly outperforms
both lexical and dense sentence-level retrieval baselines. Our findings provide
empirical support for representation selection within LLMs and demonstrate a
practical solution for scalable, schema-free entity retrieval. The NER
Retriever Codebase is publicly available at
https://github.com/ShacharOr100/ner_retriever
Authors' comments: Findings of EMNLP 2025
Baiqiang Wang, Qian Lou, Mengxin Zheng, Dongfang Zhao
Retrieval-Augmented Generation (RAG) has become a foundational component of modern AI systems, yet it introduces significant privacy risks by exposing user queries to service providers. To address this, we introduce PIR-RAG, a practical system for privacy-preserving RAG. PIR-RAG employs a novel architecture that uses coarse-grained semantic clustering to prune the search space, combined with a fast, lattice-based Private Information Retrieval (PIR) protocol. This design allows for the efficient retrieval of entire document clusters, uniquely optimizing for the end-to-end RAG workflow where full document content is required. Our comprehensive evaluation against strong baseline architectures, including graph-based PIR and Tiptoe-style private scoring, demonstrates PIR-RAG's scalability and its superior performance in terms of "RAG-Ready Latency"-the true end-to-end time required to securely fetch content for an LLM. Our work establishes PIR-RAG as a viable and highly efficient solution for privacy in large-scale AI systems.
Xuejun Chang, Zaiqiao Meng, Debasis Ganguly
Retrievability of a document is a collection-based statistic that measures
its expected (reciprocal) rank of being retrieved within a specific rank
cut-off. A collection with uniformly distributed retrievability scores across
documents is an indicator of fair document exposure. While retrievability
scores have been used to quantify the fairness of exposure for a collection, in
our work, we use the distribution of retrievability scores to measure the
exposure bias of retrieval models. We hypothesise that an uneven distribution
of retrievability scores across the entire collection may not accurately
reflect exposure bias but rather indicate variations in topical relevance. As a
solution, we propose a topic-focused localised retrievability measure, which we
call \textit{T-Retrievability} (topic-retrievability), which first computes
retrievability scores over multiple groups of topically-related documents, and
then aggregates these localised values to obtain the collection-level
statistics. Our analysis using this proposed T-Retrievability measure uncovers
new insights into the exposure characteristics of various neural ranking
models. The findings suggest that this localised measure provides a more
nuanced understanding of exposure fairness, offering a more reliable approach
for assessing document accessibility in IR systems.
Authors' comments: Accepted by Proceedings of the 34th ACM International Conference on
Information and Knowledge Management (CIKM 2025), November 10-14, 2025,
Seoul, Republic of Korea
Yanbo Dai, Zhenlan Ji, Zongjie Li, Kuan Li, Shuai Wang
Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs). Prior work demonstrates the vulnerability of RAG systems by misleading them into generating attacker-chosen outputs through poisoning the knowledge base. However, this paper uncovers that such attacks could be mitigated by the strong \textit{self-correction ability (SCA)} of modern LLMs, which can reject false context once properly configured. This SCA poses a significant challenge for attackers aiming to manipulate RAG systems. In contrast to previous poisoning methods, which primarily target the knowledge base, we introduce \textsc{DisarmRAG}, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs. This compromisation enables the attacker to straightforwardly embed anti-SCA instructions into the context provided to the generator, thereby bypassing the SCA. To this end, we present a contrastive-learning-based model editing technique that performs localized and stealthy edits, ensuring the retriever returns a malicious instruction only for specific victim queries while preserving benign retrieval behavior. To further strengthen the attack, we design an iterative co-optimization framework that automatically discovers robust instructions capable of bypassing prompt-based defenses. We extensively evaluate DisarmRAG across six LLMs and three QA benchmarks. Our results show near-perfect retrieval of malicious instructions, which successfully suppress SCA and achieve attack success rates exceeding 90\% under diverse defensive prompts. Also, the edited retriever remains stealthy under several detection methods, highlighting the urgent need for retriever-centric defenses.
Jonghyun Song, Youngjune Lee, Gyu-Hwung Cho, Ilhyeon Song, Saehun Kim, Yohan Jo
Vision-Language Pretrained (VLP) models have achieved impressive performance
on multimodal tasks, including text-image retrieval, based on dense
representations. Meanwhile, Learned Sparse Retrieval (LSR) has gained traction
in text-only settings due to its interpretability and efficiency with fast
term-based lookup via inverted indexes. Inspired by these advantages, recent
work has extended LSR to the multimodal domain. However, these methods often
rely on computationally expensive contrastive pre-training, or distillation
from a frozen dense model, which limits the potential for mutual enhancement.
To address these limitations, we propose a simple yet effective framework that
enables bi-directional learning between dense and sparse representations
through Self-Knowledge Distillation. This bi-directional learning is achieved
using an integrated similarity score-a weighted sum of dense and sparse
similarities-which serves as a shared teacher signal for both representations.
To ensure efficiency, we fine-tune the final layer of the dense encoder and the
sparse projection head, enabling easy adaptation of any existing VLP model.
Experiments on MSCOCO and Flickr30k demonstrate that our sparse retriever not
only outperforms existing sparse baselines, but also achieves performance
comparable to-or even surpassing-its dense counterparts, while retaining the
benefits of sparse models.
Authors' comments: accepted to CIKM 2025 short research paper track
Weijia Liu, Jiuxin Cao, Bo Miao, Zhiheng Fu, Xuelin Zhu, Jiawei Ge, Bo Liu, Mehwish Nasim et al.
Current text-driven Video Moment Retrieval (VMR) methods encode all video
clips, including irrelevant ones, disrupting multimodal alignment and hindering
optimization. To this end, we propose a denoise-then-retrieve paradigm that
explicitly filters text-irrelevant clips from videos and then retrieves the
target moment using purified multimodal representations. Following this
paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising
Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF)
modules. TCD integrates cross-attention and structured state space blocks to
dynamically identify noisy clips and produce a noise mask to purify multimodal
video representations. TRF further distills a single query embedding from
purified video representations and aligns it with the text embedding, serving
as auxiliary supervision for denoising during training. Finally, we perform
conditional retrieval using text embeddings on purified video representations
for accurate VMR. Experiments on Charades-STA and QVHighlights demonstrate that
our approach surpasses state-of-the-art methods on all metrics. Furthermore,
our denoise-then-retrieve paradigm is adaptable and can be seamlessly
integrated into advanced VMR models to boost performance.
Authors' comments: Accepted by IJCAI 2025
Jaehyun Kwak, Ramahdani Muhammad Izaaz Inhar, Se-Young Yun, Sung-Ju Lee
Composed Image Retrieval (CIR) retrieves relevant images based on a reference
image and accompanying text describing desired modifications. However, existing
CIR methods only focus on retrieving the target image and disregard the
relevance of other images. This limitation arises because most methods
employing contrastive learning-which treats the target image as positive and
all other images in the batch as negatives-can inadvertently include false
negatives. This may result in retrieving irrelevant images, reducing user
satisfaction even when the target image is retrieved. To address this issue, we
propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which
optimizes a reward model objective to reduce false negatives. Additionally, we
introduce a hard negative sampling strategy that selects images positioned
between two steep drops in relevance scores following the target image, to
effectively filter false negatives. In order to evaluate CIR models on their
alignment with human satisfaction, we create Human-Preference FashionIQ
(HP-FashionIQ), a new dataset that explicitly captures user preferences beyond
target retrieval. Extensive experiments demonstrate that QuRe achieves
state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting
the strongest alignment with human preferences on the HP-FashionIQ dataset. The
source code is available at https://github.com/jackwaky/QuRe.
Authors' comments: Accepted to ICML 2025
Ting-Wen Ko, Jyun-Yu Jiang, Pu-Jen Cheng
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external documents at inference time, enabling up-to-date knowledge access without costly retraining. However, conventional RAG methods retrieve passages independently, often leading to redundant, noisy, or insufficiently diverse context-particularly problematic - particularly problematic in noisy corpora and for multi-hop questions. To address this, we propose Adaptive Passage Combination Retrieval (AdaPCR), a novel framework for open-domain question answering with black-box LMs. AdaPCR explicitly models dependencies between passages by considering passage combinations as units for retrieval and reranking. It consists of a context-aware query reformulation using concatenated passages, and a reranking step trained with a predictive objective aligned with downstream answer likelihood. Crucially, AdaPCR adaptively selects the number of retrieved passages without additional stopping modules. Experiments across several QA benchmarks show that AdaPCR outperforms baselines, particularly in multi-hop reasoning, demonstrating the effectiveness of modeling inter-passage dependencies for improved retrieval.
Antoine Nzeyimana, Andre Niyongabo Rubungo
The recent mainstream adoption of large language model (LLM) technology is enabling novel applications in the form of chatbots and virtual assistants across many domains. With the aim of grounding LLMs in trusted domains and avoiding the problem of hallucinations, retrieval-augmented generation (RAG) has emerged as a viable solution. In order to deploy sustainable RAG systems in low-resource settings, achieving high retrieval accuracy is not only a usability requirement but also a cost-saving strategy. Through empirical evaluations on a Kinyarwanda-language dataset, we find that the most limiting factors in achieving high retrieval accuracy are limited language coverage and inadequate sub-word tokenization in pre-trained language models. We propose a new retriever model, KinyaColBERT, which integrates two key concepts: late word-level interactions between queries and documents, and a morphology-based tokenization coupled with two-tier transformer encoding. This methodology results in lexically grounded contextual embeddings that are both fine-grained and self-contained. Our evaluation results indicate that KinyaColBERT outperforms strong baselines and leading commercial text embedding APIs on a Kinyarwanda agricultural retrieval benchmark. By adopting this retrieval strategy, we believe that practitioners in other low-resource settings can not only achieve reliable RAG systems but also deploy solutions that are more cost-effective.
Li-Cheng Shen, Jih-Kang Hsieh, Wei-Hua Li, Chu-Song Chen
Text-to-image retrieval (TIR) aims to find relevant images based on a textual
query, but existing approaches are primarily based on whole-image captions and
lack interpretability. Meanwhile, referring expression segmentation (RES)
enables precise object localization based on natural language descriptions but
is computationally expensive when applied across large image collections. To
bridge this gap, we introduce Mask-aware TIR (MaTIR), a new task that unifies
TIR and RES, requiring both efficient image search and accurate object
segmentation. To address this task, we propose a two-stage framework,
comprising a first stage for segmentation-aware image retrieval and a second
stage for reranking and object grounding with a multimodal large language model
(MLLM). We leverage SAM 2 to generate object masks and Alpha-CLIP to extract
region-level embeddings offline at first, enabling effective and scalable
online retrieval. Secondly, MLLM is used to refine retrieval rankings and
generate bounding boxes, which are matched to segmentation masks. We evaluate
our approach on COCO and D$^3$ datasets, demonstrating significant improvements
in both retrieval accuracy and segmentation quality over previous methods.
Authors' comments: ICMR 2025