Qian Dong, Jia Chen, Qingyao Ai, Hongning Wang, Haitao Li, Yi Wu, Yao Hu, Yiqun Liu et al.
Existing retrieval-augmented code generation (RACG) methods typically use an
external retrieval module to fetch semantically similar code snippets used for
generating subsequent fragments. However, even for consecutive code fragments,
the content often diverges due to logical progression, resulting in a content
gap. This gap undermines the performance of current RACG methods, as
\textit{external} retrieval modules based on content matching fail to infer the
specific information need of LLMs to generate the next code fragment.
Therefore, we propose \textbf{SelfRACG}, a novel paradigm that enables large
language models (LLMs) to \textbf{Self}-express their information needs to
enhance \textbf{RACG}. Specifically, SelfRACG includes an information need
expression module and a two-stage information need-guided training strategy,
which encourages LLMs to express their information need. Extensive experiments
demonstrate that SelfRACG can retrieve external knowledge that better aligns
with the LLM's own information needs, resulting in superior generation
performance compared to vanilla RACG.
Authors' comments: Tsinghua&Xiaohongshu
Agada Joseph Oche, Ademola Glory Folashade, Tirthankar Ghosal, Arpan Biswas
Retrieval-Augmented Generation (RAG) represents a major advancement in
natural language processing (NLP), combining large language models (LLMs) with
information retrieval systems to enhance factual grounding, accuracy, and
contextual relevance. This paper presents a comprehensive systematic review of
RAG, tracing its evolution from early developments in open domain question
answering to recent state-of-the-art implementations across diverse
applications. The review begins by outlining the motivations behind RAG,
particularly its ability to mitigate hallucinations and outdated knowledge in
parametric models. Core technical components-retrieval mechanisms,
sequence-to-sequence generation models, and fusion strategies are examined in
detail. A year-by-year analysis highlights key milestones and research trends,
providing insight into RAG's rapid growth. The paper further explores the
deployment of RAG in enterprise systems, addressing practical challenges
related to retrieval of proprietary data, security, and scalability. A
comparative evaluation of RAG implementations is conducted, benchmarking
performance on retrieval accuracy, generation fluency, latency, and
computational efficiency. Persistent challenges such as retrieval quality,
privacy concerns, and integration overhead are critically assessed. Finally,
the review highlights emerging solutions, including hybrid retrieval
approaches, privacy-preserving techniques, optimized fusion strategies, and
agentic RAG architectures. These innovations point toward a future of more
reliable, efficient, and context-aware knowledge-intensive NLP systems.
Authors' comments: 33 pages, 2 figures
Zhengyun Zhao, Huaiyuan Ying, Yue Zhong, Sheng Yu
Electronic Health Records (EHRs) are pivotal in clinical practices, yet their
retrieval remains a challenge mainly due to semantic gap issues. Recent
advancements in dense retrieval offer promising solutions but existing models,
both general-domain and biomedical-domain, fall short due to insufficient
medical knowledge or mismatched training corpora. This paper introduces
\texttt{DR.EHR}, a series of dense retrieval models specifically tailored for
EHR retrieval. We propose a two-stage training pipeline utilizing MIMIC-IV
discharge summaries to address the need for extensive medical knowledge and
large-scale training data. The first stage involves medical entity extraction
and knowledge injection from a biomedical knowledge graph, while the second
stage employs large language models to generate diverse training data. We train
two variants of \texttt{DR.EHR}, with 110M and 7B parameters, respectively.
Evaluated on the CliniQ benchmark, our models significantly outperforms all
existing dense retrievers, achieving state-of-the-art results. Detailed
analyses confirm our models' superiority across various match and query types,
particularly in challenging semantic matches like implication and abbreviation.
Ablation studies validate the effectiveness of each pipeline component, and
supplementary experiments on EHR QA datasets demonstrate the models'
generalizability on natural language questions, including complex ones with
multiple entities. This work significantly advances EHR retrieval, offering a
robust solution for clinical applications.
Authors' comments: Model and code released upon acceptance
Ruiqi He, Zekun Fei, Jiaqi Li, Xinyuan Zhu, Biao Yi, Siyi Lv, Weijie Liu, Zheli Liu
Vector Database (VDB) can efficiently index and search high-dimensional vector embeddings from unstructured data, crucially enabling fast semantic similarity search essential for modern AI applications like generative AI and recommendation systems. Since current VDB service providers predominantly use proprietary black-box models, users are forced to expose raw query text to them via API in exchange for the vector retrieval services. Consequently, if query text involves confidential records from finance or healthcare domains, this mechanism inevitably leads to critical leakage of user's sensitive information. To address this issue, we introduce STEER (\textbf{S}ecure \textbf{T}ransformed \textbf{E}mbedding v\textbf{E}ctor\textbf{ R}etrieval), a private vector retrieval framework that leverages the alignment relationship between the semantic spaces of different embedding models to derive approximate embeddings for the query text. STEER performs the retrieval using the approximate embeddings within the original VDB and requires no modifications to the server side. Our theoretical and experimental analyses demonstrate that STEER effectively safeguards query text privacy while maintaining the retrieval accuracy. Even though approximate embeddings are approximations of the embeddings from proprietary models, they still prevent the providers from recovering the query text through Embedding Inversion Attacks (EIAs). Extensive experimental results show that Recall@100 of STEER can basically achieve a decrease of less than 5\%. Furthermore, even when searching within a text corpus of millions of entries, STEER achieves a Recall@20 accuracy 20\% higher than current baselines.
Zezhou Yang, Ting Peng, Cuiyun Gao, Chaozheng Wang, Hailiang Huang, Yuetang Deng
Code completion, a crucial task in software engineering that enhances
developer productivity, has seen substantial improvements with the rapid
advancement of large language models (LLMs). In recent years,
retrieval-augmented generation (RAG) has emerged as a promising method to
enhance the code completion capabilities of LLMs, which leverages relevant
context from codebases without requiring model retraining. While existing
studies have demonstrated the effectiveness of RAG on public repositories and
benchmarks, the potential distribution shift between open-source and
closed-source codebases presents unique challenges that remain unexplored. To
mitigate the gap, we conduct an empirical study to investigate the performance
of widely-used RAG methods for code completion in the industrial-scale codebase
of WeChat, one of the largest proprietary software systems. Specifically, we
extensively explore two main types of RAG methods, namely identifier-based RAG
and similarity-based RAG, across 26 open-source LLMs ranging from 0.5B to 671B
parameters. For a more comprehensive analysis, we employ different retrieval
techniques for similarity-based RAG, including lexical and semantic retrieval.
Based on 1,669 internal repositories, we achieve several key findings: (1) both
RAG methods demonstrate effectiveness in closed-source repositories, with
similarity-based RAG showing superior performance, (2) the effectiveness of
similarity-based RAG improves with more advanced retrieval techniques, where
BM25 (lexical retrieval) and GTE-Qwen (semantic retrieval) achieve superior
performance, and (3) the combination of lexical and semantic retrieval
techniques yields optimal results, demonstrating complementary strengths.
Furthermore, we conduct a developer survey to validate the practical utility of
RAG methods in real-world development environments.
Authors' comments: Accepted in ICSME 25 Industry Track
Yifu Chen, Bingchen Huang, Zhiling Wang, Yuanchao Du, Junfeng Luo, Lei Shen, Zhineng chen
In-context learning (ICL) has become a classic approach for enabling LLMs to handle various tasks based on a few input-output examples. The effectiveness of ICL heavily relies on the quality of these examples, and previous works which focused on enhancing example retrieval capabilities have achieved impressive performances. However, two challenges remain in retrieving high-quality examples: (1) Difficulty in distinguishing cross-task data distributions, (2) Difficulty in making the fine-grained connection between retriever output and feedback from LLMs. In this paper, we propose a novel framework called TDR. TDR decouples the ICL examples from different tasks, which enables the retrieval module to retrieve examples specific to the target task within a multi-task dataset. Furthermore, TDR models fine-grained feedback from LLMs to supervise and guide the training of the retrieval module, which helps to retrieve high-quality examples. We conducted extensive experiments on a suite of 30 NLP tasks, the results demonstrate that TDR consistently improved results across all datasets and achieves state-of-the-art performance. Meanwhile, our approach is a plug-and-play method, which can be easily combined with various LLMs to improve example retrieval abilities for ICL. The code is available at https://github.com/Nnn-s/TDR.
Shubham Mohole, Hongjun Choi, Shusen Liu, Christine Klymko, Shashank Kushwaha, Derek Shi, Wesam Sakla, Sainyam Galhotra et al.
Retrieval-augmented generation (RAG) systems are increasingly adopted in clinical decision support, yet they remain methodologically blind-they retrieve evidence but cannot vet its scientific quality. A paper claiming "Antioxidant proteins decreased after alloferon treatment" and a rigorous multi-laboratory replication study will be treated as equally credible, even if the former lacked scientific rigor or was even retracted. To address this challenge, we introduce VERIRAG, a framework that makes three notable contributions: (i) the Veritable, an 11-point checklist that evaluates each source for methodological rigor, including data integrity and statistical validity; (ii) a Hard-to-Vary (HV) Score, a quantitative aggregator that weights evidence by its quality and diversity; and (iii) a Dynamic Acceptance Threshold, which calibrates the required evidence based on how extraordinary a claim is. Across four datasets-comprising retracted, conflicting, comprehensive, and settled science corpora-the VERIRAG approach consistently outperforms all baselines, achieving absolute F1 scores ranging from 0.53 to 0.65, representing a 10 to 14 point improvement over the next-best method in each respective dataset. We will release all materials necessary for reproducing our results.
Farnaz Khun Jush, Steffen Vogler, Matthias Lenga
The increasing volume of medical images poses challenges for radiologists in retrieving relevant cases. Content-based image retrieval (CBIR) systems offer potential for efficient access to similar cases, yet lack standardized evaluation and comprehensive studies. Building on prior studies for tumor characterization via CBIR, this study advances CBIR research for volumetric medical images through three key contributions: (1) a framework eliminating reliance on pre-segmented data and organ-specific datasets, aligning with large and unstructured image archiving systems, i.e. PACS in clinical practice; (2) introduction of C-MIR, a novel volumetric re-ranking method adapting ColBERT's contextualized late interaction mechanism for 3D medical imaging; (3) comprehensive evaluation across four tumor sites using three feature extractors and three database configurations. Our evaluations highlight the significant advantages of C-MIR. We demonstrate the successful adaptation of the late interaction principle to volumetric medical images, enabling effective context-aware re-ranking. A key finding is C-MIR's ability to effectively localize the region of interest, eliminating the need for pre-segmentation of datasets and offering a computationally efficient alternative to systems relying on expensive data enrichment steps. C-MIR demonstrates promising improvements in tumor flagging, achieving improved performance, particularly for colon and lung tumors (p<0.05). C-MIR also shows potential for improving tumor staging, warranting further exploration of its capabilities. Ultimately, our work seeks to bridge the gap between advanced retrieval techniques and their practical applications in healthcare, paving the way for improved diagnostic processes.
Ruijie Yang, Yan Zhu, Peiyao Fu, Yizhe Zhang, Zhihua Wang, Quanlin Li, Pinghong Zhou, Xian Yang et al.
Colorectal cancer (CRC) remains a leading cause of cancer-related mortality, underscoring the importance of timely polyp detection and diagnosis. While deep learning models have improved optical-assisted diagnostics, they often demand extensive labeled datasets and yield "black-box" outputs with limited interpretability. In this paper, we propose EndoFinder, an online polyp retrieval framework that leverages multi-view scene representations for explainable and scalable CRC diagnosis. First, we develop a Polyp-aware Image Encoder by combining contrastive learning and a reconstruction task, guided by polyp segmentation masks. This self-supervised approach captures robust features without relying on large-scale annotated data. Next, we treat each polyp as a three-dimensional "scene" and introduce a Scene Representation Transformer, which fuses multiple views of the polyp into a single latent representation. By discretizing this representation through a hashing layer, EndoFinder enables real-time retrieval from a compiled database of historical polyp cases, where diagnostic information serves as interpretable references for new queries. We evaluate EndoFinder on both public and newly collected polyp datasets for re-identification and pathology classification. Results show that EndoFinder outperforms existing methods in accuracy while providing transparent, retrieval-based insights for clinical decision-making. By contributing a novel dataset and a scalable, explainable framework, our work addresses key challenges in polyp diagnosis and offers a promising direction for more efficient AI-driven colonoscopy workflows. The source code is available at https://github.com/ku262/EndoFinder-Scene.
Zhehao Tan, Yihan Jiao, Dan Yang, Lei Liu, Jie Feng, Duolin Sun, Yue Shen, Jian Wang et al.
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs' roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM's parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system's generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in https://github.com/Alipay-Med/PRGB.
Fangjian Lei, Mariam El Mezouar, Shayan Noei, Ying Zou
Large Language Models (LLMs) have shown promise in assisting developers with code-related questions; however, LLMs carry the risk of generating unreliable answers. To address this, Retrieval-Augmented Generation (RAG) has been proposed to reduce the unreliability (i.e., hallucinations) of LLMs. However, designing effective pipelines remains challenging due to numerous design choices. In this paper, we construct a retrieval corpus of over 3 million Java and Python related Stack Overflow posts with accepted answers, and explore various RAG pipeline designs to answer developer questions, evaluating their effectiveness in generating accurate and reliable responses. More specifically, we (1) design and evaluate 7 different RAG pipelines and 63 pipeline variants to answer questions that have historically similar matches, and (2) address new questions without any close prior matches by automatically lowering the similarity threshold during retrieval, thereby increasing the chance of finding partially relevant context and improving coverage for unseen cases. We find that implementing a RAG pipeline combining hypothetical-documentation-embedding (HyDE) with the full-answer context performs best in retrieving and answering similarcontent for Stack Overflow questions. Finally, we apply our optimal RAG pipeline to 4 open-source LLMs and compare the results to their zero-shot performance. Our findings show that RAG with our optimal RAG pipeline consistently outperforms zero-shot baselines across models, achieving higher scores for helpfulness, correctness, and detail with LLM-as-a-judge. These findings demonstrate that our optimal RAG pipelines robustly enhance answer quality for a wide range of developer queries including both previously seen and novel questions across different LLMs
Hellina Hailu Nigatu, Min Li, Maartje ter Hoeve, Saloni Potdar, Sarah Chasins
Knowledge Graphs represent real-world entities and the relationships between
them. Multilingual Knowledge Graph Construction (mKGC) refers to the task of
automatically constructing or predicting missing entities and links for
knowledge graphs in a multilingual setting. In this work, we reformulate the
mKGC task as a Question Answering (QA) task and introduce mRAKL: a
Retrieval-Augmented Generation (RAG) based system to perform mKGC. We achieve
this by using the head entity and linking relation in a question, and having
our model predict the tail entity as an answer. Our experiments focus primarily
on two low-resourced languages: Tigrinya and Amharic. We experiment with using
higher-resourced languages Arabic and English for cross-lingual transfer. With
a BM25 retriever, we find that the RAG-based approach improves performance over
a no-context setting. Further, our ablation studies show that with an idealized
retrieval system, mRAKL improves accuracy by 4.92 and 8.79 percentage points
for Tigrinya and Amharic, respectively.
Authors' comments: Accepted to Findings of ACL 2025
Linbin Li, Haiyang Peng, Yong Xia, Meng Huang
This paper investigates phase retrieval using the Reshaped Wirtinger Flow
(RWF) algorithm, focusing on recovering target vector $\vx \in \R^n$ from
magnitude measurements \(y_i = \left| \langle \va_i, \vx \rangle \right|, \; i
= 1, \ldots, m,\) under random initialization, where $\va_i \in \R^n$ are
measurement vectors. For Gaussian measurement designs, we prove that when $m\ge
O(n \log^2 n\log^3 m)$, the RWF algorithm with random initialization achieves
$\epsilon$-accuracy within \(O\big(\log n + \log(1/\epsilon)\big)\) iterations,
thereby attaining nearly optimal sample and computational complexities
comparable to those previously established for spectrally initialized methods.
Numerical experiments demonstrate that the convergence rate is robust to
initialization randomness and remains stable even with larger step sizes.
Authors' comments: 33 pages, 6 figures
Xinping Zhao, Shouzheng Huang, Yan Zhong, Xinshuo Hu, Baotian Hu, Min Zhang
Retrieval-Augmented Generation (RAG) effectively improves the accuracy of
Large Language Models (LLMs). However, retrieval noises significantly impact
the quality of LLMs' generation, necessitating the development of denoising
mechanisms. Previous methods extract evidence straightforwardly without
explicit thinking, which risks filtering out key clues and struggles with
generalization. To this end, we propose LEAR, which learns to extract rational
evidence by (1) explicitly reasoning to identify potential cues within
retrieval contents first, and then (2) consciously extracting to avoid omitting
any key cues helpful for answering questions. Specifically, we frame evidence
reasoning and evidence extraction into one unified response for end-to-end
training; apply knowledge token masks for disentanglement to derive
reasoning-based and extraction-based answers; and devise three types of
verifiable reward functions, including answer, length, and format, to update
the model via the policy optimization algorithm. Extensive experiments on three
benchmark datasets show the effectiveness of LEAR, providing compact and
high-quality evidence, improving the accuracy of downstream tasks, and
promoting effective application in online RAG systems.
Authors' comments: 16 pages, 7 Figures, 10 Tables
Wenhao Li, Selvakumar Manickam, Yung-wey Chong, Shankar Karuppayah
Phishing websites remain a major cybersecurity threat, yet existing methods
primarily focus on detection, while the recognition of underlying malicious
intentions remains largely unexplored. To address this gap, we propose
PhishIntentionLLM, a multi-agent retrieval-augmented generation (RAG) framework
that uncovers phishing intentions from website screenshots. Leveraging the
visual-language capabilities of large language models (LLMs), our framework
identifies four key phishing objectives: Credential Theft, Financial Fraud,
Malware Distribution, and Personal Information Harvesting. We construct and
release the first phishing intention ground truth dataset (~2K samples) and
evaluate the framework using four commercial LLMs. Experimental results show
that PhishIntentionLLM achieves a micro-precision of 0.7895 with GPT-4o and
significantly outperforms the single-agent baseline with a ~95% improvement in
micro-precision. Compared to the previous work, it achieves 0.8545 precision
for credential theft, marking a ~4% improvement. Additionally, we generate a
larger dataset of ~9K samples for large-scale phishing intention profiling
across sectors. This work provides a scalable and interpretable solution for
intention-aware phishing analysis.
Authors' comments: Accepted by EAI ICDF2C 2025
Xiaofeng Shi, Yuduo Li, Qian Kou, Longbin Yu, Jinxin Xie, Hua Zhou
Recent advances in large language models (LLMs) have opened new opportunities for academic literature retrieval. However, existing systems often rely on rigid pipelines and exhibit limited reasoning capabilities. We introduce SPAR, a multi-agent framework that incorporates RefChain-based query decomposition and query evolution to enable more flexible and effective search. To facilitate systematic evaluation, we also construct SPARBench, a challenging benchmark with expert-annotated relevance labels. Experimental results demonstrate that SPAR substantially outperforms strong baselines, achieving up to +56% F1 on AutoScholar and +23% F1 on SPARBench over the best-performing baseline. Together, SPAR and SPARBench provide a scalable, interpretable, and high-performing foundation for advancing research in scholarly retrieval. Code and data will be available at: https://github.com/xiaofengShi/SPAR
Jerry Wang, Fang Yu
Adversarial prompt attacks can significantly alter the reliability of
Retrieval-Augmented Generation (RAG) systems by re-ranking them to produce
incorrect outputs. In this paper, we present a novel method that applies
Differential Evolution (DE) to optimize adversarial prompt suffixes for
RAG-based question answering. Our approach is gradient-free, treating the RAG
pipeline as a black box and evolving a population of candidate suffixes to
maximize the retrieval rank of a targeted incorrect document to be closer to
real world scenarios. We conducted experiments on the BEIR QA datasets to
evaluate attack success at certain retrieval rank thresholds under multiple
retrieving applications. Our results demonstrate that DE-based prompt
optimization attains competitive (and in some cases higher) success rates
compared to GGPP to dense retrievers and PRADA to sparse retrievers, while
using only a small number of tokens (<=5 tokens) in the adversarial suffix.
Furthermore, we introduce a readability-aware suffix construction strategy,
validated by a statistically significant reduction in MLM negative
log-likelihood with Welch's t-test. Through evaluations with a BERT-based
adversarial suffix detector, we show that DE-generated suffixes evade
detection, yielding near-chance detection accuracy.
Authors' comments: Accepted by KDD Workshop on Prompt Optimization 2025
Amna Ali, Liyanage C. De Silva, Pg Emeroylariffion Abas
Patent examiners and inventors face significant pressure to verify the originality and non-obviousness of inventions, and the intricate nature of patent data intensifies the challenges of patent retrieval. Therefore, there is a pressing need to devise cutting-edge retrieval strategies that can reliably achieve the desired recall. This study introduces FullRecall, a novel patent retrieval approach that effectively manages the complexity of patent data while maintaining the reliability of relevance matching and maximising recall. It leverages IPC-guided knowledge to generate informative phrases, which are processed to extract key information in the form of noun phrases characterising the query patent under observation. From these, the top k keyphrases are selected to construct a query for retrieving a focused subset of the dataset. This initial retrieval step achieves complete recall, successfully capturing all relevant documents. To further refine the results, a ranking scheme is applied to the retrieved subset, reducing its size while maintaining 100% recall. This multi-phase process demonstrates an effective strategy for balancing precision and recall in patent retrieval tasks. Comprehensive experiments were conducted, and the results were compared with baseline studies, namely HRR2 [1] and ReQ-ReC [2]. The proposed approach yielded superior results, achieving 100% recall in all five test cases. However, HRR2[1] recall values across the five test cases were 10%, 25%, 33.3%, 0%, and 14.29%, while ReQ-ReC [2] showed 50% for the first test case, 25% for the second test case, and 0% for the third, fourth, and fifth test cases. The 100% recall ensures that no relevant prior art is overlooked, thereby strengthening the patent pre-filing and examination processes, hence reducing potential legal risks.
Xiaojie Li, Chu Li, Shi-Zhe Chen, Xi Chen
Universal multimodal retrieval (UMR), which aims to address complex retrieval
tasks where both queries and candidates span diverse modalities, has been
significantly advanced by the emergence of MLLMs. While state-of-the-art
MLLM-based methods in the literature predominantly adopt contrastive learning
principles, they often differ in their specific training recipes. Despite their
success, the mechanisms underlying their retrieval capabilities remain largely
unexplored, potentially resulting in suboptimal performance and limited
generalization ability. To address these issues, we present a comprehensive
study aimed at uncovering the key factors that drive effective embedding
learning for UMR using MLLMs. We begin by implementing a general MLLM-based
embedding learning pipeline, and systematically analyze the primary
contributors to high-performing universal retrieval systems. Based on this, we
explore various aspects of the details in embedding generation and training
strategies, including progressive transition, hard negative mining and
re-ranker distillation. Notably, our findings reveal that often-overlooked
factors can have a substantial impact on model performance. Building on these
discoveries, we introduce a unified framework termed U-MARVEL
(\textbf{U}niversal \textbf{M}ultimod\textbf{A}l \textbf{R}etrie\textbf{V}al
via \textbf{E}mbedding \textbf{L}earning), which outperforms state-of-the-art
competitors on the M-BEIR benchmark by a large margin in supervised settings,
and also exihibits strong zero-shot performance on several tasks such as
composed image retrieval and text-to-video retrieval. These results underscore
the generalization potential of our framework across various embedding-based
retrieval tasks. Code is available at https://github.com/chaxjli/U-MARVEL
Authors' comments: Technical Report (in progress)
Van-Hoang Le, Duc-Vu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
Large Language Models (LLMs) face significant challenges in specialized
domains like law, where precision and domain-specific knowledge are critical.
This paper presents a streamlined two-stage framework consisting of Retrieval
and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our
approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval,
followed by a Cross-Encoder for precise re-ranking, both optimized through
strategic negative example mining. Key innovations include the introduction of
the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard
negatives to mitigate training bias, which significantly improved re-ranking
performance. Evaluated on the SoICT Hackathon 2024 for Legal Document
Retrieval, our team, 4Huiter, achieved a top-three position. While
top-performing teams employed ensemble models and iterative self-training on
large bge-m3 architectures, our lightweight, single-pass approach offered a
competitive alternative with far fewer parameters. The framework demonstrates
that optimized data processing, tailored loss functions, and balanced negative
sampling are pivotal for building robust retrieval-augmented systems in legal
contexts.
Authors' comments: Accepted at ICCCI 2025