Yixuan Cao, Zhengrong Chen, Chengxuan Xia, Kun Wu, Ping Luo
Quantitative facts are continually generated by companies and governments,
supporting data-driven decision-making. While common facts are structured, many
long-tail quantitative facts remain buried in unstructured documents, making
them difficult to access. We propose the task of Quantity Retrieval: given a
description of a quantitative fact, the system returns the relevant value and
supporting evidence. Understanding quantity semantics in context is essential
for this task. We introduce a framework based on description parsing that
converts text into structured (description, quantity) pairs for effective
retrieval. To improve learning, we construct a large paraphrase dataset using
weak supervision based on quantity co-occurrence. We evaluate our approach on a
large corpus of financial annual reports and a newly annotated quantity
description dataset. Our method significantly improves top-1 retrieval accuracy
from 30.98 percent to 64.66 percent.
Authors' comments: Extended version of the paper accepted in DEXA 2025
Youngjoon Jang, Seongtae Hong, Junyoung Son, Sungjin Park, Chanjun Park, Heuiseok Lim
Retrieval-Augmented Generation (RAG) has emerged as a crucial framework in natural language processing (NLP), improving factual consistency and reducing hallucinations by integrating external document retrieval with large language models (LLMs). However, the effectiveness of RAG is often hindered by coreferential complexity in retrieved documents, introducing ambiguity that disrupts in-context learning. In this study, we systematically investigate how entity coreference affects both document retrieval and generative performance in RAG-based systems, focusing on retrieval relevance, contextual understanding, and overall response quality. We demonstrate that coreference resolution enhances retrieval effectiveness and improves question-answering (QA) performance. Through comparative analysis of different pooling strategies in retrieval tasks, we find that mean pooling demonstrates superior context capturing ability after applying coreference resolution. In QA tasks, we discover that smaller models benefit more from the disambiguation process, likely due to their limited inherent capacity for handling referential ambiguity. With these findings, this study aims to provide a deeper understanding of the challenges posed by coreferential complexity in RAG, providing guidance for improving retrieval and generation in knowledge-intensive AI applications.
Evert Nasedkin, Merle Schrader, Johanna M. Vos, Beth Biller, Ben Burningham, Nicolas B. Cowan, Jacqueline Faherty, Eileen Gonzales et al.
SIMP-0136 is a T2.5 brown dwarf whose young age ($200\pm50$~Myr) and low mass
($15\pm3$~M$_{\rm Jup}$) make it an ideal analogue for the directly imaged
exoplanet population. With a 2.4 hour period, it is known to be variable in
both the infrared and the radio, which has been attributed to changes in the
cloud coverage and the presence of an aurora respectively. To quantify the
changes in the atmospheric state that drive this variability, we obtained
time-series spectra of SIMP-0136 covering one full rotation with both
NIRSpec/PRISM and the MIRI/LRS on board JWST. We performed a series of
time-resolved atmospheric retrievals using petitRADTRANS in order to measure
changes in the temperature structure, chemistry, and cloudiness. We inferred
the presence of a ~250 K thermal inversion above 10 mbar of SIMP-0136 at all
phases, and propose that this inversion is due to the deposition of energy into
the upper atmosphere by an aurora. Statistical tests were performed in order to
determine which parameters drive the observed spectroscopic variability. The
primary contribution was due to changes in the temperature profile at pressures
deeper than 10 mbar, which resulted in variation of the effective temperature
from 1243 K to 1248 K. This changing effective temperature was also correlated
to observed changes in the abundances of CO2 and H2S, while all other chemical
species were consistent with being homogeneous throughout the atmosphere.
Patchy silicate clouds were required to fit the observed spectra, but the cloud
properties were not found to systematically vary with longitude. This work
paints a portrait of an L/T transition object where the primary variability
mechanisms are magnetic and thermodynamic in nature, rather than due to
inhomogeneous cloud coverage.
Authors' comments: 25 pages, 22 figures, submitted to Astronomy & Astrophysics
Chetan Arora, Fanyu Wang, Chakkrit Tantithamthavorn, Aldeida Aleti, Shaun Kenyon
Requirements engineering (RE) in the space industry is inherently complex, demanding high precision, alignment with rigorous standards, and adaptability to mission-specific constraints. Smaller space organisations and new entrants often struggle to derive actionable requirements from extensive, unstructured documents such as mission briefs, interface specifications, and regulatory standards. In this innovation opportunity paper, we explore the potential of Retrieval-Augmented Generation (RAG) models to support and (semi-)automate requirements generation in the space domain. We present a modular, AI-driven approach that preprocesses raw space mission documents, classifies them into semantically meaningful categories, retrieves contextually relevant content from domain standards, and synthesises draft requirements using large language models (LLMs). We apply the approach to a real-world mission document from the space domain to demonstrate feasibility and assess early outcomes in collaboration with our industry partner, Starbound Space Solutions. Our preliminary results indicate that the approach can reduce manual effort, improve coverage of relevant requirements, and support lightweight compliance alignment. We outline a roadmap toward broader integration of AI in RE workflows, intending to lower barriers for smaller organisations to participate in large-scale, safety-critical missions.
Qin Zhou, Guoyan Liang, Xindi Li, Jingyuan Chen, Wang Zhe, Chang Yao, Sai Wu
Automated radiology report generation is essential for improving diagnostic
efficiency and reducing the workload of medical professionals. However,
existing methods face significant challenges, such as disease class imbalance
and insufficient cross-modal fusion. To address these issues, we propose the
learnable Retrieval Enhanced Visual-Text Alignment and Fusion (REVTAF)
framework, which effectively tackles both class imbalance and visual-text
fusion in report generation. REVTAF incorporates two core components: (1) a
Learnable Retrieval Enhancer (LRE) that utilizes semantic hierarchies from
hyperbolic space and intra-batch context through a ranking-based metric. LRE
adaptively retrieves the most relevant reference reports, enhancing image
representations, particularly for underrepresented (tail) class inputs; and (2)
a fine-grained visual-text alignment and fusion strategy that ensures
consistency across multi-source cross-attention maps for precise alignment.
This component further employs an optimal transport-based cross-attention
mechanism to dynamically integrate task-relevant textual knowledge for improved
report generation. By combining adaptive retrieval with multi-source alignment
and fusion, REVTAF achieves fine-grained visual-text integration under weak
image-report level supervision while effectively mitigating data imbalance
issues. The experiments demonstrate that REVTAF outperforms state-of-the-art
methods, achieving an average improvement of 7.4% on the MIMIC-CXR dataset and
2.9% on the IU X-Ray dataset. Comparisons with mainstream multimodal LLMs
(e.g., GPT-series models), further highlight its superiority in radiology
report generation https://github.com/banbooliang/REVTAF-RRG.
Authors' comments: 10 pages,3 figures, conference
Chen Amiraz, Yaroslav Fyodorov, Elad Haramaty, Zohar Karnin, Liane Lewin-Eytan
Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with significant performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever's difficulty in ranking documents across languages. Finally, we propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.
Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, Liqiang Nie
As multimedia content expands, the demand for unified multimodal retrieval (UMR) in real-world applications increases. Recent work leverages multimodal large language models (MLLMs) to tackle this task. However, their large parameter size results in high training costs and low inference efficiency. To address this, we propose PUMA: a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning. Our approach improves UMR from both structural and learning perspectives. (1) Structurally, we propose Layer-Pruned Self-Distillation, which prunes MLLMs by keeping only shallow layers while distilling features from dropped deep layers as teacher signals. This reduces parameters and preserves representation capability. (2) On the learning side, we introduce Modality-Adaptive Contrastive Learning Loss (MAC-Loss), which separates in-batch negatives into harder intra-modality and easier inter-modality groups based on the target modality, assigning different temperature strategies to enhance learning efficiency. Experiments show our method significantly reduces resource usage while maintaining strong performance.
Authors' comments: Accepted to ACM MM 2025
Anirban Saha Anik, Xiaoying Song, Elliott Wang, Bryan Wang, Bengisu Yarimbas, Lingzi Hong
Large language models (LLMs) incorporated with Retrieval-Augmented Generation (RAG) have demonstrated powerful capabilities in generating counterspeech against misinformation. However, current studies rely on limited evidence and offer less control over final outputs. To address these challenges, we propose a Multi-agent Retrieval-Augmented Framework to generate counterspeech against health misinformation, incorporating multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement. Our approach integrates both static and dynamic evidence, ensuring that the generated counterspeech is relevant, well-grounded, and up-to-date. Our method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy, demonstrating its effectiveness in generating high-quality counterspeech. To further validate our approach, we conduct ablation studies to verify the necessity of each component in our framework. Furthermore, human evaluations reveal that refinement significantly enhances counterspeech quality and obtains human preference.
Fengran Mo, Yifan Gao, Chuan Meng, Xin Liu, Zhuofeng Wu, Kelong Mao, Zhengyang Wang, Pei Chen et al.
The rapid advancement of conversational search systems revolutionizes how
information is accessed by enabling the multi-turn interaction between the user
and the system. Existing conversational search systems are usually built with
two different models. This separation restricts the system from leveraging the
intrinsic knowledge of the models simultaneously, which cannot ensure the
effectiveness of retrieval benefiting the generation. The existing studies for
developing unified models cannot fully address the aspects of understanding
conversational context, managing retrieval independently, and generating
responses. In this paper, we explore how to unify dense retrieval and response
generation for large language models in conversation. We conduct joint
fine-tuning with different objectives and design two mechanisms to reduce the
inconsistency risks while mitigating data discrepancy. The evaluations on five
conversational search datasets demonstrate that our unified model can mutually
improve both tasks and outperform the existing baselines.
Authors' comments: Accepted by ACL 2025 (main)
Xueqing Xu, Boris Bolliet, Adrian Dimitrov, Andrew Laverick, Francisco Villaescusa-Navarro, Licong Xu, Íñigo Zubeldia
We evaluate 9 Retrieval Augmented Generation (RAG) agent configurations on
105 Cosmology Question-Answer (QA) pairs that we built specifically for this
purpose.The RAG configurations are manually evaluated by a human expert, that
is, a total of 945 generated answers were assessed. We find that currently the
best RAG agent configuration is with OpenAI embedding and generative model,
yielding 91.4\% accuracy. Using our human evaluation results we calibrate
LLM-as-a-Judge (LLMaaJ) system which can be used as a robust proxy for human
evaluation. These results allow us to systematically select the best RAG agent
configuration for multi-agent system for autonomous scientific discovery in
astrophysics (e.g., cmbagent presented in a companion paper) and provide us
with an LLMaaJ system that can be scaled to thousands of cosmology QA pairs. We
make our QA dataset, human evaluation results, RAG pipelines, and LLMaaJ system
publicly available for further use by the astrophysics community.
Authors' comments: Accepted contribution (spotlight) to the ICML 2025 Workshop on
Machine Learning for Astrophysics; codes:
https://huggingface.co/datasets/ASTROANTS/CosmoPaperQA,
https://github.com/CMBAgents/cmbagent, https://github.com/CMBAgents/scirag
Sezen Perçin, Xin Su, Qutub Sha Syed, Phillip Howard, Aleksei Kuvshinov, Leo Schwinn, Kay-Ulrich Scholl
Large language models (LLMs) are very costly and inefficient to update with
new information. To address this limitation, retrieval-augmented generation
(RAG) has been proposed as a solution that dynamically incorporates external
knowledge during inference, improving factual consistency and reducing
hallucinations. Despite its promise, RAG systems face practical challenges-most
notably, a strong dependence on the quality of the input query for accurate
retrieval. In this paper, we investigate the sensitivity of different
components in the RAG pipeline to various types of query perturbations. Our
analysis reveals that the performance of commonly used retrievers can degrade
significantly even under minor query variations. We study each module in
isolation as well as their combined effect in an end-to-end question answering
setting, using both general-domain and domain-specific datasets. Additionally,
we propose an evaluation framework to systematically assess the query-level
robustness of RAG pipelines and offer actionable recommendations for
practitioners based on the results of more than 1092 experiments we performed.
Authors' comments: Accepted to Generation, Evaluation & Metrics (GEM) Workshop at ACL
2025
Garapati Keerthana, Manik Gupta
Large language models (LLMs), including zero-shot and few-shot paradigms,
have shown promising capabilities in clinical text generation. However,
real-world applications face two key challenges: (1) patient data is highly
unstructured, heterogeneous, and scattered across multiple note types and (2)
clinical notes are often long and semantically dense, making naive prompting
infeasible due to context length constraints and the risk of omitting
clinically relevant information.
We introduce CLI-RAG (Clinically Informed Retrieval-Augmented Generation), a
domain-specific framework for structured and clinically grounded text
generation using LLMs. It incorporates a novel hierarchical chunking strategy
that respects clinical document structure and introduces a task-specific
dual-stage retrieval mechanism. The global stage identifies relevant note types
using evidence-based queries, while the local stage extracts high-value content
within those notes creating relevance at both document and section levels.
We apply the system to generate structured progress notes for individual
hospital visits using 15 clinical note types from the MIMIC-III dataset.
Experiments show that it preserves temporal and semantic alignment across
visits, achieving an average alignment score of 87.7%, surpassing the 80.7%
baseline from real clinician-authored notes. The generated outputs also
demonstrate high consistency across LLMs, reinforcing deterministic behavior
essential for reproducibility, reliability, and clinical trust.
Authors' comments: 12 pages, 4 figures
Matteo Masto, Vincent Favre-Nicolin, Steven Leake, Clement Atlan, Marie-Ingrid Richard, Tobias Schülli, Ewen Bellec
In Bragg Coherent Diffraction Imaging (BCDI), Phase Retrieval of highly strained crystals is often challenging with standard iterative algorithms. This computational obstacle limits the potential of the technique as it precludes the reconstruction of physically interesting highly-strained particles. Here, we propose a novel approach to this problem using a supervised Convolutional Neural Network (CNN) trained on 3D simulated diffraction data to predict the corresponding reciprocal space phase. This method allows to fully exploit the potential of the CNN by mapping functions within the same space and leveraging structural similarities between input and output. The final object is obtained by the inverse Fourier transform of the retrieved complex diffracted amplitude and is then further refined with iterative algorithms. We demonstrate that our model outperforms standard algorithms on highly strained simulated data not included in the training set, as well as on experimental data.
François Gardères, Shizhe Chen, Camille-Sovanneary Gauthier, Jean Ponce
The composed image retrieval (CIR) task is to retrieve target images given a reference image and a modification text. Recent methods for CIR leverage large pretrained vision-language models (VLMs) and achieve good performance on general-domain concepts like color and texture. However, they still struggle with application domains like fashion, because the rich and diverse vocabulary used in fashion requires specific fine-grained vision and language understanding. An additional difficulty is the lack of large-scale fashion datasets with detailed and relevant annotations, due to the expensive cost of manual annotation by specialists. To address these challenges, we introduce FACap, a large-scale, automatically constructed fashion-domain CIR dataset. It leverages web-sourced fashion images and a two-stage annotation pipeline powered by a VLM and a large language model (LLM) to generate accurate and detailed modification texts. Then, we propose a new CIR model FashionBLIP-2, which fine-tunes the general-domain BLIP-2 model on FACap with lightweight adapters and multi-head query-candidate matching to better account for fine-grained fashion-specific information. FashionBLIP-2 is evaluated with and without additional fine-tuning on the Fashion IQ benchmark and the enhanced evaluation dataset enhFashionIQ, leveraging our pipeline to obtain higher-quality annotations. Experimental results show that the combination of FashionBLIP-2 and pretraining with FACap significantly improves the model's performance in fashion CIR especially for retrieval with fine-grained modification texts, demonstrating the value of our dataset and approach in a highly demanding environment such as e-commerce websites. Code is available at https://fgxaos.github.io/facap-paper-website/.
Swapnil Bhattacharyya, Shrey Ganatra, Harshvivek Kashid, Spandan Anaokar, Shruti Nair, Reshma Sekhar, Siddharth Manohar, Rahul Hemrajani et al.
AI-based judicial assistance and case prediction have been extensively studied in criminal and civil domains, but remain largely unexplored in consumer law, especially in India. In this paper, we present Nyay-Darpan, a novel two-in-one framework that (i) summarizes consumer case files and (ii) retrieves similar case judgements to aid decision-making in consumer dispute resolution. Our methodology not only addresses the gap in consumer law AI tools but also introduces an innovative approach to evaluate the quality of the summary. The term 'Nyay-Darpan' translates into 'Mirror of Justice', symbolizing the ability of our tool to reflect the core of consumer disputes through precise summarization and intelligent case retrieval. Our system achieves over 75 percent accuracy in similar case prediction and approximately 70 percent accuracy across material summary evaluation metrics, demonstrating its practical effectiveness. We will publicly release the Nyay-Darpan framework and dataset to promote reproducibility and facilitate further research in this underexplored yet impactful domain.
Haiwen Li, Delong Liu, Zhaohui Hou, Zhicheng Zhao, Fei Su
As a challenging vision-language (VL) task, Composed Image Retrieval (CIR)
aims to retrieve target images using multimodal (image+text) queries. Although
many existing CIR methods have attained promising performance, their reliance
on costly, manually labeled triplets hinders scalability and zero-shot
capability. To address this issue, we propose a scalable pipeline for automatic
triplet generation, along with a fully synthetic dataset named Composed Image
Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a
large language model (LLM) to generate diverse prompts, controlling a
text-to-image generative model to produce image pairs with identical elements
in each pair, which are then filtered and reorganized to form the CIRHS
dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a
novel CIR framework, which can accomplish global alignment and local reasoning
within a broader context, enabling the model to learn more robust and
informative representations. By utilizing the synthetic CIRHS dataset, CoAlign
achieves outstanding zero-shot performance on three commonly used benchmarks,
demonstrating for the first time the feasibility of training CIR models on a
fully synthetic dataset. Furthermore, under supervised training, our method
outperforms all the state-of-the-art supervised CIR approaches, validating the
effectiveness of our proposed retrieval framework. The code and the CIRHS
dataset will be released soon.
Authors' comments: This paper was originally submitted to ACM MM 2025 on April 12, 2025
Y. Du
Vector retrieval systems exhibit significant performance variance across
queries due to heterogeneous embedding quality. We propose a lightweight
framework for predicting retrieval performance at the query level by combining
quantization robustness and neighborhood density metrics. Our approach is
motivated by the observation that high-quality embeddings occupy geometrically
stable regions in the embedding space and exhibit consistent neighborhood
structures. We evaluate our method on 4 standard retrieval datasets, showing
consistent improvements of 9.4$\pm$1.2\% in Recall@10 over competitive
baselines. The framework requires minimal computational overhead (less than 5\%
of retrieval time) and enables adaptive retrieval strategies. Our analysis
reveals systematic patterns in embedding quality across different query types,
providing insights for targeted training data augmentation.
Authors' comments: 7 pages
Yuan An, John Liu, Niyam Acharya, Ruhma Hashmi
Retrieval practice is a well-established pedagogical technique known to significantly enhance student learning and knowledge retention. However, generating high-quality retrieval practice questions is often time-consuming and labor intensive for instructors, especially in rapidly evolving technical subjects. Large Language Models (LLMs) offer the potential to automate this process by generating questions in response to prompts, yet the effectiveness of LLM-generated retrieval practice on student learning remains to be established. In this study, we conducted an empirical study involving two college-level data science courses, with approximately 60 students. We compared learning outcomes during one week in which students received LLM-generated multiple-choice retrieval practice questions to those from a week in which no such questions were provided. Results indicate that students exposed to LLM-generated retrieval practice achieved significantly higher knowledge retention, with an average accuracy of 89%, compared to 73% in the week without such practice. These findings suggest that LLM-generated retrieval questions can effectively support student learning and may provide a scalable solution for integrating retrieval practice into real-time teaching. However, despite these encouraging outcomes and the potential time-saving benefits, cautions must be taken, as the quality of LLM-generated questions can vary. Instructors must still manually verify and revise the generated questions before releasing them to students.
Shashank Verma, Fengyi Jiang, Xiangning Xue
Biomedical semantic question answering rooted in information retrieval can
play a crucial role in keeping up to date with vast, rapidly evolving and
ever-growing biomedical literature. A robust system can help researchers,
healthcare professionals and even layman users access relevant knowledge
grounded in evidence. The BioASQ 2025 Task13b Challenge serves as an important
benchmark, offering a competitive platform for advancement of this space. This
paper presents the methodologies and results from our participation in this
challenge where we built a Retrieval-Augmented Generation (RAG) system that can
answer biomedical questions by retrieving relevant PubMed documents and
snippets to generate answers. For the retrieval task, we generated dense
embeddings from biomedical articles for initial retrieval, and applied an
ensemble of finetuned cross-encoders and large language models (LLMs) for
re-ranking to identify top relevant documents. Our solution achieved an MAP@10
of 0.1581, placing 10th on the leaderboard for the retrieval task. For answer
generation, we employed few-shot prompting of instruction-tuned LLMs. Our
system achieved macro-F1 score of 0.95 for yes/no questions (rank 12), Mean
Reciprocal Rank (MRR) of 0.64 for factoid questions (rank 1), mean-F1 score of
0.63 for list questions (rank 5), and ROUGE-SU4 F1 score of 0.29 for ideal
answers (rank 11).
Authors' comments: Paper submitted to CLEF 2025 CEUR-WS
Alex ZH Dou, Zhongwei Wan, Dongfei Cui, Xin Wang, Jing Xiong, Haokun Lin, Chaofan Tao, Shen Yan et al.
Test-time scaling has emerged as a promising paradigm in language modeling,
leveraging additional computational resources at inference time to enhance
model performance. In this work, we introduce R2-LLMs, a novel and versatile
hierarchical retrieval-augmented reasoning framework designed to improve
test-time scaling in large language models (LLMs) without requiring
distillation from more advanced models to obtain chain-of-thought (CoT)
training data. R2-LLMs enhances inference-time generalization by integrating
dual-level retrieval-based in-context learning: (1) At the coarse level, our
approach extracts abstract templates from complex reasoning problems and
retrieves similar problem-answer pairs to facilitate high-level in-context
learning; (2) At the fine level, during Monte Carlo Tree Search (MCTS), R2-LLMs
efficiently retrieves analogous intermediate solution steps from reference
mathematical problem datasets, refining step-wise reasoning with the aid of a
process reward model (PRM) for scoring. R2-LLMs is a robust hierarchical
reasoning-augmentation method that enhances in-context-level reasoning while
seamlessly integrating with step-level tree search methods. Utilizing PRM, it
refines both candidate generation and decision-making for improved reasoning
accuracy. Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO
datasets achieve substantial relative improvement with an increase of up to 16%
using LLaMA-3.1-8B compared to the baselines, showcasing the effectiveness of
our approach in complex reasoning tasks.
Authors' comments: Technical Report