Jeffrey Eben, Aitzaz Ahmad, Stephen Lau
Despite advances in large language model (LLM)-based natural language interfaces for databases, scaling to enterprise-level data catalogs remains an under-explored challenge. Prior works addressing this challenge rely on domain-specific fine-tuning - complicating deployment - and fail to leverage important semantic context contained within database metadata. To address these limitations, we introduce a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units, each separately indexed for targeted retrieval. Our approach prioritizes effective table identification while leveraging column-level information, ensuring the total number of retrieved tables remains within a manageable context budget. Experiments demonstrate that our method maintains high recall and accuracy, with our system outperforming baselines over massive databases with varying structure and available metadata. Our solution enables practical text-to-SQL systems deployable across diverse enterprise settings without specialized fine-tuning, addressing a critical scalability gap in natural language database interfaces.
Jie He, Victor Gutierrez Basulto, Jeff Z. Pan
Reinforcement learning-based retrieval-augmented generation (RAG) methods enhance the reasoning abilities of large language models (LLMs). However, most rely only on final-answer rewards, overlooking intermediate reasoning quality. This paper analyzes existing RAG reasoning models and identifies three main failure patterns: (1) information insufficiency, meaning the model fails to retrieve adequate support; (2) faulty reasoning, where logical or content-level flaws appear despite sufficient information; and (3) answer-reasoning inconsistency, where a valid reasoning chain leads to a mismatched final answer. We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system to improve reasoning and stability. TIRESRAG-R1 introduces: (1) a sufficiency reward to encourage thorough retrieval; (2) a reasoning quality reward to assess the rationality and accuracy of the reasoning chain; and (3) a reflection reward to detect and revise errors. It also employs a difficulty-aware reweighting strategy and training sample filtering to boost performance on complex tasks. Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks. The code and data are available at: https://github.com/probe2/TIRESRAG-R1.
Cheng Cheng, Baixiang Wu, Jun Xian
In this paper, we study the conjugate phase retrieval for complex-valued \mbox{signals} residing on graphs, and explore its applications to shift-invariant spaces. Given a complex-valued graph signal $\bf f$ residing on the graph $\mathcal G$, we introduce a graph ${\mathcal G}_{\bf f}$ and show that its connectivity is sufficient to determine $\bf f$ up to a global unimodular constant and conjugation. We then construct two explicit graph models and show that graph signals residing on them can be recovered, up to a unimodular constant and conjugation, from its absolute values on the vertices and the relative magnitudes between neighboring vertices. Building on this graph-based framework, we apply our results to shift-invariant spaces generated by real-valued functions. For signals in the Paley-Wiener space, we show that any complex-valued function can be recovered, up to a unimodular constant and conjugation, from structured phaseless samples taken at three times the Nyquist rate. For more general shift invariant spaces, we establish the conjugate phase retrievability of signals from phaseless samples collected on a discrete sampling set, in conjunction with relative magnitude measurements between neighboring sample points. Two numerical reconstruction algorithms are introduced to recover the signals in the Paley-Wiener space and general shift-invariant spaces, up to a unimodular constant and conjugation, from the given phaseless measurements.
Nicola Fanelli, Gennaro Vessio, Giovanna Castellano
Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is WikiFragments, a Wikipedia-scale dataset of image-text fragments curated to support knowledge-grounded multimodal reasoning. Our framework achieves state-of-the-art results on multiple benchmarks, including a +8.4% F1 improvement in style classification over GraphCLIP and a +7.1 BLEU@1 gain in captioning on ArtPedia. Qualitative analyses show that ArtSeek can interpret visual motifs, infer historical context, and retrieve relevant knowledge, even for obscure works. Though focused on visual arts, our approach generalizes to other domains requiring external knowledge, supporting scalable multimodal AI research. Both the dataset and the source code will be made publicly available at https://github.com/cilabuniba/artseek.
M. Amani, M. Dashtdar
A method for direct phase difference reconstruction using single-shot dual-wavelength off-axis digital holography is presented. This approach enables direct imaging of samples with high steps without the need to reconstruct phase images at each individual wavelength. As the dual wavelengths in the reference and object arms pass through a common path in this configuration, single-wavelength arrangements can be applied. Due to the unique capability of the presented method, a sodium-vapor lamp source has been utilized to obtain two closely spaced wavelengths (${\lambda}1 = 589 nm$ and ${\lambda}2 = 589.6 nm$), with in a synthetic wavelength of ${\Lambda} = 578.8 \mu m$ in the Michelson configuration. To evaluate the validity of the method, the height of an air wedge measured using the proposed approach has been compared with the result obtained from phase unwrapping in the single-wavelength method. The capability of the proposed technique to image samples with high step structures is further demonstrated by measuring a $30 \mu m$ step height and a glass plate with a thickness of approximately $140 \mu m$.
Hao Ye, Mengshi Qi, Zhaohong Liu, Liang Liu, Huadong Ma
In this work, we study how vision-language models (VLMs) can be utilized to enhance the safety for the autonomous driving system, including perception, situational understanding, and path planning. However, existing research has largely overlooked the evaluation of these models in traffic safety-critical driving scenarios. To bridge this gap, we create the benchmark (SafeDrive228K) and propose a new baseline based on VLM with knowledge graph-based retrieval-augmented generation (SafeDriveRAG) for visual question answering (VQA). Specifically, we introduce SafeDrive228K, the first large-scale multimodal question-answering benchmark comprising 228K examples across 18 sub-tasks. This benchmark encompasses a diverse range of traffic safety queries, from traffic accidents and corner cases to common safety knowledge, enabling a thorough assessment of the comprehension and reasoning abilities of the models. Furthermore, we propose a plug-and-play multimodal knowledge graph-based retrieval-augmented generation approach that employs a novel multi-scale subgraph retrieval algorithm for efficient information retrieval. By incorporating traffic safety guidelines collected from the Internet, this framework further enhances the model's capacity to handle safety-critical situations. Finally, we conduct comprehensive evaluations on five mainstream VLMs to assess their reliability in safety-sensitive driving tasks. Experimental results demonstrate that integrating RAG significantly improves performance, achieving a +4.73% gain in Traffic Accidents tasks, +8.79% in Corner Cases tasks and +14.57% in Traffic Safety Commonsense across five mainstream VLMs, underscoring the potential of our proposed benchmark and methodology for advancing research in traffic safety. Our source code and data are available at https://github.com/Lumos0507/SafeDriveRAG.
Jungyeon Lee, Kangmin Lee, Taeuk Kim
Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model's parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection -- especially when multi-hop reasoning is required -- and often fail to pinpoint the exact source of contradictions. Finally, we present in-depth analyses that serve as a foundation for improving LLMs in integrating diverse, sometimes even conflicting, information.
Zhichuan Wang, Yang Zhou, Zhe Liu, Rui Yu, Song Bai, Yulong Wang, Xinwei He, Xiang Bai
Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D
objects of unseen categories beyond the training set. Existing methods
typically utilize all modalities (i.e., voxels, point clouds, multi-view
images) and train specific backbones before fusion. However, they still
struggle to produce generalized representations due to insufficient 3D training
data. Being contrastively pre-trained on web-scale image-text pairs, CLIP
inherently produces generalized representations for a wide range of downstream
tasks. Building upon it, we present a simple yet effective framework named
Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set
3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large
language model (MLLM) to learn generalized 3D representations, where the MLLM
is used for dual purposes. First, it describes the seen category information to
align with CLIP's training objective for adaptation during training. Second, it
provides external hints about unknown objects complementary to visual cues
during inference. To improve the synergy, we introduce an Additive-Bias
Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further
enhances the generalization to unseen categories. With only multi-view images,
DAC significantly surpasses prior arts by an average of +10.01\% mAP on four
open-set 3DOR datasets. Moreover, its generalization is also validated on
image-based and cross-dataset setups. Code is available at
https://github.com/wangzhichuan123/DAC.
Authors' comments: Accepted to ICCV 2025
Ashley Rector, Keaton Minor, Kamden Minor, Jeff McCormack, Beth Breeden, Ryan Nowers, Jay Dorris
This study evaluated Sherpa Rx, an artificial intelligence tool leveraging large language models and retrieval-augmented generation (RAG) for pharmacogenomics, to validate its performance on key response metrics. Sherpa Rx integrated Clinical Pharmacogenetics Implementation Consortium (CPIC) guidelines with Pharmacogenomics Knowledgebase (PharmGKB) data to generate contextually relevant responses. A dataset (N=260 queries) spanning 26 CPIC guidelines was used to evaluate drug-gene interactions, dosing recommendations, and therapeutic implications. In Phase 1, only CPIC data was embedded. Phase 2 additionally incorporated PharmGKB content. Responses were scored on accuracy, relevance, clarity, completeness (5-point Likert scale), and recall. Wilcoxon signed-rank tests compared accuracy between Phase 1 and Phase 2, and between Phase 2 and ChatGPT-4omini. A 20-question quiz assessed the tool's real-world applicability against other models. In Phase 1 (N=260), Sherpa Rx demonstrated high performance of accuracy 4.9, relevance 5.0, clarity 5.0, completeness 4.8, and recall 0.99. The subset analysis (N=20) showed improvements in accuracy (4.6 vs. 4.4, Phase 2 vs. Phase 1 subset) and completeness (5.0 vs. 4.8). ChatGPT-4omini performed comparably in relevance (5.0) and clarity (4.9) but lagged in accuracy (3.9) and completeness (4.2). Differences in accuracy between Phase 1 and Phase 2 was not statistically significant. However, Phase 2 significantly outperformed ChatGPT-4omini. On the 20-question quiz, Sherpa Rx achieved 90% accuracy, outperforming other models. Integrating additional resources like CPIC and PharmGKB with RAG enhances AI accuracy and performance. This study highlights the transformative potential of generative AI like Sherpa Rx in pharmacogenomics, improving decision-making with accurate, personalized responses.
Sodtavilan Odonchimed, Tatsuya Matsushima, Simon Holk, Yusuke Iwasawa, Yutaka Matsuo
Diffusion Policies (DPs) have attracted attention for their ability to achieve significant accuracy improvements in various imitation learning tasks. However, DPs depend on Diffusion Models, which require multiple noise removal steps to generate a single action, resulting in long generation times. To solve this problem, knowledge distillation-based methods such as Consistency Policy (CP) have been proposed. However, these methods require a significant amount of training time, especially for difficult tasks. In this study, we propose RAGDP (Retrieve-Augmented Generation for Diffusion Policies) as a novel framework that eliminates the need for additional training using a knowledge base to expedite the inference of pre-trained DPs. In concrete, RAGDP encodes observation-action pairs through the DP encoder to construct a vector database of expert demonstrations. During inference, the current observation is embedded, and the most similar expert action is extracted. This extracted action is combined with an intermediate noise removal step to reduce the number of steps required compared to the original diffusion step. We show that by using RAGDP with the base model and existing acceleration methods, we improve the accuracy and speed trade-off with no additional training. Even when accelerating the models 20 times, RAGDP maintains an advantage in accuracy, with a 7% increase over distillation models such as CP.
Likun Tan, Kuan-Wei Huang, Kevin Wu
Hallucinations in large language models pose a critical challenge for applications requiring factual reliability, particularly in high-stakes domains such as finance. This work presents an effective approach for detecting and editing factually incorrect content in model-generated responses based on the provided context. Given a user-defined domain-specific error taxonomy, we construct a synthetic dataset by inserting tagged errors into financial question-answering corpora and then fine-tune four language models, Phi-4, Phi-4-mini, Qwen3-4B, and Qwen3-14B, to detect and edit these factual inaccuracies. Our best-performing model, fine-tuned Phi-4, achieves an 8% improvement in binary F1 score and a 30% gain in overall detection performance compared to OpenAI-o3. Notably, our fine-tuned Phi-4-mini model, despite having only 4 billion parameters, maintains competitive performance with just a 2% drop in binary detection and a 0.1% decline in overall detection compared to OpenAI-o3. Our work provides a practical solution for detecting and editing factual inconsistencies in financial text generation while introducing a generalizable framework that can enhance the trustworthiness and alignment of large language models across diverse applications beyond finance. Our code and data are available at https://github.com/pegasi-ai/fine-grained-editting.
Duc-Tai Dinh, Duc Anh Khoa Dinh
We present ZSE-Cap (Zero-Shot Ensemble for Captioning), our 4th place system in Event-Enriched Image Analysis (EVENTA) shared task on article-grounded image retrieval and captioning. Our zero-shot approach requires no finetuning on the competition's data. For retrieval, we ensemble similarity scores from CLIP, SigLIP, and DINOv2. For captioning, we leverage a carefully engineered prompt to guide the Gemma 3 model, enabling it to link high-level events from the article to the visual content in the image. Our system achieved a final score of 0.42002, securing a top-4 position on the private test set, demonstrating the effectiveness of combining foundation models through ensembling and prompting. Our code is available at https://github.com/ductai05/ZSE-Cap.
Michele Grimaldi, Carlo Cernicchiaro, Sebastian Realpe Rua, Alaaeddine El-Masri-El-Chaarani, Markus Buchholz, Loizos Michael, Pere Ridao Rodriguez, Ignacio Carlucho et al.
Robotic platforms have become essential for marine operations by providing regular and continuous access to offshore assets, such as underwater infrastructure inspection, environmental monitoring, and resource exploration. However, the complex and dynamic nature of underwater environments, characterized by limited visibility, unpredictable currents, and communication constraints, presents significant challenges that demand advanced autonomy while ensuring operator trust and oversight. Central to addressing these challenges are knowledge representation and reasoning techniques, particularly knowledge graphs and retrieval-augmented generation (RAG) systems, that enable robots to efficiently structure, retrieve, and interpret complex environmental data. These capabilities empower robotic agents to reason, adapt, and respond effectively to changing conditions. The primary goal of this work is to demonstrate both multi-agent autonomy and shared autonomy, where multiple robotic agents operate independently while remaining connected to a human supervisor. We show how a RAG-powered large language model, augmented with knowledge graph data and domain taxonomy, enables autonomous multi-agent decision-making and facilitates seamless human-robot interaction, resulting in 100\% mission validation and behavior completeness. Finally, ablation studies reveal that without structured knowledge from the graph and/or taxonomy, the LLM is prone to hallucinations, which can compromise decision quality.
Ran Xu, Yuchen Zhuang, Yue Yu, Haoyu Wang, Wenqi Shi, Carl Yang
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by
integrating external knowledge retrieved at inference time. While RAG
demonstrates strong performance on benchmarks largely derived from
general-domain corpora like Wikipedia, its effectiveness under realistic,
diverse retrieval scenarios remains underexplored. We evaluated RAG systems
using MassiveDS, a large-scale datastore with mixture of knowledge, and
identified critical limitations: retrieval mainly benefits smaller models,
rerankers add minimal value, and no single retrieval source consistently
excels. Moreover, current LLMs struggle to route queries across heterogeneous
knowledge sources. These findings highlight the need for adaptive retrieval
strategies before deploying RAG in real-world settings. Our code and data can
be found at https://github.com/ritaranx/RAG_in_the_Wild.
Authors' comments: Work in Progress. Code will be published at:
https://github.com/ritaranx/RAG_in_the_Wild
Xin Zhang, Lissette Iturburu, Juan Nicolas Villamizar, Xiaoyu Liu, Manuel Salmeron, Shirley J. Dyke, Julio Ramirez
Structural drawings are widely used in many fields, e.g., mechanical engineering, civil engineering, etc. In civil engineering, structural drawings serve as the main communication tool between architects, engineers, and builders to avoid conflicts, act as legal documentation, and provide a reference for future maintenance or evaluation needs. They are often organized using key elements such as title/subtitle blocks, scales, plan views, elevation view, sections, and detailed sections, which are annotated with standardized symbols and line types for interpretation by engineers and contractors. Despite advances in software capabilities, the task of generating a structural drawing remains labor-intensive and time-consuming for structural engineers. Here we introduce a novel generative AI-based method for generating structural drawings employing a large language model (LLM) agent. The method incorporates a retrieval-augmented generation (RAG) technique using externally-sourced facts to enhance the accuracy and reliability of the language model. This method is capable of understanding varied natural language descriptions, processing these to extract necessary information, and generating code to produce the desired structural drawing in AutoCAD. The approach developed, demonstrated and evaluated herein enables the efficient and direct conversion of a structural drawing's natural language description into an AutoCAD drawing, significantly reducing the workload compared to current working process associated with manual drawing production, facilitating the typical iterative process of engineers for expressing design ideas in a simplified way.
Rahul Raja, Arpita Vats
Vector databases typically rely on approximate nearest neighbor (ANN) search to retrieve the top-k closest vectors to a query in embedding space. While effective, this approach often yields semantically redundant results, missing the diversity and contextual richness required by applications such as retrieval-augmented generation (RAG), multi-hop QA, and memory-augmented agents. We introduce a new retrieval paradigm: semantic compression, which aims to select a compact, representative set of vectors that captures the broader semantic structure around a query. We formalize this objective using principles from submodular optimization and information geometry, and show that it generalizes traditional top-k retrieval by prioritizing coverage and diversity. To operationalize this idea, we propose graph-augmented vector retrieval, which overlays semantic graphs (e.g., kNN or knowledge-based links) atop vector spaces to enable multi-hop, context-aware search. We theoretically analyze the limitations of proximity-based retrieval under high-dimensional concentration and highlight how graph structures can improve semantic coverage. Our work outlines a foundation for meaning-centric vector search systems, emphasizing hybrid indexing, diversity-aware querying, and structured semantic retrieval. We make our implementation publicly available to foster future research in this area.
Georgios Varnavides, Julie Marie Bekkevold, Stephanie M Ribet, Mary C Scott, Lewys Jones, Colin Ophus
The contrast transfer function (CTF) is widely used to evaluate phase
retrieval methods in scanning transmission electron microscopy (STEM),
including center-of-mass imaging, parallax imaging, direct ptychography, and
iterative ptychography. However, the CTF reflects only the maximum usable
signal, neglecting the effects of finite electron fluence and the
Poisson-limited nature of detection. As a result, it can significantly
overestimate practical performance, especially in low-dose regimes. Here, we
employ the spectral signal-to-noise ratio (SSNR), as a dose-aware statistical
framework to evaluate the recoverable signal as a function of spatial
frequency. Using numerical reconstructions of white-noise objects, we show that
center-of-mass, parallax, and direct ptychography exhibit dose-independent
SSNRs, with close-form analytic expressions. In contrast, iterative
ptychography exhibits a surprising dose dependence: at low fluence, its SSNR
converges to that of direct ptychography; at high fluence, it saturates at a
value consistent with the maximum detective quantum efficiency predicted by
recent quantum Fisher information bounds. The results highlight the limitations
of CTF-based evaluation and motivate SSNR as a more accurate, dose-aware metric
for assessing STEM phase retrieval methods.
Authors' comments: 9 pages, 5 figures
Minghao Tang, Shiyu Ni, Jiafeng Guo, Keping Bi
Retrieval-augmented generation (RAG) has been widely adopted to augment large language models (LLMs) with external knowledge for knowledge-intensive tasks. However, its effectiveness is often undermined by the presence of noisy (i.e., low-quality) retrieved passages. Enhancing LLMs' robustness to such noise is critical for improving the reliability of RAG systems. Recent advances have equipped LLMs with strong reasoning and self-reflection capabilities, allowing them to identify and correct errors in their reasoning process. Inspired by this ability, we propose Passage Injection-a simple yet effective method that explicitly incorporates retrieved passages into LLMs' reasoning process, aiming to enhance the model's ability to recognize and resist noisy passages. We validate Passage Injection under general RAG settings using BM25 as the retriever. Experiments on four reasoning-enhanced LLMs across four factual QA datasets demonstrate that Passage Injection significantly improves overall RAG performance. Further analysis on two noisy retrieval settings-random noise, where the model is provided irrelevant passages, and counterfactual noise, where it is given misleading passages-shows that Passage Injection consistently improves robustness. Controlled experiments confirm that Passage Injection can also effectively leverage helpful passages. These findings suggest that incorporating passages in LLMs' reasoning process is a promising direction for building more robust RAG systems. The code can be found \href{here}{https://github.com/mh-tang/Passage-Injection}.
Jack J. Davey, Kai Hou Yip, Quentin Changeat, Ingo P. Waldmann
In studies of exoplanet atmospheres using transmission spectroscopy, Bayesian
retrievals are the most popular form of analysis. In these procedures it is
common to adopt a Gaussian likelihood. However, this implicitly assumes that
the upper and lower error bars on the spectral points are equal. With recent
observations from the James Webb Space Telescope (JWST) offering higher quality
of data, it is worth revisiting this assumption to understand the impact that
an asymmetry between the error bars may have on retrieved parameters. In this
study, we challenge the approximation by comparing retrievals using a
symmetric, Gaussian likelihood, and an asymmetric, split normal likelihood. We
find that the influence of this assumption is minimal at the scales of
asymmetry observed in JWST observations of WASP-39 b (with a maximum asymmetry
of 77%) but we show that it would become critical with greater levels of
asymmetry (e.g. an average asymmetry of 80%). Furthermore, we stress the
importance of the shape of the asymmetric distribution and the difficulty in
fitting this distribution from three summary statistics (the median and an
upper and lower bound on the transit depth). An asymmetric likelihood sampler
will incorrectly predict parameters if the shape of the likelihood does not
match that of the underlying noise distribution even when the levels of
asymmetry are equal in both. Overall, we find that it is safe to use the
Gaussian likelihood assumption for current datasets but it is worth considering
the potential bias if greater asymmetries are observed.
Authors' comments: 17 pages, 16 figures - Submitted to RASTI
Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by
incorporating retrieved information. Standard retrieval process prioritized
relevance, focusing on topical alignment between queries and passages. In
contrast, in RAG, the emphasis has shifted to utility, which considers the
usefulness of passages for generating accurate answers. Despite empirical
evidence showing the benefits of utility-based retrieval in RAG, the high
computational cost of using LLMs for utility judgments limits the number of
passages evaluated. This restriction is problematic for complex queries
requiring extensive information. To address this, we propose a method to
distill the utility judgment capabilities of LLMs into smaller, more efficient
models. Our approach focuses on utility-based selection rather than ranking,
enabling dynamic passage selection tailored to specific queries without the
need for fixed thresholds. We train student models to learn pseudo-answer
generation and utility judgments from teacher LLMs, using a sliding window
method that dynamically selects useful passages. Our experiments demonstrate
that utility-based selection provides a flexible and cost-effective solution
for RAG, significantly reducing computational costs while improving answer
quality. We present the distillation results using Qwen3-32B as the teacher
model for both relevance ranking and utility-based selection, distilled into
RankQwen1.7B and UtilityQwen1.7B. Our findings indicate that for complex
questions, utility-based selection is more effective than relevance ranking in
enhancing answer generation performance. We will release the relevance ranking
and utility-based selection annotations for the MS MARCO dataset, supporting
further research in this area.
Authors' comments: 9 pages, 5 figures