Viktor Moskvoretskii, Maria Lysyuk, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov et al.
Retrieval Augmented Generation (RAG) improves correctness of Question
Answering (QA) and addresses hallucinations in Large Language Models (LLMs),
yet greatly increase computational costs. Besides, RAG is not always needed as
may introduce irrelevant information. Recent adaptive retrieval methods
integrate LLMs' intrinsic knowledge with external information appealing to LLM
self-knowledge, but they often neglect efficiency evaluations and comparisons
with uncertainty estimation techniques. We bridge this gap by conducting a
comprehensive analysis of 35 adaptive retrieval methods, including 8 recent
approaches and 27 uncertainty estimation techniques, across 6 datasets using 10
metrics for QA performance, self-knowledge, and efficiency. Our findings show
that uncertainty estimation techniques often outperform complex pipelines in
terms of efficiency and self-knowledge, while maintaining comparable QA
performance.
Authors' comments: The code and data are at https://github.com/s-nlp/AdaRAGUE
Junyi Wang, Quan Zang, Jinyu Wang, Minquan Cheng
In this paper, we study the coded caching scheme for the $(L, K, M, N)$
multi-user information retrieval (MIR) system, which consists of a content
library containing $N$ files, a base station (BS) with $L$ antennas that cannot
access the library, and $K$ single-antenna users, each of which can cache at
most $M$ files from the library. The users communicate with the others assisted
by the BS to decode their required files. In this paper, we focus on designing
a coded caching scheme with low communication latency measured by normalized
delivery time (NDT), computational complexity, and subpacketizations. When
$\frac{KM}{N}\geq L$ we first simply the precoding matrix in the downlink step
to an identity matrix and use the multiple-antenna placement delivery array
(MAPDA), which was originally proposed for the multiple-input single-output
networks, to generate several new schemes for MIR system. Compared to the
existing schemes, both the theoretical and numerical analyses show that our new
schemes achieve much lower computational complexity and smaller
subpacketizations with the same NDT.
Authors' comments: 14
Anoosheh Heidarzadeh, Ningze Wang, Alex Sprintson
This work presents an algorithmic framework that uses linear programming to construct \emph{addition-based Private Information Retrieval (AB-PIR)} schemes, where retrieval is performed by downloading only linear combinations of message symbols with coefficients set to 0 or 1. The AB-PIR schemes generalize several existing capacity-achieving PIR schemes and are of practical interest because they use only addition operations -- avoiding multiplication and other complex operations -- and are compatible with any finite field, including binary. Our framework broadens the search space to include all feasible solutions and can be used to construct optimal AB-PIR schemes for the entire range of problem parameters, including the number of servers, the total number of messages, and the number of messages that need to be retrieved. The framework enables us to identify schemes that outperform the previously proposed PIR schemes in certain cases and, in other cases, achieve performance on par with the best-known AB-PIR solutions. Additionally, the schemes generated by our framework can be integrated into existing solutions for several related PIR scenarios, improving their overall performance.
Yicheng Tao, Haotian Liu, Shanwen Wang, Hongteng Xu
Premise selection is a crucial yet challenging step in mathematical formalization, especially for users with limited experience. Due to the lack of available formalization projects, existing approaches that leverage language models often suffer from data scarcity. In this work, we introduce an innovative method for training a premise retriever to support the formalization of mathematics. Our approach employs a BERT model to embed proof states and premises into a shared latent space. The retrieval model is trained within a contrastive learning framework and incorporates a domain-specific tokenizer along with a fine-grained similarity computation method. Experimental results show that our model is highly competitive compared to existing baselines, achieving strong performance while requiring fewer computational resources. Performance is further enhanced through the integration of a re-ranking module. To streamline the formalization process, we will release a search engine that enables users to query Mathlib theorems directly using proof states, significantly improving accessibility and efficiency. Codes are available at https://github.com/ruc-ai4math/Premise-Retrieval.
Chandan Anand, Jayesh Seshadri, Prasad Krishnan, Gowtham R. Kurri
In information-theoretic private information retrieval (PIR), a client wants to retrieve one desired file out of $M$ files, stored across $N$ servers, while keeping the index of the desired file private from each $T$-sized subset of servers. A PIR protocol must ideally maximize the rate, which is the ratio of the file size to the total quantum of the download from the servers, while ensuring such privacy. In Weak-PIR (WPIR), the criterion of perfect information-theoretic privacy is relaxed. This enables higher rates to be achieved, while some information about the desired file index leaks to the servers. This leakage is captured by various known privacy metrics. By leveraging the well-established capacity-achieving schemes of Sun and Jafar under non-colluding ($T=1$) and colluding ($1<T\leq N$) scenarios, we present WPIR protocols for these scenarios. We also present a new WPIR scheme for the MDS scenario, by building upon the scheme by Banawan and Ulukus for this scenario. We present corresponding explicit rate-privacy trade-offs for these setups, under the mutual-information and the maximal leakage privacy metrics. In the collusion-free setup, our presented rate-privacy trade-off under maximal leakage matches that of the previous state of the art. With respect to the MDS scenario under the maximal leakage metric, we compare with the non-explicit trade-off in the literature, and show that our scheme performs better for some numerical examples. For the $T$-collusion setup (under both privacy metrics) and for the MDS setup under the mutual information metric, our rate-privacy trade-offs are the first in the literature, to the best of our knowledge.
Andrew Parry, Catherine Chen, Carsten Eickhoff, Sean MacAvaney
Mechanistic interpretability is an emerging diagnostic approach for neural
models that has gained traction in broader natural language processing domains.
This paradigm aims to provide attribution to components of neural systems where
causal relationships between hidden layers and output were previously
uninterpretable. As the use of neural models in IR for retrieval and evaluation
becomes ubiquitous, we need to ensure that we can interpret why a model
produces a given output for both transparency and the betterment of systems.
This work comprises a flexible framework for diagnostic analysis and
intervention within these highly parametric neural systems specifically
tailored for IR tasks and architectures. In providing such a framework, we look
to facilitate further research in interpretable IR with a broader scope for
practical interventions derived from mechanistic interpretability. We provide
preliminary analysis and look to demonstrate our framework through an axiomatic
lens to show its applications and ease of use for those IR practitioners
inexperienced in this emerging paradigm.
Authors' comments: 5 pages, 2 figures, Accepted to ECIR 2025 as a Demo Paper
Chuangtao Ma, Sriom Chakrabarti, Arijit Khan, Bálint Molnár
Traditional similarity-based schema matching methods are incapable of
resolving semantic ambiguities and conflicts in domain-specific complex mapping
scenarios due to missing commonsense and domain-specific knowledge. The
hallucination problem of large language models (LLMs) also makes it challenging
for LLM-based schema matching to address the above issues. Therefore, we
propose a Knowledge Graph-based Retrieval-Augmented Generation model for Schema
Matching, referred to as the KG-RAG4SM. In particular, KG-RAG4SM introduces
novel vector-based, graph traversal-based, and query-based graph retrievals, as
well as a hybrid approach and ranking schemes that identify the most relevant
subgraphs from external large knowledge graphs (KGs). We showcase that KG-based
retrieval-augmented LLMs are capable of generating more accurate results for
complex matching cases without any re-training. Our experimental results show
that KG-RAG4SM outperforms the LLM-based state-of-the-art (SOTA) methods (e.g.,
Jellyfish-8B) by 35.89% and 30.50% in terms of precision and F1 score on the
MIMIC dataset, respectively; KG-RAG4SM with GPT-4o-mini outperforms the
pre-trained language model (PLM)-based SOTA methods (e.g., SMAT) by 69.20% and
21.97% in terms of precision and F1 score on the Synthea dataset, respectively.
The results also demonstrate that our approach is more efficient in end-to-end
schema matching, and scales to retrieve from large KGs. Our case studies on the
dataset from the real-world schema matching scenario exhibit that the
hallucination problem of LLMs for schema matching is well mitigated by our
solution.
Authors' comments: Under Review
Aditi Singh, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei
Large Language Models (LLMs) have revolutionized artificial intelligence (AI) by enabling human like text generation and natural language understanding. However, their reliance on static training data limits their ability to respond to dynamic, real time queries, resulting in outdated or inaccurate outputs. Retrieval Augmented Generation (RAG) has emerged as a solution, enhancing LLMs by integrating real time data retrieval to provide contextually relevant and up-to-date responses. Despite its promise, traditional RAG systems are constrained by static workflows and lack the adaptability required for multistep reasoning and complex task management. Agentic Retrieval-Augmented Generation (Agentic RAG) transcends these limitations by embedding autonomous AI agents into the RAG pipeline. These agents leverage agentic design patterns reflection, planning, tool use, and multiagent collaboration to dynamically manage retrieval strategies, iteratively refine contextual understanding, and adapt workflows to meet complex task requirements. This integration enables Agentic RAG systems to deliver unparalleled flexibility, scalability, and context awareness across diverse applications. This survey provides a comprehensive exploration of Agentic RAG, beginning with its foundational principles and the evolution of RAG paradigms. It presents a detailed taxonomy of Agentic RAG architectures, highlights key applications in industries such as healthcare, finance, and education, and examines practical implementation strategies. Additionally, it addresses challenges in scaling these systems, ensuring ethical decision making, and optimizing performance for real-world applications, while providing detailed insights into frameworks and tools for implementing Agentic RAG.
Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu
Multimodal document retrieval aims to identify and retrieve various forms of
multimodal content, such as figures, tables, charts, and layout information
from extensive documents. Despite its increasing popularity, there is a notable
lack of a comprehensive and robust benchmark to effectively evaluate the
performance of systems in such tasks. To address this gap, this work introduces
a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level
and layout-level retrieval. The former evaluates the performance of identifying
the most relevant pages within a long document, while the later assesses the
ability of detecting specific layouts, providing a more fine-grained measure
than whole-page analysis. A layout refers to a variety of elements, including
textual paragraphs, equations, figures, tables, or charts. The MMDocIR
benchmark comprises a rich dataset featuring 1,685 questions annotated by
experts and 173,843 questions with bootstrapped labels, making it a valuable
resource in multimodal document retrieval for both training and evaluation.
Through rigorous experiments, we demonstrate that (i) visual retrievers
significantly outperform their text counterparts, (ii) MMDocIR training set
effectively enhances the performance of multimodal document retrieval and (iii)
text retrievers leveraging VLM-text significantly outperforms retrievers
relying on OCR-text. Our dataset is available at
https://mmdocrag.github.io/MMDocIR/.
Authors' comments: https://huggingface.co/MMDocIR
Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Song Yang et al.
Retrieval-Augmented Generation (RAG) systems for Large Language Models (LLMs)
hold promise in knowledge-intensive tasks but face limitations in complex
multi-step reasoning. While recent methods have integrated RAG with
chain-of-thought reasoning or test-time search using Process Reward Models
(PRMs), these approaches encounter challenges such as a lack of explanations,
bias in PRM training data, early-step bias in PRM scores, and insufficient
post-training optimization of reasoning potential. To address these issues, we
propose Retrieval-Augmented Reasoning through Trustworthy Process Rewarding
(ReARTeR), a framework that enhances RAG systems' reasoning capabilities
through post-training and test-time scaling. At test time, ReARTeR introduces
Trustworthy Process Rewarding via a Process Reward Model for accurate scalar
scoring and a Process Explanation Model (PEM) for generating natural language
explanations, enabling step refinement. During post-training, it utilizes Monte
Carlo Tree Search guided by Trustworthy Process Rewarding to collect
high-quality step-level preference data, optimized through Iterative Preference
Optimization. ReARTeR addresses three core challenges: (1) misalignment between
PRM and PEM, tackled through off-policy preference learning; (2) bias in PRM
training data, mitigated by balanced annotation methods and stronger
annotations for challenging examples; and (3) early-step bias in PRM, resolved
through a temporal-difference-based look-ahead search strategy. Experimental
results on multi-step reasoning benchmarks demonstrate significant
improvements, underscoring ReARTeR's potential to advance the reasoning
capabilities of RAG systems.
Authors' comments: 11 pages, 5 figures
Siran Li, Linus Stenzel, Carsten Eickhoff, Seyed Ali Bahrainian
Retrieval-Augmented Generation (RAG) systems have recently shown remarkable advancements by integrating retrieval mechanisms into language models, enhancing their ability to produce more accurate and contextually relevant responses. However, the influence of various components and configurations within RAG systems remains underexplored. A comprehensive understanding of these elements is essential for tailoring RAG systems to complex retrieval tasks and ensuring optimal performance across diverse applications. In this paper, we develop several advanced RAG system designs that incorporate query expansion, various novel retrieval strategies, and a novel Contrastive In-Context Learning RAG. Our study systematically investigates key factors, including language model size, prompt design, document chunk size, knowledge base size, retrieval stride, query expansion techniques, Contrastive In-Context Learning knowledge bases, multilingual knowledge bases, and Focus Mode retrieving relevant context at sentence-level. Through extensive experimentation, we provide a detailed analysis of how these factors influence response quality. Our findings offer actionable insights for developing RAG systems, striking a balance between contextual richness and retrieval-generation efficiency, thereby paving the way for more adaptable and high-performing RAG frameworks in diverse real-world scenarios. Our code and implementation details are publicly available.
Jinjing Zhu, Songze Li, Lin Wang
Conventional knowledge distillation (KD) approaches are designed for the student model to predict similar output as the teacher model for each sample. Unfortunately, the relationship across samples with same class is often neglected. In this paper, we explore to redefine the knowledge in distillation, capturing the relationship between each sample and its corresponding in-context samples (a group of similar samples with the same or different classes), and perform KD from an in-context sample retrieval perspective. As KD is a type of learned label smoothing regularization (LSR), we first conduct a theoretical analysis showing that the teacher's knowledge from the in-context samples is a crucial contributor to regularize the student training with the corresponding samples. Buttressed by the analysis, we propose a novel in-context knowledge distillation (IC-KD) framework that shows its superiority across diverse KD paradigms (offline, online, and teacher-free KD). Firstly, we construct a feature memory bank from the teacher model and retrieve in-context samples for each corresponding sample through retrieval-based learning. We then introduce Positive In-Context Distillation (PICD) to reduce the discrepancy between a sample from the student and the aggregated in-context samples with the same class from the teacher in the logit space. Moreover, Negative In-Context Distillation (NICD) is introduced to separate a sample from the student and the in-context samples with different classes from the teacher in the logit space. Extensive experiments demonstrate that IC-KD is effective across various types of KD, and consistently achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets.
Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj Setlur et al.
Compositional image retrieval (CIR) is a multimodal learning task where a
model combines a query image with a user-provided text modification to retrieve
a target image. CIR finds applications in a variety of domains including
product retrieval (e-commerce) and web search. Existing methods primarily focus
on fully-supervised learning, wherein models are trained on datasets of labeled
triplets such as FashionIQ and CIRR. This poses two significant challenges: (i)
curating such triplet datasets is labor intensive; and (ii) models lack
generalization to unseen objects and domains. In this work, we propose SCOT
(Self-supervised COmpositional Training), a novel zero-shot compositional
pretraining strategy that combines existing large image-text pair datasets with
the generative capabilities of large language models to contrastively train an
embedding composition network. Specifically, we show that the text embedding
from a large-scale contrastively-pretrained vision-language model can be
utilized as proxy target supervision during compositional pretraining,
replacing the target image embedding. In zero-shot settings, this strategy
surpasses SOTA zero-shot compositional retrieval methods as well as many
fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.
Authors' comments: Paper accepted at WACV 2025 in round 1
Rui Liu, Zhenqi Jia, Feilong Bao, Haizhou Li
Conversational speech synthesis (CSS) aims to take the current dialogue (CD)
history as a reference to synthesize expressive speech that aligns with the
conversational style. Unlike CD, stored dialogue (SD) contains preserved
dialogue fragments from earlier stages of user-agent interaction, which include
style expression knowledge relevant to scenarios similar to those in CD. Note
that this knowledge plays a significant role in enabling the agent to
synthesize expressive conversational speech that generates empathetic feedback.
However, prior research has overlooked this aspect. To address this issue, we
propose a novel Retrieval-Augmented Dialogue Knowledge Aggregation scheme for
expressive CSS, termed RADKA-CSS, which includes three main components: 1) To
effectively retrieve dialogues from SD that are similar to CD in terms of both
semantic and style. First, we build a stored dialogue semantic-style database
(SDSSD) which includes the text and audio samples. Then, we design a
multi-attribute retrieval scheme to match the dialogue semantic and style
vectors of the CD with the stored dialogue semantic and style vectors in the
SDSSD, retrieving the most similar dialogues. 2) To effectively utilize the
style knowledge from CD and SD, we propose adopting the multi-granularity graph
structure to encode the dialogue and introducing a multi-source style knowledge
aggregation mechanism. 3) Finally, the aggregated style knowledge are fed into
the speech synthesizer to help the agent synthesize expressive speech that
aligns with the conversational style. We conducted a comprehensive and in-depth
experiment based on the DailyTalk dataset, which is a benchmarking dataset for
the CSS task.
Both objective and subjective evaluations demonstrate that RADKA-CSS
outperforms baseline models in expressiveness rendering. Code and audio samples
can be found at: https://github.com/Coder-jzq/RADKA-CSS.
Authors' comments: Accepted by Information Fusion 2025
Steven H. Wang, Maksim Zubkov, Kexin Fan, Sarah Harrell, Yuyang Sun, Wei Chen, Andreas Plesner, Roger Wattenhofer
Information retrieval, specifically contract clause retrieval, is
foundational to contract drafting because lawyers rarely draft contracts from
scratch; instead, they locate and revise the most relevant precedent. We
introduce the Atticus Clause Retrieval Dataset (ACORD), the first retrieval
benchmark for contract drafting fully annotated by experts. ACORD focuses on
complex contract clauses such as Limitation of Liability, Indemnification,
Change of Control, and Most Favored Nation. It includes 114 queries and over
126,000 query-clause pairs, each ranked on a scale from 1 to 5 stars. The task
is to find the most relevant precedent clauses to a query. The bi-encoder
retriever paired with pointwise LLMs re-rankers shows promising results.
However, substantial improvements are still needed to effectively manage the
complex legal work typically undertaken by lawyers. As the first retrieval
benchmark for contract drafting annotated by experts, ACORD can serve as a
valuable IR benchmark for the NLP community.
Authors' comments: Accepted to ACL 2025. See the project page at
https://www.atticusprojectai.org/acord
Bing Gao, Ran Gu, Shigui Ma
This paper presents a rigorous theoretical convergence analysis of the
Wirtinger Flow (WF) algorithm for Poisson phase retrieval, a fundamental
problem in imaging applications. Unlike prior analyses that rely on truncation
or additional adjustments to handle outliers, our framework avoids eliminating
measurements or introducing extra computational steps, thereby reducing overall
complexity. We prove that WF achieves linear convergence to the true signal
under noiseless conditions and remains robust and stable in the presence of
bounded noise for Poisson phase retrieval. Additionally, we propose an
incremental variant of WF, which significantly improves computational
efficiency and guarantees convergence to the true signal with high probability
under suitable conditions.
Authors' comments: 22 pages, 4 figures
Ruitao Pu, Yang Qin, Dezhong Peng, Xiaomin Song, Huiming Zheng
Cross-modal retrieval (CMR) typically involves learning common representations to directly measure similarities between multimodal samples. Most existing CMR methods commonly assume multimodal samples in pairs and employ joint training to learn common representations, limiting the flexibility of CMR. Although some methods adopt independent training strategies for each modality to improve flexibility in CMR, they utilize the randomly initialized orthogonal matrices to guide representation learning, which is suboptimal since they assume inter-class samples are independent of each other, limiting the potential of semantic alignments between sample representations and ground-truth labels. To address these issues, we propose a novel method termed Deep Reversible Consistency Learning (DRCL) for cross-modal retrieval. DRCL includes two core modules, \ie Selective Prior Learning (SPL) and Reversible Semantic Consistency learning (RSC). More specifically, SPL first learns a transformation weight matrix on each modality and selects the best one based on the quality score as the Prior, which greatly avoids blind selection of priors learned from low-quality modalities. Then, RSC employs a Modality-invariant Representation Recasting mechanism (MRR) to recast the potential modality-invariant representations from sample semantic labels by the generalized inverse matrix of the prior. Since labels are devoid of modal-specific information, we utilize the recast features to guide the representation learning, thus maintaining semantic consistency to the fullest extent possible. In addition, a feature augmentation mechanism (FA) is introduced in RSC to encourage the model to learn over a wider data distribution for diversity. Finally, extensive experiments conducted on five widely used datasets and comparisons with 15 state-of-the-art baselines demonstrate the effectiveness and superiority of our DRCL.
Matin Mortaheb, Mohammad A. Amir Khojastepour, Srimat T. Chakradhar, Sennur Ulukus
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge to generate a response within a context with improved accuracy and reduced hallucinations. However, multi-modal RAG systems face unique challenges: (i) the retrieval process may select irrelevant entries to user query (e.g., images, documents), and (ii) vision-language models or multi-modal language models like GPT-4o may hallucinate when processing these entries to generate RAG output. In this paper, we aim to address the first challenge, i.e, improving the selection of relevant context from the knowledge-base in retrieval phase of the multi-modal RAG. Specifically, we leverage the relevancy score (RS) measure designed in our previous work for evaluating the RAG performance to select more relevant entries in retrieval process. The retrieval based on embeddings, say CLIP-based embedding, and cosine similarity usually perform poorly particularly for multi-modal data. We show that by using a more advanced relevancy measure, one can enhance the retrieval process by selecting more relevant pieces from the knowledge-base and eliminate the irrelevant pieces from the context by adaptively selecting up-to-$k$ entries instead of fixed number of entries. Our evaluation using COCO dataset demonstrates significant enhancement in selecting relevant context and accuracy of the generated response.
Yongkang Li, Panagiotis Eustratiadis, Evangelos Kanoulas
HotFlip is a topical gradient-based word substitution method for attacking
language models. Recently, this method has been further applied to attack
retrieval systems by generating malicious passages that are injected into a
corpus, i.e., corpus poisoning. However, HotFlip is known to be computationally
inefficient, with the majority of time being spent on gradient accumulation for
each query-passage pair during the adversarial token generation phase, making
it impossible to generate an adequate number of adversarial passages in a
reasonable amount of time. Moreover, the attack method itself assumes access to
a set of user queries, a strong assumption that does not correspond to how
real-world adversarial attacks are usually performed. In this paper, we first
significantly boost the efficiency of HotFlip, reducing the adversarial
generation process from 4 hours per document to only 15 minutes, using the same
hardware. We further contribute experiments and analysis on two additional
tasks: (1) transfer-based black-box attacks, and (2) query-agnostic attacks.
Whenever possible, we provide comparisons between the original method and our
improved version. Our experiments demonstrate that HotFlip can effectively
attack a variety of dense retrievers, with an observed trend that its attack
performance diminishes against more advanced and recent methods. Interestingly,
we observe that while HotFlip performs poorly in a black-box setting,
indicating limited capacity for generalization, in query-agnostic scenarios its
performance is correlated to the volume of injected adversarial passages.
Authors' comments: This paper has been accepted for oral presentation in the
reproducibility track at ECIR 2025
Navya Yarrabelly, Saloni Mittal
This work deals with the challenge of learning and reasoning over multi-modal
multi-hop question answering (QA). We propose a graph reasoning network based
on the semantic structure of the sentences to learn multi-source reasoning
paths and find the supporting facts across both image and text modalities for
answering the question. In this paper, we investigate the importance of graph
structure for multi-modal multi-hop question answering. Our analysis is
centered on WebQA. We construct a strong baseline model, that finds relevant
sources using a pairwise classification task. We establish that, with the
proper use of feature representations from pre-trained models, graph structure
helps in improving multi-modal multi-hop question answering. We point out that
both graph structure and adjacency matrix are task-related prior knowledge, and
graph structure can be leveraged to improve the retrieval performance for the
task. Experiments and visualized analysis demonstrate that message propagation
over graph networks or the entire graph structure can replace massive
multimodal transformers with token-wise cross-attention. We demonstrated the
applicability of our method and show a performance gain of \textbf{4.6$\%$}
retrieval F1score over the transformer baselines, despite being a very light
model. We further demonstrated the applicability of our model to a large scale
retrieval setting.
Authors' comments: arXiv admin note: text overlap with arXiv:2010.03604 by other authors