Tianyuan Shi, Liangzhi Li, Zijian Lin, Tao Yang, Xiaojun Quan, Qifan Wang
Efficient knowledge retrieval plays a pivotal role in ensuring the success of
end-to-end task-oriented dialogue systems by facilitating the selection of
relevant information necessary to fulfill user requests. However, current
approaches generally integrate knowledge retrieval and response generation,
which poses scalability challenges when dealing with extensive knowledge bases.
Taking inspiration from open-domain question answering, we propose a
retriever-generator architecture that harnesses a retriever to retrieve
pertinent knowledge and a generator to generate system responses.~Due to the
lack of retriever training labels, we propose relying on feedback from the
generator as pseudo-labels to train the retriever. To achieve this, we
introduce a dual-feedback mechanism that generates both positive and negative
feedback based on the output of the generator. Our method demonstrates superior
performance in task-oriented dialogue tasks, as evidenced by experimental
results on three benchmark datasets.
Authors' comments: Accepted to EMNLP 2023 (Main Conference)
Qi Gou, Zehua Xia, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li, Nguyen Cam-Tu
Given a textual passage and an answer, humans are able to ask questions with
various expressions, but this ability is still challenging for most question
generation (QG) systems. Existing solutions mainly focus on the internal
knowledge within the given passage or the semantic word space for diverse
content planning. These methods, however, have not considered the potential of
external knowledge for expression diversity. To bridge this gap, we propose
RAST, a framework for Retrieval-Augmented Style Transfer, where the objective
is to utilize the style of diverse templates for question generation. For
training RAST, we develop a novel Reinforcement Learning (RL) based approach
that maximizes a weighted combination of diversity reward and consistency
reward. Here, the consistency reward is computed by a Question-Answering (QA)
model, whereas the diversity reward measures how much the final output mimics
the retrieved template. Experimental results show that our method outperforms
previous diversity-driven baselines on diversity while being comparable in
terms of consistency scores. Our code is available at
https://github.com/gouqi666/RAST.
Authors' comments: EMNLP2023 camera-ready
Xu Yuan, Zheng Zhang, Xunguang Wang, Lin Wu
Deep hashing has been intensively studied and successfully applied in large-scale image retrieval systems due to its efficiency and effectiveness. Recent studies have recognized that the existence of adversarial examples poses a security threat to deep hashing models, that is, adversarial vulnerability. Notably, it is challenging to efficiently distill reliable semantic representatives for deep hashing to guide adversarial learning, and thereby it hinders the enhancement of adversarial robustness of deep hashing-based retrieval models. Moreover, current researches on adversarial training for deep hashing are hard to be formalized into a unified minimax structure. In this paper, we explore Semantic-Aware Adversarial Training (SAAT) for improving the adversarial robustness of deep hashing models. Specifically, we conceive a discriminative mainstay features learning (DMFL) scheme to construct semantic representatives for guiding adversarial learning in deep hashing. Particularly, our DMFL with the strict theoretical guarantee is adaptively optimized in a discriminative learning manner, where both discriminative and semantic properties are jointly considered. Moreover, adversarial examples are fabricated by maximizing the Hamming distance between the hash codes of adversarial samples and mainstay features, the efficacy of which is validated in the adversarial attack trials. Further, we, for the first time, formulate the formalized adversarial training of deep hashing into a unified minimax optimization under the guidance of the generated mainstay codes. Extensive experiments on benchmark datasets show superb attack performance against the state-of-the-art algorithms, meanwhile, the proposed adversarial training can effectively eliminate adversarial perturbations for trustworthy deep hashing-based retrieval. Our code is available at https://github.com/xandery-geek/SAAT.
Vaibhav Mavi, Abulhair Saparov, Chen Zhao
Applying existing question answering (QA) systems to specialized domains like
law and finance presents challenges that necessitate domain expertise. Although
large language models (LLMs) have shown impressive language comprehension and
in-context learning capabilities, their inability to handle very long
inputs/contexts is well known. Tasks specific to these domains need significant
background knowledge, leading to contexts that can often exceed the maximum
length that existing LLMs can process. This study explores leveraging the
semi-structured nature of legal and financial data to efficiently retrieve
relevant context, enabling the use of LLMs for domain-specialized QA. The
resulting system outperforms contemporary models and also provides useful
explanations for the answers, encouraging the integration of LLMs into legal
and financial NLP systems for future research.
Authors' comments: to appear in NLLP 2023
Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Lu Wang
Open-domain question answering (QA) systems are often built with retrieval
modules. However, retrieving passages from a given source is known to suffer
from insufficient knowledge coverage. Alternatively, prompting large language
models (LLMs) to generate contextual passages based on their parametric
knowledge has been shown to improve QA performance. Yet, LLMs tend to
"hallucinate" content that conflicts with the retrieved knowledge. Based on the
intuition that answers supported by both sources are more likely to be correct,
we propose COMBO, a Compatibility-Oriented knowledge Merging for Better
Open-domain QA framework, to effectively leverage the two sources of
information. Concretely, we match LLM-generated passages with retrieved
counterparts into compatible pairs, based on discriminators trained with silver
compatibility labels. Then a Fusion-in-Decoder-based reader model handles
passage pairs to arrive at the final answer. Experiments show that COMBO
outperforms competitive baselines on three out of four tested open-domain QA
benchmarks. Further analysis reveals that our proposed framework demonstrates
greater efficacy in scenarios with a higher degree of knowledge conflicts.
Authors' comments: EMNLP 2023 - Camera Ready
Uri Katz, Matan Vetzler, Amir DN Cohen, Yoav Goldberg
Recognizing entities in texts is a central need in many information-seeking
scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the
most successful examples of a widely adopted NLP task and corresponding NLP
technology. Recent advances in large language models (LLMs) appear to provide
effective solutions (also) for NER tasks that were traditionally handled with
dedicated models, often matching or surpassing the abilities of the dedicated
models. Should NER be considered a solved problem? We argue to the contrary:
the capabilities provided by LLMs are not the end of NER research, but rather
an exciting beginning. They allow taking NER to the next level, tackling
increasingly more useful, and increasingly more challenging, variants. We
present three variants of the NER task, together with a dataset to support
them. The first is a move towards more fine-grained -- and intersectional --
entity types. The second is a move towards zero-shot recognition and extraction
of these fine-grained types based on entity-type labels. The third, and most
challenging, is the move from the recognition setup to a novel retrieval setup,
where the query is a zero-shot entity type, and the expected result is all the
sentences from a large, pre-indexed corpus that contain entities of these
types, and their corresponding spans. We show that all of these are far from
being solved. We provide a large, silver-annotated corpus of 4 million
paragraphs covering 500 entity types, to facilitate research towards all of
these three goals.
Authors' comments: Findings of EMNLP 2023
Priyanka Ranade, Anupam Joshi
Narrative construction is the process of representing disparate event information into a logical plot structure that models an end to end story. Intelligence analysis is an example of a domain that can benefit tremendously from narrative construction techniques, particularly in aiding analysts during the largely manual and costly process of synthesizing event information into comprehensive intelligence reports. Manual intelligence report generation is often prone to challenges such as integrating dynamic event information, writing fine-grained queries, and closing information gaps. This motivates the development of a system that retrieves and represents critical aspects of events in a form that aids in automatic generation of intelligence reports. We introduce a Retrieval Augmented Generation (RAG) approach to augment prompting of an autoregressive decoder by retrieving structured information asserted in a knowledge graph to generate targeted information based on a narrative plot model. We apply our approach to the problem of neural intelligence report generation and introduce FABULA, framework to augment intelligence analysis workflows using RAG. An analyst can use FABULA to query an Event Plot Graph (EPG) to retrieve relevant event plot points, which can be used to augment prompting of a Large Language Model (LLM) during intelligence report generation. Our evaluation studies show that the plot points included in the generated intelligence reports have high semantic relevance, high coherency, and low data redundancy.
Xiangru Jian, Yimu Wang
Over recent decades, significant advancements in cross-modal retrieval are
mainly driven by breakthroughs in visual and linguistic modeling. However, a
recent study shows that multi-modal data representations tend to cluster within
a limited convex cone (as representation degeneration problem), which hinders
retrieval performance due to the inseparability of these representations. In
our study, we first empirically validate the presence of the representation
degeneration problem across multiple cross-modal benchmarks and methods. Next,
to address it, we introduce a novel method, called InvGC, a post-processing
technique inspired by graph convolution and average pooling. Specifically,
InvGC defines the graph topology within the datasets and then applies graph
convolution in a subtractive manner. This method effectively separates
representations by increasing the distances between data points. To improve the
efficiency and effectiveness of InvGC, we propose an advanced graph topology,
LocalAdj, which only aims to increase the distances between each data point and
its nearest neighbors. To understand why InvGC works, we present a detailed
theoretical analysis, proving that the lower bound of recall will be improved
after deploying InvGC. Extensive empirical results show that InvGC and InvGC
w/LocalAdj significantly mitigate the representation degeneration problem,
thereby enhancing retrieval performance.
Our code is available at
https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval
Authors' comments: Findings of EMNLP 2023
Moshe Berchansky, Peter Izsak, Avi Caciularu, Ido Dagan, Moshe Wasserblat
Fusion-in-Decoder (FiD) is an effective retrieval-augmented language model applied across a variety of open-domain tasks, such as question answering, fact checking, etc. In FiD, supporting passages are first retrieved and then processed using a generative model (Reader), which can cause a significant bottleneck in decoding time, particularly with long outputs. In this work, we analyze the contribution and necessity of all the retrieved passages to the performance of reader models, and propose eliminating some of the retrieved information, at the token level, that might not contribute essential information to the answer generation process. We demonstrate that our method can reduce run-time by up to 62.2%, with only a 2% reduction in performance, and in some cases, even improve the performance results.
Roman Jacome, Kumar Vijay Mishra, Brian M. Sadler, Henry Arguello
Hypercomplex signal processing (HSP) provides state-of-the-art tools to
handle multidimensional signals by harnessing intrinsic correlation of the
signal dimensions through Clifford algebra. Recently, the hypercomplex
representation of the phase retrieval (PR) problem, wherein a complex-valued
signal is estimated through its intensity-only projections, has attracted
significant interest. The hypercomplex PR (HPR) arises in many optical imaging
and computational sensing applications that usually comprise quaternion and
octonion-valued signals. Analogous to the traditional PR, measurements in HPR
may involve complex, hypercomplex, Fourier, and other sensing matrices. This
set of problems opens opportunities for developing novel HSP tools and
algorithms. This article provides a synopsis of the emerging areas and
applications of HPR with a focus on optical imaging.
Authors' comments: 10 pages, 4 figures, 2 tables
Tianchi Yang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang
Generative retrieval, which is a new advanced paradigm for document retrieval, has recently attracted research interests, since it encodes all documents into the model and directly generates the retrieved documents. However, its power is still underutilized since it heavily relies on the "preprocessed" document identifiers (docids), thus limiting its retrieval performance and ability to retrieve new documents. In this paper, we propose a novel fully end-to-end retrieval paradigm. It can not only end-to-end learn the best docids for existing and new documents automatically via a semantic indexing module, but also perform end-to-end document retrieval via an encoder-decoder-based generative model, namely Auto Search Indexer (ASI). Besides, we design a reparameterization mechanism to combine the above two modules into a joint optimization framework. Extensive experimental results demonstrate the superiority of our model over advanced baselines on both public and industrial datasets and also verify the ability to deal with new documents.
Yulin Chen, Zhenran Xu, Baotian Hu, Min Zhang
Entity linking aims to link ambiguous mentions to their corresponding
entities in a knowledge base. One of the key challenges comes from insufficient
labeled data for specific domains. Although dense retrievers have achieved
excellent performance on several benchmarks, their performance decreases
significantly when only a limited amount of in-domain labeled data is
available. In such few-shot setting, we revisit the sparse retrieval method,
and propose an ELECTRA-based keyword extractor to denoise the mention context
and construct a better query expression. For training the extractor, we propose
a distant supervision method to automatically generate training data based on
overlapping tokens between mention contexts and entity descriptions.
Experimental results on the ZESHEL dataset demonstrate that the proposed method
outperforms state-of-the-art models by a significant margin across all test
domains, showing the effectiveness of keyword-enhanced sparse retrieval.
Authors' comments: EMNLP 2023
Shuwen Yang, Anran Wu, Xingjiao Wu, Luwei Xiao, Tianlong Ma, Cheng Jin, Liang He
Pre-trained multimodal models have achieved significant success in retrieval-based question answering. However, current multimodal retrieval question-answering models face two main challenges. Firstly, utilizing compressed evidence features as input to the model results in the loss of fine-grained information within the evidence. Secondly, a gap exists between the feature extraction of evidence and the question, which hinders the model from effectively extracting critical features from the evidence based on the given question. We propose a two-stage framework for evidence retrieval and question-answering to alleviate these issues. First and foremost, we propose a progressive evidence refinement strategy for selecting crucial evidence. This strategy employs an iterative evidence retrieval approach to uncover the logical sequence among the evidence pieces. It incorporates two rounds of filtering to optimize the solution space, thus further ensuring temporal efficiency. Subsequently, we introduce a semi-supervised contrastive learning training strategy based on negative samples to expand the scope of the question domain, allowing for a more thorough exploration of latent knowledge within known samples. Finally, in order to mitigate the loss of fine-grained information, we devise a multi-turn retrieval and question-answering strategy to handle multimodal inputs. This strategy involves incorporating multimodal evidence directly into the model as part of the historical dialogue and question. Meanwhile, we leverage a cross-modal attention mechanism to capture the underlying connections between the evidence and the question, and the answer is generated through a decoding generation approach. We validate the model's effectiveness through extensive experiments, achieving outstanding performance on WebQA and MultimodelQA benchmark tests.
Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.
Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, Jimmy Lin
The effectiveness of multi-stage text retrieval has been solidly demonstrated since before the era of pre-trained language models. However, most existing studies utilize models that predate recent advances in large language models (LLMs). This study seeks to explore potential improvements that state-of-the-art LLMs can bring. We conduct a comprehensive study, fine-tuning the latest LLaMA model both as a dense retriever (RepLLaMA) and as a pointwise reranker (RankLLaMA) for both passage retrieval and document retrieval using the MS MARCO datasets. Our findings demonstrate that the effectiveness of large language models indeed surpasses that of smaller models. Additionally, since LLMs can inherently handle longer contexts, they can represent entire documents holistically, obviating the need for traditional segmenting and pooling strategies. Furthermore, evaluations on BEIR demonstrate that our RepLLaMA-RankLLaMA pipeline exhibits strong zero-shot effectiveness. Model checkpoints from this study are available on HuggingFace.
Shiyang Yan, Zongxuan Liu, Lin Xu
Metric learning plays a critical role in training image retrieval and classification. It is also a key algorithm in representation learning, e.g., for feature learning and its alignment in metric space. Hyperbolic embedding has been recently developed. Compared to the conventional Euclidean embedding in most of the previously developed models, Hyperbolic embedding can be more effective in representing the hierarchical data structure. Second, uncertainty estimation/measurement is a long-lasting challenge in artificial intelligence. Successful uncertainty estimation can improve a machine learning model's performance, robustness, and security. In Hyperbolic space, uncertainty measurement is at least with equivalent, if not more, critical importance. In this paper, we develop a Hyperbolic image embedding with uncertainty-aware metric learning for image retrieval. We call our method Hyp-UML: Hyperbolic Uncertainty-aware Metric Learning. Our contribution are threefold: we propose an image embedding algorithm based on Hyperbolic space, with their corresponding uncertainty value; we propose two types of uncertainty-aware metric learning, for the popular Contrastive learning and conventional margin-based metric learning, respectively. We perform extensive experimental validations to prove that the proposed algorithm can achieve state-of-the-art results among related methods. The comprehensive ablation study validates the effectiveness of each component of the proposed algorithm.
Luiza Pozzobon, Beyza Ermis, Patrick Lewis, Sara Hooker
Considerable effort has been dedicated to mitigating toxicity, but existing methods often require drastic modifications to model parameters or the use of computationally intensive auxiliary models. Furthermore, previous approaches have often neglected the crucial factor of language's evolving nature over time. In this work, we present a comprehensive perspective on toxicity mitigation that takes into account its changing nature. We introduce Goodtriever, a flexible methodology that matches the current state-of-the-art toxicity mitigation while achieving 43% relative latency reduction during inference and being more computationally efficient. By incorporating a retrieval-based approach at decoding time, Goodtriever enables toxicity-controlled text generation. Our research advocates for an increased focus on adaptable mitigation techniques, which better reflect the data drift models face when deployed in the wild. Code and data are available at https://github.com/for-ai/goodtriever.
Pandeng Li, Hongtao Xie, Jiannan Ge, Lei Zhang, Shaobo Min, Yongdong Zhang
Unsupervised video hashing usually optimizes binary codes by learning to
reconstruct input videos. Such reconstruction constraint spends much effort on
frame-level temporal context changes without focusing on video-level global
semantics that are more useful for retrieval. Hence, we address this problem by
decomposing video information into reconstruction-dependent and
semantic-dependent information, which disentangles the semantic extraction from
reconstruction constraint. Specifically, we first design a simple dual-stream
structure, including a temporal layer and a hash layer. Then, with the help of
semantic similarity knowledge obtained from self-supervision, the hash layer
learns to capture information for semantic retrieval, while the temporal layer
learns to capture the information for reconstruction. In this way, the model
naturally preserves the disentangled semantics into binary codes. Validated by
comprehensive experiments, our method consistently outperforms the
state-of-the-arts on three video benchmarks.
Authors' comments: 17 pages, 8 figures, ECCV 2022
Guoyuan An, Juhyung Seon, Inkyu An, Yuchi Huo, Sung-Eui Yoon
This paper presents an innovative approach to enhancing explainable image retrieval, particularly in situations where a fine-tuning set is unavailable. The widely-used SPatial verification (SP) method, despite its efficacy, relies on a spatial model and the hypothesis-testing strategy for instance recognition, leading to inherent limitations, including the assumption of planar structures and neglect of topological relations among features. To address these shortcomings, we introduce a pioneering technique that replaces the spatial model with a topological one within the RANSAC process. We propose bio-inspired saccade and fovea functions to verify the topological consistency among features, effectively circumventing the issues associated with SP's spatial model. Our experimental results demonstrate that our method significantly outperforms SP, achieving state-of-the-art performance in non-fine-tuning retrieval. Furthermore, our approach can enhance performance when used in conjunction with fine-tuned features. Importantly, our method retains high explainability and is lightweight, offering a practical and adaptable solution for a variety of real-world applications.
Qingfa Xiao, Shuangyin Li, Lei Chen
Prompt-based learning's efficacy across numerous natural language processing
tasks has led to its integration into dense passage retrieval. Prior research
has mainly focused on enhancing the semantic understanding of pre-trained
language models by optimizing a single vector as a continuous prompt. This
approach, however, leads to a semantic space collapse; identical semantic
information seeps into all representations, causing their distributions to
converge in a restricted region. This hinders differentiation between relevant
and irrelevant passages during dense retrieval. To tackle this issue, we
present Topic-DPR, a dense passage retrieval model that uses topic-based
prompts. Unlike the single prompt method, multiple topic-based prompts are
established over a probabilistic simplex and optimized simultaneously through
contrastive learning. This encourages representations to align with their topic
distributions, improving space uniformity. Furthermore, we introduce a novel
positive and negative sampling strategy, leveraging semi-structured data to
boost dense retrieval efficiency. Experimental results from two datasets affirm
that our method surpasses previous state-of-the-art retrieval techniques.
Authors' comments: Findings of EMNLP 2023