Frank Palma Gomez, Ramon Sanabria, Yun-hsuan Sung, Daniel Cer, Siddharth Dalmia, Gustavo Hernandez Abrego
Large language models (LLMs) are trained on text-only data that go far beyond the languages with paired speech and text data. At the same time, Dual Encoder (DE) based retrieval systems project queries and documents into the same embedding space and have demonstrated their success in retrieval and bi-text mining. To match speech and text in many languages, we propose using LLMs to initialize multi-modal DE retrieval systems. Unlike traditional methods, our system doesn't require speech data during LLM pre-training and can exploit LLM's multilingual text understanding capabilities to match speech and text in languages unseen during retrieval training. Our multi-modal LLM-based retrieval system is capable of matching speech and text in 102 languages despite only training on 21 languages. Our system outperforms previous systems trained explicitly on all 102 languages. We achieve a 10% absolute improvement in Recall@1 averaged across these languages. Additionally, our model demonstrates cross-lingual speech and text matching, which is further enhanced by readily available machine translation data.
Jiarong Xian, Jibao Yuan, Peiwei Zheng, Dexian Chen, Nie yuntao
Text plagiarism detection task is a common natural language processing task
that aims to detect whether a given text contains plagiarism or copying from
other texts. In existing research, detection of high level plagiarism is still
a challenge due to the lack of high quality datasets. In this paper, we propose
a plagiarized text data generation method based on GPT-3.5, which produces
32,927 pairs of text plagiarism detection datasets covering a wide range of
plagiarism methods, bridging the gap in this part of research. Meanwhile, we
propose a plagiarism identification method based on Faiss with BERT with high
efficiency and high accuracy. Our experiments show that the performance of this
model outperforms other models in several metrics, including 98.86\%, 98.90%,
98.86%, and 0.9888 for Accuracy, Precision, Recall, and F1 Score, respectively.
At the end, we also provide a user-friendly demo platform that allows users to
upload a text library and intuitively participate in the plagiarism analysis.
Authors' comments: arXiv admin note: text overlap with arXiv:1604.06573 by other authors
Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, Jie Fu
Large Language Models (LLMs) exhibit remarkable capabilities but are prone to generating inaccurate or hallucinatory responses. This limitation stems from their reliance on vast pretraining datasets, making them susceptible to errors in unseen scenarios. To tackle these challenges, Retrieval-Augmented Generation (RAG) addresses this by incorporating external, relevant documents into the response generation process, thus leveraging non-parametric knowledge alongside LLMs' in-context learning abilities. However, existing RAG implementations primarily focus on initial input for context retrieval, overlooking the nuances of ambiguous or complex queries that necessitate further clarification or decomposition for accurate responses. To this end, we propose learning to Refine Query for Retrieval Augmented Generation (RQ-RAG) in this paper, endeavoring to enhance the model by equipping it with capabilities for explicit rewriting, decomposition, and disambiguation. Our experimental results indicate that our method, when applied to a 7B Llama2 model, surpasses the previous state-of-the-art (SOTA) by an average of 1.9\% across three single-hop QA datasets, and also demonstrates enhanced performance in handling complex, multi-hop QA datasets. Our code is available at https://github.com/chanchimin/RQ-RAG.
Tenghao Huang, Dongwon Jung, Muhao Chen
Recent advancements in integrating external tools with Large Language Models
(LLMs) have opened new frontiers, with applications in mathematical reasoning,
code generators, and smart assistants. However, existing methods, relying on
simple one-time retrieval strategies, fall short on effectively and accurately
shortlisting relevant tools. This paper introduces a novel PLUTO (Planning,
Learning, and Understanding for TOols) approach, encompassing
`Plan-and-Retrieve (P&R)` and `Edit-and-Ground (E&G)` paradigms. The P&R
paradigm consists of a neural retrieval module for shortlisting relevant tools
and an LLM-based query planner that decomposes complex queries into actionable
tasks, enhancing the effectiveness of tool utilization. The E&G paradigm
utilizes LLMs to enrich tool descriptions based on user scenarios, bridging the
gap between user queries and tool functionalities. Experiment results
demonstrate that these paradigms significantly improve the recall and NDCG in
tool retrieval tasks, significantly surpassing current state-of-the-art models.
Authors' comments: This paper is accepted at NAACL-Findings 2024
Junhao Xu, Longdi Xian, Zening Liu, Mingliang Chen, Qiuyang Yin, Fenghua Song
Artificial Intelligence Generated Content (AIGC) technology development has
facilitated the creation of rumors with misinformation, impacting societal,
economic, and political ecosystems, challenging democracy. Current rumor
detection efforts fall short by merely labeling potentially misinformation
(classification task), inadequately addressing the issue, and it is unrealistic
to have authoritative institutions debunk every piece of information on social
media. Our proposed comprehensive debunking process not only detects rumors but
also provides explanatory generated content to refute the authenticity of the
information. The Expert-Citizen Collective Wisdom (ECCW) module we designed
aensures high-precision assessment of the credibility of information and the
retrieval module is responsible for retrieving relevant knowledge from a
Real-time updated debunking database based on information keywords. By using
prompt engineering techniques, we feed results and knowledge into a LLM (Large
Language Model), achieving satisfactory discrimination and explanatory effects
while eliminating the need for fine-tuning, saving computational costs, and
contributing to debunking efforts.
Authors' comments: 8 pages
Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, Ming-Wei Chang
Image retrieval, i.e., finding desired images given a reference image,
inherently encompasses rich, multi-faceted search intents that are difficult to
capture solely using image-based measures. Recent work leverages text
instructions to allow users to more freely express their search intents.
However, existing work primarily focuses on image pairs that are visually
similar and/or can be characterized by a small set of pre-defined relations.
The core thesis of this paper is that text instructions can enable retrieving
images with richer relations beyond visual similarity. To show this, we
introduce MagicLens, a series of self-supervised image retrieval models that
support open-ended instructions. MagicLens is built on a key novel insight:
image pairs that naturally occur on the same web pages contain a wide range of
implicit relations (e.g., inside view of), and we can bring those implicit
relations explicit by synthesizing instructions via large multimodal models
(LMMs) and large language models (LLMs). Trained on 36.7M (query image,
instruction, target image) triplets with rich semantic relations mined from the
web, MagicLens achieves comparable or better results on eight benchmarks of
various image retrieval tasks than prior state-of-the-art (SOTA) methods.
Remarkably, it outperforms previous SOTA but with a 50X smaller model size on
multiple benchmarks. Additional human analyses on a 1.4M-image unseen corpus
further demonstrate the diversity of search intents supported by MagicLens.
Authors' comments: Work in progress
Robik Shrestha, Yang Zou, Qiuyu Chen, Zhiheng Li, Yusheng Xie, Siqi Deng
Existing text-to-image generative models reflect or even amplify societal
biases ingrained in their training data. This is especially concerning for
human image generation where models are biased against certain demographic
groups. Existing attempts to rectify this issue are hindered by the inherent
limitations of the pre-trained models and fail to substantially improve
demographic diversity. In this work, we introduce Fair Retrieval Augmented
Generation (FairRAG), a novel framework that conditions pre-trained generative
models on reference images retrieved from an external image database to improve
fairness in human generation. FairRAG enables conditioning through a
lightweight linear module that projects reference images into the textual
space. To enhance fairness, FairRAG applies simple-yet-effective debiasing
strategies, providing images from diverse demographic groups during the
generative process. Extensive experiments demonstrate that FairRAG outperforms
existing methods in terms of demographic diversity, image-text alignment, and
image fidelity while incurring minimal computational overhead during inference.
Authors' comments: CVPR 2024
Benedikt Diederichs, Frank Filbir, Patricia Römer
The problem of phase retrieval has many applications in the field of optical imaging. Motivated by imaging experiments with biological specimens, we primarily consider the setting of low-dose illumination where Poisson noise plays the dominant role. In this paper, we discuss gradient descent algorithms based on different loss functions adapted to data affected by Poisson noise, in particular in the low-dose regime. Starting from the maximum log-likelihood function for the Poisson distribution, we investigate different regularizations and approximations of the problem to design an algorithm that meets the requirements that are faced in applications. In the course of this, we focus on low-count measurements. For all suggested loss functions, we study the convergence of the respective gradient descent algorithms to stationary points and find constant step sizes that guarantee descent of the loss in each iteration. Numerical experiments in the low-dose regime are performed to corroborate the theoretical observations.
Changkun Liu, Jianhao Jiao, Huajian Huang, Zhengyang Ma, Dimitrios Kanoulas, Tristan Braud
State-of-the-art hierarchical localisation pipelines (HLoc) employ image
retrieval (IR) to establish 2D-3D correspondences by selecting the top-$k$ most
similar images from a reference database. While increasing $k$ improves
localisation robustness, it also linearly increases computational cost and
runtime, creating a significant bottleneck. This paper investigates the
relationship between global and local descriptors, showing that greater
similarity between the global descriptors of query and database images
increases the proportion of feature matches. Low similarity queries
significantly benefit from increasing $k$, while high similarity queries
rapidly experience diminishing returns. Building on these observations, we
propose an adaptive strategy that adjusts $k$ based on the similarity between
the query's global descriptor and those in the database, effectively mitigating
the feature-matching bottleneck. Our approach optimizes processing time without
sacrificing accuracy. Experiments on three indoor and outdoor datasets show
that AIR-HLoc reduces feature matching time by up to 30\%, while preserving
state-of-the-art accuracy. The results demonstrate that AIR-HLoc facilitates a
latency-sensitive localisation system.
Authors' comments: Accepted to the 2025 IEEE International Conference on Robotics and
Automation (ICRA)
Deokhyung Kang, Baikjin Jung, Yunsu Kim, Gary Geunbae Lee
In table-text open-domain question answering, a retriever system retrieves
relevant evidence from tables and text to answer questions. Previous studies in
table-text open-domain question answering have two common challenges: firstly,
their retrievers can be affected by false-positive labels in training datasets;
secondly, they may struggle to provide appropriate evidence for questions that
require reasoning across the table. To address these issues, we propose
Denoised Table-Text Retriever (DoTTeR). Our approach involves utilizing a
denoised training dataset with fewer false positive labels by discarding
instances with lower question-relevance scores measured through a false
positive detection model. Subsequently, we integrate table-level ranking
information into the retriever to assist in finding evidence for questions that
demand reasoning across the table. To encode this ranking information, we
fine-tune a rank-aware column encoder to identify minimum and maximum values
within a column. Experimental results demonstrate that DoTTeR significantly
outperforms strong baselines on both retrieval recall and downstream QA tasks.
Our code is available at https://github.com/deokhk/DoTTeR.
Authors' comments: Accepted to LREC-COLING 2024
Yanran Tang, Ruihong Qiu, Hongzhi Yin, Xue Li, Zi Huang
In case law, the precedents are the relevant cases that are used to support the decisions made by the judges and the opinions of lawyers towards a given case. This relevance is referred to as the case-to-case reference relation. To efficiently find relevant cases from a large case pool, retrieval tools are widely used by legal practitioners. Existing legal case retrieval models mainly work by comparing the text representations of individual cases. Although they obtain a decent retrieval accuracy, the intrinsic case connectivity relationships among cases have not been well exploited for case encoding, therefore limiting the further improvement of retrieval performance. In a case pool, there are three types of case connectivity relationships: the case reference relationship, the case semantic relationship, and the case legal charge relationship. Due to the inductive manner in the task of legal case retrieval, using case reference as input is not applicable for testing. Thus, in this paper, a CaseLink model based on inductive graph learning is proposed to utilise the intrinsic case connectivity for legal case retrieval, a novel Global Case Graph is incorporated to represent both the case semantic relationship and the case legal charge relationship. A novel contrastive objective with a regularisation on the degree of case nodes is proposed to leverage the information carried by the case reference relationship to optimise the model. Extensive experiments have been conducted on two benchmark datasets, which demonstrate the state-of-the-art performance of CaseLink. The code has been released on https://github.com/yanran-tang/CaseLink.
Junhua Liu, Yong Keat Tan, Bin Fu, Kwan Hui Lim
Multi-turn intent classification is notably challenging due to the complexity
and evolving nature of conversational contexts. This paper introduces LARA, a
Linguistic-Adaptive Retrieval-Augmentation framework to enhance accuracy in
multi-turn classification tasks across six languages, accommodating a large
number of intents in chatbot interactions. LARA combines a fine-tuned smaller
model with a retrieval-augmented mechanism, integrated within the architecture
of LLMs. The integration allows LARA to dynamically utilize past dialogues and
relevant intents, thereby improving the understanding of the context.
Furthermore, our adaptive retrieval techniques bolster the cross-lingual
capabilities of LLMs without extensive retraining and fine-tuning.
Comprehensive experiments demonstrate that LARA achieves state-of-the-art
performance on multi-turn intent classification tasks, enhancing the average
accuracy by 3.67\% from state-of-the-art single-turn intent classifiers.
Authors' comments: Accepted to EMNLP'24
Ashish Chouhan, Michael Gertz
With the increase in legislative documents at the EU, the number of new terms
and their definitions is increasing as well. As per the Joint Practical Guide
of the European Parliament, the Council and the Commission, terms used in legal
documents shall be consistent, and identical concepts shall be expressed
without departing from their meaning in ordinary, legal, or technical language.
Thus, while drafting a new legislative document, having a framework that
provides insights about existing definitions and helps define new terms based
on a document's context will support such harmonized legal definitions across
different regulations and thus avoid ambiguities. In this paper, we present
LexDrafter, a framework that assists in drafting Definitions articles for
legislative documents using retrieval augmented generation (RAG) and existing
term definitions present in different legislative documents. For this,
definition elements are built by extracting definitions from existing
documents. Using definition elements and RAG, a Definitions article can be
suggested on demand for a legislative document that is being drafted. We
demonstrate and evaluate the functionality of LexDrafter using a collection of
EU documents from the energy domain. The code for LexDrafter framework is
available at https://github.com/achouhan93/LexDrafter.
Authors' comments: Accepted at LREC-COLING 2024
Yucheng Suo, Fan Ma, Linchao Zhu, Yi Yang
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to
retrieve the target image given a reference image and a description without
training on the triplet datasets. Previous works generate pseudo-word tokens by
projecting the reference image features to the text embedding space. However,
they focus on the global visual representation, ignoring the representation of
detailed attributes, e.g., color, object number and layout. To address this
challenge, we propose a Knowledge-Enhanced Dual-stream zero-shot composed image
retrieval framework (KEDs). KEDs implicitly models the attributes of the
reference images by incorporating a database. The database enriches the
pseudo-word tokens by providing relevant images and captions, emphasizing
shared attribute information in various aspects. In this way, KEDs recognizes
the reference image from diverse perspectives. Moreover, KEDs adopts an extra
stream that aligns pseudo-word tokens with textual concepts, leveraging
pseudo-triplets mined from image-text pairs. The pseudo-word tokens generated
in this stream are explicitly aligned with fine-grained semantics in the text
embedding space. Extensive experiments on widely used benchmarks, i.e.
ImageNet-R, COCO object, Fashion-IQ and CIRR, show that KEDs outperforms
previous zero-shot composed image retrieval methods.
Authors' comments: CVPR 2024
Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini
Modern Large Language Models (LLMs) are capable of following long and complex instructions that enable a diverse amount of user tasks. However, despite Information Retrieval (IR) models using LLMs as the backbone of their architectures, nearly all of them still only take queries as input, with no instructions. For the handful of recent models that do take instructions, it's unclear how they use them. We introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR builds off the long history of the TREC conferences: as TREC provides human annotators with instructions (also known as narratives) to determine document relevance, so should IR models be able to understand and decide relevance based on these detailed instructions. Our evaluation benchmark starts with three deeply judged TREC collections and alters the annotator instructions, re-annotating relevant documents. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements (over 13%) after fine-tuning on our training set.
Xiaobin Zhang, Liangjun Zang, Qianwen Liu, Shuchong Wei, Songlin Hu
Event temporal relation (TempRel) is a primary subject of the event relation
extraction task. However, the inherent ambiguity of TempRel increases the
difficulty of the task. With the rise of prompt engineering, it is important to
design effective prompt templates and verbalizers to extract relevant
knowledge. The traditional manually designed templates struggle to extract
precise temporal knowledge. This paper introduces a novel retrieval-augmented
TempRel extraction approach, leveraging knowledge retrieved from large language
models (LLMs) to enhance prompt templates and verbalizers. Our method
capitalizes on the diverse capabilities of various LLMs to generate a wide
array of ideas for template and verbalizer design. Our proposed method fully
exploits the potential of LLMs for generation tasks and contributes more
knowledge to our design. Empirical evaluations across three widely recognized
datasets demonstrate the efficacy of our method in improving the performance of
event temporal relation extraction tasks.
Authors' comments: 8 pages,6 figures.Accepted to the International Joint Conference on
Neural Networks (IJCNN2024)
Zhenrui Yue, Huimin Zeng, Yimeng Lu, Lanyu Shang, Yang Zhang, Dong Wang
The proliferation of online misinformation has posed significant threats to
public interest. While numerous online users actively participate in the combat
against misinformation, many of such responses can be characterized by the lack
of politeness and supporting facts. As a solution, text generation approaches
are proposed to automatically produce counter-misinformation responses.
Nevertheless, existing methods are often trained end-to-end without leveraging
external knowledge, resulting in subpar text quality and excessively repetitive
responses. In this paper, we propose retrieval augmented response generation
for online misinformation (RARG), which collects supporting evidence from
scientific sources and generates counter-misinformation responses based on the
evidences. In particular, our RARG consists of two stages: (1) evidence
collection, where we design a retrieval pipeline to retrieve and rerank
evidence documents using a database comprising over 1M academic articles; (2)
response generation, in which we align large language models (LLMs) to generate
evidence-based responses via reinforcement learning from human feedback (RLHF).
We propose a reward function to maximize the utilization of the retrieved
evidence while maintaining the quality of the generated text, which yields
polite and factual responses that clearly refutes misinformation. To
demonstrate the effectiveness of our method, we study the case of COVID-19 and
perform extensive experiments with both in- and cross-domain datasets, where
RARG consistently outperforms baselines by generating high-quality
counter-misinformation responses.
Authors' comments: Accepted to NAACL 2024
Lucas Iijima, Nikolaos Giakoumoglou, Tania Stathaki
Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Traditional approaches focus on visual image features and rely heavily on supervised learning with labeled data and cross-domain correspondences, which leads to an often struggle with the significant domain gap. This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models. Our method, dubbed as Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or fine-tuning. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in unsupervised settings with improvements of 24.0% on Office-Home and 132.2% on DomainNet over previous methods. We also demonstrate our method's effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.
Xiang Li, Zhenyu Li, Chen Shi, Yong Xu, Qing Du, Mingkui Tan, Jun Huang, Wei Lin
The task of financial analysis primarily encompasses two key areas: stock
trend prediction and the corresponding financial question answering. Currently,
machine learning and deep learning algorithms (ML&DL) have been widely applied
for stock trend predictions, leading to significant progress. However, these
methods fail to provide reasons for predictions, lacking interpretability and
reasoning processes. Also, they can not integrate textual information such as
financial news or reports. Meanwhile, large language models (LLMs) have
remarkable textual understanding and generation ability. But due to the
scarcity of financial training datasets and limited integration with real-time
knowledge, LLMs still suffer from hallucinations and are unable to keep up with
the latest information. To tackle these challenges, we first release AlphaFin
datasets, combining traditional research datasets, real-time financial data,
and handwritten chain-of-thought (CoT) data. It has a positive impact on
training LLMs for completing financial analysis. We then use AlphaFin datasets
to benchmark a state-of-the-art method, called Stock-Chain, for effectively
tackling the financial analysis task, which integrates retrieval-augmented
generation (RAG) techniques. Extensive experiments are conducted to demonstrate
the effectiveness of our framework on financial analysis.
Authors' comments: COLING 2024. The first three authors contributed equally. Project
website: https://github.com/AlphaFin-proj/AlphaFin
Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Xueqi Cheng
Recently, a novel generative retrieval (GR) paradigm has been proposed, where
a single sequence-to-sequence model is learned to directly generate a list of
relevant document identifiers (docids) given a query. Existing GR models
commonly employ maximum likelihood estimation (MLE) for optimization: this
involves maximizing the likelihood of a single relevant docid given an input
query, with the assumption that the likelihood for each docid is independent of
the other docids in the list. We refer to these models as the pointwise
approach in this paper. While the pointwise approach has been shown to be
effective in the context of GR, it is considered sub-optimal due to its
disregard for the fundamental principle that ranking involves making
predictions about lists. In this paper, we address this limitation by
introducing an alternative listwise approach, which empowers the GR model to
optimize the relevance at the docid list level. Specifically, we view the
generation of a ranked docid list as a sequence learning process: at each step
we learn a subset of parameters that maximizes the corresponding generation
likelihood of the $i$-th docid given the (preceding) top $i-1$ docids. To
formalize the sequence learning process, we design a positional conditional
probability for GR. To alleviate the potential impact of beam search on the
generation quality during inference, we perform relevance calibration on the
generation likelihood of model-generated docids according to relevance grades.
We conduct extensive experiments on representative binary and multi-graded
relevance datasets. Our empirical results demonstrate that our method
outperforms state-of-the-art GR baselines in terms of retrieval performance.
Authors' comments: Accepted by ACM Transactions on Information Systems