Jheng-Hong Yang, Jimmy Lin
Vision--Language Models (VLMs) have demonstrated success across diverse
applications, yet their potential to assist in relevance judgments remains
uncertain. This paper assesses the relevance estimation capabilities of VLMs,
including CLIP, LLaVA, and GPT-4V, within a large-scale \textit{ad hoc}
retrieval task tailored for multimedia content creation in a zero-shot fashion.
Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V,
encompassing open-source and closed-source visual-instruction-tuned Large
Language Models (LLMs), achieve notable Kendall's $\tau \sim 0.4$ when compared
to human relevance judgments, surpassing the CLIPScore metric. (2) While
CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based
retrieval systems. (3) GPT-4V's score distribution aligns more closely with
human judgments than other models, achieving a Cohen's $\kappa$ value of around
0.08, which outperforms CLIPScore at approximately -0.096. These findings
underscore the potential of LLM-powered VLMs in enhancing relevance judgments.
Authors' comments: Accepted by ACM SIGIR 2024 LLM4Eval Workshop:
https://llm4eval.github.io/papers
H. R. Tizhoosh
Medical images play a crucial role in modern healthcare by providing vital information for diagnosis, treatment planning, and disease monitoring. Fields such as radiology and pathology rely heavily on accurate image interpretation, with radiologists examining X-rays, CT scans, and MRIs to diagnose conditions from fractures to cancer, while pathologists use microscopy and digital images to detect cellular abnormalities for diagnosing cancers and infections. The technological advancements have exponentially increased the volume and complexity of medical images, necessitating efficient tools for management and retrieval. Content-Based Image Retrieval (CBIR) systems address this need by searching and retrieving images based on visual content, enhancing diagnostic accuracy by allowing clinicians to find similar cases and compare pathological patterns. Comprehensive validation of image search engines in medical applications involves evaluating performance metrics like accuracy, indexing, and search times, and storage overhead, ensuring reliable and efficient retrieval of accurate results, as demonstrated by recent validations in histopathology.
Youna Kim, Hyuhng Joon Kim, Cheonbok Park, Choonghyun Park, Hyunsoo Cho, Junyeob Kim, Kang Min Yoo, Sang-goo Lee et al.
When using large language models (LLMs) in knowledge-intensive tasks, such as
open-domain question answering, external context can bridge the gap between
external knowledge and the LLMs' parametric knowledge. Recent research has been
developed to amplify contextual knowledge over the parametric knowledge of LLMs
with contrastive decoding approaches. While these approaches could yield
truthful responses when relevant context is provided, they are prone to
vulnerabilities when faced with noisy contexts. We extend the scope of previous
studies to encompass noisy contexts and propose adaptive contrastive decoding
(ACD) to leverage contextual influence effectively. ACD demonstrates
improvements in open-domain question answering tasks compared to baselines,
especially in robustness by remaining undistracted by noisy contexts in
retrieval-augmented generation.
Authors' comments: EMNLP 2024 Findings
Yue Duan, Zhangxuan Gu, Zhenzhe Ying, Lei Qi, Changhua Meng, Yinghuan Shi
In the realm of cross-modal retrieval, seamlessly integrating diverse
modalities within multimedia remains a formidable challenge, especially given
the complexities introduced by noisy correspondence learning (NCL). Such noise
often stems from mismatched data pairs, which is a significant obstacle
distinct from traditional noisy labels. This paper introduces
Pseudo-Classification based Pseudo-Captioning (PC$^2$) framework to address
this challenge. PC$^2$ offers a threefold strategy: firstly, it establishes an
auxiliary "pseudo-classification" task that interprets captions as categorical
labels, steering the model to learn image-text semantic similarity through a
non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques,
capitalizing on PC$^2$'s pseudo-classification capability, we generate
pseudo-captions to provide more informative and tangible supervision for each
mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed
to assistant the correction of correspondence. In addition to technical
contributions, we develop a realistic NCL dataset called Noise of Web (NoW),
which could be a new powerful NCL benchmark where noise exists naturally.
Empirical evaluations of PC$^2$ showcase marked improvements over existing
state-of-the-art robust cross-modal retrieval techniques on both simulated and
realistic datasets with various NCL settings. The contributed dataset and
source code are released at https://github.com/alipay/PC2-NoiseofWeb.
Authors' comments: Accepted by ACM MM 2024
Xiaoye Qu, Qiyuan Chen, Wei Wei, Jishuo Sun, Jianfeng Dong
Despite the remarkable ability of large vision-language models (LVLMs) in image comprehension, these models frequently generate plausible yet factually incorrect responses, a phenomenon known as hallucination.Recently, in large language models (LLMs), augmenting LLMs by retrieving information from external knowledge resources has been proven as a promising solution to mitigate hallucinations.However, the retrieval augmentation in LVLM significantly lags behind the widespread applications of LVLM. Moreover, when transferred to augmenting LVLMs, sometimes the hallucination degree of the model is even exacerbated.Motivated by the research gap and counter-intuitive phenomenon, we introduce a novel framework, the Active Retrieval-Augmented large vision-language model (ARA), specifically designed to address hallucinations by incorporating three critical dimensions: (i) dissecting the retrieval targets based on the inherent hierarchical structures of images. (ii) pinpointing the most effective retrieval methods and filtering out the reliable retrieval results. (iii) timing the retrieval process to coincide with episodes of low certainty, while circumventing unnecessary retrieval during periods of high certainty. To assess the capability of our proposed ARA model in reducing hallucination, we employ three widely used LVLM models (LLaVA-1.5, Qwen-VL, and mPLUG-Owl2) across four benchmarks. Our empirical observations suggest that by utilizing fitting retrieval mechanisms and timing the retrieval judiciously, we can effectively mitigate the hallucination problem. We hope that this study can provide deeper insights into how to adapt the retrieval augmentation to LVLMs for reducing hallucinations with more effective retrieval and minimal retrieval occurrences.
Okko Makkonen, David Karpuk, Camilla Hollanti
Private information retrieval (PIR) considers the problem of retrieving a
data item from a database or distributed storage system without disclosing any
information about which data item was retrieved. Secure PIR complements this
problem by further requiring the contents of the data to be kept secure.
Privacy and security can be achieved by adding suitable noise to the queries
and data using methods from secret sharing. In this paper, a new framework for
homomorphic secret sharing in secure and private information retrieval from
colluding servers is proposed, generalizing the original cross-subspace
alignment (CSA) codes proposed by Jia, Sun, and Jafar. We utilize this
framework to give a secure PIR construction using algebraic geometry codes over
hyperelliptic curves of arbitrary genus. It is shown that the proposed scheme
offers interesting tradeoffs between the field size, file size, number of
colluding servers, and the total number of servers. When the field size is
fixed, this translates in some cases to higher retrieval rates than those of
the original scheme. In addition, the new schemes exist also for some
parameters where the original ones do not.
Authors' comments: 19 pages, 1 figure. Extended version of arXiv:2405.18052
Gangyan Zeng, Yuan Zhang, Jin Wei, Dongbao Yang, Peng Zhang, Yiwen Gao, Xugong Qin, Yu Zhou
Scene text retrieval aims to find all images containing the query text from
an image gallery. Current efforts tend to adopt an Optical Character
Recognition (OCR) pipeline, which requires complicated text detection and/or
recognition processes, resulting in inefficient and inflexible retrieval.
Different from them, in this work we propose to explore the intrinsic potential
of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text
retrieval. Through empirical analysis, we observe that the main challenges of
CLIP as a text retriever are: 1) limited text perceptual scale, and 2)
entangled visual-semantic concepts. To this end, a novel model termed FDP
(Focus, Distinguish, and Prompt) is developed. FDP first focuses on scene text
via shifting the attention to the text area and probing the hidden text
knowledge, and then divides the query text into content word and function word
for processing, in which a semantic-aware prompting scheme and a distracted
queries assistance module are utilized. Extensive experiments show that FDP
significantly enhances the inference speed while achieving better or
competitive retrieval accuracy compared to existing methods. Notably, on the
IIIT-STR benchmark, FDP surpasses the state-of-the-art model by 4.37% with a 4
times faster speed. Furthermore, additional experiments under phrase-level and
attribute-aware scene text retrieval settings validate FDP's particular
advantages in handling diverse forms of query text. The source code will be
publicly available at https://github.com/Gyann-z/FDP.
Authors' comments: Accepted by ACM MM 2024
Houye Ji, Ye Tang, Zhaoxin Chen, Lixi Deng, Jun Hu, Lei Su
With the rapid development of the short video industry, traditional e-commerce has encountered a new paradigm, video-driven e-commerce, which leverages attractive videos for product showcases and provides both video and item services for users. Benefitting from the dynamic and visualized introduction of items,video-driven e-commerce has shown huge potential in stimulating consumer confidence and promoting sales. In this paper, we focus on the video retrieval task, facing the following challenges: (1) Howto handle the heterogeneities among users, items, and videos? (2)How to mine the complementarity between items and videos for better user understanding? In this paper, we first leverage the dual graph to model the co-existing of user-video and user-item interactions in video-driven e-commerce and innovatively reduce user preference understanding to a graph matching problem. To solve it, we further propose a novel bi-level Graph Matching Network(GMN), which mainly consists of node- and preference-level graph matching. Given a user, node-level graph matching aims to match videos and items, while preference-level graph matching aims to match multiple user preferences extracted from both videos and items. Then the proposed GMN can generate and improve user embedding by aggregating matched nodes or preferences from the dual graph in a bi-level manner. Comprehensive experiments show the superiority of the proposed GMN with significant improvements over state-of-the-art approaches (e.g., AUC+1.9% and CTR+7.15%). We have developed it on a well-known video-driven e-commerce platform, serving hundreds of millions of users every day
Anton Korikov, George Saad, Ethan Baron, Mustafa Khan, Manav Shah, Scott Sanner
While user-generated product reviews often contain large quantities of information, their utility in addressing natural language product queries has been limited, with a key challenge being the need to aggregate information from multiple low-level sources (reviews) to a higher item level during retrieval. Existing methods for reviewed-item retrieval (RIR) typically take a late fusion (LF) approach which computes query-item scores by simply averaging the top-K query-review similarity scores for an item. However, we demonstrate that for multi-aspect queries and multi-aspect items, LF is highly sensitive to the distribution of aspects covered by reviews in terms of aspect frequency and the degree of aspect separation across reviews. To address these LF failures, we propose several novel aspect fusion (AF) strategies which include Large Language Model (LLM) query extraction and generative reranking. Our experiments show that for imbalanced review corpora, AF can improve over LF by a MAP@10 increase from 0.36 to 0.52, while achieving equivalent performance for balanced review corpora.
Guangzhi Xiong, Qiao Jin, Xiao Wang, Minjia Zhang, Zhiyong Lu, Aidong Zhang
The emergent abilities of large language models (LLMs) have demonstrated
great potential in solving medical questions. They can possess considerable
medical knowledge, but may still hallucinate and are inflexible in the
knowledge updates. While Retrieval-Augmented Generation (RAG) has been proposed
to enhance the medical question-answering capabilities of LLMs with external
knowledge bases, it may still fail in complex cases where multiple rounds of
information-seeking are required. To address such an issue, we propose
iterative RAG for medicine (i-MedRAG), where LLMs can iteratively ask follow-up
queries based on previous information-seeking attempts. In each iteration of
i-MedRAG, the follow-up queries will be answered by a conventional RAG system
and they will be further used to guide the query generation in the next
iteration. Our experiments show the improved performance of various LLMs
brought by i-MedRAG compared with conventional RAG on complex questions from
clinical vignettes in the United States Medical Licensing Examination (USMLE),
as well as various knowledge tests in the Massive Multitask Language
Understanding (MMLU) dataset. Notably, our zero-shot i-MedRAG outperforms all
existing prompt engineering and fine-tuning methods on GPT-3.5, achieving an
accuracy of 69.68% on the MedQA dataset. In addition, we characterize the
scaling properties of i-MedRAG with different iterations of follow-up queries
and different numbers of queries per iteration. Our case studies show that
i-MedRAG can flexibly ask follow-up queries to form reasoning chains, providing
an in-depth analysis of medical questions. To the best of our knowledge, this
is the first-of-its-kind study on incorporating follow-up queries into medical
RAG. The implementation of i-MedRAG is available at
https://github.com/Teddy-XiongGZ/MedRAG.
Authors' comments: Accepted to PSB 2025
Lukas Gienapp, Niklas Deckers, Martin Potthast, Harrisen Scells
Representation-based retrieval models, so-called biencoders, estimate the
relevance of a document to a query by calculating the similarity of their
respective embeddings. Current state-of-the-art biencoders are trained using an
expensive training regime involving knowledge distillation from a teacher model
and batch-sampling. Instead of relying on a teacher model, we contribute a
novel parameter-free loss function for self-supervision that exploits the
pre-trained language modeling capabilities of the encoder model as a training
signal, eliminating the need for batch sampling by performing implicit hard
negative mining. We investigate the capabilities of our proposed approach
through extensive ablation studies, demonstrating that self-distillation can
match the effectiveness of teacher distillation using only 13.5% of the data,
while offering a speedup in training time between 3x and 15x compared to
parametrized losses. Code and data is made openly available.
Authors' comments: 9 Pages, 4 Tables, 6 Figures
Zhirui Kuai, Zuxu Chen, Huimu Wang, Mingming Li, Dadong Miao, Binbin Wang, Xusong Chen, Li Kuang et al.
Generative retrieval (GR) has emerged as a transformative paradigm in search and recommender systems, leveraging numeric-based identifier representations to enhance efficiency and generalization. Notably, methods like TIGER employing Residual Quantization-based Semantic Identifiers (RQ-SID), have shown significant promise in e-commerce scenarios by effectively managing item IDs. However, a critical issue termed the "\textbf{Hourglass}" phenomenon, occurs in RQ-SID, where intermediate codebook tokens become overly concentrated, hindering the full utilization of generative retrieval methods. This paper analyses and addresses this problem by identifying data sparsity and long-tailed distribution as the primary causes. Through comprehensive experiments and detailed ablation studies, we analyze the impact of these factors on codebook utilization and data distribution. Our findings reveal that the "Hourglass" phenomenon substantially impacts the performance of RQ-SID in generative retrieval. We propose effective solutions to mitigate this issue, thereby significantly enhancing the effectiveness of generative retrieval in real-world E-commerce applications.
Yingcai Ma, Ziyang Wang, Yuliang Yan, Jian Wu, Yuning Jiang, Longbin Li, Wen Chen, Jianhang Huang
In recommendation systems, the matching stage is becoming increasingly critical, serving as the upper limit for the entire recommendation process. Recently, some studies have started to explore the use of multi-scenario information for recommendations, such as model-based and data-based approaches. However, the matching stage faces significant challenges due to the need for ultra-large-scale retrieval and meeting low latency requirements. As a result, the methods applied at this stage (collaborative filtering and two-tower models) are often designed to be lightweight, hindering the full utilization of extensive information. On the other hand, the ranking stage features the most sophisticated models with the strongest scoring capabilities, but due to the limited screen size of mobile devices, most of the ranked results may not gain exposure or be displayed. In this paper, we introduce an innovative multi-scenario nearline retrieval framework. It operates by harnessing ranking logs from various scenarios through Flink, allowing us to incorporate finely ranked results from other scenarios into our matching stage in near real-time. Besides, we propose a streaming scoring module, which selects a crucial subset from the candidate pool. Implemented on the "Guess You Like" (homepage of the Taobao APP), China's premier e-commerce platform, our method has shown substantial improvements-most notably, a 5% uptick in product transactions. Furthermore, the proposed approach is not only model-free but also highly efficient, suggesting it can be quickly implemented in diverse scenarios and demonstrate promising performance.
Yanxu Mao, Xiaohui Chen, Peipei Liu, Tiehan Cui, Zuhui Yue, Zheng Li
Document-level relation extraction (DocRE) aims to extract relations between entities from unstructured document text. Compared to sentence-level relation extraction, it requires more complex semantic understanding from a broader text context. Currently, some studies are utilizing logical rules within evidence sentences to enhance the performance of DocRE. However, in the data without provided evidence sentences, researchers often obtain a list of evidence sentences for the entire document through evidence retrieval (ER). Therefore, DocRE suffers from two challenges: firstly, the relevance between evidence and entity pairs is weak; secondly, there is insufficient extraction of complex cross-relations between long-distance multi-entities. To overcome these challenges, we propose GEGA, a novel model for DocRE. The model leverages graph neural networks to construct multiple weight matrices, guiding attention allocation to evidence sentences. It also employs multi-scale representation aggregation to enhance ER. Subsequently, we integrate the most efficient evidence information to implement both fully supervised and weakly supervised training processes for the model. We evaluate the GEGA model on three widely used benchmark datasets: DocRED, Re-DocRED, and Revisit-DocRED. The experimental results indicate that our model has achieved comprehensive improvements compared to the existing SOTA model.
Riccardo Orlando, Pere-Lluis Huguet Cabot, Edoardo Barba, Roberto Navigli
Entity Linking (EL) and Relation Extraction (RE) are fundamental tasks in
Natural Language Processing, serving as critical components in a wide range of
applications. In this paper, we propose ReLiK, a Retriever-Reader architecture
for both EL and RE, where, given an input text, the Retriever module undertakes
the identification of candidate entities or relations that could potentially
appear within the text. Subsequently, the Reader module is tasked to discern
the pertinent retrieved entities or relations and establish their alignment
with the corresponding textual spans. Notably, we put forward an innovative
input representation that incorporates the candidate entities or relations
alongside the text, making it possible to link entities or extract relations in
a single forward pass and to fully leverage pre-trained language models
contextualization capabilities, in contrast with previous
Retriever-Reader-based methods, which require a forward pass for each
candidate. Our formulation of EL and RE achieves state-of-the-art performance
in both in-domain and out-of-domain benchmarks while using academic budget
training and with up to 40x inference speed compared to competitors. Finally,
we show how our architecture can be used seamlessly for Information Extraction
(cIE), i.e. EL + RE, and setting a new state of the art by employing a shared
Reader that simultaneously extracts entities and relations.
Authors' comments: Findings of the Association for Computational Linguistics ACL 2024
Kyra Wilson, Aylin Caliskan
Artificial intelligence (AI) hiring tools have revolutionized resume
screening, and large language models (LLMs) have the potential to do the same.
However, given the biases which are embedded within LLMs, it is unclear whether
they can be used in this scenario without disadvantaging groups based on their
protected attributes. In this work, we investigate the possibilities of using
LLMs in a resume screening setting via a document retrieval framework that
simulates job candidate selection. Using that framework, we then perform a
resume audit study to determine whether a selection of Massive Text Embedding
(MTE) models are biased in resume screening scenarios. We simulate this for
nine occupations, using a collection of over 500 publicly available resumes and
500 job descriptions. We find that the MTEs are biased, significantly favoring
White-associated names in 85.1\% of cases and female-associated names in only
11.1\% of cases, with a minority of cases showing no statistically significant
differences. Further analyses show that Black males are disadvantaged in up to
100\% of cases, replicating real-world patterns of bias in employment settings,
and validate three hypotheses of intersectionality. We also find an impact of
document length as well as the corpus frequency of names in the selection of
resumes. These findings have implications for widely used AI tools that are
automating employment, fairness, and tech policy.
Authors' comments: To be published in Proceedings of the 2024 AAAI/ACM Conference on AI,
Ethics, and Society; code available at
https://github.com/kyrawilson/Resume-Screening-Bias
Mikel Williams-Lekuona, Georgina Cosma
In the field of Image-Text Retrieval (ITR), recent advancements have
leveraged large-scale Vision-Language Pretraining (VLP) for Fine-Grained (FG)
instance-level retrieval, achieving high accuracy at the cost of increased
computational complexity. For Coarse-Grained (CG) category-level retrieval,
prominent approaches employ Cross-Modal Hashing (CMH) to prioritise efficiency,
albeit at the cost of retrieval performance. Due to differences in
methodologies, FG and CG models are rarely compared directly within evaluations
in the literature, resulting in a lack of empirical data quantifying the
retrieval performance-efficiency tradeoffs between the two. This paper
addresses this gap by introducing the \texttt{FiCo-ITR} library, which
standardises evaluation methodologies for both FG and CG models, facilitating
direct comparisons. We conduct empirical evaluations of representative models
from both subfields, analysing precision, recall, and computational complexity
across varying data scales. Our findings offer new insights into the
performance-efficiency trade-offs between recent representative FG and CG
models, highlighting their respective strengths and limitations. These findings
provide the foundation necessary to make more informed decisions regarding
model selection for specific retrieval tasks and highlight avenues for future
research into hybrid systems that leverage the strengths of both FG and CG
approaches.
Authors' comments: 19 pages, submitted to International Journal of Multimedia
Information Retrieval
Manish Bhattarai, Javier E. Santos, Shawn Jones, Ayan Biswas, Boian Alexandrov, Daniel O'Malley
The advent of large language models (LLMs) has significantly advanced the
field of code translation, enabling automated translation between programming
languages. However, these models often struggle with complex translation tasks
due to inadequate contextual understanding. This paper introduces a novel
approach that enhances code translation through Few-Shot Learning, augmented
with retrieval-based techniques. By leveraging a repository of existing code
translations, we dynamically retrieve the most relevant examples to guide the
model in translating new code segments. Our method, based on
Retrieval-Augmented Generation (RAG), substantially improves translation
quality by providing contextual examples from which the model can learn in
real-time. We selected RAG over traditional fine-tuning methods due to its
ability to utilize existing codebases or a locally stored corpus of code, which
allows for dynamic adaptation to diverse translation tasks without extensive
retraining. Extensive experiments on diverse datasets with open LLM models such
as Starcoder, Llama3-70B Instruct, CodeLlama-34B Instruct, Granite-34B Code
Instruct, and Mixtral-8x22B, as well as commercial LLM models like GPT-3.5
Turbo and GPT-4o, demonstrate our approach's superiority over traditional
zero-shot methods, especially in translating between Fortran and CPP. We also
explored varying numbers of shots i.e. examples provided during inference,
specifically 1, 2, and 3 shots and different embedding models for RAG,
including Nomic-Embed, Starencoder, and CodeBERT, to assess the robustness and
effectiveness of our approach.
Authors' comments: LLM for code translation
Neele Falk, Andreas Waldis, Iryna Gurevych
Argument retrieval is the task of finding relevant arguments for a given query. While existing approaches rely solely on the semantic alignment of queries and arguments, this first shared task on perspective argument retrieval incorporates perspectives during retrieval, accounting for latent influences in argumentation. We present a novel multilingual dataset covering demographic and socio-cultural (socio) variables, such as age, gender, and political attitude, representing minority and majority groups in society. We distinguish between three scenarios to explore how retrieval systems consider explicitly (in both query and corpus) and implicitly (only in query) formulated perspectives. This paper provides an overview of this shared task and summarizes the results of the six submitted systems. We find substantial challenges in incorporating perspectivism, especially when aiming for personalization based solely on the text of arguments without explicitly providing socio profiles. Moreover, retrieval systems tend to be biased towards the majority group but partially mitigate bias for the female gender. While we bootstrap perspective argument retrieval, further research is essential to optimize retrieval systems to facilitate personalization and reduce polarization.
Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang et al.
We present systematic efforts in building long-context multilingual text
representation model (TRM) and reranker from scratch for text retrieval. We
first introduce a text encoder (base size) enhanced with RoPE and unpadding,
pre-trained in a native 8192-token context (longer than 512 of previous
multilingual encoders). Then we construct a hybrid TRM and a cross-encoder
reranker by contrastive learning. Evaluations show that our text encoder
outperforms the same-sized previous state-of-the-art XLM-R. Meanwhile, our TRM
and reranker match the performance of large-sized state-of-the-art BGE-M3
models and achieve better results on long-context retrieval benchmarks. Further
analysis demonstrate that our proposed models exhibit higher efficiency during
both training and inference. We believe their efficiency and effectiveness
could benefit various researches and industrial applications.
Authors' comments: Camera-ready version of EMNLP 2024: Industry Track