Akshat Mohan Dasula, Hrushitha Tigulla, Preethika Bhukya
Traditionally in the domain of legal research, the retrieval of pertinent
citations from intricate case descriptions has demanded manual effort and
keyword-based search applications that mandate expertise in understanding legal
jargon. Legal case descriptions hold pivotal information for legal
professionals and researchers, necessitating more efficient and automated
approaches. We propose a methodology that combines natural language processing
(NLP) and machine learning techniques to enhance the organization and
utilization of legal case descriptions. This approach revolves around the
creation of textual embeddings with the help of state-of-art embedding models.
Our methodology addresses two primary objectives: unsupervised clustering and
supervised citation retrieval, both designed to automate the citation
extraction process. Although the proposed methodology can be used for any
dataset, we employed the Supreme Court of The United States (SCOTUS) dataset,
yielding remarkable results. Our methodology achieved an impressive accuracy
rate of 90.9%. By automating labor-intensive processes, we pave the way for a
more efficient, time-saving, and accessible landscape in legal research,
benefiting legal professionals, academics, and researchers.
Authors' comments: 14 pages, 16 images, Submitted to Multimedia Tools and Applications
Springer journal
Yu Wang, Nedim Lipka, Ruiyi Zhang, Alexa Siu, Yuying Zhao, Bo Ni, Xin Wang, Ryan Rossi et al.
Despite the impressive advancements of Large Language Models (LLMs) in generating text, they are often limited by the knowledge contained in the input and prone to producing inaccurate or hallucinated content. To tackle these issues, Retrieval-augmented Generation (RAG) is employed as an effective strategy to enhance the available knowledge base and anchor the responses in reality by pulling additional texts from external databases. In real-world applications, texts are often linked through entities within a graph, such as citations in academic papers or comments in social networks. This paper exploits these topological relationships to guide the retrieval process in RAG. Specifically, we explore two kinds of topological connections: proximity-based, focusing on closely connected nodes, and role-based, which looks at nodes sharing similar subgraph structures. Our empirical research confirms their relevance to text relationships, leading us to develop a Topology-aware Retrieval-augmented Generation framework. This framework includes a retrieval module that selects texts based on their topological relationships and an aggregation module that integrates these texts into prompts to stimulate LLMs for text generation. We have curated established text-attributed networks and conducted comprehensive experiments to validate the effectiveness of this framework, demonstrating its potential to enhance RAG with topological awareness.
Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, Prateek Mittal
Retrieval-augmented generation (RAG) has been shown vulnerable to retrieval corruption attacks: an attacker can inject malicious passages into retrieval results to induce inaccurate responses. In this paper, we propose RobustRAG as the first defense framework against retrieval corruption attacks. The key insight of RobustRAG is an isolate-then-aggregate strategy: we get LLM responses from each passage in isolation and then securely aggregate these isolated responses. To instantiate RobustRAG, we design keyword-based and decoding-based algorithms for securely aggregating unstructured text responses. Notably, RobustRAG can achieve certifiable robustness: we can formally prove and certify that, for certain queries, RobustRAG can always return accurate responses, even when the attacker has full knowledge of our defense and can arbitrarily inject a small number of malicious passages. We evaluate RobustRAG on open-domain QA and long-form text generation datasets and demonstrate its effectiveness and generalizability across various tasks and datasets.
Bill Psomas, Ioannis Kakogeorgiou, Nikos Efthymiadis, Giorgos Tolias, Ondrej Chum, Yannis Avrithis, Konstantinos Karantzalos
This work introduces composed image retrieval to remote sensing. It allows to
query a large image archive by image examples alternated by a textual
description, enriching the descriptive power over unimodal queries, either
visual or textual. Various attributes can be modified by the textual part, such
as shape, color, or context. A novel method fusing image-to-image and
text-to-image similarity is introduced. We demonstrate that a vision-language
model possesses sufficient descriptive power and no further learning step or
training data are necessary. We present a new evaluation benchmark focused on
color, context, density, existence, quantity, and shape modifications. Our work
not only sets the state-of-the-art for this task, but also serves as a
foundational step in addressing a gap in the field of remote sensing image
retrieval. Code at: https://github.com/billpsomas/rscir
Authors' comments: Accepted for ORAL presentation at the 2024 IEEE International
Geoscience and Remote Sensing Symposium
Laura Dietz
This resource paper addresses the challenge of evaluating Information
Retrieval (IR) systems in the era of autoregressive Large Language Models
(LLMs). Traditional methods relying on passage-level judgments are no longer
effective due to the diversity of responses generated by LLM-based systems. We
provide a workbench to explore several alternative evaluation approaches to
judge the relevance of a system's response that incorporate LLMs: 1. Asking an
LLM whether the response is relevant; 2. Asking the LLM which set of nuggets
(i.e., relevant key facts) is covered in the response; 3. Asking the LLM to
answer a set of exam questions with the response.
This workbench aims to facilitate the development of new, reusable test
collections. Researchers can manually refine sets of nuggets and exam
questions, observing their impact on system evaluation and leaderboard
rankings.
Resource available at https://github.com/TREMA-UNH/autograding-workbench
Authors' comments: 10 pages. To appear in the Resource & Reproducibility Track of SIGIR
2024
Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara
The objective of image captioning models is to bridge the gap between the
visual and linguistic modalities by generating natural language descriptions
that accurately reflect the content of input images. In recent years,
researchers have leveraged deep learning-based models and made advances in the
extraction of visual features and the design of multimodal connections to
tackle this task. This work presents a novel approach towards developing image
captioning models that utilize an external kNN memory to improve the generation
process. Specifically, we propose two model variants that incorporate a
knowledge retriever component that is based on visual similarities, a
differentiable encoder to represent input images, and a kNN-augmented language
model to predict tokens based on contextual cues and text retrieved from the
external memory. We experimentally validate our approach on COCO and nocaps
datasets and demonstrate that incorporating an explicit external memory can
significantly enhance the quality of captions, especially with a larger
retrieval corpus. This work provides valuable insights into retrieval-augmented
captioning models and opens up new avenues for improving image captioning at a
larger scale.
Authors' comments: ACM Transactions on Multimedia Computing, Communications and
Applications (2024)
Gengchen Wei, Xinle Pang, Tianning Zhang, Yu Sun, Xun Qian, Chen Lin, Han-Sen Zhong, Wanli Ouyang
With over 200 million published academic documents and millions of new documents being written each year, academic researchers face the challenge of searching for information within this vast corpus. However, existing retrieval systems struggle to understand the semantics and domain knowledge present in academic papers. In this work, we demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities, significantly outperforming existing systems. Our approach involves training the retriever and reranker using domain-specific data generated by large language models. Additionally, we utilize large language models to identify candidates from the references of retrieved papers to further enhance the performance. We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance. The results show that DocReLM achieves a Top 10 accuracy of 44.12% in computer vision, compared to Google Scholar's 15.69%, and an increase to 36.21% in quantum physics, while that of Google Scholar is 12.96%.
Jean-Jacques Godeme, Jalal Fadili, Claude Amra, Myriam Zerrad
In this paper, we aim to reconstruct an n-dimensional real vector from m phaseless measurements corrupted by an additive noise. We extend the noiseless framework developed in [15], based on mirror descent (or Bregman gradient descent), to deal with noisy measurements and prove that the procedure is stable to (small enough) additive noise. In the deterministic case, we show that mirror descent converges to a critical point of the phase retrieval problem, and if the algorithm is well initialized and the noise is small enough, the critical point is near the true vector up to a global sign change. When the measurements are i.i.d Gaussian and the signal-to-noise ratio is large enough, we provide global convergence guarantees that ensure that with high probability, mirror descent converges to a global minimizer near the true vector (up to a global sign change), as soon as the number of measurements m is large enough. The sample complexity bound can be improved if a spectral method is used to provide a good initial guess. We complement our theoretical study with several numerical results showing that mirror descent is both a computationally and statistically efficient scheme to solve the phase retrieval problem.
Abhishek Divekar, Greg Durrett
Large language models (LLMs) are versatile and can address many tasks, but for computational efficiency, it is often desirable to distill their capabilities into smaller student models. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is "seeded" with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to standard 32-shot prompting and six baseline approaches.
Dominik Farhan
Entity linking (EL) is the computational process of connecting textual
mentions to corresponding entities. Like many areas of natural language
processing, the EL field has greatly benefited from deep learning, leading to
significant performance improvements. However, present-day approaches are
expensive to train and rely on diverse data sources, complicating their
reproducibility. In this thesis, we develop multiple systems that are fast to
train, demonstrating that competitive entity linking can be achieved without a
large GPU cluster. Moreover, we train on a publicly available dataset, ensuring
reproducibility and accessibility. Our models are evaluated for 9 languages
giving an accurate overview of their strengths. Furthermore, we offer
a~detailed analysis of bi-encoder training hyperparameters, a popular approach
in EL, to guide their informed selection. Overall, our work shows that building
competitive neural network based EL systems that operate in multiple languages
is possible even with limited resources, thus making EL more approachable.
Authors' comments: Bachelor's thesis, Charles University
Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, Daniel Campos
Test collections play a vital role in evaluation of information retrieval
(IR) systems. Obtaining a diverse set of user queries for test collection
construction can be challenging, and acquiring relevance judgments, which
indicate the appropriateness of retrieved documents to a query, is often costly
and resource-intensive. Generating synthetic datasets using Large Language
Models (LLMs) has recently gained significant attention in various
applications. In IR, while previous work exploited the capabilities of LLMs to
generate synthetic queries or documents to augment training data and improve
the performance of ranking models, using LLMs for constructing synthetic test
collections is relatively unexplored. Previous studies demonstrate that LLMs
have the potential to generate synthetic relevance judgments for use in the
evaluation of IR systems. In this paper, we comprehensively investigate whether
it is possible to use LLMs to construct fully synthetic test collections by
generating not only synthetic judgments but also synthetic queries. In
particular, we analyse whether it is possible to construct reliable synthetic
test collections and the potential risks of bias such test collections may
exhibit towards LLM-based models. Our experiments indicate that using LLMs it
is possible to construct synthetic test collections that can reliably be used
for retrieval evaluation.
Authors' comments: SIGIR 2024
Lazaro Janier Gonzalez-Soler, Maciej Salwowski, Christian Rathgeb, Daniel Fischer
Tattoos have been used effectively as soft biometrics to assist law
enforcement in the identification of offenders and victims, as they contain
discriminative information, and are a useful indicator to locate members of a
criminal gang or organisation. Due to various privacy issues in the acquisition
of images containing tattoos, only a limited number of databases exists. This
lack of databases has delayed the development of new methods to effectively
retrieve a potential suspect's tattoo images from a candidate gallery. To
mitigate this issue, in our work, we use an unsupervised generative approach to
create a balanced database consisting of 28,550 semi-synthetic images with
tattooed subjects from 571 tattoo categories. Further, we introduce a novel
Tattoo Template Reconstruction Network (TattTRN), which learns to map the input
tattoo sample to its respective tattoo template to enhance the distinguishing
attributes of the final feature embedding. Experimental results with real data,
i.e., WebTattoo and BIVTatt databases, demonstrate the soundness of the
presented approach: an accuracy of up to 99% is achieved for checking at most
the first 20 entries of the candidate list.
Authors' comments: Accepted at CVPR Workshop 2024
Juhwan Lee, Jisu Kim
This study addresses the hallucination problem in large language models (LLMs). We adopted Retrieval-Augmented Generation(RAG) (Lewis et al., 2020), a technique that involves embedding relevant information in the prompt to obtain accurate answers. However, RAG also faced inherent issues in retrieving correct information. To address this, we employed the Dense Passage Retrieval(DPR) (Karpukhin et al., 2020) model for fetching domain-specific documents related to user queries. Despite this, the DPR model still lacked accuracy in document retrieval. We enhanced the DPR model by incorporating control tokens, achieving significantly superior performance over the standard DPR model, with a 13% improvement in Top-1 accuracy and a 4% improvement in Top-20 accuracy.
Xin Du, Lixin Xiu, Kumiko Tanaka-Ishii
We apply an information-theoretic perspective to reconsider generative
document retrieval (GDR), in which a document $x \in X$ is indexed by $t \in
T$, and a neural autoregressive model is trained to map queries $Q$ to $T$. GDR
can be considered to involve information transmission from documents $X$ to
queries $Q$, with the requirement to transmit more bits via the indexes $T$. By
applying Shannon's rate-distortion theory, the optimality of indexing can be
analyzed in terms of the mutual information, and the design of the indexes $T$
can then be regarded as a {\em bottleneck} in GDR. After reformulating GDR from
this perspective, we empirically quantify the bottleneck underlying GDR.
Finally, using the NQ320K and MARCO datasets, we evaluate our proposed
bottleneck-minimal indexing method in comparison with various previous indexing
methods, and we show that it outperforms those methods.
Authors' comments: Accepted for ICML 2024
Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, Zhaofeng Liu
Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand these challenges, we conduct A Unified Evaluation Process of RAG (Auepora) and aim to provide a comprehensive overview of the evaluation and benchmarks of RAG systems. Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness, within the current RAG benchmarks, encompassing the possible output and ground truth pairs. We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks.
Yong Guan, Dingxiao Liu, Jinchen Ma, Hao Peng, Xiaozhi Wang, Lei Hou, Ru Li
Generative document retrieval, an emerging paradigm in information retrieval,
learns to build connections between documents and identifiers within a single
model, garnering significant attention. However, there are still two
challenges: (1) neglecting inner-content correlation during document
representation; (2) lacking explicit semantic structure during identifier
construction. Nonetheless, events have enriched relations and well-defined
taxonomy, which could facilitate addressing the above two challenges. Inspired
by this, we propose Event GDR, an event-centric generative document retrieval
model, integrating event knowledge into this task. Specifically, we utilize an
exchange-then-reflection method based on multi-agents for event knowledge
extraction. For document representation, we employ events and relations to
model the document to guarantee the comprehensiveness and inner-content
correlation. For identifier construction, we map the events to well-defined
event taxonomy to construct the identifiers with explicit semantic structure.
Our method achieves significant improvement over the baselines on two datasets,
and also hopes to provide insights for future research.
Authors' comments: Accepted to WWW 2024
Eugene Yang
High Recall Retrieval (HRR), such as eDiscovery and medical systematic
review, is a search problem that optimizes the cost of retrieving most relevant
documents in a given collection. Iterative approaches, such as iterative
relevance feedback and uncertainty sampling, are shown to be effective under
various operational scenarios. Despite neural models demonstrating success in
other text-related tasks, linear models such as logistic regression, in
general, are still more effective and efficient in HRR since the model is
trained and retrieves documents from the same fixed collection. In this work,
we leverage SPLADE, an efficient retrieval model that transforms documents into
contextualized sparse vectors, for HRR. Our approach combines the best of both
worlds, leveraging both the contextualization from pretrained language models
and the efficiency of linear models. It reduces 10% and 18% of the review cost
in two HRR evaluation collections under a one-phase review workflow with a
target recall of 80%. The experiment is implemented with TARexp and is
available at https://github.com/eugene-yang/LSR-for-TAR.
Authors' comments: 5 pages, 1 figure, accepted at SIGIR 2024 as short paper
Eugene Yang, Thomas Jänich, James Mayfield, Dawn Lawrie
Multilingual information retrieval (MLIR) considers the problem of ranking
documents in several languages for a query expressed in a language that may
differ from any of those languages. Recent work has observed that approaches
such as combining ranked lists representing a single document language each or
using multilingual pretrained language models demonstrate a preference for one
language over others. This results in systematic unfair treatment of documents
in different languages. This work proposes a language fairness metric to
evaluate whether documents across different languages are fairly ranked through
statistical equivalence testing using the Kruskal-Wallis test. In contrast to
most prior work in group fairness, we do not consider any language to be an
unprotected group. Thus our proposed measure, PEER (Probability of
EqualExpected Rank), is the first fairness metric specifically designed to
capture the language fairness of MLIR systems. We demonstrate the behavior of
PEER on artificial ranked lists. We also evaluate real MLIR systems on two
publicly available benchmarks and show that the PEER scores align with prior
analytical findings on MLIR fairness. Our implementation is compatible with
ir-measures and is available at http://github.com/hltcoe/peer_measure.
Authors' comments: 5 pages, 1 figure, accepted at SIGIR 2024 as short paper
Hao-Cheng Lo, Jung-Mei Chu, Jieh Hsiang, Chun-Chieh Cho
In patent prosecution, image-based retrieval systems for identifying
similarities between current patent images and prior art are pivotal to ensure
the novelty and non-obviousness of patent applications. Despite their growing
popularity in recent years, existing attempts, while effective at recognizing
images within the same patent, fail to deliver practical value due to their
limited generalizability in retrieving relevant prior art. Moreover, this task
inherently involves the challenges posed by the abstract visual features of
patent images, the skewed distribution of image classifications, and the
semantic information of image descriptions. Therefore, we propose a
language-informed, distribution-aware multimodal approach to patent image
feature learning, which enriches the semantic understanding of patent image by
integrating Large Language Models and improves the performance of
underrepresented classes with our proposed distribution-aware contrastive
losses. Extensive experiments on DeepPatent2 dataset show that our proposed
method achieves state-of-the-art or comparable performance in image-based
patent retrieval with mAP +53.3%, Recall@10 +41.8%, and MRR@10 +51.9%.
Furthermore, through an in-depth user analysis, we explore our model in aiding
patent professionals in their image retrieval efforts, highlighting the model's
real-world applicability and effectiveness.
Authors' comments: 8 pages. Under review
Jiabao Wang, Yang Wu, Jun Wang, Ni Chen
The multi-plane phase retrieval method provides a budget-friendly and effective way to perform phase imaging, yet it often encounters alignment challenges due to shifts along the optical axis in experiments. Traditional methods, such as employing beamsplitters instead of mechanical stage movements or adjusting focus using tunable light sources, add complexity to the setup required for multi-plane phase retrieval. Attempts to address these issues computationally face difficulties due to the variable impact of diffraction, which renders conventional homography techniques inadequate. In our research, we introduce a novel Adaptive Cascade Calibrated (ACC) strategy for multi-plane phase retrieval that overcomes misalignment issues. This technique detects feature points within the refocused sample space and calculates the transformation matrix for neighboring planes on-the-fly to digitally adjust measurements, facilitating alignment-free multi-plane phase retrieval. This approach not only avoids the need for complex and expensive optical hardware but also simplifies the imaging setup, reducing overall costs. The effectiveness of our method is validated through simulations and real-world optical experiments.