Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Jiayang Cheng, Cunxiang Wang et al.
Despite Retrieval-Augmented Generation (RAG) has shown promising capability
in leveraging external knowledge, a comprehensive evaluation of RAG systems is
still challenging due to the modular nature of RAG, evaluation of long-form
responses and reliability of measurements. In this paper, we propose a
fine-grained evaluation framework, RAGChecker, that incorporates a suite of
diagnostic metrics for both the retrieval and generation modules. Meta
evaluation verifies that RAGChecker has significantly better correlations with
human judgments than other evaluation metrics. Using RAGChecker, we evaluate 8
RAG systems and conduct an in-depth analysis of their performance, revealing
insightful patterns and trade-offs in the design choices of RAG architectures.
The metrics of RAGChecker can guide researchers and practitioners in developing
more effective RAG systems.
Authors' comments: Under Review
Jacob-Junqi Tian, Hao Yu, Yury Orlovskiy, Tyler Vergho, Mauricio Rivera, Mayank Goel, Zachary Yang, Jean-Francois Godbout et al.
This paper develops an agent-based automated fact-checking approach for
detecting misinformation. We demonstrate that combining a powerful LLM agent,
which does not have access to the internet for searches, with an online web
search agent yields better results than when each tool is used independently.
Our approach is robust across multiple models, outperforming alternatives and
increasing the macro F1 of misinformation detection by as much as 20 percent
compared to LLMs without search. We also conduct extensive analyses on the
sources our system leverages and their biases, decisions in the construction of
the system like the search tool and the knowledge base, the type of evidence
needed and its impact on the results, and other parts of the overall process.
By combining strong performance with in-depth understanding, we hope to provide
building blocks for future search-enabled misinformation mitigation systems.
Authors' comments: 1 main figure, 8 tables, 10 pages, 12 figures in Appendix, 7 tables
in Appendix GitHub URL: https://github.com/ComplexData-MILA/webretrieval
Giuseppe De Gregorio, Simon Perrin, Rodrigo C. G. Pena, Isabelle Marthot-Santaniello, Harold Mouchère
The intersection of computer vision and machine learning has emerged as a promising avenue for advancing historical research, facilitating a more profound exploration of our past. However, the application of machine learning approaches in historical palaeography is often met with criticism due to their perceived ``black box'' nature. In response to this challenge, we introduce NeuroPapyri, an innovative deep learning-based model specifically designed for the analysis of images containing ancient Greek papyri. To address concerns related to transparency and interpretability, the model incorporates an attention mechanism. This attention mechanism not only enhances the model's performance but also provides a visual representation of the image regions that significantly contribute to the decision-making process. Specifically calibrated for processing images of papyrus documents with lines of handwritten text, the model utilizes individual attention maps to inform the presence or absence of specific characters in the input image. This paper presents the NeuroPapyri model, including its architecture and training methodology. Results from the evaluation demonstrate NeuroPapyri's efficacy in document retrieval, showcasing its potential to advance the analysis of historical manuscripts.
Kaijing Ma, Han Fang, Xianghao Zang, Chao Ban, Lanxiang Zhou, Zhongjiang He, Yongxiang Li, Hao Sun et al.
Video Moment Retrieval, which aims to locate in-context video moments according to a natural language query, is an essential task for cross-modal grounding. Existing methods focus on enhancing the cross-modal interactions between all moments and the textual description for video understanding. However, constantly interacting with all locations is unreasonable because of uneven semantic distribution across the timeline and noisy visual backgrounds. This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval by disentangling complex correlations and denoising irrelevant dynamics.Specifically, we propose a query-guided semantic disentanglement (QSD) to decouple video moments by estimating alignment levels according to the global and fine-grained correlation. A Context-aware Dynamic Denoisement (CDD) is proposed to enhance understanding of aligned spatial-temporal details by learning a group of query-relevant offsets. Extensive experiments on public benchmarks demonstrate that the proposed CDNet achieves state-of-the-art performances.
Lifeng Zhou, Yuke Li
In this paper, we propose a novel framework for speech-image retrieval. We utilize speech-image contrastive (SIC) learning tasks to align speech and image representations at a coarse level and speech-image matching (SIM) learning tasks to further refine the fine-grained cross-modal alignment. SIC and SIM learning tasks are jointly trained in a unified manner. To optimize the learning process, we utilize an embedding queue that facilitates efficient sampling of high-quality and diverse negative representations during SIC learning. Additionally, it enhances the learning of SIM tasks by effectively mining hard negatives based on contrastive similarities calculated in SIC tasks. To further optimize learning under noisy supervision, we incorporate momentum distillation into the training process. Experimental results show that our framework outperforms the state-of-the-art method by more than 4% in R@1 on two benchmark datasets for the speech-image retrieval tasks. Moreover, as observed in zero-shot experiments, our framework demonstrates excellent generalization capabilities.
Dattaraj Rao
Retrieval Augmented Generation or RAG is the most popular pattern for modern Large Language Model or LLM applications. RAG involves taking a user query and finding relevant paragraphs of context in a large corpus typically captured in a vector database. Once the first level of search happens over a vector database, the top n chunks of relevant text are included directly in the context and sent as prompt to the LLM. Problem with this approach is that quality of text chunks depends on effectiveness of search. There is no strong post processing after search to determine if the chunk does hold enough information to include in prompt. Also many times there may be chunks that have conflicting information on the same subject and the model has no prior experience which chunk to prioritize to make a decision. Often times, this leads to the model providing a statement that there are conflicting statements, and it cannot produce an answer. In this research we propose a Bayesian approach to verify the quality of text chunks from the search results. Bayes theorem tries to relate conditional probabilities of the hypothesis with evidence and prior probabilities. We propose that, finding likelihood of text chunks to give a quality answer and using prior probability of quality of text chunks can help us improve overall quality of the responses from RAG systems. We can use the LLM itself to get a likelihood of relevance of a context paragraph. For prior probability of the text chunk, we use the page number in the documents parsed. Assumption is that that paragraphs in earlier pages have a better probability of being findings and more relevant to generalizing an answer.
Renascence Tarafder Prapty, Ashish Kundu, Arun Iyengar
As the complexity of modern systems increases, so does the importance of assessing their security posture through effective vulnerability management and threat modeling techniques. One powerful tool in the arsenal of cybersecurity professionals is the attack graph, a representation of all potential attack paths within a system that an adversary might exploit to achieve a certain objective. Traditional methods of generating attack graphs involve expert knowledge, manual curation, and computational algorithms that might not cover the entire threat landscape due to the ever-evolving nature of vulnerabilities and exploits. This paper explores the approach of leveraging large language models (LLMs), such as ChatGPT, to automate the generation of attack graphs by intelligently chaining Common Vulnerabilities and Exposures (CVEs) based on their preconditions and effects. It also shows how to utilize LLMs to create attack graphs from threat reports.
Juexin Lin, Sachin Yadav, Feng Liu, Nicholas Rossi, Praveen Reddy Suram, Satya Chembolu, Prijith Chandran, Hrushikesh Mohapatra et al.
Embedding-based neural retrieval (EBR) is an effective search retrieval
method in product search for tackling the vocabulary gap between customer
search queries and products. The initial launch of our EBR system at Walmart
yielded significant gains in relevance and add-to-cart rates [1]. However,
despite EBR generally retrieving more relevant products for reranking, we have
observed numerous instances of relevance degradation. Enhancing retrieval
performance is crucial, as it directly influences product reranking and affects
the customer shopping experience. Factors contributing to these degradations
include false positives/negatives in the training data and the inability to
handle query misspellings. To address these issues, we present several
approaches to further strengthen the capabilities of our EBR model in terms of
retrieval relevance. We introduce a Relevance Reward Model (RRM) based on human
relevance feedback. We utilize RRM to remove noise from the training data and
distill it into our EBR model through a multi-objective loss. In addition, we
present the techniques to increase the performance of our EBR model, such as
typo-aware training, and semi-positive generation. The effectiveness of our EBR
is demonstrated through offline relevance evaluation, online AB tests, and
successful deployments to live production.
[1] Alessandro Magnani, Feng Liu, Suthee Chaidaroon, Sachin Yadav, Praveen
Reddy Suram, Ajit Puthenputhussery, Sijie Chen, Min Xie, Anirudh Kashi, Tony
Lee, et al. 2022. Semantic retrieval at walmart. In Proceedings of the 28th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining. 3495-3503.
Authors' comments: 8 pages, 3 figures, CIKM 2024
Ziyuan Zhuang, Zhiyang Zhang, Sitao Cheng, Fangkai Yang, Jia Liu, Shujian Huang, Qingwei Lin, Saravan Rajmohan et al.
Retrieval-augmented generation (RAG) methods encounter difficulties when
addressing complex questions like multi-hop queries. While iterative retrieval
methods improve performance by gathering additional information, current
approaches often rely on multiple calls of large language models (LLMs). In
this paper, we introduce EfficientRAG, an efficient retriever for multi-hop
question answering. EfficientRAG iteratively generates new queries without the
need for LLM calls at each iteration and filters out irrelevant information.
Experimental results demonstrate that EfficientRAG surpasses existing RAG
methods on three open-domain multi-hop question-answering datasets.
Authors' comments: 20 pages, 4 figures
Sophia Ho, Jinsol Park, Patrick Wang
We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively "compacted". REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST's accepted token length with 10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance length than REST using the same storage space on the HumanEval and MT Bench benchmarks.
Irina Nikishina, Özge Sevgili, Mahei Manhai Li, Chris Biemann, Martin Semmann
In this research, we develop a taxonomy to conceptualize a comprehensive overview of the constituting characteristics that define retrieval augmented generation (RAG) applications, facilitating the adoption of this technology for different application domains. To the best of our knowledge, no holistic RAG application taxonomies have been developed so far. We employ the method foreign to ACL and thus contribute to the set of methods in the taxonomy creation. It comprises four iterative phases designed to refine and enhance our understanding and presentation of RAG's core dimensions. We have developed a total of five meta-dimensions and sixteen dimensions to comprehensively capture the concept of RAG applications. Thus, the taxonomy can be used to better understand RAG applications and to derive design knowledge for future solutions in specific application domains.
Yanfei Chen, Jinsung Yoon, Devendra Singh Sachan, Qingze Wang, Vincent Cohen-Addad, Mohammadhossein Bateni, Chen-Yu Lee, Tomas Pfister
Recent advances in large language models (LLMs) have enabled autonomous
agents with complex reasoning and task-fulfillment capabilities using a wide
range of tools. However, effectively identifying the most relevant tools for a
given task becomes a key bottleneck as the toolset size grows, hindering
reliable tool utilization. To address this, we introduce Re-Invoke, an
unsupervised tool retrieval method designed to scale effectively to large
toolsets without training. Specifically, we first generate a diverse set of
synthetic queries that comprehensively cover different aspects of the query
space associated with each tool document during the tool indexing phase.
Second, we leverage LLM's query understanding capabilities to extract key
tool-related context and underlying intents from user queries during the
inference phase. Finally, we employ a novel multi-view similarity ranking
strategy based on intents to pinpoint the most relevant tools for each query.
Our evaluation demonstrates that Re-Invoke significantly outperforms
state-of-the-art alternatives in both single-tool and multi-tool scenarios, all
within a fully unsupervised setting. Notably, on the ToolE datasets, we achieve
a 20% relative improvement in nDCG@5 for single-tool retrieval and a 39%
improvement for multi-tool retrieval.
Authors' comments: EMNLP Findings 2024
Zhichun Wang, Xuan Chen
Entity Alignment (EA) aims to match equivalent entities in different Knowledge Graphs (KGs), which is essential for knowledge fusion and integration. Recently, embedding-based EA has attracted significant attention and many approaches have been proposed. Early approaches primarily focus on learning entity embeddings from the structural features of KGs, defined by relation triples. Later methods incorporated entities' names and attributes as auxiliary information to enhance embeddings for EA. However, these approaches often used different techniques to encode structural and attribute information, limiting their interaction and mutual enhancement. In this work, we propose a dense entity retrieval framework for EA, leveraging language models to uniformly encode various features of entities and facilitate nearest entity search across KGs. Alignment candidates are first generated through entity retrieval, which are subsequently reranked to determine the final alignments. We conduct comprehensive experiments on both cross-lingual and monolingual EA datasets, demonstrating that our approach achieves state-of-the-art performance compared to existing EA methods.
Jagoda Wojcik, Jiaqi Jiang, Jiacheng Wu, Shan Luo
Cross-Modal Retrieval (CMR), which retrieves relevant items from one modality
(e.g., audio) given a query in another modality (e.g., visual), has undergone
significant advancements in recent years. This capability is crucial for robots
to integrate and interpret information across diverse sensory inputs. However,
the retrieval space in existing robotic CMR approaches often consists of only
one modality, which limits the robot's performance. In this paper, we propose a
novel CMR model that incorporates three different modalities, i.e., visual,
audio and tactile, for enhanced multi-modal object retrieval, named as VAT-CMR.
In this model, multi-modal representations are first fused to provide a
holistic view of object features. To mitigate the semantic gaps between
representations of different modalities, a dominant modality is then selected
during the classification training phase to improve the distinctiveness of the
representations, so as to improve the retrieval performance. To evaluate our
proposed approach, we conducted a case study and the results demonstrate that
our VAT-CMR model surpasses competing approaches. Further, our proposed
dominant modality selection significantly enhances cross-retrieval accuracy.
Authors' comments: 7 pages, 6 figures, accepted to the 2024 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS 2024)
Rubing Chen, Xulu Zhang, Jiaxin Wu, Wenqi Fan, Xiao-Yong Wei, Qing Li
This paper addresses the need for improved precision in existing knowledge-enhanced question-answering frameworks, specifically Retrieval-Augmented Generation (RAG) methods that primarily focus on enhancing recall. We propose a multi-layer knowledge pyramid approach within the RAG framework to achieve a better balance between precision and recall. The knowledge pyramid consists of three layers: Ontologies, Knowledge Graphs (KGs), and chunk-based raw text. We employ cross-layer augmentation techniques for comprehensive knowledge coverage and dynamic updates of the Ontology schema and instances. To ensure compactness, we utilize cross-layer filtering methods for knowledge condensation in KGs. Our approach, named PolyRAG, follows a waterfall model for retrieval, starting from the top of the pyramid and progressing down until a confident answer is obtained. We introduce two benchmarks for domain-specific knowledge retrieval, one in the academic domain and the other in the financial domain. The effectiveness of the methods has been validated through comprehensive experiments by outperforming 19 SOTA methods. An encouraging observation is that the proposed method has augmented the GPT-4, providing 395% F1 gain by improving its performance from 0.1636 to 0.8109.
Joon Soo Yoo, Mi Yeon Hong, Ji Won Heo, Kang Hoon Lee, Ji Won Yoon
Location-based services offer immense utility, but also pose significant
privacy risks. In response, we propose LocPIR, a novel framework using
homomorphic encryption (HE), specifically the TFHE scheme, to preserve user
location privacy when retrieving data from public clouds. Our system employs
TFHE's expertise in non-polynomial evaluations, crucial for comparison
operations. LocPIR showcases minimal client-server interaction, reduced memory
overhead, and efficient throughput. Performance tests confirm its computational
speed, making it a viable solution for practical scenarios, demonstrated via
application to a COVID-19 alert model. Thus, LocPIR effectively addresses
privacy concerns in location-based services, enabling secure data sharing from
the public cloud.
Authors' comments: Accepted at the IEEE International Conference on Advanced Video and
Signal-Based Surveillance (AVSS) 2024
Yuan Xia, Jingbo Zhou, Zhenhui Shi, Jun Chen, Haifeng Huang
The Retrieval-Augmented Language Model (RALM) has shown remarkable performance on knowledge-intensive tasks by incorporating external knowledge during inference, which mitigates the factual hallucinations inherited in large language models (LLMs). Despite these advancements, challenges persist in the implementation of RALMs, particularly concerning their reliability and traceability. To be specific, the irrelevant document retrieval may result in unhelpful response generation or even deteriorate the performance of LLMs, while the lack of proper citations in generated outputs complicates efforts to verify the trustworthiness of the models. To this end, we propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs, whose core idea is to leverage reasoning trajectories generated by the LLM itself. The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process. We have evaluated our framework across four public datasets (two short-form QA datasets, one long-form QA dataset, and one fact verification dataset) to demonstrate the superiority of our method, which can outperform existing state-of-art models and can achieve comparable performance with GPT-4, while only using 2,000 training samples.
Hongming Tan, Shaoxiong Zhan, Hai Lin, Hai-Tao Zheng, Wai Kin Chan
In dense retrieval, embedding long texts into dense vectors can result in information loss, leading to inaccurate query-text matching. Additionally, low-quality texts with excessive noise or sparse key information are unlikely to align well with relevant queries. Recent studies mainly focus on improving the sentence embedding model or retrieval process. In this work, we introduce a novel text augmentation framework for dense retrieval. This framework transforms raw documents into information-dense text formats, which supplement the original texts to effectively address the aforementioned issues without modifying embedding or retrieval methodologies. Two text representations are generated via large language models (LLMs) zero-shot prompting: question-answer pairs and element-driven events. We term this approach QAEA-DR: unifying question-answer generation and event extraction in a text augmentation framework for dense retrieval. To further enhance the quality of generated texts, a scoring-based evaluation and regeneration mechanism is introduced in LLM prompting. Our QAEA-DR model has a positive impact on dense retrieval, supported by both theoretical analysis and empirical experiments.
Viktor Nikitin, Marcus Carlsson, Doga Gursoy, Rajmund Mokso, Peter Cloetens
In conventional tomographic reconstruction, the pre-processing step includes flat-field correction, where each sample projection on the detector is divided by a reference image taken without the sample. When using coherent X-rays as probe, this approach overlooks the phase component of the illumination field (probe), leading to artifacts in phase-retrieved projection images, which are then propagated to the reconstructed 3D sample representation. The problem intensifies in nano-holotomography with focusing optics, that due to various imperfections create high-frequency components in the probe function. Here, we present a new iterative reconstruction scheme for holotomography, simultaneously retrieving the complex-valued probe function. Implemented on GPUs, this algorithm results in 3D reconstruction resolving twice thinner layers in a 3D ALD standard sample measured using nano-holotomography.
Mingming Li, Huimu Wang, Zuxu Chen, Guangtao Nie, Yiming Qiu, Guoyu Tang, Lin Liu, Jingwei Zhuo
Generative retrieval introduces a groundbreaking paradigm to document retrieval by directly generating the identifier of a pertinent document in response to a specific query. This paradigm has demonstrated considerable benefits and potential, particularly in representation and generalization capabilities, within the context of large language models. However, it faces significant challenges in E-commerce search scenarios, including the complexity of generating detailed item titles from brief queries, the presence of noise in item titles with weak language order, issues with long-tail queries, and the interpretability of results. To address these challenges, we have developed an innovative framework for E-commerce search, called generative retrieval with preference optimization. This framework is designed to effectively learn and align an autoregressive model with target data, subsequently generating the final item through constraint-based beam search. By employing multi-span identifiers to represent raw item titles and transforming the task of generating titles from queries into the task of generating multi-span identifiers from queries, we aim to simplify the generation process. The framework further aligns with human preferences using click data and employs a constrained search method to identify key spans for retrieving the final item, thereby enhancing result interpretability. Our extensive experiments show that this framework achieves competitive performance on a real-world dataset, and online A/B tests demonstrate the superiority and effectiveness in improving conversion gains.