Nicholas Pipitone, Ghita Houir Alami
Retrieval-Augmented Generation (RAG) systems are showing promising potential, and are becoming increasingly relevant in AI-powered legal applications. Existing benchmarks, such as LegalBench, assess the generative capabilities of Large Language Models (LLMs) in the legal domain, but there is a critical gap in evaluating the retrieval component of RAG systems. To address this, we introduce LegalBench-RAG, the first benchmark specifically designed to evaluate the retrieval step of RAG pipelines within the legal space. LegalBench-RAG emphasizes precise retrieval by focusing on extracting minimal, highly relevant text segments from legal documents. These highly relevant snippets are preferred over retrieving document IDs, or large sequences of imprecise chunks, both of which can exceed context window limitations. Long context windows cost more to process, induce higher latency, and lead LLMs to forget or hallucinate information. Additionally, precise results allow LLMs to generate citations for the end user. The LegalBench-RAG benchmark is constructed by retracing the context used in LegalBench queries back to their original locations within the legal corpus, resulting in a dataset of 6,858 query-answer pairs over a corpus of over 79M characters, entirely human-annotated by legal experts. We also introduce LegalBench-RAG-mini, a lightweight version for rapid iteration and experimentation. By providing a dedicated benchmark for legal retrieval, LegalBench-RAG serves as a critical tool for companies and researchers focused on enhancing the accuracy and performance of RAG systems in the legal domain. The LegalBench-RAG dataset is publicly available at https://github.com/zeroentropy-cc/legalbenchrag.
Xiao Wang, Yuehang Li, Fuling Wang, Shiao Wang, Chuanfu Li, Bo Jiang
Inspired by the tremendous success of Large Language Models (LLMs), existing
X-ray medical report generation methods attempt to leverage large models to
achieve better performance. They usually adopt a Transformer to extract the
visual features of a given X-ray image, and then, feed them into the LLM for
text generation. How to extract more effective information for the LLMs to help
them improve final results is an urgent problem that needs to be solved.
Additionally, the use of visual Transformer models also brings high
computational complexity. To address these issues, this paper proposes a novel
context-guided efficient X-ray medical report generation framework.
Specifically, we introduce the Mamba as the vision backbone with linear
complexity, and the performance obtained is comparable to that of the strong
Transformer model. More importantly, we perform context retrieval from the
training set for samples within each mini-batch during the training phase,
utilizing both positively and negatively related samples to enhance feature
representation and discriminative learning. Subsequently, we feed the vision
tokens, context information, and prompt statements to invoke the LLM for
generating high-quality medical reports. Extensive experiments on three X-ray
report generation datasets (i.e., IU-Xray, MIMIC-CXR, CheXpert Plus) fully
validated the effectiveness of our proposed model. The source code of this work
will be released on \url{https://github.com/Event-AHU/Medical_Image_Analysis}.
Authors' comments: In Peer Review
Haijin Wang, Zheng Chen, Nan Shang, Shangheng Yao, Zibin Pan, Fushuan Wen, Junhua Zhao
Carbon footprint accounting is crucial for quantifying greenhouse gas emissions and achieving carbon neutrality.The dynamic nature of processes, accounting rules, carbon-related policies, and energy supply structures necessitates real-time updates of CFA. Traditional life cycle assessment methods rely heavily on human expertise, making near-real-time updates challenging. This paper introduces a novel approach integrating large language models (LLMs) with retrieval-augmented generation technology to enhance the real-time, professional, and economical aspects of carbon footprint information retrieval and analysis. By leveraging LLMs' logical and language understanding abilities and RAG's efficient retrieval capabilities, the proposed method LLMs-RAG-CFA can retrieve more relevant professional information to assist LLMs, enhancing the model's generative abilities. This method offers broad professional coverage, efficient real-time carbon footprint information acquisition and accounting, and cost-effective automation without frequent LLMs' parameter updates. Experimental results across five industries(primary aluminum, lithium battery, photovoltaic, new energy vehicles, and transformers)demonstrate that the LLMs-RAG-CFA method outperforms traditional methods and other LLMs, achieving higher information retrieval rates and significantly lower information deviations and carbon footprint accounting deviations. The economically viable design utilizes RAG technology to balance real-time updates with cost-effectiveness, providing an efficient, reliable, and cost-saving solution for real-time carbon emission management, thereby enhancing environmental sustainability practices.
Sebastian Heineking, Jonas Probst, Daniel Steinbach, Martin Potthast, Harrisen Scells
Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Many evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required. One such domain is health, where misleading or incorrect answers can have a negative impact on a user's well-being. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking models trained on annotated document collections as a substitute for explicit relevance judgements and apply it to the CLEF 2021 eHealth dataset. In a user study, our method correlates with the preferences of a human expert (Kendall's $\tau=0.64$). It is also consistent with previous findings in that the quality of generated answers improves with the size of the model and more sophisticated prompting strategies.
Zeyuan Chen, Haiyan Wu, Kaixin Wu, Wei Chen, Mingjie Zhong, Jia Xu, Zhongyi Liu, Wei Zhang
Relevance modeling is a critical component for enhancing user experience in search engines, with the primary objective of identifying items that align with users' queries. Traditional models only rely on the semantic congruence between queries and items to ascertain relevance. However, this approach represents merely one aspect of the relevance judgement, and is insufficient in isolation. Even powerful Large Language Models (LLMs) still cannot accurately judge the relevance of a query and an item from a semantic perspective. To augment LLMs-driven relevance modeling, this study proposes leveraging user interactions recorded in search logs to yield insights into users' implicit search intentions. The challenge lies in the effective prompting of LLMs to capture dynamic search intentions, which poses several obstacles in real-world relevance scenarios, i.e., the absence of domain-specific knowledge, the inadequacy of an isolated prompt, and the prohibitive costs associated with deploying LLMs. In response, we propose ProRBP, a novel Progressive Retrieved Behavior-augmented Prompting framework for integrating search scenario-oriented knowledge with LLMs effectively. Specifically, we perform the user-driven behavior neighbors retrieval from the daily search logs to obtain domain-specific knowledge in time, retrieving candidates that users consider to meet their expectations. Then, we guide LLMs for relevance modeling by employing advanced prompting techniques that progressively improve the outputs of the LLMs, followed by a progressive aggregation with comprehensive consideration of diverse aspects. For online serving, we have developed an industrial application framework tailored for the deployment of LLMs in relevance modeling. Experiments on real-world industry data and online A/B testing demonstrate our proposal achieves promising performance.
Geethan Sannidhi, Sagar Srinivas Sakhinana, Venkataramana Runkana
Pre-trained large language models (PLLMs) like OpenAI ChatGPT and Google
Gemini face challenges such as inaccurate factual recall, hallucinations,
biases, and future data leakage for temporal Knowledge Graph (tKG) forecasting.
To address these issues, we introduce sLA-tKGF (small-scale language assistant
for tKG forecasting), which utilizes Retrieval-Augmented Generation (RAG)
aided, custom-trained small-scale language models through a tabula rasa
approach from scratch for effective tKG forecasting. Our framework constructs
knowledge-infused prompts with relevant historical data from tKGs, web search
results, and PLLMs-generated textual descriptions to understand historical
entity relationships prior to the target time. It leverages these external
knowledge-infused prompts for deeper understanding and reasoning of
context-specific semantic and temporal information to zero-shot prompt
small-scale language models for more accurate predictions of future events
within tKGs. It reduces hallucinations and mitigates distributional shift
challenges through comprehending changing trends over time. As a result, it
enables more accurate and contextually grounded forecasts of future events
while minimizing computational demands. Rigorous empirical studies demonstrate
our framework robustness, scalability, and state-of-the-art (SOTA) performance
on benchmark datasets with interpretable and trustworthy tKG forecasting.
Authors' comments: Paper was accepted at ACM KDD -2024 -- Undergraduate Consortium.
Please find the link: https://kdd2024.kdd.org/undergraduate-consortium/
Kevin Jose Thomas
This paper introduces an open-source interface for American Sign Language
fingerspell recognition and semantic pose retrieval, aimed to serve as a
stepping stone towards more advanced sign language translation systems.
Utilizing a combination of convolutional neural networks and pose estimation
models, the interface provides two modular components: a recognition module for
translating ASL fingerspelling into spoken English and a production module for
converting spoken English into ASL pose sequences. The system is designed to be
highly accessible, user-friendly, and capable of functioning in real-time under
varying environmental conditions like backgrounds, lighting, skin tones, and
hand sizes. We discuss the technical details of the model architecture,
application in the wild, as well as potential future enhancements for
real-world consumer applications.
Authors' comments: 8 pages, 9 figures
Matt Langsenkamp, Bryan Amador, Richard Zanibbi
A Pyramidal Histogram Of Characters (PHOC) represents the spatial location of symbols as binary vectors. The vectors are composed of levels that split a formula into equal-sized regions of one or more types (e.g., rectangles or ellipses). For each region type, this produces a pyramid of overlapping regions, where the first level contains the entire formula, and the final level the finest-grained regions. In this work, we introduce concentric rectangles for regions, and analyze whether subsequent PHOC levels encode redundant information by omitting levels from PHOC configurations. As a baseline, we include a bag of words PHOC containing only the first whole-formula level. Finally, using the ARQMath-3 formula retrieval benchmark, we demonstrate that some levels encoded in the original PHOC configurations are redundant, that PHOC models with rectangular regions outperform earlier PHOC models, and that despite their simplicity, PHOC models are surprisingly competitive with the state-of-the-art. PHOC is not math-specific, and might be used for chemical diagrams, charts, or other graphics.
Tongyoung Kim, Soojin Yoon, Seongku Kang, Jinyoung Yeo, Dongha Lee
Language Models (LMs) are increasingly employed in recommendation systems due to their advanced language understanding and generation capabilities. Recent recommender systems based on generative retrieval have leveraged the inferential abilities of LMs to directly generate the index tokens of the next item, based on item sequences within the user's interaction history. Previous studies have mostly focused on item indices based solely on textual semantic or collaborative information. However, although the standalone effectiveness of these aspects has been demonstrated, the integration of this information has remained unexplored. Our in-depth analysis finds that there is a significant difference in the knowledge captured by the model from heterogeneous item indices and diverse input prompts, which can have a high potential for complementarity. In this paper, we propose SC-Rec, a unified recommender system that learns diverse preference knowledge from two distinct item indices and multiple prompt templates. Furthermore, SC-Rec adopts a novel reranking strategy that aggregates a set of ranking results, inferred based on different indices and prompts, to achieve the self-consistency of the model. Our empirical evaluation on three real-world datasets demonstrates that SC-Rec considerably outperforms the state-of-the-art methods for sequential recommendation, effectively incorporating complementary knowledge from varied outputs of the model.
Rong-Ching Chang, Jiawei Zhang
Despite advancements in Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems, their effectiveness is often hindered by a lack of integration with entity relationships and community structures, limiting their ability to provide contextually rich and accurate information retrieval for fact-checking. We introduce CommunityKG-RAG (Community Knowledge Graph-Retrieval Augmented Generation), a novel zero-shot framework that integrates community structures within Knowledge Graphs (KGs) with RAG systems to enhance the fact-checking process. Capable of adapting to new domains and queries without additional training, CommunityKG-RAG utilizes the multi-hop nature of community structures within KGs to significantly improve the accuracy and relevance of information retrieval. Our experimental results demonstrate that CommunityKG-RAG outperforms traditional methods, representing a significant advancement in fact-checking by offering a robust, scalable, and efficient solution.
Zhengyuan Zhu, Daniel Lee, Hong Zhang, Sai Sree Harsha, Loic Feujio, Akash Maharaj, Yunyao Li
Recent advancements in retrieval-augmented generation (RAG) have demonstrated
impressive performance in the question-answering (QA) task. However, most
previous works predominantly focus on text-based answers. While some studies
address multimodal data, they still fall short in generating comprehensive
multimodal answers, particularly for explaining concepts or providing
step-by-step tutorials on how to accomplish specific goals. This capability is
especially valuable for applications such as enterprise chatbots and settings
such as customer service and educational systems, where the answers are sourced
from multimodal data. In this paper, we introduce a simple and effective
framework named MuRAR (Multimodal Retrieval and Answer Refinement). MuRAR
enhances text-based answers by retrieving relevant multimodal data and refining
the responses to create coherent multimodal answers. This framework can be
easily extended to support multimodal answers in enterprise chatbots with
minimal modifications. Human evaluation results indicate that multimodal
answers generated by MuRAR are more useful and readable compared to plain text
answers.
Authors' comments: Accepted at COLING 2025
Yuhao Jia, Zile Wu, Shengao Yi, Yifei Sun, Xiao Huang
Urban forecasting has increasingly benefited from high-dimensional spatial data through two primary approaches: graph-based methods that rely on predefined spatial structures, and region-based methods that focus on learning expressive urban representations. Although these methods have laid a strong foundation, they either rely heavily on structured spatial data, struggle to adapt to task-specific dependencies, or fail to integrate holistic urban context. Moreover, no existing framework systematically integrates these two paradigms and overcomes their respective limitations. To address this gap, we propose a novel, unified framework for high-dimensional urban forecasting, composed of three key components: (1) the Urban Region Representation Module that organizes latent embeddings and semantic descriptions for each region, (2) the Task-aware Dependency Retrieval module that selects relevant context regions based on natural language prompts, and (3) the Prediction Module, exemplified by our proposed GeoTransformer architecture, which adopts a novel geospatial attention mechanism to incorporate spatial proximity and information entropy as priors. Our framework is modular, supports diverse representation methods and forecasting models, and can operate even with minimal input. Quantitative experiments and qualitative analysis across six urban forecasting tasks demonstrate strong task generalization and validate the framework's effectiveness.
Bruno Amaral Teixeira de Freitas, Roberto de Alencar Lotufo
This work presents Retail-GPT, an open-source RAG-based chatbot designed to
enhance user engagement in retail e-commerce by guiding users through product
recommendations and assisting with cart operations. The system is
cross-platform and adaptable to various e-commerce domains, avoiding reliance
on specific chat applications or commercial activities. Retail-GPT engages in
human-like conversations, interprets user demands, checks product availability,
and manages cart operations, aiming to serve as a virtual sales agent and test
the viability of such assistants across different retail businesses.
Authors' comments: 5 pages, 4 figures
Jinming Nian, Zhiyuan Peng, Qifan Wang, Yi Fang
In knowledge-intensive tasks such as open-domain question answering (OpenQA), large language models (LLMs) often struggle to generate factual answers, relying solely on their internal (parametric) knowledge. To address this limitation, Retrieval-Augmented Generation (RAG) systems enhance LLMs by retrieving relevant information from external sources, thereby positioning the retriever as a pivotal component. Although dense retrieval demonstrates state-of-the-art performance, its training poses challenges due to the scarcity of ground-truth evidence, largely attributed to the high costs of human annotation. In this paper, we propose W-RAG, a method that draws weak training signals from the downstream task (such as OpenQA) of an LLM, and fine-tunes the retriever to prioritize passages that most benefit the task. Specifically, we rerank the top-$k$ passages retrieved via BM25 by assessing the probability that the LLM will generate the correct answer for a question given each passage. The highest-ranking passages are then used as positive fine-tuning examples for dense retrieval. We conduct comprehensive experiments across four publicly available OpenQA datasets to demonstrate that our approach enhances both retrieval and OpenQA performance compared to baseline models, achieving results comparable to models fine-tuned with human-labeled data.
John A. Kappelmeier, Ryan J. MacDonald, Nikole K. Lewis
Transmission spectroscopy is the most widely used technique for studying
exoplanet atmospheres. Since the planetary nightside faces the observer during
a transit, highly irradiated giant exoplanets with warm nightsides emit thermal
radiation that can contaminate transmission spectra. Observations of ultra-hot
Jupiters in the near- and mid-infrared with JWST are especially susceptible to
nightside contamination. However, nightside thermal emission is generally not
considered in atmospheric retrievals of exoplanet transmission spectra. Here,
we quantify the potential biases from neglecting nightside thermal emission in
multidimensional atmospheric retrievals of an ultra-hot Jupiter. Using
simulated JWST transmission spectra of the ultra-hot Jupiter WASP-33b (0.8-12
$\mu$m), we find that transmission spectra retrievals without nightside
emission can overestimate molecular abundances by almost an order-of-magnitude
and underestimate the dayside temperature by $\gtrsim$ 400 K. We show that a
modified retrieval prescription, including both transmitted light and nightside
thermal emission, correctly recovers the atmospheric properties and is favored
by Bayesian model comparisons. Nightside thermal contamination can be readily
implemented in retrieval models via a first-order approximation, and we provide
formulae to estimate whether this effect is likely to be significant for a
given planet. We recommend that nightside emission should be included as
standard practice when interpreting ultra-hot Jupiter transmission spectra with
JWST.
Authors' comments: 21 pages, 11 figures. Accepted for publication in ApJ
Tobias A. Opsahl
Despite recent success in natural language processing (NLP), fact
verification still remains a difficult task. Due to misinformation spreading
increasingly fast, attention has been directed towards automatically verifying
the correctness of claims. In the domain of NLP, this is usually done by
training supervised machine learning models to verify claims by utilizing
evidence from trustworthy corpora. We present efficient methods for verifying
claims on a dataset where the evidence is in the form of structured knowledge
graphs. We use the FactKG dataset, which is constructed from the DBpedia
knowledge graph extracted from Wikipedia. By simplifying the evidence retrieval
process, from fine-tuned language models to simple logical retrievals, we are
able to construct models that both require less computational resources and
achieve better test-set accuracy.
Authors' comments: 10 pages, 3 figures, appendix
Weijian Xie, Xuefeng Liang, Yuhui Liu, Kaihua Ni, Hong Cheng, Zetian Hu
Large Language Models (LLMs) have greatly contributed to the development of
adaptive intelligent agents and are positioned as an important way to achieve
Artificial General Intelligence (AGI). However, LLMs are prone to produce
factually incorrect information and often produce "phantom" content that
undermines their reliability, which poses a serious challenge for their
deployment in real-world scenarios. Enhancing LLMs by combining external
databases and information retrieval mechanisms is an effective path. To address
the above challenges, we propose a new approach called WeKnow-RAG, which
integrates Web search and Knowledge Graphs into a "Retrieval-Augmented
Generation (RAG)" system. First, the accuracy and reliability of LLM responses
are improved by combining the structured representation of Knowledge Graphs
with the flexibility of dense vector retrieval. WeKnow-RAG then utilizes
domain-specific knowledge graphs to satisfy a variety of queries and domains,
thereby improving performance on factual information and complex reasoning
tasks by employing multi-stage web page retrieval techniques using both sparse
and dense retrieval methods. Our approach effectively balances the efficiency
and accuracy of information retrieval, thus improving the overall retrieval
process. Finally, we also integrate a self-assessment mechanism for the LLM to
evaluate the trustworthiness of the answers it generates. Our approach proves
its outstanding effectiveness in a wide range of offline experiments and online
submissions.
Authors' comments: 8 pages, 2 figures, technical report for 3rd place in Task 3 of Meta
KDD Cup 2024 CRAG Challenge
Lifeng Zhou, Yuke Li, Rui Deng, Yuting Yang, Haoqi Zhu
The success of speech-image retrieval relies on establishing an effective
alignment between speech and image. Existing methods often model cross-modal
interaction through simple cosine similarity of the global feature of each
modality, which fall short in capturing fine-grained details within modalities.
To address this issue, we introduce an effective framework and a novel learning
task named cross-modal denoising (CMD) to enhance cross-modal interaction to
achieve finer-level cross-modal alignment. Specifically, CMD is a denoising
task designed to reconstruct semantic features from noisy features within one
modality by interacting features from another modality. Notably, CMD operates
exclusively during model training and can be removed during inference without
adding extra inference time. The experimental results demonstrate that our
framework outperforms the state-of-the-art method by 2.0% in mean R@1 on the
Flickr8k dataset and by 1.7% in mean R@1 on the SpokenCOCO dataset for the
speech-image retrieval tasks, respectively. These experimental results validate
the efficiency and effectiveness of our framework.
Authors' comments: arXiv admin note: substantial text overlap with arXiv:2408.13119
Dayong Wu, Jiaqi Li, Baoxin Wang, Honghong Zhao, Siyuan Xue, Yanjie Yang, Zhijun Chang, Rui Zhang et al.
Large language models (LLMs) have shown remarkable achievements across various language tasks.To enhance the performance of LLMs in scientific literature services, we developed the scientific literature LLM (SciLit-LLM) through pre-training and supervised fine-tuning on scientific literature, building upon the iFLYTEK Spark LLM. Furthermore, we present a knowledge service system Spark Research Assistant (SparkRA) based on our SciLit-LLM. SparkRA is accessible online and provides three primary functions: literature investigation, paper reading, and academic writing. As of July 30, 2024, SparkRA has garnered over 50,000 registered users, with a total usage count exceeding 1.3 million.
Kaushik Rangadurai, Siyang Yuan, Minhui Huang, Yiqun Liu, Golnaz Ghasemiesfeh, Yunchen Pu, Haiyu Lu, Xingfeng He et al.
Retrieval, the initial stage of a recommendation system, is tasked with
down-selecting items from a pool of tens of millions of candidates to a few
thousands. Embedding Based Retrieval (EBR) has been a typical choice for this
problem, addressing the computational demands of deep neural networks across
vast item corpora. EBR utilizes Two Tower or Siamese Networks to learn
representations for users and items, and employ Approximate Nearest Neighbor
(ANN) search to efficiently retrieve relevant items. Despite its popularity in
industry, EBR faces limitations. The Two Tower architecture, relying on a
single dot product interaction, struggles to capture complex data distributions
due to limited capability in learning expressive interactions between users and
items. Additionally, ANN index building and representation learning for user
and item are often separate, leading to inconsistencies exacerbated by
representation (e.g. continuous online training) and item drift (e.g. items
expired and new items added). In this paper, we introduce the Hierarchical
Structured Neural Network (HSNN), an efficient deep neural network model to
learn intricate user and item interactions beyond the commonly used dot product
in retrieval tasks, achieving sublinear computational costs relative to corpus
size. A Modular Neural Network (MoNN) is designed to maintain high
expressiveness for interaction learning while ensuring efficiency. A mixture of
MoNNs operate on a hierarchical item index to achieve extensive computation
sharing, enabling it to scale up to large corpus size. MoNN and the
hierarchical index are jointly learnt to continuously adapt to distribution
shifts in both user interests and item distributions. HSNN achieves substantial
improvement in offline evaluation compared to prevailing methods.
Authors' comments: Resubmit