Wanlong Liu, Enqi Zhang, Li Zhou, Dingyi Zeng, Shaohuan Cheng, Chen Zhang, Malu Zhang, Wenyu Chen
Recent works have demonstrated the effectiveness of retrieval augmentation in
the Event Argument Extraction (EAE) task. However, existing retrieval-based EAE
methods have two main limitations: (1) input length constraints and (2) the gap
between the retriever and the inference model. These issues limit the diversity
and quality of the retrieved information. In this paper, we propose a
Compressive Memory-based Retrieval (CMR) mechanism for EAE, which addresses the
two limitations mentioned above. Our compressive memory, designed as a dynamic
matrix that effectively caches retrieved information and supports continuous
updates, overcomes the limitations of the input length. Additionally, after
pre-loading all candidate demonstrations into the compressive memory, the model
further retrieves and filters relevant information from memory based on the
input query, bridging the gap between the retriever and the inference model.
Extensive experiments show that our method achieves new state-of-the-art
performance on three public datasets (RAMS, WikiEvents, ACE05), significantly
outperforming existing retrieval-based EAE methods.
Authors' comments: 15 pages
Kanchan Shivashankar, Nadine Steinmetz
Scholarly communication is a rapid growing field containing a wealth of knowledge. However, due to its unstructured and document format, it is challenging to extract useful information from them through conventional document retrieval methods. Scholarly knowledge graphs solve this problem, by representing the documents in a semantic network, providing, hidden insights, summaries and ease of accessibility through queries. Naturally, question answering for scholarly graphs expands the accessibility to a wider audience. But some of the knowledge in this domain is still presented as unstructured text, thus requiring a hybrid solution for question answering systems. In this paper, we present a two step solution using open source Large Language Model(LLM): Llama3.1 for Scholarly-QALD dataset. Firstly, we extract the context pertaining to the question from different structured and unstructured data sources: DBLP, SemOpenAlex knowledge graphs and Wikipedia text. Secondly, we implement prompt engineering to improve the information retrieval performance of the LLM. Our approach achieved an F1 score of 40% and also observed some anomalous responses from the LLM, that are discussed in the final part of the paper.
Shaojun Li, Hengchao Shang, Daimeng Wei, Jiaxin Guo, Zongyao Li, Xianghui He, Min Zhang, Hao Yang
Recent advancements in integrating speech information into large language
models (LLMs) have significantly improved automatic speech recognition (ASR)
accuracy. However, existing methods often constrained by the capabilities of
the speech encoders under varied acoustic conditions, such as accents. To
address this, we propose LA-RAG, a novel Retrieval-Augmented Generation (RAG)
paradigm for LLM-based ASR. LA-RAG leverages fine-grained token-level speech
datastores and a speech-to-speech retrieval mechanism to enhance ASR accuracy
via LLM in-context learning (ICL) capabilities. Experiments on Mandarin and
various Chinese dialect datasets demonstrate significant improvements in ASR
accuracy compared to existing methods, validating the effectiveness of our
approach, especially in handling accent variations.
Authors' comments: submitted to ICASSP 2025
Zhihong Lei, Xingyu Na, Mingbin Xu, Ernest Pusateri, Christophe Van Gysel, Yuanyuan Zhang, Shiyi Han, Zhen Huang
Large language models (LLMs) have shown superb capability of modeling multimodal signals including audio and text, allowing the model to generate spoken or textual response given a speech input. However, it remains a challenge for the model to recognize personal named entities, such as contacts in a phone book, when the input modality is speech. In this work, we start with a speech recognition task and propose a retrieval-based solution to contextualize the LLM: we first let the LLM detect named entities in speech without any context, then use this named entity as a query to retrieve phonetically similar named entities from a personal database and feed them to the LLM, and finally run context-aware LLM decoding. In a voice assistant task, our solution achieved up to 30.2% relative word error rate reduction and 73.6% relative named entity error rate reduction compared to a baseline system without contextualization. Notably, our solution by design avoids prompting the LLM with the full named entity database, making it highly efficient and applicable to large named entity databases.
Fabio Gregório, Rafaela Castro, Kele Belloze, Rui Pedro Lopes, Eduardo Bezerra
The Brazilian Constitution, known as the Citizen's Charter, provides
mechanisms for citizens to petition the Judiciary, including the so-called
special appeal. This specific type of appeal aims to standardize the legal
interpretation of Brazilian legislation in cases where the decision contradicts
federal laws. The handling of special appeals is a daily task in the Judiciary,
regularly presenting significant demands in its courts. We propose a new method
called GLARE, based on unsupervised machine learning, to help the legal analyst
classify a special appeal on a topic from a list made available by the National
Court of Brazil (STJ). As part of this method, we propose a modification of the
graph-based LexRank algorithm, which we call Guided LexRank. This algorithm
generates the summary of a special appeal. The degree of similarity between the
generated summary and different topics is evaluated using the BM25 algorithm.
As a result, the method presents a ranking of themes most appropriate to the
analyzed special appeal. The proposed method does not require prior labeling of
the text to be evaluated and eliminates the need for large volumes of data to
train a model. We evaluate the effectiveness of the method by applying it to a
special appeal corpus previously classified by human experts.
Authors' comments: 26 pages, 8 figures, submitted to AI and Law
Philip Fradkin, Puria Azadi, Karush Suri, Frederik Wenkel, Ali Bashashati, Maciej Sypetkowski, Dominique Beaini
Predicting molecular impact on cellular function is a core challenge in therapeutic design. Phenomic experiments, designed to capture cellular morphology, utilize microscopy based techniques and demonstrate a high throughput solution for uncovering molecular impact on the cell. In this work, we learn a joint latent space between molecular structures and microscopy phenomic experiments, aligning paired samples with contrastive learning. Specifically, we study the problem ofContrastive PhenoMolecular Retrieval, which consists of zero-shot molecular structure identification conditioned on phenomic experiments. We assess challenges in multi-modal learning of phenomics and molecular modalities such as experimental batch effect, inactive molecule perturbations, and encoding perturbation concentration. We demonstrate improved multi-modal learner retrieval through (1) a uni-modal pre-trained phenomics model, (2) a novel inter sample similarity aware loss, and (3) models conditioned on a representation of molecular concentration. Following this recipe, we propose MolPhenix, a molecular phenomics model. MolPhenix leverages a pre-trained phenomics model to demonstrate significant performance gains across perturbation concentrations, molecular scaffolds, and activity thresholds. In particular, we demonstrate an 8.1x improvement in zero shot molecular retrieval of active molecules over the previous state-of-the-art, reaching 77.33% in top-1% accuracy. These results open the door for machine learning to be applied in virtual phenomics screening, which can significantly benefit drug discovery applications.
Ernest Pusateri, Anmol Walia, Anirudh Kashi, Bortik Bandyopadhyay, Nadia Hyder, Sayantan Mahinder, Raviteja Anantha, Daben Liu et al.
In recent years, end-to-end automatic speech recognition (ASR) systems have
proven themselves remarkably accurate and performant, but these systems still
have a significant error rate for entity names which appear infrequently in
their training data. In parallel to the rise of end-to-end ASR systems, large
language models (LLMs) have proven to be a versatile tool for various natural
language processing (NLP) tasks. In NLP tasks where a database of relevant
knowledge is available, retrieval augmented generation (RAG) has achieved
impressive results when used with LLMs. In this work, we propose a RAG-like
technique for correcting speech recognition entity name errors. Our approach
uses a vector database to index a set of relevant entities. At runtime,
database queries are generated from possibly errorful textual ASR hypotheses,
and the entities retrieved using these queries are fed, along with the ASR
hypotheses, to an LLM which has been adapted to correct ASR errors. Overall,
our best system achieves 33%-39% relative word error rate reductions on
synthetic test sets focused on voice assistant queries of rare music entities
without regressing on the STOP test set, a publicly available voice assistant
test set covering many domains.
Authors' comments: Submitted to ICASSP 2025
Francisco Valentini, Viviana Cotik, Damián Furman, Ivan Bercovich, Edgar Altszyler, Juan Manuel Pérez
Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, current IR benchmarks lack Spanish data, hindering the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with around 730 thousand queries from Google's autocomplete API and relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.
Zhiyuan Hu, Julián Tachella, Michael Unser, Jonathan Dong
Phase retrieval, a nonlinear problem prevalent in imaging applications, has been extensively studied using random models, some of which with i.i.d. sensing matrix components. While these models offer robust reconstruction guarantees, they are computationally expensive and impractical for real-world scenarios. In contrast, Fourier-based models, common in applications such as ptychography and coded diffraction imaging, are computationally more efficient but lack the theoretical guarantees of random models. Here, we introduce structured random models for phase retrieval that combine the efficiency of fast Fourier transforms with the versatility of random diagonal matrices. These models emulate i.i.d. random matrices at a fraction of the computational cost. Our approach demonstrates robust reconstructions comparable to fully random models using gradient descent and spectral methods. Furthermore, we establish that a minimum of two structured layers is necessary to achieve these structured-random properties. The proposed method is suitable for optical implementation and offers an efficient and robust alternative for phase retrieval in practical imaging applications.
Suyan Li, Fuxiang Huang, Lei Zhang
In the real world, where information is abundant and diverse across different
modalities, understanding and utilizing various data types to improve retrieval
systems is a key focus of research. Multimodal composite retrieval integrates
diverse modalities such as text, image and audio, etc. to provide more
accurate, personalized, and contextually relevant results. To facilitate a
deeper understanding of this promising direction, this survey explores
multimodal composite editing and retrieval in depth, covering image-text
composite editing, image-text composite retrieval, and other multimodal
composite retrieval. In this survey, we systematically organize the application
scenarios, methods, benchmarks, experiments, and future directions. Multimodal
learning is a hot topic in large model era, and have also witnessed some
surveys in multimodal learning and vision-language models with transformers
published in the PAMI journal. To the best of our knowledge, this survey is the
first comprehensive review of the literature on multimodal composite retrieval,
which is a timely complement of multimodal fusion to existing reviews. To help
readers' quickly track this field, we build the project page for this survey,
which can be found at
https://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval.
Authors' comments: 20 pages, 3 figures, and 11 tables
Jintian Zhang, Cheng Peng, Mengshu Sun, Xiang Chen, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen et al.
Despite the recent advancements in Large Language Models (LLMs), which have
significantly enhanced the generative capabilities for various NLP tasks, LLMs
still face limitations in directly handling retrieval tasks. However, many
practical applications demand the seamless integration of both retrieval and
generation. This paper introduces a novel and efficient One-pass Generation and
retrieval framework (OneGen), designed to improve LLMs' performance on tasks
that require both generation and retrieval. The proposed framework bridges the
traditionally separate training approaches for generation and retrieval by
incorporating retrieval tokens generated autoregressively. This enables a
single LLM to handle both tasks simultaneously in a unified forward pass. We
conduct experiments on two distinct types of composite tasks, RAG and Entity
Linking, to validate the pluggability, effectiveness, and efficiency of OneGen
in training and inference. Furthermore, our results show that integrating
generation and retrieval within the same context preserves the generative
capabilities of LLMs while improving retrieval performance. To the best of our
knowledge, OneGen is the first to enable LLMs to conduct vector retrieval
during the generation.
Authors' comments: EMNLP 2024 Findings; code is available at
https://github.com/zjunlp/OneGen
Hemanth Kandula, Damianos Karakos, Haoling Qiu, Benjamin Rozonoyer, Ian Soboroff, Lee Tarlin, Bonan Min
Frequently, users of an Information Retrieval (IR) system start with an overarching information need (a.k.a., an analytic task) and proceed to define finer-grained queries covering various important aspects (i.e., sub-topics) of that analytic task. We present a novel, interactive system called $\textit{QueryBuilder}$, which allows a novice, English-speaking user to create queries with a small amount of effort, through efficient exploration of an English development corpus in order to rapidly develop cross-lingual information retrieval queries corresponding to the user's information needs. QueryBuilder performs near real-time retrieval of documents based on user-entered search terms; the user looks through the retrieved documents and marks sentences as relevant to the information needed. The marked sentences are used by the system as additional information in query formation and refinement: query terms (and, optionally, event features, which capture event $'triggers'$ (indicator terms) and agent/patient roles) are appropriately weighted, and a neural-based system, which better captures textual meaning, retrieves other relevant content. The process of retrieval and marking is repeated as many times as desired, giving rise to increasingly refined queries in each iteration. The final product is a fine-grained query used in Cross-Lingual Information Retrieval (CLIR). Our experiments using analytic tasks and requests from the IARPA BETTER IR datasets show that with a small amount of effort (at most 10 minutes per sub-topic), novice users can form $\textit{useful}$ fine-grained queries including in languages they don't understand. QueryBuilder also provides beneficial capabilities to the traditional corpus exploration and query formation process. A demonstration video is released at https://vimeo.com/734795835
Sepanta Zeighami, Zac Wellmer, Aditya Parameswaran
$k$-Nearest Neighbor search on dense vector embeddings ($k$-NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of $k$-NN retrieval. We present a thorough theoretical and experimental study of NUDGE's non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.
Thiem Nguyen Ba, Vinh Doan The, Tung Pham Quang, Toan Tran Van
In the modern era of rapidly increasing data volumes, accurately retrieving
and recommending relevant documents has become crucial in enhancing the
reliability of Question Answering (QA) systems. Recently, Retrieval Augmented
Generation (RAG) has gained significant recognition for enhancing the
capabilities of large language models (LLMs) by mitigating hallucination issues
in QA systems, which is particularly beneficial in the legal domain. Various
methods, such as semantic search using dense vector embeddings or a combination
of multiple techniques to improve results before feeding them to LLMs, have
been proposed. However, these methods often fall short when applied to the
Vietnamese language due to several challenges, namely inefficient Vietnamese
data processing leading to excessive token length or overly simplistic ensemble
techniques that lead to instability and limited improvement. Moreover, a
critical issue often overlooked is the ordering of final relevant documents
which are used as reference to ensure the accuracy of the answers provided by
LLMs. In this report, we introduce our three main modifications taken to
address these challenges. First, we explore various practical approaches to
data processing to overcome the limitations of the embedding model.
Additionally, we enhance Reciprocal Rank Fusion by normalizing order to combine
results from keyword and vector searches effectively. We also meticulously
re-rank the source pieces of information used by LLMs with Active Retrieval to
improve user experience when refining the information generated. In our
opinion, this technique can also be considered as a new re-ranking method that
might be used in place of the traditional cross encoder. Finally, we integrate
these techniques into a comprehensive QA system, significantly improving its
performance and reliability
Authors' comments: 7 pages
Yun Joon Soh, Hanxian Huang, Yuandong Tian, Jishen Zhao
Supporting longer context for Large Language Models (LLM) is a promising direction to advance LLMs. As training a model for a longer context window is computationally expensive, many alternative solutions, such as Retrieval Augmented Generation (RAG), have been used. However, most existing RAG methods adopt embedding-based retrieval that falls short on long contexts. To address such challenges, we propose an attention-based retrieval technique, You Only Use Reactive Attention slice (YOURA). YOURA leverages a novel retrieval heuristic called reaction score to rank the relevance of each sentence in the input context with the query sentence. Intuitively, we measure how the per-token attention score "reacts" to the query and greedily retrieves the most reactive sentences. Internally, YOURA generates a token-indexed vector (called reaction vector) for the whole input context. To map each sentence to the token-indexed vector, we propose an Embedding-Agnostic Sentence Yield (EASY), a best-effort token wiggling algorithm. We evaluate our retrieval technique on three open-source pre-trained LLM models across six LongBench QA datasets. Our technique achieves up to 30% vLLM inference throughput improvement for serving long-context queries with a nearly identical quality score to the simple yet effective truncate-middle approach.
Yuan Yang, Siheng Xiong, Ehsan Shareghi, Faramarz Fekri
Recent advancements in large language models (LLMs) have significantly enhanced their capacity to aggregate and process information across multiple modalities, enabling them to perform a wide range of tasks such as multimodal data querying, tool usage, web interactions, and handling long documents. These capabilities pave the way for transforming LLMs from mere chatbots into general-purpose agents capable of interacting with the real world. This paper explores the concept of using a language model as the core component of an operating system (OS), effectively acting as a CPU that processes data stored in a context window, which functions as RAM. A key challenge in realizing such an LM OS is managing the life-long context and ensuring statefulness across sessions, a feature limited by the current session-based interaction paradigm due to context window size limit. To address this, we introduce compressor-retriever, a model-agnostic architecture designed for life-long context management. Unlike other long-context solutions such as retrieval-augmented generation, our approach exclusively uses the base model's forward function to compress and retrieve context, ensuring end-to-end differentiability. Preliminary experiments demonstrate the effectiveness of this architecture in in-context learning tasks, marking a step towards the development of a fully stateful LLM OS. Project repo available at: https://github.com/gblackout/LM-OS
Leqi Shen, Tianxiang Hao, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, Guiguang Ding
Most text-video retrieval methods utilize the text-image pre-trained CLIP as a backbone, incorporating complex modules that result in high computational overhead. As a result, many studies focus on efficient fine-tuning. The primary challenge in efficient adaption arises from the inherent differences between image and video modalities. Each sampled video frame must be processed by the image encoder independently, which increases complexity and complicates practical deployment. Although existing efficient methods fine-tune with small trainable parameters, they still incur high inference costs due to the large token number. In this work, we argue that temporal redundancy significantly contributes to the model's high complexity due to the repeated information in consecutive frames. Existing token compression methods for image models fail to solve the unique challenges, as they overlook temporal redundancy across frames. To tackle these problems, we propose Temporal Token Merging (TempMe) to reduce temporal redundancy. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we merge temporal tokens across different frames and learn video-level features, leading to lower complexity and better performance. Extensive experiments validate the superiority of our TempMe. Compared to previous efficient text-video retrieval methods, TempMe significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. Additionally, TempMe exhibits robust generalization capabilities by integrating effectively with both efficient and full fine-tuning methods. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. Our code will be released.
Andreea-Maria Oncescu, João F. Henriques, A. Sophia Koepke
Recent advancements in machine learning have fueled research on multimodal
tasks, such as for instance text-to-video and text-to-audio retrieval. These
tasks require models to understand the semantic content of video and audio
data, including objects, and characters. The models also need to learn spatial
arrangements and temporal relationships. In this work, we analyse the temporal
ordering of sounds, which is an understudied problem in the context of
text-to-audio retrieval. In particular, we dissect the temporal understanding
capabilities of a state-of-the-art model for text-to-audio retrieval on the
AudioCaps and Clotho datasets. Additionally, we introduce a synthetic
text-audio dataset that provides a controlled setting for evaluating temporal
capabilities of recent models. Lastly, we present a loss function that
encourages text-audio models to focus on the temporal ordering of events. Code
and data are available at
https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/.
Authors' comments: 9 pages, 5 figures, ACM Multimedia 2024,
https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/
Hajime Ueda, Shun Katakami, Masato Okada
Phase retrieval refers to the problem of recovering a high-dimensional vector $\boldsymbol{x} \in \mathbb{C}^N$ from the magnitude of its linear transform $\boldsymbol{z} = A \boldsymbol{x}$, observed through a noisy channel. To improve the ill-posed nature of the inverse problem, it is a common practice to observe the magnitude of linear measurements $\boldsymbol{z}^{(1)} = A^{(1)} \boldsymbol{x},..., \boldsymbol{z}^{(L)} = A^{(L)}\boldsymbol{x}$ using multiple sensing matrices $A^{(1)},..., A^{(L)}$, with ptychographic imaging being a remarkable example of such strategies. Inspired by existing algorithms for ptychographic reconstruction, we introduce stochasticity to Vector Approximate Message Passing (VAMP), a computationally efficient algorithm applicable to a wide range of Bayesian inverse problems. By testing our approach in the setup of phase retrieval, we show the superior convergence speed of the proposed algorithm.
Su Hyeon Lim, Minkuk Kim, Hyeon Bae Kim, Seong Tae Kim
Visual Question Answering with Natural Language Explanation (VQA-NLE) task is
challenging due to its high demand for reasoning-based inference. Recent
VQA-NLE studies focus on enhancing model networks to amplify the model's
reasoning capability but this approach is resource-consuming and unstable. In
this work, we introduce a new VQA-NLE model, ReRe (Retrieval-augmented natural
language Reasoning), using leverage retrieval information from the memory to
aid in generating accurate answers and persuasive explanations without relying
on complex networks and extra datasets. ReRe is an encoder-decoder architecture
model using a pre-trained clip vision encoder and a pre-trained GPT-2 language
model as a decoder. Cross-attention layers are added in the GPT-2 for
processing retrieval features. ReRe outperforms previous methods in VQA
accuracy and explanation score and shows improvement in NLE with more
persuasive, reliability.
Authors' comments: ICIP Workshop 2024