Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang
Long-Context Question Answering (LCQA), a challenging task, aims to reason
over long-context documents to yield accurate answers to questions. Existing
long-context Large Language Models (LLMs) for LCQA often struggle with the
"lost in the middle" issue. Retrieval-Augmented Generation (RAG) mitigates this
issue by providing external factual evidence. However, its chunking strategy
disrupts the global long-context information, and its low-quality retrieval in
long contexts hinders LLMs from identifying effective factual details due to
substantial noise. To this end, we propose LongRAG, a general,
dual-perspective, and robust LLM-based RAG system paradigm for LCQA to enhance
RAG's understanding of complex long-context knowledge (i.e., global information
and factual details). We design LongRAG as a plug-and-play paradigm,
facilitating adaptation to various domains and LLMs. Extensive experiments on
three multi-hop datasets demonstrate that LongRAG significantly outperforms
long-context LLMs (up by 6.94%), advanced RAG (up by 6.16%), and Vanilla RAG
(up by 17.25%). Furthermore, we conduct quantitative ablation studies and
multi-dimensional analyses, highlighting the effectiveness of the system's
components and fine-tuning strategies. Data and code are available at
https://github.com/QingFei1/LongRAG.
Authors' comments: EMNLP 2024 Main, Final
Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gaopeng Gou, Gang Xiong, Qi Wu
Zero-Shot Composed Image Retrieval (ZS-CIR) supports diverse tasks with a
broad range of visual content manipulation intentions that can be related to
domain, scene, object, and attribute. A key challenge for ZS-CIR is to
accurately map image representation to a pseudo-word token that captures the
manipulation intention relevant image information for generalized CIR. However,
existing methods between the retrieval and pre-training stages lead to
significant redundancy in the pseudo-word tokens. In this paper, we propose a
novel denoising image-to-word mapping approach, named Denoise-I2W, for mapping
images into denoising pseudo-word tokens that, without intention-irrelevant
visual information, enhance accurate ZS-CIR. Specifically, a pseudo triplet
construction module first automatically constructs pseudo triples
(\textit{i.e.,} a pseudo-reference image, a pseudo-manipulation text, and a
target image) for pre-training the denoising mapping network. Then, a
pseudo-composed mapping module maps the pseudo-reference image to a pseudo-word
token and combines it with the pseudo-manipulation text with manipulation
intention. This combination aligns with the target image, facilitating
denoising intention-irrelevant visual information for mapping. Our proposed
Denoise-I2W is a model-agnostic and annotation-free approach. It demonstrates
strong generalization capabilities across three state-of-the-art ZS-CIR models
on four benchmark datasets. By integrating Denoise-I2W with existing best
models, we obtain consistent and significant performance boosts ranging from
1.45\% to 4.17\% over the best methods without increasing inference costs. and
achieve new state-of-the-art results on ZS-CIR. Our code is available at
\url{https://github.com/Pter61/denoise-i2w-tmm}.
Authors' comments: This work was submitted to IJCAI 2024, with a score of weak accept
and borderline accept
Zongmeng Zhang, Yufeng Shi, Jinhua Zhu, Wengang Zhou, Xiang Qi, Peng Zhang, Houqiang Li
Trustworthiness is an essential prerequisite for the real-world application
of large language models. In this paper, we focus on the trustworthiness of
language models with respect to retrieval augmentation. Despite being supported
with external evidence, retrieval-augmented generation still suffers from
hallucinations, one primary cause of which is the conflict between contextual
and parametric knowledge. We deem that retrieval-augmented language models have
the inherent capabilities of supplying response according to both contextual
and parametric knowledge. Inspired by aligning language models with human
preference, we take the first step towards aligning retrieval-augmented
language models to a status where it responds relying merely on the external
evidence and disregards the interference of parametric knowledge. Specifically,
we propose a reinforcement learning based algorithm Trustworthy-Alignment,
theoretically and experimentally demonstrating large language models'
capability of reaching a trustworthy status without explicit supervision on how
to respond. Our work highlights the potential of large language models on
exploring its intrinsic abilities by its own and expands the application
scenarios of alignment from fulfilling human preference to creating trustworthy
agents.
Authors' comments: ICML 2024
Gustavo Penha, Ali Vardasbi, Enrico Palumbo, Marco de Nadai, Hugues Bouchard
Generative retrieval for search and recommendation is a promising paradigm
for retrieving items, offering an alternative to traditional methods that
depend on external indexes and nearest-neighbor searches. Instead, generative
models directly associate inputs with item IDs. Given the breakthroughs of
Large Language Models (LLMs), these generative systems can play a crucial role
in centralizing a variety of Information Retrieval (IR) tasks in a single model
that performs tasks such as query understanding, retrieval, recommendation,
explanation, re-ranking, and response generation. Despite the growing interest
in such a unified generative approach for IR systems, the advantages of using a
single, multi-task model over multiple specialized models are not well
established in the literature. This paper investigates whether and when such a
unified approach can outperform task-specific models in the IR tasks of search
and recommendation, broadly co-existing in multiple industrial online
platforms, such as Spotify, YouTube, and Netflix. Previous work shows that (1)
the latent representations of items learned by generative recommenders are
biased towards popularity, and (2) content-based and
collaborative-filtering-based information can improve an item's
representations. Motivated by this, our study is guided by two hypotheses: [H1]
the joint training regularizes the estimation of each item's popularity, and
[H2] the joint training regularizes the item's latent representations, where
search captures content-based aspects of an item and recommendation captures
collaborative-filtering aspects. Our extensive experiments with both simulated
and real-world data support both [H1] and [H2] as key contributors to the
effectiveness improvements observed in the unified search and recommendation
generative models over the single-task approaches.
Authors' comments: Accepted for publication in the 18th ACM Conference on Recommender
Systems (RecSys'24)
Junjie Huang, Jiarui Qin, Jianghao Lin, Ziming Feng, Yong Yu, Weinan Zhang
Recommender systems (RS) are pivotal in managing information overload in
modern digital services. A key challenge in RS is efficiently processing vast
item pools to deliver highly personalized recommendations under strict latency
constraints. Multi-stage cascade ranking addresses this by employing
computationally efficient retrieval methods to cover diverse user interests,
followed by more precise ranking models to refine the results. In the retrieval
stage, multi-channel retrieval is often used to generate distinct item subsets
from different candidate generators, leveraging the complementary strengths of
these methods to maximize coverage. However, forwarding all retrieved items
overwhelms downstream rankers, necessitating truncation. Despite advancements
in individual retrieval methods, multi-channel fusion, the process of
efficiently merging multi-channel retrieval results, remains underexplored. We
are the first to identify and systematically investigate multi-channel fusion
in the retrieval stage. Current industry practices often rely on heuristic
approaches and manual designs, which often lead to suboptimal performance.
Moreover, traditional gradient-based methods like SGD are unsuitable for this
task due to the non-differentiable nature of the selection process. In this
paper, we explore advanced channel fusion strategies by assigning
systematically optimized weights to each channel. We utilize black-box
optimization techniques, including the Cross Entropy Method and Bayesian
Optimization for global weight optimization, alongside policy gradient-based
approaches for personalized merging. Our methods enhance both personalization
and flexibility, achieving significant performance improvements across multiple
datasets and yielding substantial gains in real-world deployments, offering a
scalable solution for optimizing multi-channel fusion in retrieval.
Authors' comments: 12 pages, 8 figures
Ayman Asad Khan, Md Toufique Hasan, Kai Kristian Kemell, Jussi Rasku, Pekka Abrahamsson
This paper presents an experience report on the development of Retrieval
Augmented Generation (RAG) systems using PDF documents as the primary data
source. The RAG architecture combines generative capabilities of Large Language
Models (LLMs) with the precision of information retrieval. This approach has
the potential to redefine how we interact with and augment both structured and
unstructured knowledge in generative models to enhance transparency, accuracy,
and contextuality of responses. The paper details the end-to-end pipeline, from
data collection, preprocessing, to retrieval indexing and response generation,
highlighting technical challenges and practical solutions. We aim to offer
insights to researchers and practitioners developing similar systems using two
distinct approaches: OpenAI's Assistant API with GPT Series and Llama's
open-source models. The practical implications of this research lie in
enhancing the reliability of generative AI systems in various sectors where
domain-specific knowledge and real-time information retrieval is important. The
Python code used in this work is also available at:
https://github.com/GPT-Laboratory/RAG-LLM-Development-Guidebook-from-PDFs.
Authors' comments: 36 pages, 8 figures, 2 tables, and python code snippets
Chen-Chi Chang, Han-Pi Chang, Hung-Shin Lee
In an era where cultural preservation is increasingly intertwined with
technological innovation, this study introduces a groundbreaking approach to
promoting and safeguarding the rich heritage of Taiwanese Hakka culture through
the development of a Retrieval-Augmented Generation (RAG)-enhanced chatbot.
Traditional large language models (LLMs), while powerful, often fall short in
delivering accurate and contextually rich responses, particularly in culturally
specific domains. By integrating external databases with generative AI models,
RAG technology bridges this gap, empowering chatbots to not only provide
precise answers but also resonate deeply with the cultural nuances that are
crucial for authentic interactions. This study delves into the intricate
process of augmenting the chatbot's knowledge base with targeted cultural data,
specifically curated to reflect the unique aspects of Hakka traditions,
language, and practices. Through dynamic information retrieval, the
RAG-enhanced chatbot becomes a versatile tool capable of handling complex
inquiries that demand an in-depth understanding of Hakka cultural context. This
is particularly significant in an age where digital platforms often dilute
cultural identities, making the role of culturally aware AI systems more
critical than ever. System usability studies conducted as part of our research
reveal a marked improvement in both user satisfaction and engagement,
highlighting the chatbot's effectiveness in fostering a deeper connection with
Hakka culture. The feedback underscores the potential of RAG technology to not
only enhance user experience but also to serve as a vital instrument in the
broader mission of ethnic mainstreaming and cultural celebration.
Authors' comments: Accepted to IEEE RASSE 2024
Kashob Kumar Roy, Pritom Saha Akash, Kevin Chen-Chuan Chang, Lucian Popa
Open-domain long-form text generation requires generating coherent,
comprehensive responses that address complex queries with both breadth and
depth. This task is challenging due to the need to accurately capture diverse
facets of input queries. Existing iterative retrieval-augmented generation
(RAG) approaches often struggle to delve deeply into each facet of complex
queries and integrate knowledge from various sources effectively. This paper
introduces ConTReGen, a novel framework that employs a context-driven,
tree-structured retrieval approach to enhance the depth and relevance of
retrieved content. ConTReGen integrates a hierarchical, top-down in-depth
exploration of query facets with a systematic bottom-up synthesis, ensuring
comprehensive coverage and coherent integration of multifaceted information.
Extensive experiments on multiple datasets, including LFQA and ODSUM, alongside
a newly introduced dataset, ODSUM-WikiHow, demonstrate that ConTReGen
outperforms existing state-of-the-art RAG models.
Authors' comments: Accepted at EMNLP'24 Findings
Xin Zhou, Ping Nie, Yiwen Guo, Haojie Wei, Zhanqiu Zhang, Pasquale Minervini, Ruotian Ma, Tao Gui et al.
Retrieval-Augmented Generation (RAG) significantly improved the ability of Large Language Models (LLMs) to solve knowledge-intensive tasks. While existing research seeks to enhance RAG performance by retrieving higher-quality documents or designing RAG-specific LLMs, the internal mechanisms within LLMs that contribute to the effectiveness of RAG systems remain underexplored. In this paper, we aim to investigate these internal mechanisms within the popular Mixture-of-Expert (MoE)-based LLMs and demonstrate how to improve RAG by examining expert activations in these LLMs. Our controlled experiments reveal that several core groups of experts are primarily responsible for RAG-related behaviors. The activation of these core experts can signify the model's inclination towards external/internal knowledge and adjust its behavior. For instance, we identify core experts that can (1) indicate the sufficiency of the model's internal knowledge, (2) assess the quality of retrieved documents, and (3) enhance the model's ability to utilize context. Based on these findings, we propose several strategies to enhance RAG's efficiency and effectiveness through expert activation. Experimental results across various datasets and MoE-based LLMs show the effectiveness of our method.
Shang Wang, Tianqing Zhu, Dayong Ye, Wanlei Zhou
The deployment of large language models (LLMs) like ChatGPT and Gemini has
shown their powerful natural language generation capabilities. However, these
models can inadvertently learn and retain sensitive information and harmful
content during training, raising significant ethical and legal concerns. To
address these issues, machine unlearning has been introduced as a potential
solution. While existing unlearning methods take into account the specific
characteristics of LLMs, they often suffer from high computational demands,
limited applicability, or the risk of catastrophic forgetting. To address these
limitations, we propose a lightweight unlearning framework based on
Retrieval-Augmented Generation (RAG) technology. By modifying the external
knowledge base of RAG, we simulate the effects of forgetting without directly
interacting with the unlearned LLM. We approach the construction of unlearned
knowledge as a constrained optimization problem, deriving two key components
that underpin the effectiveness of RAG-based unlearning. This RAG-based
approach is particularly effective for closed-source LLMs, where existing
unlearning methods often fail. We evaluate our framework through extensive
experiments on both open-source and closed-source models, including ChatGPT,
Gemini, Llama-2-7b-chat-hf, and PaLM 2. The results demonstrate that our
approach meets five key unlearning criteria: effectiveness, universality,
harmlessness, simplicity, and robustness. Meanwhile, this approach can extend
to multimodal large language models and LLM-based agents.
Authors' comments: 15 pages, 9 figures, 9 tables
Yuankai Li, Jia-Chen Gu, Di Wu, Kai-Wei Chang, Nanyun Peng
Retrieval-augmented generation (RAG) can supplement large language models
(LLMs) by integrating external knowledge. However, as the number of retrieved
documents increases, the input length to LLMs grows linearly, causing a
dramatic increase in latency and a degradation in long-context understanding.
This is particularly serious for multi-hop questions that require a chain of
reasoning across documents. To accelerate inference, reduce costs, and minimize
distractions, this paper presents BRIEF (Bridging Retrieval and Inference
through Evidence Fusion), a lightweight approach that performs query-aware
multi-hop reasoning by compressing retrieved documents into highly dense
textual summaries to integrate into in-context RAG. To enable learning
compression for multi-hop reasoning, we curate synthetic data by extracting
atomic propositions that encapsulate distinct factoids from the source
documents to compose synthetic summaries. Based on our synthetic data built
entirely by open-source models, BRIEF generates more concise summaries and
enables a range of LLMs to achieve exceptional open-domain question answering
(QA) performance. For example, on HotpotQA, BRIEF improves the compression rate
by 2 times compared to the state-of-the-art baseline, while outperforming it by
3.00% EM and 4.16% F1 with Flan-UL2 as the reader model. It also generates more
concise summaries than proprietary GPT-3.5, while demonstrating nearly
identical QA performance.
Authors' comments: Accepted by NAACL 2025 Findings. Project page:
https://jasonforjoy.github.io/BRIEF/
Hao-Tang Tsui, Chien-Yao Wang, Hong-Yuan Mark Liao
Identifying and localizing objects within images is a fundamental challenge, and numerous efforts have been made to enhance model accuracy by experimenting with diverse architectures and refining training strategies. Nevertheless, a prevalent limitation in existing models is overemphasizing the current input while ignoring the information from the entire dataset. We introduce an innovative Retriever-Dictionary (RD) module to address this issue. This architecture enables YOLO-based models to efficiently retrieve features from a Dictionary that contains the insight of the dataset, which is built by the knowledge from Visual Models (VM), Large Language Models (LLM), or Visual Language Models (VLM). The flexible RD enables the model to incorporate such explicit knowledge that enhances the ability to benefit multiple tasks, specifically, segmentation, detection, and classification, from pixel to image level. The experiments show that using the RD significantly improves model performance, achieving more than a 3\% increase in mean Average Precision for object detection with less than a 1% increase in model parameters. Beyond 1-stage object detection models, the RD module improves the effectiveness of 2-stage models and DETR-based architectures, such as Faster R-CNN and Deformable DETR. Code is released at https://github.com/henrytsui000/YOLO.
Zichen Wang, Yaokun Ji, Jianing Tian, Shuangjia Zheng
Antibodies are essential proteins responsible for immune responses in organisms, capable of specifically recognizing antigen molecules of pathogens. Recent advances in generative models have significantly enhanced rational antibody design. However, existing methods mainly create antibodies from scratch without template constraints, leading to model optimization challenges and unnatural sequences. To address these issues, we propose a retrieval-augmented diffusion framework, termed RADAb, for efficient antibody design. Our method leverages a set of structural homologous motifs that align with query structural constraints to guide the generative model in inversely optimizing antibodies according to desired design criteria. Specifically, we introduce a structure-informed retrieval mechanism that integrates these exemplar motifs with the input backbone through a novel dual-branch denoising module, utilizing both structural and evolutionary information. Additionally, we develop a conditional diffusion model that iteratively refines the optimization process by incorporating both global context and local evolutionary conditions. Our approach is agnostic to the choice of generative models. Empirical experiments demonstrate that our method achieves state-of-the-art performance in multiple antibody inverse folding and optimization tasks, offering a new perspective on biomolecular generative models.
Jiajing Chen, Runyuan Bao, Hongye Zheng, Zhen Qi, Jianjun Wei, Jiacheng Hu
This study aims to improve the accuracy and quality of large-scale language models (LLMs) in answering questions by integrating Elasticsearch into the Retrieval Augmented Generation (RAG) framework. The experiment uses the Stanford Question Answering Dataset (SQuAD) version 2.0 as the test dataset and compares the performance of different retrieval methods, including traditional methods based on keyword matching or semantic similarity calculation, BM25-RAG and TF-IDF- RAG, and the newly proposed ES-RAG scheme. The results show that ES-RAG not only has obvious advantages in retrieval efficiency but also performs well in key indicators such as accuracy, which is 0.51 percentage points higher than TF-IDF-RAG. In addition, Elasticsearch's powerful search capabilities and rich configuration options enable the entire question-answering system to better handle complex queries and provide more flexible and efficient responses based on the diverse needs of users. Future research directions can further explore how to optimize the interaction mechanism between Elasticsearch and LLM, such as introducing higher-level semantic understanding and context-awareness capabilities, to achieve a more intelligent and humanized question-answering experience.
Muhe Ding, Yang Ma, Pengda Qin, Jianlong Wu, Yuhong Li, Liqiang Nie
Multimodal Large Language Models (MLLMs) have recently received substantial
interest, which shows their emerging potential as general-purpose models for
various vision-language tasks. MLLMs involve significant external knowledge
within their parameters; however, it is challenging to continually update these
models with the latest knowledge, which involves huge computational costs and
poor interpretability. Retrieval augmentation techniques have proven to be
effective plugins for both LLMs and MLLMs. In this study, we propose multimodal
adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
(RA-BLIP), a novel retrieval-augmented framework for various MLLMs. Considering
the redundant information within vision modality, we first leverage the
question to instruct the extraction of visual information through interactions
with one set of learnable queries, minimizing irrelevant interference during
retrieval and generation. Besides, we introduce a pre-trained multimodal
adaptive fusion module to achieve question text-to-multimodal retrieval and
integration of multimodal knowledge by projecting visual and language
modalities into a unified semantic space. Furthermore, we present an Adaptive
Selection Knowledge Generation (ASKG) strategy to train the generator to
autonomously discern the relevance of retrieved knowledge, which realizes
excellent denoising performance. Extensive experiments on open multimodal
question-answering datasets demonstrate that RA-BLIP achieves significant
performance and surpasses the state-of-the-art retrieval-augmented models.
Authors' comments: 10 pages, 6 figures, Journal
Simon Lupart, Mohammad Aliannejadi, Evangelos Kanoulas
Conversational Search (CS) involves retrieving relevant documents from a
corpus while considering the conversational context, integrating retrieval with
context modeling. Recent advancements in Large Language Models (LLMs) have
significantly enhanced CS by enabling query rewriting based on conversational
context. However, employing LLMs during inference poses efficiency challenges.
Existing solutions mitigate this issue by distilling embeddings derived from
human-rewritten queries, focusing primarily on learning the context modeling
task. These methods, however, often separate the contrastive retrieval task
from the distillation process, treating it as an independent loss term. To
overcome these limitations, we introduce DiSCo (Distillation of Sparse
Conversational retrieval), a novel approach that unifies retrieval and context
modeling through a relaxed distillation objective. Instead of relying
exclusively on representation learning, our method distills similarity scores
between conversations and documents, providing more freedom in the
representation space and better leveraging the contrastive nature of document
relevance. Extensive experiments on Learned Sparse Retrieval (LSR) across five
CS datasets demonstrate that DiSCo achieves substantial improvements in both
in-domain and out-of-domain retrieval tasks, achieving up to a six-point gain
in recall for out-of-domain datasets over state-of-the-art methods.
Additionally, DiSCo employs a multi-teacher distillation strategy, using
multiple LLMs as teachers, further enhancing performance and surpassing the
individual teachers in in-domain settings. Furthermore, analysis of model
sparsity reveals that DiSCo allows for more effective control over the sparsity
of the trained models.
Authors' comments: 11 pages, 6 figures. SIGIR '25 Proceedings of the 48th International
ACM SIGIR Conference on Research and Development in Information Retrieval
July 13--18, 2025 Padua, Italy
Simone Conia, Daniel Lee, Min Li, Umar Farooq Minhas, Saloni Potdar, Yunyao Li
Translating text that contains entity names is a challenging task, as
cultural-related references can vary significantly across languages. These
variations may also be caused by transcreation, an adaptation process that
entails more than transliteration and word-for-word translation. In this paper,
we address the problem of cross-cultural translation on two fronts: (i) we
introduce XC-Translate, the first large-scale, manually-created benchmark for
machine translation that focuses on text that contains potentially
culturally-nuanced entity names, and (ii) we propose KG-MT, a novel end-to-end
method to integrate information from a multilingual knowledge graph into a
neural machine translation model by leveraging a dense retrieval mechanism. Our
experiments and analyses show that current machine translation systems and
large language models still struggle to translate texts containing entity
names, whereas KG-MT outperforms state-of-the-art approaches by a large margin,
obtaining a 129% and 62% relative improvement compared to NLLB-200 and GPT-4,
respectively.
Authors' comments: Accepted at EMNLP 2024
Dongfang Zhao
This paper introduces \textit{Federated Retrieval-Augmented Generation (FRAG)}, a novel database management paradigm tailored for the growing needs of retrieval-augmented generation (RAG) systems, which are increasingly powered by large-language models (LLMs). FRAG enables mutually-distrusted parties to collaboratively perform Approximate $k$-Nearest Neighbor (ANN) searches on encrypted query vectors and encrypted data stored in distributed vector databases, all while ensuring that no party can gain any knowledge about the queries or data of others. Achieving this paradigm presents two key challenges: (i) ensuring strong security guarantees, such as Indistinguishability under Chosen-Plaintext Attack (IND-CPA), under practical assumptions (e.g., we avoid overly optimistic assumptions like non-collusion among parties); and (ii) maintaining performance overheads comparable to traditional, non-federated RAG systems. To address these challenges, FRAG employs a single-key homomorphic encryption protocol that simplifies key management across mutually-distrusted parties. Additionally, FRAG introduces a \textit{multiplicative caching} technique to efficiently encrypt floating-point numbers, significantly improving computational performance in large-scale federated environments. We provide a rigorous security proof using standard cryptographic reductions and demonstrate the practical scalability and efficiency of FRAG through extensive experiments on both benchmark and real-world datasets.
Sreyan Ghosh, Mohammad Sadegh Rasooli, Michael Levit, Peidong Wang, Jian Xue, Dinesh Manocha, Jinyu Li
Generative Error Correction (GEC) has emerged as a powerful post-processing
method to enhance the performance of Automatic Speech Recognition (ASR)
systems. However, we show that GEC models struggle to generalize beyond the
specific types of errors encountered during training, limiting their ability to
correct new, unseen errors at test time, particularly in out-of-domain (OOD)
scenarios. This phenomenon amplifies with named entities (NEs), where, in
addition to insufficient contextual information or knowledge about the NEs,
novel NEs keep emerging. To address these issues, we propose DARAG (Data- and
Retrieval-Augmented Generative Error Correction), a novel approach designed to
improve GEC for ASR in in-domain (ID) and OOD scenarios. We augment the GEC
training dataset with synthetic data generated by prompting LLMs and
text-to-speech models, thereby simulating additional errors from which the
model can learn. For OOD scenarios, we simulate test-time errors from new
domains similarly and in an unsupervised fashion. Additionally, to better
handle named entities, we introduce retrieval-augmented correction by
augmenting the input with entities retrieved from a database. Our approach is
simple, scalable, and both domain- and language-agnostic. We experiment on
multiple datasets and settings, showing that DARAG outperforms all our
baselines, achieving 8\% -- 30\% relative WER improvements in ID and 10\% --
33\% improvements in OOD settings.
Authors' comments: Preprint. Under Review
Shangda Wu, Yashan Wang, Ruibin Yuan, Zhancheng Guo, Xu Tan, Ge Zhang, Monan Zhou, Jing Chen et al.
Challenges in managing linguistic diversity and integrating various musical
modalities are faced by current music information retrieval systems. These
limitations reduce their effectiveness in a global, multimodal music
environment. To address these issues, we introduce CLaMP 2, a system compatible
with 101 languages that supports both ABC notation (a text-based musical
notation format) and MIDI (Musical Instrument Digital Interface) for music
information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text
triplets, includes a multilingual text encoder and a multimodal music encoder
aligned via contrastive learning. By leveraging large language models, we
obtain refined and consistent multilingual descriptions at scale, significantly
reducing textual noise and balancing language distribution. Our experiments
show that CLaMP 2 achieves state-of-the-art results in both multilingual
semantic search and music classification across modalities, thus establishing a
new standard for inclusive and global music information retrieval.
Authors' comments: 17 pages, 10 figures, 4 tables, accepted by NAACL 2025