Ali Asgarov, Samir Rustamov
This research explores the development of multimodal vision-language models for image retrieval in low-resource languages, specifically Azerbaijani. Existing vision-language models primarily support high-resource languages, and fine-tuning them remains computationally demanding. To address challenges in vision-language retrieval for low-resource languages, we integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance. These techniques include synthetic data generation through machine translation, image augmentation, and further training the attention mechanisms of transformer-based models with domain-specific data. We integrated Multilingual BERT as a text encoder with image encoders like ResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer. Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on, such as COCO, Flickr30k, and Flickr8k. Augmentation techniques boosted EfficientNet0 MAP on Flickr30k from 0.84 to 0.87 and ResNet50 MAP on MSCOCO from 0.70 to 0.80, contributing to a new state of the art in vision-language retrieval. We share our configurations and results to support further research. Code and pre-trained models are available at https://github.com/aliasgerovs/azclip.
Wenrui Li, Wei Han, Yandu Chen, Yeyu Chai, Yidan Lu, Xingtao Wang, Xiaopeng Fan
Due to the challenges in acquiring paired Text-3D data and the inherent irregularity of 3D data structures, combined representation learning of 3D point clouds and text remains unexplored. In this paper, we propose a novel Riemann-based Multi-scale Attention Reasoning Network (RMARN) for text-3D retrieval. Specifically, the extracted text and point cloud features are refined by their respective Adaptive Feature Refiner (AFR). Furthermore, we introduce the innovative Riemann Local Similarity (RLS) module and the Global Pooling Similarity (GPS) module. However, as 3D point cloud data and text data often possess complex geometric structures in high-dimensional space, the proposed RLS employs a novel Riemann Attention Mechanism to reflect the intrinsic geometric relationships of the data. Without explicitly defining the manifold, RMARN learns the manifold parameters to better represent the distances between text-point cloud samples. To address the challenges of lacking paired text-3D data, we have created the large-scale Text-3D Retrieval dataset T3DR-HIT, which comprises over 3,380 pairs of text and point cloud data. T3DR-HIT contains coarse-grained indoor 3D scenes and fine-grained Chinese artifact scenes, consisting of 1,380 and over 2,000 text-3D pairs, respectively. Experiments on our custom datasets demonstrate the superior performance of the proposed method. Our code and proposed datasets are available at \url{https://github.com/liwrui/RMARN}.
Dillon Davis, Huiji Gao, Weiwei Guo, Thomas Legrand, Malay Haldar, Alex Deng, Han Zhao, Liwei He et al.
The Airbnb search system grapples with many unique challenges as it continues to evolve. We oversee a marketplace that is nuanced by geography, diversity of homes, and guests with a variety of preferences. Crafting an efficient search system that can accommodate diverse guest needs, while showcasing relevant homes lies at the heart of Airbnb's success. Airbnb search has many challenges that parallel other recommendation and search systems but it has a unique information retrieval problem, upstream of ranking, called location retrieval. It requires defining a topological map area that is relevant to the searched query for homes listing retrieval. The purpose of this paper is to demonstrate the methodology, challenges, and impact of building a machine learning based location retrieval product from the ground up. Despite the lack of suitable, prevalent machine learning based approaches, we tackle cold start, generalization, differentiation and algorithmic bias. We detail the efficacy of heuristics, statistics, machine learning, and reinforcement learning approaches to solve these challenges, particularly for systems that are often unexplored by current literature.
Chenghua Gao, Min Li, Jianshuo Liu, Junxing Ren, Lin Chen, Haoyu Liu, Bo Meng, Jitao Fu et al.
Video Moment Retrieval (VMR) aims to retrieve relevant moments of an
untrimmed video corresponding to the query. While cross-modal interaction
approaches have shown progress in filtering out query-irrelevant information in
videos, they assume the precise alignment between the query semantics and the
corresponding video moments, potentially overlooking the misunderstanding of
the natural language semantics. To address this challenge, we propose a novel
model called \textit{QD-VMR}, a query debiasing model with enhanced contextual
understanding. Firstly, we leverage a Global Partial Aligner module via video
clip and query features alignment and video-query contrastive learning to
enhance the cross-modal understanding capabilities of the model. Subsequently,
we employ a Query Debiasing Module to obtain debiased query features
efficiently, and a Visual Enhancement module to refine the video features
related to the query. Finally, we adopt the DETR structure to predict the
possible target video moments. Through extensive evaluations of three benchmark
datasets, QD-VMR achieves state-of-the-art performance, proving its potential
to improve the accuracy of VMR. Further analytical experiments demonstrate the
effectiveness of our proposed module. Our code will be released to facilitate
future research.
Authors' comments: 9 pages, 4 figures, 4 tables
Kun Luo, Minghao Qin, Zheng Liu, Shitao Xiao, Jun Zhao, Kang Liu
Pretrained language models like BERT and T5 serve as crucial backbone
encoders for dense retrieval. However, these models often exhibit limited
generalization capabilities and face challenges in improving in domain
accuracy. Recent research has explored using large language models (LLMs) as
retrievers, achieving SOTA performance across various tasks. Despite these
advancements, the specific benefits of LLMs over traditional retrievers and the
impact of different LLM configurations, such as parameter sizes, pretraining
duration, and alignment processes on retrieval tasks remain unclear. In this
work, we conduct a comprehensive empirical study on a wide range of retrieval
tasks, including in domain accuracy, data efficiency, zero shot generalization,
lengthy retrieval, instruction based retrieval, and multi task learning. We
evaluate over 15 different backbone LLMs and non LLMs. Our findings reveal that
larger models and extensive pretraining consistently enhance in domain accuracy
and data efficiency. Additionally, larger models demonstrate significant
potential in zero shot generalization, lengthy retrieval, instruction based
retrieval, and multi task learning. These results underscore the advantages of
LLMs as versatile and effective backbone encoders in dense retrieval, providing
valuable insights for future research and development in this field.
Authors' comments: Submitted to EMNLP24
Shiyue Zhang, Zheng Chong, Xujie Zhang, Hanhui Li, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang
General text-to-image models bring revolutionary innovation to the fields of
arts, design, and media. However, when applied to garment generation, even the
state-of-the-art text-to-image models suffer from fine-grained semantic
misalignment, particularly concerning the quantity, position, and
interrelations of garment components. Addressing this, we propose
GarmentAligner, a text-to-garment diffusion model trained with
retrieval-augmented multi-level corrections. To achieve semantic alignment at
the component level, we introduce an automatic component extraction pipeline to
obtain spatial and quantitative information of garment components from
corresponding images and captions. Subsequently, to exploit component
relationships within the garment images, we construct retrieval subsets for
each garment by retrieval augmentation based on component-level similarity
ranking and conduct contrastive learning to enhance the model perception of
components from positive and negative samples. To further enhance the alignment
of components across semantic, spatial, and quantitative granularities, we
propose the utilization of multi-level correction losses that leverage detailed
component information. The experimental findings demonstrate that
GarmentAligner achieves superior fidelity and fine-grained semantic alignment
when compared to existing competitors.
Authors' comments: Accepted by ECCV 2024
Jie Wu, Zhaochun Ren, Suzan Verberne
In this paper, we analyze the capabilities of the multi-lingual Dense Passage Retriever (mDPR) for extremely low-resource languages. In the Cross-lingual Open-Retrieval Answer Generation (CORA) pipeline, mDPR achieves success on multilingual open QA benchmarks across 26 languages, of which 9 were unseen during training. These results are promising for Question Answering (QA) for low-resource languages. We focus on two extremely low-resource languages for which mDPR performs poorly: Amharic and Khmer. We collect and curate datasets to train mDPR models using Translation Language Modeling (TLM) and question--passage alignment. We also investigate the effect of our extension on the language distribution in the retrieval results. Our results on the MKQA and AmQA datasets show that language alignment brings improvements to mDPR for the low-resource languages, but the improvements are modest and the results remain low. We conclude that fulfilling CORA's promise to enable multilingual open QA in extremely low-resource settings is challenging because the model, the data, and the evaluation approach are intertwined. Hence, all three need attention in follow-up work. We release our code for reproducibility and future work: https://anonymous.4open.science/r/Question-Answering-for-Low-Resource-Languages-B13C/
Omar Erak, Nouf Alabbasi, Omar Alhussein, Ismail Lotfi, Amr Hussein, Sami Muhaidat, Merouane Debbah
Recent studies show that large language models (LLMs) struggle with technical
standards in telecommunications. We propose a fine-tuned retrieval-augmented
generation (RAG) system based on the Phi-2 small language model (SLM) to serve
as an oracle for communication networks. Our developed system leverages
forward-looking semantic chunking to adaptively determine parsing breakpoints
based on embedding similarity, enabling effective processing of diverse
document formats. To handle the challenge of multiple similar contexts in
technical standards, we employ a re-ranking algorithm to prioritize the most
relevant retrieved chunks. Recognizing the limitations of Phi-2's small context
window, we implement a recent technique, namely SelfExtend, to expand the
context window during inference, which not only boosts the performance but also
can accommodate a wider range of user queries and design requirements from
customers to specialized technicians. For fine-tuning, we utilize the low-rank
adaptation (LoRA) technique to enhance computational efficiency during training
and enable effective fine-tuning on small datasets. Our comprehensive
experiments demonstrate substantial improvements over existing
question-answering approaches in the telecom domain, achieving performance that
exceeds larger language models such as GPT-4 (which is about 880 times larger
in size). This work presents a novel approach to leveraging SLMs for
communication networks, offering a balance of efficiency and performance. This
work can serve as a foundation towards agentic language models for networks.
Authors' comments: submitted to Proc. IEEE Globecom
Priyanka Mandikal
LLMs have revolutionized the landscape of information retrieval and knowledge
dissemination. However, their application in specialized areas is often
hindered by factual inaccuracies and hallucinations, especially in long-tail
knowledge distributions. We explore the potential of retrieval-augmented
generation (RAG) models for long-form question answering (LFQA) in a
specialized knowledge domain. We present VedantaNY-10M, a dataset curated from
extensive public discourses on the ancient Indian philosophy of Advaita
Vedanta. We develop and benchmark a RAG model against a standard, non-RAG LLM,
focusing on transcription, retrieval, and generation performance. Human
evaluations by computational linguists and domain experts show that the RAG
model significantly outperforms the standard model in producing factual and
comprehensive responses having fewer hallucinations. In addition, a
keyword-based hybrid retriever that emphasizes unique low-frequency terms
further improves results. Our study provides insights into effectively
integrating modern large language models with ancient knowledge systems.
Project page with dataset and code: https://sites.google.com/view/vedantany-10m
Authors' comments: Outstanding Paper at the Machine Learning for Ancient Languages
Workshop, 2024.ml4al-1.23, Association for Computational Linguistics (ACL)
2024
Xuanwang Zhang, Yunze Song, Yidong Wang, Shuyun Tang, Xinfeng Li, Zhengran Zeng, Zhen Wu, Wei Ye et al.
Large Language Models (LLMs) demonstrate human-level capabilities in
dialogue, reasoning, and knowledge retention. However, even the most advanced
LLMs face challenges such as hallucinations and real-time updating of their
knowledge. Current research addresses this bottleneck by equipping LLMs with
external knowledge, a technique known as Retrieval Augmented Generation (RAG).
However, two key issues constrained the development of RAG. First, there is a
growing lack of comprehensive and fair comparisons between novel RAG
algorithms. Second, open-source tools such as LlamaIndex and LangChain employ
high-level abstractions, which results in a lack of transparency and limits the
ability to develop novel algorithms and evaluation metrics. To close this gap,
we introduce RAGLAB, a modular and research-oriented open-source library.
RAGLAB reproduces 6 existing algorithms and provides a comprehensive ecosystem
for investigating RAG algorithms. Leveraging RAGLAB, we conduct a fair
comparison of 6 RAG algorithms across 10 benchmarks. With RAGLAB, researchers
can efficiently compare the performance of various algorithms and develop novel
algorithms.
Authors' comments: 6 pages, 3 figures
Jiheng Liang, Ziru Yu, Zujie Xie, Xiangyang Yu
Large Language Model (LLM) has demonstrated significant success in a range of
natural language processing (NLP) tasks within general domain. The emergence of
LLM has introduced innovative methodologies across diverse fields, including
the natural sciences. Researchers aim to implement automated, concurrent
process driven by LLM to supplant conventional manual, repetitive and
labor-intensive work. In the domain of spectral analysis and detection, it is
imperative for researchers to autonomously acquire pertinent knowledge across
various research objects, which encompasses the spectroscopic techniques and
the chemometric methods that are employed in experiments and analysis.
Paradoxically, despite the recognition of spectroscopic detection as an
effective analytical method, the fundamental process of knowledge retrieval
remains both time-intensive and repetitive. In response to this challenge, we
first introduced the Spectral Detection and Analysis Based Paper(SDAAP)
dataset, which is the first open-source textual knowledge dataset for spectral
analysis and detection and contains annotated literature data as well as
corresponding knowledge instruction data. Subsequently, we also designed an
automated Q\&A framework based on the SDAAP dataset, which can retrieve
relevant knowledge and generate high-quality responses by extracting entities
in the input as retrieval parameters. It is worth noting that: within this
framework, LLM is only used as a tool to provide generalizability, while RAG
technique is used to accurately capture the source of the knowledge.This
approach not only improves the quality of the generated responses, but also
ensures the traceability of the knowledge. Experimental results show that our
framework generates responses with more reliable expertise compared to the
baseline.
Authors' comments: 16 pages,10 figures,3 tables
Rounak Meyur, Hung Phan, Sridevi Wagle, Jan Strube, Mahantesh Halappanavar, Sameera Horawalavithana, Anurag Acharya, Sai Munikoti
Wind energy project assessments present significant challenges for
decision-makers, who must navigate and synthesize hundreds of pages of
environmental and scientific documentation. These documents often span
different regions and project scales, covering multiple domains of expertise.
This process traditionally demands immense time and specialized knowledge from
decision-makers. The advent of Large Language Models (LLM) and Retrieval
Augmented Generation (RAG) approaches offer a transformative solution, enabling
rapid, accurate cross-document information retrieval and synthesis. As the
landscape of Natural Language Processing (NLP) and text generation continues to
evolve, benchmarking becomes essential to evaluate and compare the performance
of different RAG-based LLMs. In this paper, we present a comprehensive
framework to generate a domain relevant RAG benchmark. Our framework is based
on automatic question-answer generation with Human (domain experts)-AI (LLM)
teaming. As a case study, we demonstrate the framework by introducing WeQA, a
first-of-its-kind benchmark on the wind energy domain which comprises of
multiple scientific documents/reports related to environmental aspects of wind
energy projects. Our framework systematically evaluates RAG performance using
diverse metrics and multiple question types with varying complexity level,
providing a foundation for rigorous assessment of RAG-based systems in complex
scientific domains and enabling researchers to identify areas for improvement
in domain-specific applications.
Authors' comments: 8 pages without Limitation and References
Adel Elmahdy, Sheng-Chieh Lin, Amin Ahmad
Information retrieval across different languages is an increasingly important
challenge in natural language processing. Recent approaches based on
multilingual pre-trained language models have achieved remarkable success, yet
they often optimize for either monolingual, cross-lingual, or multilingual
retrieval performance at the expense of others. This paper proposes a novel
hybrid batch training strategy to simultaneously improve zero-shot retrieval
performance across monolingual, cross-lingual, and multilingual settings while
mitigating language bias. The approach fine-tunes multilingual language models
using a mix of monolingual and cross-lingual question-answer pair batches
sampled based on dataset size. Experiments on XQuAD-R, MLQA-R, and MIRACL
benchmark datasets show that the proposed method consistently achieves
comparable or superior results in zero-shot retrieval across various languages
and retrieval tasks compared to monolingual-only or cross-lingual-only
training. Hybrid batch training also substantially reduces language bias in
multilingual retrieval compared to monolingual training. These results
demonstrate the effectiveness of the proposed approach for learning
language-agnostic representations that enable strong zero-shot retrieval
performance across diverse languages.
Authors' comments: 15 pages, 2 figures, 13 tables
Xiaoming Zhang, Ming Wang, Xiaocui Yang, Daling Wang, Shi Feng, Yifei Zhang
Multi-hop Question Answering (QA) necessitates complex reasoning by
integrating multiple pieces of information to resolve intricate questions.
However, existing QA systems encounter challenges such as outdated information,
context window length limitations, and an accuracy-quantity trade-off. To
address these issues, we propose a novel framework, the Hierarchical
Retrieval-Augmented Generation Model with Rethink (HiRAG), comprising
Decomposer, Definer, Retriever, Filter, and Summarizer five key modules. We
introduce a new hierarchical retrieval strategy that incorporates both sparse
retrieval at the document level and dense retrieval at the chunk level,
effectively integrating their strengths. Additionally, we propose a
single-candidate retrieval method to mitigate the limitations of
multi-candidate retrieval. We also construct two new corpora, Indexed
Wikicorpus and Profile Wikicorpus, to address the issues of outdated and
insufficient knowledge.
Our experimental results on four datasets demonstrate that HiRAG outperforms
state-of-the-art models across most metrics, and our Indexed Wikicorpus is
effective. The code for HiRAG is available at
https://github.com/2282588541a/HiRAG
Authors' comments: undereview
Francisco Vega Ibáñez, Jo Verbeeck
The challenge of imaging low-density objects in an electron microscope
without causing beam damage is significant in modern TEM. This is especially
true for life science imaging, where the sample, rather than the instrument,
still determines the resolution limit. Here, we explore whether we have to
accept this or can progress further in this area. To do this, we use numerical
simulations to see how much information we can obtain from a weak phase object
at different electron doses. Starting from a model with four phase values, we
compare Zernike phase contrast with measuring diffracted intensity under
multiple random phase illuminations to solve the inverse problem. Our
simulations have shown that diffraction-based methods perform better than the
Zernike method, as we have found and addressed a normalization issue that, in
some other studies, led to an overly optimistic representation of the Zernike
setup. We further validated this using more realistic 2D objects and found that
random phase illuminated diffraction can be up to five times more efficient
than an ideal Zernike implementation. These findings suggest that
diffraction-based methods could be a promising approach for imaging
beam-sensitive materials and that current low-dose imaging methods are not yet
at the quantum limit.
Authors' comments: 25 pages, 7 figures
Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li, Xiaodan Liang
Text-Video Retrieval (TVR) aims to align and associate relevant video content
with corresponding natural language queries. Most existing TVR methods are
based on large-scale pre-trained vision-language models (e.g., CLIP). However,
due to the inherent plain structure of CLIP, few TVR methods explore the
multi-scale representations which offer richer contextual information for a
more thorough understanding. To this end, we propose MUSE, a multi-scale mamba
with linear computational complexity for efficient cross-resolution modeling.
Specifically, the multi-scale representations are generated by applying a
feature pyramid on the last single-scale feature map. Then, we employ the Mamba
structure as an efficient multi-scale learner to jointly learn scale-wise
representations. Furthermore, we conduct comprehensive studies to investigate
different model structures and designs. Extensive results on three popular
benchmarks have validated the superiority of MUSE.
Authors' comments: Accepted by AAAI 2025
Alex Gichamba, Tewodros Kederalah Idris, Brian Ebiyau, Eric Nyberg, Teruko Mitamura
Domain-specific question answering remains challenging for language models,
given the deep technical knowledge required to answer questions correctly. This
difficulty is amplified for smaller language models that cannot encode as much
information in their parameters as larger models. The "Specializing Large
Language Models for Telecom Networks" challenge aimed to enhance the
performance of two small language models, Phi-2 and Falcon-7B in
telecommunication question answering. In this paper, we present our question
answering systems for this challenge. Our solutions achieved leading marks of
81.9% accuracy for Phi-2 and 57.3% for Falcon-7B. We have publicly released our
code and fine-tuned models.
Authors' comments: 7 pages, 2 figures, and 8 tables. This paper has been accepted at the
2024 IEEE Global Communications (GLOBECOM) Workshops
Xiangyu Zhao, Yuehan Zhang, Wenlong Zhang, Xiao-Ming Wu
The fashion domain encompasses a variety of real-world multimodal tasks,
including multimodal retrieval and multimodal generation. The rapid
advancements in artificial intelligence generated content, particularly in
technologies like large language models for text generation and diffusion
models for visual generation, have sparked widespread research interest in
applying these multimodal models in the fashion domain. However, tasks
involving embeddings, such as image-to-text or text-to-image retrieval, have
been largely overlooked from this perspective due to the diverse nature of the
multimodal fashion domain. And current research on multi-task single models
lack focus on image generation. In this work, we present UniFashion, a unified
framework that simultaneously tackles the challenges of multimodal generation
and retrieval tasks within the fashion domain, integrating image generation
with retrieval tasks and text generation tasks. UniFashion unifies embedding
and generative tasks by integrating a diffusion model and LLM, enabling
controllable and high-fidelity generation. Our model significantly outperforms
previous single-task state-of-the-art models across diverse fashion tasks, and
can be readily adapted to manage complex vision-language tasks. This work
demonstrates the potential learning synergy between multimodal generation and
retrieval, offering a promising direction for future research in the fashion
domain. The source code is available at
https://github.com/xiangyu-mm/UniFashion.
Authors' comments: Accepted by EMNLP 2024, main conference
Karl El Hajal, Ajinkya Kulkarni, Enno Hermann, Mathew Magimai. -Doss
While recent zero-shot multispeaker text-to-speech (TTS) models achieve impressive results, they typically rely on extensive transcribed speech datasets from numerous speakers and intricate training pipelines. Meanwhile, self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS. It was also observed that SSL features from different speakers that are linearly close share phonetic information while maintaining individual speaker identity, which enables straight-forward and robust voice cloning. In this study, we introduce SSL-TTS, a lightweight and efficient zero-shot TTS framework trained on transcribed speech from a single speaker. SSL-TTS leverages SSL features and retrieval methods for simple and robust zero-shot multi-speaker synthesis. Objective and subjective evaluations show that our approach achieves performance comparable to state-of-the-art models that require significantly larger training datasets. The low training data requirements mean that SSL-TTS is well suited for the development of multi-speaker TTS systems for low-resource domains and languages. We also introduce an interpolation parameter which enables fine control over the output speech by blending voices. Demo samples are available at https://idiap.github.io/ssl-tts
Guangyuan Ma, Yongliang Ma, Xing Wu, Zhenpeng Su, Ming Zhou, Songlin Hu
Large Language Model-based Dense Retrieval (LLM-DR) optimizes over numerous
heterogeneous fine-tuning collections from different domains. However, the
discussion about its training data distribution is still minimal. Previous
studies rely on empirically assigned dataset choices or sampling ratios, which
inevitably lead to sub-optimal retrieval performances. In this paper, we
propose a new task-level Distributionally Robust Optimization (tDRO) algorithm
for LLM-DR fine-tuning, targeted at improving the universal domain
generalization ability by end-to-end reweighting the data distribution of each
task. The tDRO parameterizes the domain weights and updates them with scaled
domain gradients. The optimized weights are then transferred to the LLM-DR
fine-tuning to train more robust retrievers. Experiments show optimal
improvements in large-scale retrieval benchmarks and reduce up to 30% dataset
usage after applying our optimization algorithm with a series of
different-sized LLM-DR models.
Authors' comments: Accepted by AAAI25. Source code is available at
https://github.com/ma787639046/tdro