Nicola Messina, Jan Sedmidubsky, Fabrizio Falchi, Tomáš Rebok
Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously - together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process sequences of skeleton data. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods.
Zijie J. Wang, Duen Horng Chau
Retrieval-augmented text generation (RAG) addresses the common limitations of
large language models (LLMs), such as hallucination, by retrieving information
from an updatable external knowledge base. However, existing approaches often
require dedicated backend servers for data storage and retrieval, thereby
limiting their applicability in use cases that require strict data privacy,
such as personal finance, education, and medicine. To address the pressing need
for client-side dense retrieval, we introduce MeMemo, the first open-source
JavaScript toolkit that adapts the state-of-the-art approximate nearest
neighbor search technique HNSW to browser environments. Developed with modern
and native Web technologies, such as IndexedDB and Web Workers, our toolkit
leverages client-side hardware capabilities to enable researchers and
developers to efficiently search through millions of high-dimensional vectors
in the browser. MeMemo enables exciting new design and research opportunities,
such as private and personalized content creation and interactive prototyping,
as demonstrated in our example application RAG Playground. Reflecting on our
work, we discuss the opportunities and challenges for on-device dense
retrieval. MeMemo is available at https://github.com/poloclub/mememo.
Authors' comments: Accepted to SIGIR 2024. 6 pages, 2 figures. For a live demo, visit
https://poloclub.github.io/mememo/. Code is open-source at
https://github.com/poloclub/mememo
Yilong Lai, Jialong Wu, Congzhi Zhang, Haowen Sun, Deyu Zhou
Conversational Query Reformulation (CQR) has significantly advanced in
addressing the challenges of conversational search, particularly those stemming
from the latent user intent and the need for historical context. Recent works
aimed to boost the performance of CQR through alignment. However, they are
designed for one specific retrieval system, which potentially results in
sub-optimal generalization. To overcome this limitation, we present a novel
framework AdaCQR. By aligning reformulation models with both term-based and
semantic-based retrieval systems, AdaCQR enhances the generalizability of
information-seeking queries among diverse retrieval environments through a
two-stage training strategy. Moreover, two effective approaches are proposed to
obtain superior labels and diverse input candidates, boosting the efficiency
and robustness of the framework. Experimental results on the TopiOCQA and QReCC
datasets demonstrate that AdaCQR outperforms the existing methods in a more
efficient framework, offering both quantitative and qualitative improvements in
conversational query reformulation.
Authors' comments: Accepted by COLING 2025
Hanwen Su, Ge Song, Kai Huang, Jiyan Wang, Ming Yang
In this paper, we study the problem of zero-shot sketch-based image retrieval (ZS-SBIR). The prior methods tackle the problem in a two-modality setting with only category labels or even no textual information involved. However, the growing prevalence of Large-scale pre-trained Language Models (LLMs), which have demonstrated great knowledge learned from web-scale data, can provide us with an opportunity to conclude collective textual information. Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers. To this end, we propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval. The network consists of three components: (i) a Description Generation Module that generates textual descriptions for each training category by prompting an LLM with several interrogative sentences, (ii) a Feature Extraction Module that includes two ViTs for sketch and image data, a transformer for extracting tokens of sentences of each training category, finally (iii) a Cross-modal Alignment Module that exchanges the token features of both text-sketch and text-image using cross-attention mechanism, and align the tokens locally and globally. Extensive experiments on three benchmark datasets show our superior performances over the state-of-the-art ZS-SBIR methods.
Wenbo Xu, Liang Yan, Peiyi Han, Haifeng Zhu, Chuanyi Liu, Shaoming Duan, Cuiyun Gao, Yingwei Liang
Large Language Model-based (LLM-based) Text-to-SQL methods have achieved important progress in generating SQL queries for real-world applications. When confronted with table content-aware questions in real-world scenarios, ambiguous data content keywords and non-existent database schema column names within the question leads to the poor performance of existing methods. To solve this problem, we propose a novel approach towards Table Content-aware Text-to-SQL with Self-Retrieval (TCSR-SQL). It leverages LLM's in-context learning capability to extract data content keywords within the question and infer possible related database schema, which is used to generate Seed SQL to fuzz search databases. The search results are further used to confirm the encoding knowledge with the designed encoding knowledge table, including column names and exact stored content values used in the SQL. The encoding knowledge is sent to obtain the final Precise SQL following multi-rounds of generation-execution-revision process. To validate our approach, we introduce a table-content-aware, question-related benchmark dataset, containing 1,692 question-SQL pairs. Comprehensive experiments conducted on this benchmark demonstrate the remarkable performance of TCSR-SQL, achieving an improvement of at least 13.7% in execution accuracy compared to other state-of-the-art methods.
Takyoung Kim, Kyungjae Lee, Young Rok Jang, Ji Yong Cho, Gangwoo Kim, Minseok Cho, Moontae Lee
Interactions with large language models (LLMs) often yield long and detailed
responses, leveraging both parametric knowledge and retrieval-augmented
generation (RAG). While these responses can provide rich insights, they often
include redundant or less engaging content not aligned with user interests.
This issue becomes apparent when users specify particular subtopics to include
or exclude -- termed coverage-conditioned ($C^2$) queries -- as LLMs often
struggle to provide tailored responses. To address this challenge, we
investigate the role of query outlines, sequences of subqueries designed to
guide LLMs in generating responses that meet specific user requirements. To
systematically create and evaluate these outlines, we introduce QTree, a
dataset of 10K hierarchical sets of information-seeking subqueries that define
structured boundaries for outline creation and evaluation in $C^2$ scenarios.
Additionally, we develop QPlanner, a 7B language model trained to generate
customized outlines within boundaries of QTree. We evaluate the effectiveness
of the generated outlines through automatic and human judgements, focusing on
their impact within retrieval-augmented generation (RAG) systems. Experimental
results demonstrate that QPlanner, especially when trained with alignment
techniques like DPO, generates higher-quality outlines that better fulfill
diverse user needs.
Authors' comments: NAACL 2025 (Findings, Long). Resources are available at
https://github.com/youngerous/qtree
Sirui Xia, Xintao Wang, Jiaqing Liang, Yifei Zhang, Weikang Zhou, Jiaji Deng, Fei Yu, Yanghua Xiao
Retrieval-Augmented Generation (RAG) has been widely adopted to enhance Large
Language Models (LLMs) in knowledge-intensive tasks. To enhance credibility and
verifiability in RAG systems, Attributed Text Generation (ATG) is proposed,
which provides citations to retrieval knowledge in LLM-generated responses.
Prior methods mainly adopt coarse-grained attributions, with passage-level or
paragraph-level references or citations, which fall short in verifiability.
This paper proposes ReClaim (Refer & Claim), a fine-grained ATG method that
alternates the generation of references and answers step by step. Different
from previous coarse-grained attribution, ReClaim provides sentence-level
citations in long-form question-answering tasks. With extensive experiments, we
verify the effectiveness of ReClaim in extensive settings, achieving a citation
accuracy rate of 90%.
Authors' comments: Accepted to NAACL 2025 Findings
Yanfang Chen, Ding Chen, Shichao Song, Simin Niu, Hanyu Wang, Zeyun Tang, Feiyu Xiong, Zhiyu Li
As people increasingly prioritize their health, the speed and breadth of health information dissemination on the internet have also grown. At the same time, the presence of false health information (health rumors) intermingled with genuine content poses a significant potential threat to public health. However, current research on Chinese health rumors still lacks a large-scale, public, and open-source dataset of health rumor information, as well as effective and reliable rumor detection methods. This paper addresses this gap by constructing a dataset containing 1.12 million health-related rumors (HealthRCN) through web scraping of common health-related questions and a series of data processing steps. HealthRCN is the largest known dataset of Chinese health information rumors to date. Based on this dataset, we propose retrieval-augmented large language models for Chinese health rumor detection and explainability (HRDE). This model leverages retrieved relevant information to accurately determine whether the input health information is a rumor and provides explanatory responses, effectively aiding users in verifying the authenticity of health information. In evaluation experiments, we compared multiple models and found that HRDE outperformed them all, including GPT-4-1106-Preview, in rumor detection accuracy and answer quality. HRDE achieved an average accuracy of 91.04% and an F1 score of 91.58%.
Xinyu Mao, Shengyao Zhuang, Bevan Koopman, Guido Zuccon
The goal of screening prioritisation in systematic reviews is to identify
relevant documents with high recall and rank them in early positions for
review. This saves reviewing effort if paired with a stopping criterion, and
speeds up review completion if performed alongside downstream tasks. Recent
studies have shown that neural models have good potential on this task, but
their time-consuming fine-tuning and inference discourage their widespread use
for screening prioritisation. In this paper, we propose an alternative approach
that still relies on neural models, but leverages dense representations and
relevance feedback to enhance screening prioritisation, without the need for
costly model fine-tuning and inference. This method exploits continuous
relevance feedback from reviewers during document screening to efficiently
update the dense query representation, which is then applied to rank the
remaining documents to be screened. We evaluate this approach across the CLEF
TAR datasets for this task. Results suggest that the investigated dense
query-driven approach is more efficient than directly using neural models and
shows promising effectiveness compared to previous methods developed on the
considered datasets. Our code is available at
https://github.com/ielab/dense-screening-feedback.
Authors' comments: Accepted at SIGIR 2024;typos corrected
Jia Fu, Xiaoting Qin, Fangkai Yang, Lu Wang, Jue Zhang, Qingwei Lin, Yubo Chen, Dongmei Zhang et al.
Recent advancements in Large Language Models have transformed ML/AI development, necessitating a reevaluation of AutoML principles for the Retrieval-Augmented Generation (RAG) systems. To address the challenges of hyper-parameter optimization and online adaptation in RAG, we propose the AutoRAG-HP framework, which formulates the hyper-parameter tuning as an online multi-armed bandit (MAB) problem and introduces a novel two-level Hierarchical MAB (Hier-MAB) method for efficient exploration of large search spaces. We conduct extensive experiments on tuning hyper-parameters, such as top-k retrieved documents, prompt compression ratio, and embedding methods, using the ALCE-ASQA and Natural Questions datasets. Our evaluation from jointly optimization all three hyper-parameters demonstrate that MAB-based online learning methods can achieve Recall@5 $\approx 0.8$ for scenarios with prominent gradients in search space, using only $\sim20\%$ of the LLM API calls required by the Grid Search approach. Additionally, the proposed Hier-MAB approach outperforms other baselines in more challenging optimization scenarios. The code will be made available at https://aka.ms/autorag.
Ivica Kostric, Krisztian Balog
Conversational passage retrieval is challenging as it often requires the
resolution of references to previous utterances and needs to deal with the
complexities of natural language, such as coreference and ellipsis. To address
these challenges, pre-trained sequence-to-sequence neural query rewriters are
commonly used to generate a single de-contextualized query based on
conversation history. Previous research shows that combining multiple query
rewrites for the same user utterance has a positive effect on retrieval
performance. We propose the use of a neural query rewriter to generate multiple
queries and show how to integrate those queries in the passage retrieval
pipeline efficiently. The main strength of our approach lies in its simplicity:
it leverages how the beam search algorithm works and can produce multiple query
rewrites at no additional cost. Our contributions further include devising ways
to utilize multi-query rewrites in both sparse and dense first-pass retrieval.
We demonstrate that applying our approach on top of a standard passage
retrieval pipeline delivers state-of-the-art performance without sacrificing
efficiency.
Authors' comments: Proceedings of the 47th International ACM SIGIR Conference on
Research and Development in Information Retrieval
Huaying Zhang, Rintaro Yanagi, Ren Togo, Takahiro Ogawa, Miki Haseyama
This paper proposes a novel zero-shot composed image retrieval (CIR) method
considering the query-target relationship by masked image-text pairs. The
objective of CIR is to retrieve the target image using a query image and a
query text. Existing methods use a textual inversion network to convert the
query image into a pseudo word to compose the image and text and use a
pre-trained visual-language model to realize the retrieval. However, they do
not consider the query-target relationship to train the textual inversion
network to acquire information for retrieval. In this paper, we propose a novel
zero-shot CIR method that is trained end-to-end using masked image-text pairs.
By exploiting the abundant image-text pairs that are convenient to obtain with
a masking strategy for learning the query-target relationship, it is expected
that accurate zero-shot CIR using a retrieval-focused textual inversion network
can be realized. Experimental results show the effectiveness of the proposed
method.
Authors' comments: Accepted as a conference paper in IEEE ICIP 2024
Anubhab Majumder, Kausik Bhattacharya, Amaresh Chakrabarti
Representing systems using the SAPPhIRE causality model is found useful in supporting design-by-analogy. However, creating a SAPPhIRE model of artificial or biological systems is an effort-intensive process that requires human experts to source technical knowledge from multiple technical documents regarding how the system works. This research investigates how to leverage Large Language Models (LLMs) in creating structured descriptions of systems using the SAPPhIRE model of causality. This paper, the second part of the two-part research, presents a new Retrieval-Augmented Generation (RAG) tool for generating information related to SAPPhIRE constructs of artificial systems and reports the results from a preliminary evaluation of the tool's success - focusing on the factual accuracy and reliability of outcomes.
Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos
Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., $10.5\%$ improvement on $20$ documents MDQA at position $10$ for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from $2.33\%$ to $6.19\%$). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.
Devashish Vikas Gupta, Azeez Syed Ali Ishaqui, Divya Kiran Kadiyala
Large language models (LLMs) have shown promising results in learning and contextualizing information from different forms of data. Recent advancements in foundational models, particularly those employing self-attention mechanisms, have significantly enhanced our ability to comprehend the semantics of diverse data types. One such area that could highly benefit from multi-modality is in understanding geospatial data, which inherently has multiple modalities. However, current Natural Language Processing (NLP) mechanisms struggle to effectively address geospatial queries. Existing pre-trained LLMs are inadequately equipped to meet the unique demands of geospatial data, lacking the ability to retrieve precise spatio-temporal data in real-time, thus leading to significantly reduced accuracy in answering complex geospatial queries. To address these limitations, we introduce Geode--a pioneering system designed to tackle zero-shot geospatial question-answering tasks with high precision using spatio-temporal data retrieval. Our approach represents a significant improvement in addressing the limitations of current LLM models, demonstrating remarkable improvement in geospatial question-answering abilities compared to existing state-of-the-art pre-trained models.
Zhijie Nie, Richong Zhang, Zhangchi Feng, Hailang Huang, Xudong Liu
Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search,
which aims to break the barriers between modality and language simultaneously
and achieves image-text retrieval in the multi-lingual scenario with a single
model. In recent years, excellent progress has been made based on cross-lingual
cross-modal pre-training; particularly, the methods based on contrastive
learning on large-scale data have significantly improved retrieval tasks.
However, these methods directly follow the existing pre-training methods in the
cross-lingual or cross-modal domain, leading to two problems of inconsistency
in CCR: The methods with cross-lingual style suffer from the intra-modal error
propagation, resulting in inconsistent recall performance across languages in
the whole dataset. The methods with cross-modal style suffer from the
inter-modal optimization direction bias, resulting in inconsistent rank across
languages within each instance, which cannot be reflected by Recall@K. To solve
these problems, we propose a simple but effective 1-to-K contrastive learning
method, which treats each language equally and eliminates error propagation and
optimization bias. In addition, we propose a new evaluation metric, Mean Rank
Variance (MRV), to reflect the rank inconsistency across languages within each
instance. Extensive experiments on four CCR datasets show that our method
improves both recall rates and MRV with smaller-scale pre-trained data,
achieving the new state-of-art.
Authors' comments: Accepted by KDD 2024 Research Track
Zhen Tan, Chengshuai Zhao, Raha Moraffah, Yifan Li, Song Wang, Jundong Li, Tianlong Chen, Huan Liu
Retrieval-Augmented Generative (RAG) models enhance Large Language Models
(LLMs) by integrating external knowledge bases, improving their performance in
applications like fact-checking and information searching. In this paper, we
demonstrate a security threat where adversaries can exploit the openness of
these knowledge bases by injecting deceptive content into the retrieval
database, intentionally changing the model's behavior. This threat is critical
as it mirrors real-world usage scenarios where RAG systems interact with
publicly accessible knowledge bases, such as web scrapings and user-contributed
data pools. To be more realistic, we target a realistic setting where the
adversary has no knowledge of users' queries, knowledge base data, and the LLM
parameters. We demonstrate that it is possible to exploit the model
successfully through crafted content uploads with access to the retriever. Our
findings emphasize an urgent need for security measures in the design and
deployment of RAG systems to prevent potential manipulation and ensure the
integrity of machine-generated content.
Authors' comments: Preprint
Guanting Dong, Yutao Zhu, Chenghao Zhang, Zechen Wang, Zhicheng Dou, Ji-Rong Wen
Retrieval-augmented generation (RAG) has demonstrated effectiveness in
mitigating the hallucination problem of large language models (LLMs). However,
the difficulty of aligning the retriever with the diverse LLMs' knowledge
preferences inevitably poses an inevitable challenge in developing a reliable
RAG system. To address this issue, we propose DPA-RAG, a universal framework
designed to align diverse knowledge preferences within RAG systems.
Specifically, we initially introduce a preference knowledge construction
pipline and incorporate five novel query augmentation strategies to alleviate
preference data scarcity. Based on preference data, DPA-RAG accomplishes both
external and internal preference alignment: 1) It jointly integrate pair-wise,
point-wise, and contrastive preference alignment abilities into the reranker,
achieving external preference alignment among RAG components. 2) It further
introduces a pre-aligned stage before vanilla Supervised Fine-tuning (SFT),
enabling LLMs to implicitly capture knowledge aligned with their reasoning
preferences, achieving LLMs' internal alignment. Experimental results across
four knowledge-intensive QA datasets demonstrate that DPA-RAG outperforms all
baselines and seamlessly integrates both black-box and open-sourced LLM
readers. Further qualitative analysis and discussions also provide empirical
guidance for achieving reliable RAG systems. Our code is publicly available at
https://github.com/dongguanting/DPA-RAG.
Authors' comments: Work in progress
Yang Wang, Alberto Garcia Hernandez, Roman Kyslyi, Nicholas Kersting
We present a comprehensive study of answer quality evaluation in
Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel
grading system that is designed to assess correctness, completeness, and
honesty. We further map the grading of quality aspects aforementioned into a
binary score, indicating an accept or reject decision, mirroring the intuitive
"thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This
approach suits factual business contexts where a clear decision opinion is
essential. Our assessment applies vRAG-Eval to two Large Language Models
(LLMs), evaluating the quality of answers generated by a vanilla RAG
application. We compare these evaluations with human expert judgments and find
a substantial alignment between GPT-4's assessments and those of human experts,
reaching 83% agreement on accept or reject decisions. This study highlights the
potential of LLMs as reliable evaluators in closed-domain, closed-ended
settings, particularly when human evaluations require significant resources.
Authors' comments: 13 pages, 8 figures, 12 tables
Lukas Bahr, Christoph Wehner, Judith Wewerka, José Bittencourt, Ute Schmid, Rüdiger Daub
Failure mode and effects analysis (FMEA) is an essential tool for mitigating potential failures, particularly during the ramp-up phases of new products. However, its effectiveness is often limited by the reasoning capabilities of the FMEA tools, which are usually tabular structured. Meanwhile, large language models (LLMs) offer novel prospects for advanced natural language processing tasks. However, LLMs face challenges in tasks that require factual knowledge, a gap that retrieval-augmented generation (RAG) approaches aim to fill. RAG retrieves information from a non-parametric data store and uses a language model to generate responses. Building on this concept, we propose to enhance the non-parametric data store with a knowledge graph (KG). By integrating a KG into the RAG framework, we aim to leverage analytical and semantic question-answering capabilities for FMEA data. This paper contributes by presenting set-theoretic standardization and a schema for FMEA data, an algorithm for creating vector embeddings from the FMEA-KG, and a KG-enhanced RAG framework. Our approach is validated through a user experience design study, and we measure the precision and performance of the context retrieval recall.