Shuai Wang
This paper introduces ProofCloud, a proof retrieval engine for verified
proofs in higher order logic. It provides a fast proof searching service for
mathematicians and computer scientists for the reuse of proofs and proof
packages. In addition, it includes the first complete proof-checking results
and benchmarks of the OpenTheory repository.
Authors' comments: The paper was presented at the International Workshop on User
Interfaces for Theorem Provers (UITP) in 2016
Runtao Ren, Yinyu Wu, Xuhui Zhang, Jinke Ren, Yanyan Shen, Shuqiang Wang, Kim-Fung Tsang
The rapid evolution of mobile edge computing (MEC) has introduced significant
challenges in optimizing resource allocation in highly dynamic wireless
communication systems, in which task offloading decisions should be made in
real-time. However, existing resource allocation strategies cannot well adapt
to the dynamic and heterogeneous characteristics of MEC systems, since they are
short of scalability, context-awareness, and interpretability. To address these
issues, this paper proposes a novel retrieval-augmented generation (RAG) method
to improve the performance of MEC systems. Specifically, a latency minimization
problem is first proposed to jointly optimize the data offloading ratio,
transmit power allocation, and computing resource allocation. Then, an
LLM-enabled information-retrieval mechanism is proposed to solve the problem
efficiently. Extensive experiments across multi-user, multi-task, and highly
dynamic offloading scenarios show that the proposed method consistently reduces
latency compared to several DL-based approaches, achieving 57% improvement
under varying user computing ability, 86% with different servers, 30% under
distinct transmit powers, and 42% for varying data volumes. These results show
the effectiveness of LLM-driven solutions to solve the resource allocation
problems in MEC systems.
Authors' comments: This manuscript has been submitted to IEEE
Huanyu Zhang, Chang Xu, Yi-Fan Zhang, Zhang Zhang, Liang Wang, Jiang Bian, Tieniu Tan
Time series forecasting plays a crucial role in data mining, driving rapid advancements across numerous industries. With the emergence of large models, time series foundation models (TSFMs) have exhibited remarkable generalization capabilities, such as zero-shot learning, through large-scale pre-training. Meanwhile, Retrieval-Augmented Generation (RAG) methods have been widely employed to enhance the performance of foundation models on unseen data, allowing models to access to external knowledge. In this paper, we introduce TimeRAF, a Retrieval-Augmented Forecasting model that enhance zero-shot time series forecasting through retrieval-augmented techniques. We develop customized time series knowledge bases that are tailored to the specific forecasting tasks. TimeRAF employs an end-to-end learnable retriever to extract valuable information from the knowledge base. Additionally, we propose Channel Prompting for knowledge integration, which effectively extracts relevant information from the retrieved knowledge along the channel dimension. Extensive experiments demonstrate the effectiveness of our model, showing significant improvement across various domains and datasets.
Nicola Messina, Lucia Vadicamo, Leo Maltese, Claudio Gennaro
Recent advancements in deep learning have significantly enhanced
content-based retrieval methods, notably through models like CLIP that map
images and texts into a shared embedding space. However, these methods often
struggle with domain-specific entities and long-tail concepts absent from their
training data, particularly in identifying specific individuals. In this paper,
we explore the task of identity-aware cross-modal retrieval, which aims to
retrieve images of persons in specific contexts based on natural language
queries. This task is critical in various scenarios, such as for searching and
browsing personalized video collections or large audio-visual archives
maintained by national broadcasters. We introduce a novel dataset, COCO Person
FaceSwap (COCO-PFS), derived from the widely used COCO dataset and enriched
with deepfake-generated faces from VGGFace2. This dataset addresses the lack of
large-scale datasets needed for training and evaluating models for this task.
Our experiments assess the performance of different CLIP variations repurposed
for this task, including our architecture, Identity-aware CLIP (Id-CLIP), which
achieves competitive retrieval performance through targeted fine-tuning. Our
contributions lay the groundwork for more robust cross-modal retrieval systems
capable of recognizing long-tail identities and contextual nuances. Data and
code are available at https://github.com/mesnico/IdCLIP.
Authors' comments: Accepted as full paper at ECIR 2025
Shintaro Ozaki, Yuta Kato, Siyuan Feng, Masayo Tomita, Kazuki Hayashi, Wataru Hashimoto, Ryoma Obara, Masafumi Oyamada et al.
Retrieval Augmented Generation (RAG) complements the knowledge of Large
Language Models (LLMs) by leveraging external information to enhance response
accuracy for queries. This approach is widely applied in several fields by
taking its advantage of injecting the most up-to-date information, and
researchers are focusing on understanding and improving this aspect to unlock
the full potential of RAG in such high-stakes applications. However, despite
the potential of RAG to address these needs, the mechanisms behind the
confidence levels of its outputs remain underexplored, although the confidence
of information is very critical in some domains, such as finance, healthcare,
and medicine. Our study focuses the impact of RAG on confidence within the
medical domain under various configurations and models. We evaluate confidence
by treating the model's predicted probability as its output and calculating
Expected Calibration Error (ECE) and Adaptive Calibration Error (ACE) scores
based on the probabilities and accuracy. In addition, we analyze whether the
order of retrieved documents within prompts calibrates the confidence. Our
findings reveal large variation in confidence and accuracy depending on the
model, settings, and the format of input prompts. These results underscore the
necessity of optimizing configurations based on the specific model and
conditions.
Authors' comments: Accepted to BioNLP2025 (Workshop colocated with ACL2025)
Wenqing Li, Xiaosong Zhu, Pengfei Lan, Kai Wang, Wanzhu He, Hannes Hübener, Umberto De Giovannini, Peixiang Lu
We propose a scheme for retrieving the ultrafast valley polarization (VP) dynamics in two-dimensional hexagonal materials via attosecond circular dichroism (CD) transient absorption spectroscopy. This approach builds on the CD transition between the first and higher conduction bands induced by the circularly polarized probe pulses. The population imbalance at nonequivalent valleys in the first conduction band is proportionally mapped onto the difference in absorption coefficients of two probe pulses with opposite helicities, supporting an unprecedented quantitative retrieval of the corresponding VP dynamics with subfemtosecond time resolution. We theoretically demonstrate the scheme for h-BN and MoS2 through ab initio calculations, achieving an accurate retrieval of the VP dynamics, particularly the transient VP switching processes, with a time resolution of 250 as.
Yang Du, Yuqi Liu, Qin Jin
Cross-modal (e.g. image-text, video-text) retrieval is an important task in
information retrieval and multimodal vision-language understanding field.
Temporal understanding makes video-text retrieval more challenging than
image-text retrieval. However, we find that the widely used video-text
benchmarks have shortcomings in comprehensively assessing abilities of models,
especially in temporal understanding, causing large-scale image-text
pre-trained models can already achieve comparable zero-shot performance with
video-text pre-trained models. In this paper, we introduce RTime, a novel
temporal-emphasized video-text retrieval dataset. We first obtain videos of
actions or events with significant temporality, and then reverse these videos
to create harder negative samples. We then recruit annotators to judge the
significance and reversibility of candidate videos, and write captions for
qualified videos. We further adopt GPT-4 to extend more captions based on
human-written captions. Our RTime dataset currently consists of 21k videos with
10 captions per video, totalling about 122 hours. Based on RTime, we propose
three retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary. We
further enhance the use of harder-negatives in model training, and benchmark a
variety of video-text models on RTime. Extensive experiment analysis proves
that RTime indeed poses new and higher challenges to video-text retrieval. We
release our RTime
dataset\footnote{\url{https://github.com/qyr0403/Reversed-in-Time}} to further
advance video-text retrieval and multimodal understanding research.
Authors' comments: ACMMM 2024 poster
Le Dong, Qixuan Cao, Lei Pu, Fangfang Wu, Weisheng Dong, Xin Li, Guangming Shi
ERVD: An Efficient and Robust ViT-Based Distillation Framework for Remote Sensing Image Retrieval
Derong Xu, Xinhang Li, Ziheng Zhang, Zhenxi Lin, Zhihong Zhu, Zhi Zheng, Xian Wu, Xiangyu Zhao et al.
Large Language Models (LLMs) demonstrate remarkable capabilities, yet
struggle with hallucination and outdated knowledge when tasked with complex
knowledge reasoning, resulting in factually incorrect outputs. Previous studies
have attempted to mitigate it by retrieving factual knowledge from large-scale
knowledge graphs (KGs) to assist LLMs in logical reasoning and prediction of
answers. However, this kind of approach often introduces noise and irrelevant
data, especially in situations with extensive context from multiple knowledge
aspects. In this way, LLM attention can be potentially mislead from question
and relevant information. In our study, we introduce an Adaptive Multi-Aspect
Retrieval-augmented over KGs (Amar) framework. This method retrieves knowledge
including entities, relations, and subgraphs, and converts each piece of
retrieved text into prompt embeddings. The Amar framework comprises two key
sub-components: 1) a self-alignment module that aligns commonalities among
entities, relations, and subgraphs to enhance retrieved text, thereby reducing
noise interference; 2) a relevance gating module that employs a soft gate to
learn the relevance score between question and multi-aspect retrieved data, to
determine which information should be used to enhance LLMs' output, or even
filtered altogether. Our method has achieved state-of-the-art performance on
two common datasets, WebQSP and CWQ, showing a 1.9\% improvement in accuracy
over its best competitor and a 6.6\% improvement in logical form generation
over a method that directly uses retrieved text as context prompts. These
results demonstrate the effectiveness of Amar in improving the reasoning of
LLMs.
Authors' comments: Accepted by AAAI'2025
Yanlin Feng, Simone Papicchio, Sajjadur Rahman
Retrieval from graph data is crucial for augmenting large language models (LLM) with both open-domain knowledge and private enterprise data, and it is also a key component in the recent GraphRAG system (edge et al., 2024). Despite decades of research on knowledge graphs and knowledge base question answering, leading LLM frameworks (e.g. Langchain and LlamaIndex) have only minimal support for retrieval from modern encyclopedic knowledge graphs like Wikidata. In this paper, we analyze the root cause and suggest that modern RDF knowledge graphs (e.g. Wikidata, Freebase) are less efficient for LLMs due to overly large schemas that far exceed the typical LLM context window, use of resource identifiers, overlapping relation types and lack of normalization. As a solution, we propose property graph views on top of the underlying RDF graph that can be efficiently queried by LLMs using Cypher. We instantiated this idea on Wikidata and introduced CypherBench, the first benchmark with 11 large-scale, multi-domain property graphs with 7.8 million entities and over 10,000 questions. To achieve this, we tackled several key challenges, including developing an RDF-to-property graph conversion engine, creating a systematic pipeline for text-to-Cypher task generation, and designing new evaluation metrics.
Shuqi Cui, Nirmalya Thakur, Audrey Poon
Emojis are widely used across social media platforms but are often lost in noisy or garbled text, posing challenges for data analysis and machine learning. Conventional preprocessing approaches recommend removing such text, risking the loss of emojis and their contextual meaning. This paper proposes a three-step reverse-engineering methodology to retrieve emojis from garbled text in social media posts. The methodology also identifies reasons for the generation of such text during social media data mining. To evaluate its effectiveness, the approach was applied to 509,248 Tweets about the Mpox outbreak, a dataset referenced in about 30 prior works that failed to retrieve emojis from garbled text. Our method retrieved 157,748 emojis from 76,914 Tweets. Improvements in text readability and coherence were demonstrated through metrics such as Flesch Reading Ease, Flesch-Kincaid Grade Level, Coleman-Liau Index, Automated Readability Index, Dale-Chall Readability Score, Text Standard, and Reading Time. Additionally, the frequency of individual emojis and their patterns of usage in these Tweets were analyzed, and the results are presented.
Antony Seabra, Claudio Cavalcante, Joao Nepomuceno, Lucas Lago, Nicolaas Ruberg, Sergio Lifschitz
We propose a methodology that combines several advanced techniques in Large
Language Model (LLM) retrieval to support the development of robust,
multi-source question-answer systems. This methodology is designed to integrate
information from diverse data sources, including unstructured documents (PDFs)
and structured databases, through a coordinated multi-agent orchestration and
dynamic retrieval approach. Our methodology leverages specialized agents-such
as SQL agents, Retrieval-Augmented Generation (RAG) agents, and router agents -
that dynamically select the most appropriate retrieval strategy based on the
nature of each query. To further improve accuracy and contextual relevance, we
employ dynamic prompt engineering, which adapts in real time to query-specific
contexts. The methodology's effectiveness is demonstrated within the domain of
Contract Management, where complex queries often require seamless interaction
between unstructured and structured data. Our results indicate that this
approach enhances response accuracy and relevance, offering a versatile and
scalable framework for developing question-answer systems that can operate
across various domains and data sources.
Authors' comments: International Conference on NLP, AI, Computer Science & Engineering
(NLAICSE 2024)
Jeongsu Yu
Text embedding models play a crucial role in natural language processing, particularly in information retrieval, and their importance is further highlighted with the recent utilization of RAG (Retrieval- Augmented Generation). This study presents an efficient fine-tuning methodology encompassing data selection, loss function, and model architecture to enhance the information retrieval performance of pre-trained text embedding models. In particular, this study proposes a novel Contrastive Learning Penalty function that overcomes the limitations of existing Contrastive Learning. The proposed methodology achieves significant performance improvements over existing methods in document retrieval tasks. This study is expected to contribute to improving the performance of information retrieval systems through fine-tuning of text embedding models. The code for this study can be found at https://github.com/CreaLabs/Enhanced-BGE-M3-with-CLP-and-MoE, and the best-performing model can be found at https://huggingface.co/CreaLabs.
Rishiraj Saha Roy, Chris Hinze, Joel Schlotthauer, Farzad Naderi, Viktor Hangya, Andreas Foltyn, Luzian Hahn, Fabian Kuech
Conversational question answering (ConvQA) is a convenient means of searching
over RDF knowledge graphs (KGs), where a prevalent approach is to translate
natural language questions to SPARQL queries. However, SPARQL has certain
shortcomings: (i) it is brittle for complex intents and conversational
questions, and (ii) it is not suitable for more abstract needs. Instead, we
propose a novel two-pronged system where we fuse: (i) SQL-query results over a
database automatically derived from the KG, and (ii) text-search results over
verbalizations of KG facts. Our pipeline supports iterative retrieval: when the
results of any branch are found to be unsatisfactory, the system can
automatically opt for further rounds. We put everything together in a retrieval
augmented generation (RAG) setup, where an LLM generates a coherent response
from accumulated search results. We demonstrate the superiority of our proposed
system over several baselines on a knowledge graph of BMW automobiles.
Authors' comments: Accepted at BTW 2025, 10 pages
Elham Peimani, Gurpreet Singh, Nisarg Mahyavanshi, Aman Arora, Awais Shaikh
Retrieving semantically relevant documents in niche domains poses significant
challenges for traditional TF-IDF-based systems, often resulting in low
similarity scores and suboptimal retrieval performance. This paper addresses
these challenges by introducing an iterative and semi-automated query
refinement methodology tailored to Humber College's career services webpages.
Initially, generic queries related to interview preparation yield low
top-document similarities (approximately 0.2--0.3). To enhance retrieval
effectiveness, we implement a two-fold approach: first, domain-aware query
refinement by incorporating specialized terms such as
resources-online-learning, student-online-services, and career-advising;
second, the integration of structured educational descriptors like "online
resume and interview improvement tools." Additionally, we automate the
extraction of domain-specific keywords from top-ranked documents to suggest
relevant terms for query expansion. Through experiments conducted on five
baseline queries, our semi-automated iterative refinement process elevates the
average top similarity score from approximately 0.18 to 0.42, marking a
substantial improvement in retrieval performance. The implementation details,
including reproducible code and experimental setups, are made available in our
GitHub repositories \url{https://github.com/Elipei88/HumberChatbotBackend} and
\url{https://github.com/Nisarg851/HumberChatbot}. We also discuss the
limitations of our approach and propose future directions, including the
integration of advanced neural retrieval models.
Authors' comments: To be submitted to CoLM 2025
Jinyan Su, Jin Peng Zhou, Zhengxin Zhang, Preslav Nakov, Claire Cardie
Retrieval-Augmented Generation (RAG) systems have emerged as a promising solution to mitigate LLM hallucinations and enhance their performance in knowledge-intensive domains. However, these systems are vulnerable to adversarial poisoning attacks, where malicious passages injected into retrieval databases can mislead the model into generating factually incorrect outputs. In this paper, we investigate both the retrieval and the generation components of RAG systems to understand how to enhance their robustness against such attacks. From the retrieval perspective, we analyze why and how the adversarial contexts are retrieved and assess how the quality of the retrieved passages impacts downstream generation. From a generation perspective, we evaluate whether LLMs' advanced critical thinking and internal knowledge capabilities can be leveraged to mitigate the impact of adversarial contexts, i.e., using skeptical prompting as a self-defense mechanism. Our experiments and findings provide actionable insights into designing safer and more resilient retrieval-augmented frameworks, paving the way for their reliable deployment in real-world applications.
Luo Ji, Feixiang Guo, Teng Chen, Qingqing Gu, Xiaoyu Wang, Ningyuan Xi, Yihong Wang, Peng Yu et al.
Despite the recent advancement in Retrieval-Augmented Generation (RAG)
systems, most retrieval methodologies are often developed for factual
retrieval, which assumes query and positive documents are semantically similar.
In this paper, we instead propose and study a more challenging type of
retrieval task, called hidden rationale retrieval, in which query and document
are not similar but can be inferred by reasoning chains, logic relationships,
or empirical experiences. To address such problems, an instruction-tuned Large
language model (LLM) with a cross-encoder architecture could be a reasonable
choice. To further strengthen pioneering LLM-based retrievers, we design a
special instruction that transforms the retrieval task into a generative task
by prompting LLM to answer a binary-choice question. The model can be
fine-tuned with direct preference optimization (DPO). The framework is also
optimized for computational efficiency with no performance degradation. We name
this retrieval framework by RaHoRe and verify its zero-shot and fine-tuned
performance superiority on Emotional Support Conversation (ESC), compared with
previous retrieval works. Our study suggests the potential to employ LLM as a
foundation for a wider scope of retrieval tasks. Our codes, models, and
datasets are available on https://github.com/flyfree5/LaHoRe.
Authors' comments: 11 pages, 3 figures, accepted by ECIR 2025
Meng-Chieh Lee, Qi Zhu, Costas Mavromatis, Zhen Han, Soji Adeshina, Vassilis N. Ioannidis, Huzefa Rangwala, Christos Faloutsos
Given a semi-structured knowledge base (SKB), where text documents are
interconnected by relations, how can we effectively retrieve relevant
information to answer user questions? Retrieval-Augmented Generation (RAG)
retrieves documents to assist large language models (LLMs) in question
answering; while Graph RAG (GRAG) uses structured knowledge bases as its
knowledge source. However, many questions require both textual and relational
information from SKB - referred to as "hybrid" questions - which complicates
the retrieval process and underscores the need for a hybrid retrieval method
that leverages both information. In this paper, through our empirical analysis,
we identify key insights that show why existing methods may struggle with
hybrid question answering (HQA) over SKB. Based on these insights, we propose
HybGRAG for HQA consisting of a retriever bank and a critic module, with the
following advantages: (1) Agentic, it automatically refines the output by
incorporating feedback from the critic module, (2) Adaptive, it solves hybrid
questions requiring both textual and relational information with the retriever
bank, (3) Interpretable, it justifies decision making with intuitive refinement
path, and (4) Effective, it surpasses all baselines on HQA benchmarks. In
experiments on the STaRK benchmark, HybGRAG achieves significant performance
gains, with an average relative improvement in Hit@1 of 51%.
Authors' comments: Accepted to ACL 2025
Giulio D'Erasmo, Giovanni Trappolini, Nicola Tonellotto, Fabrizio Silvestri
Recent advances in Information Retrieval have leveraged high-dimensional embedding spaces to improve the retrieval of relevant documents. Moreover, the Manifold Clustering Hypothesis suggests that despite these high-dimensional representations, documents relevant to a query reside on a lower-dimensional, query-dependent manifold. While this hypothesis has inspired new retrieval methods, existing approaches still face challenges in effectively separating non-relevant information from relevant signals. We propose a novel methodology that addresses these limitations by leveraging information from both relevant and non-relevant documents. Our method, ECLIPSE, computes a centroid based on irrelevant documents as a reference to estimate noisy dimensions present in relevant ones, enhancing retrieval performance. Extensive experiments on three in-domain and one out-of-domain benchmarks demonstrate an average improvement of up to 19.50% (resp. 22.35%) in mAP(AP) and 11.42% (resp. 13.10%) in nDCG@10 w.r.t. the DIME-based baseline (resp. the baseline using all dimensions). Our results pave the way for more robust, pseudo-irrelevance-based retrieval systems in future IR research.
Youngwon Lee, Seung-won Hwang, Daniel Campos, Filip Graliński, Zhewei Yao, Yuxiong He
With the adoption of retrieval-augmented generation (RAG), large language models (LLMs) are expected to ground their generation to the retrieved contexts. Yet, this is hindered by position bias of LLMs, failing to evenly attend to all contexts. Previous work has addressed this by synthesizing contexts with perturbed positions of gold segment, creating a position-diversified train set. We extend this intuition to propose consistency regularization with augmentation and distillation. First, we augment each training instance with its position perturbation to encourage consistent predictions, regardless of ordering. We also distill behaviors of this pair, although it can be counterproductive in certain RAG scenarios where the given order from the retriever is crucial for generation quality. We thus propose CORD, balancing COnsistency and Rank Distillation. CORD adaptively samples noise-controlled perturbations from an interpolation space, ensuring both consistency and respect for the rank prior. Empirical results show this balance enables CORD to outperform consistently in diverse RAG benchmarks.