Xin Jiang, Kaiqiang Wang, Yinlong Wang, Fengchang Lv, Taiyang Peng, Shuai Yang, Xianteng Wu, Pengye Zhang et al.
In recommendation systems, the relevance and novelty of the final results are
selected through a cascade system of Matching -> Ranking -> Strategy. The
matching model serves as the starting point of the pipeline and determines the
upper bound of the subsequent stages. Balancing the relevance and novelty of
matching results is a crucial step in the design and optimization of
recommendation systems, contributing significantly to improving recommendation
quality. However, the typical matching algorithms have not simultaneously
addressed the relevance and novelty perfectly. One main reason is that deep
matching algorithms exhibit significant uncertainty when estimating items in
the long tail (e.g., due to insufficient training samples) items.The
uncertainty not only affects the training of the models but also influences the
confidence in the index construction and beam search retrieval process of these
models. This paper proposes the UICR (Uncertainty-based explore for Index
Construction and Retrieval) algorithm, which introduces the concept of
uncertainty modeling in the matching stage and achieves multi-task modeling of
model uncertainty and index uncertainty. The final matching results are
obtained by combining the relevance score and uncertainty score infered by the
model. Experimental results demonstrate that the UICR improves novelty without
sacrificing relevance on realworld industrial productive environments and
multiple open-source datasets. Remarkably, online A/B test results of display
advertising in Shopee demonstrates the effectiveness of the proposed algorithm.
Authors' comments: accepted by cikm2024
Liwen Sun, James Zhao, Megan Han, Chenyan Xiong
Multimodal foundation models hold significant potential for automating
radiology report generation, thereby assisting clinicians in diagnosing cardiac
diseases. However, generated reports often suffer from serious factual
inaccuracy. In this paper, we introduce a fact-aware multimodal
retrieval-augmented pipeline in generating accurate radiology reports
(FactMM-RAG). We first leverage RadGraph to mine factual report pairs, then
integrate factual knowledge to train a universal multimodal retriever. Given a
radiology image, our retriever can identify high-quality reference reports to
augment multimodal foundation models, thus enhancing the factual completeness
and correctness of report generation. Experiments on two benchmark datasets
show that our multimodal retriever outperforms state-of-the-art retrievers on
both language generation and radiology-specific metrics, up to 6.5% and 2%
score in F1CheXbert and F1RadGraph. Further analysis indicates that employing
our factually-informed training strategy imposes an effective supervision
signal, without relying on explicit diagnostic label guidance, and successfully
propagates fact-aware capabilities from the multimodal retriever to the
multimodal foundation model in radiology report generation.
Authors' comments: NAACL 2025 main
Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiaoyong Wei, Chang Wen Chen, Qing Li
In this paper, we investigate the feasibility of leveraging large language
models (LLMs) for integrating general knowledge and incorporating pseudo-events
as priors for temporal content distribution in video moment retrieval (VMR)
models. The motivation behind this study arises from the limitations of using
LLMs as decoders for generating discrete textual descriptions, which hinders
their direct application to continuous outputs like salience scores and
inter-frame embeddings that capture inter-frame relations. To overcome these
limitations, we propose utilizing LLM encoders instead of decoders. Through a
feasibility study, we demonstrate that LLM encoders effectively refine
inter-concept relations in multimodal embeddings, even without being trained on
textual embeddings. We also show that the refinement capability of LLM encoders
can be transferred to other embeddings, such as BLIP and T5, as long as these
embeddings exhibit similar inter-concept similarity patterns to CLIP
embeddings. We present a general framework for integrating LLM encoders into
existing VMR architectures, specifically within the fusion module. Through
experimental validation, we demonstrate the effectiveness of our proposed
methods by achieving state-of-the-art performance in VMR. The source code can
be accessed at https://github.com/fletcherjiang/LLMEPET.
Authors' comments: Accepted to ACM Multimedia 2024
Jeffy Yu
The rapid growth of Decentralized Finance (DeFi) has been accompanied by
substantial financial losses due to smart contract vulnerabilities,
underscoring the critical need for effective security auditing. With attacks
becoming more frequent, the necessity and demand for auditing services has
escalated. This especially creates a financial burden for independent
developers and small businesses, who often have limited available funding for
these services. Our study builds upon existing frameworks by integrating
Retrieval-Augmented Generation (RAG) with large language models (LLMs),
specifically employing GPT-4-1106 for its 128k token context window. We
construct a vector store of 830 known vulnerable contracts, leveraging Pinecone
for vector storage, OpenAI's text-embedding-ada-002 for embeddings, and
LangChain to construct the RAG-LLM pipeline. Prompts were designed to provide a
binary answer for vulnerability detection. We first test 52 smart contracts 40
times each against a provided vulnerability type, verifying the replicability
and consistency of the RAG-LLM. Encouraging results were observed, with a 62.7%
success rate in guided detection of vulnerabilities. Second, we challenge the
model under a "blind" audit setup, without the vulnerability type provided in
the prompt, wherein 219 contracts undergo 40 tests each. This setup evaluates
the general vulnerability detection capabilities without hinted context
assistance. Under these conditions, a 60.71% success rate was observed. While
the results are promising, we still emphasize the need for human auditing at
this time. We provide this study as a proof of concept for a cost-effective
smart contract auditing process, moving towards democratic access to security.
Authors' comments: 17 pages, 3 figures, 4 tables
Fan Zhao, You Chen
Deep learning-based methods for Time Series Classification (TSC) typically utilize deep networks to extract features, which are then processed through a combination of a Fully Connected (FC) layer and a SoftMax function. However, we have observed the phenomenon of inter-class similarity and intra-class inconsistency in the datasets from the UCR archive and further analyzed how this phenomenon adversely affects the "FC+SoftMax" paradigm. To address the issue, we introduce ECR, which, for the first time to our knowledge, applies deep learning-based retrieval algorithm to the TSC problem and integrates classification and retrieval models. Experimental results on 112 UCR datasets demonstrate that ECR is state-of-the-art(sota) compared to existing deep learning-based methods. Furthermore, we have developed a more precise classifier, ECRTime, which is an ensemble of ECR. ECRTime surpasses the currently most accurate deep learning classifier, InceptionTime, in terms of accuracy, achieving this with reduced training time and comparable scalability.
Morgan Berkane, Richard Taïeb, Gabriel Granveau, Pascal Salières, Charles Bourassin-Bouchet, Camille Lévêque, Jérémie Caillat
We show that the complete photoemission dynamics in situations of
electron-ion entanglement can be retrieved from photoelectron spectral
measurements without information on the ion. To this end, we develop an
energy-time analysis of the photoelectron's reduced density matrix based on
first principles. We test and assess our approach with numerical simulations on
a low dimensional model molecule in interaction with broadband composite pulses
occulting the vibrational resolution. Our method is directly applicable to
recent experimental schemes measuring the photoelectron reduced density
matrices in atomic and molecular photoemission. Therefore, it opens a new
window on the dynamics of decoherence and entanglement at the attosecond
timescale.
Authors' comments: 7 pages, 2 figures
Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min et al.
Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
Donghoon Han, Eunhwan Park, Gisang Lee, Adam Lee, Nojun Kwak
The rapid expansion of multimedia content has made accurately retrieving
relevant videos from large collections increasingly challenging. Recent
advancements in text-video retrieval have focused on cross-modal interactions,
large-scale foundation model training, and probabilistic modeling, yet often
neglect the crucial user perspective, leading to discrepancies between user
queries and the content retrieved. To address this, we introduce MERLIN
(Multimodal Embedding Refinement via LLM-based Iterative Navigation), a novel,
training-free pipeline that leverages Large Language Models (LLMs) for
iterative feedback learning. MERLIN refines query embeddings from a user
perspective, enhancing alignment between queries and video content through a
dynamic question answering process. Experimental results on datasets like
MSR-VTT, MSVD, and ActivityNet demonstrate that MERLIN substantially improves
Recall@1, outperforming existing systems and confirming the benefits of
integrating LLMs into multimodal retrieval systems for more responsive and
context-aware multimedia retrieval.
Authors' comments: Work in progress
Han Zhou, Wei Dong, Xiaohong Liu, Shuaicheng Liu, Xiongkuo Min, Guangtao Zhai, Jun Chen
Most existing Low-light Image Enhancement (LLIE) methods either directly map
Low-Light (LL) to Normal-Light (NL) images or use semantic or illumination maps
as guides. However, the ill-posed nature of LLIE and the difficulty of semantic
retrieval from impaired inputs limit these methods, especially in extremely
low-light conditions. To address this issue, we present a new LLIE network via
Generative LAtent feature based codebook REtrieval (GLARE), in which the
codebook prior is derived from undegraded NL images using a Vector Quantization
(VQ) strategy. More importantly, we develop a generative Invertible Latent
Normalizing Flow (I-LNF) module to align the LL feature distribution to NL
latent representations, guaranteeing the correct code retrieval in the
codebook. In addition, a novel Adaptive Feature Transformation (AFT) module,
featuring an adjustable function for users and comprising an Adaptive Mix-up
Block (AMB) along with a dual-decoder architecture, is devised to further
enhance fidelity while preserving the realistic details provided by codebook
prior. Extensive experiments confirm the superior performance of GLARE on
various benchmark datasets and real-world data. Its effectiveness as a
preprocessing tool in low-light object detection tasks further validates GLARE
for high-level vision applications. Code is released at
https://github.com/LowLevelAI/GLARE.
Authors' comments: Accepted by ECCV 2024
Alexander R. Pelletier, Joseph Ramirez, Irsyad Adam, Simha Sankar, Yu Yan, Ding Wang, Dylan Steinecke, Wei Wang et al.
The vast amount of biomedical information available today presents a significant challenge for investigators seeking to digest, process, and understand these findings effectively. Large Language Models (LLMs) have emerged as powerful tools to navigate this complex and challenging data landscape. However, LLMs may lead to hallucinatory responses, making Retrieval Augmented Generation (RAG) crucial for achieving accurate information. In this protocol, we present RUGGED (Retrieval Under Graph-Guided Explainable disease Distinction), a comprehensive workflow designed to support investigators with knowledge integration and hypothesis generation, identifying validated paths forward. Relevant biomedical information from publications and knowledge bases are reviewed, integrated, and extracted via text-mining association analysis and explainable graph prediction models on disease nodes, forecasting potential links among drugs and diseases. These analyses, along with biomedical texts, are integrated into a framework that facilitates user-directed mechanism elucidation as well as hypothesis exploration through RAG-enabled LLMs. A clinical use-case demonstrates RUGGED's ability to evaluate and recommend therapeutics for Arrhythmogenic Cardiomyopathy (ACM) and Dilated Cardiomyopathy (DCM), analyzing prescribed drugs for molecular interactions and unexplored uses. The platform minimizes LLM hallucinations, offers actionable insights, and improves the investigation of novel therapeutics.
Zhouyu Jiang, Mengshu Sun, Lei Liang, Zhiqiang Zhang
Multi-hop question answering is a challenging task with distinct industrial
relevance, and Retrieval-Augmented Generation (RAG) methods based on large
language models (LLMs) have become a popular approach to tackle this task.
Owing to the potential inability to retrieve all necessary information in a
single iteration, a series of iterative RAG methods has been recently
developed, showing significant performance improvements. However, existing
methods still face two critical challenges: context overload resulting from
multiple rounds of retrieval, and over-planning and repetitive planning due to
the lack of a recorded retrieval trajectory. In this paper, we propose a novel
iterative RAG method called ReSP, equipped with a dual-function summarizer.
This summarizer compresses information from retrieved documents, targeting both
the overarching question and the current sub-question concurrently.
Experimental results on the multi-hop question-answering datasets HotpotQA and
2WikiMultihopQA demonstrate that our method significantly outperforms the
state-of-the-art, and exhibits excellent robustness concerning context length.
Authors' comments: Accepted by WWW2025 Agent4IR Workshop
Garima Agrawal, Tharindu Kumarage, Zeyad Alghamdi, Huan Liu
Large Language Models (LLMs) are proficient at generating coherent and contextually relevant text but face challenges when addressing knowledge-intensive queries in domain-specific and factual question-answering tasks. Retrieval-augmented generation (RAG) systems mitigate this by incorporating external knowledge sources, such as structured knowledge graphs (KGs). However, LLMs often struggle to produce accurate answers despite access to KG-extracted information containing necessary facts. Our study investigates this dilemma by analyzing error patterns in existing KG-based RAG methods and identifying eight critical failure points. We observed that these errors predominantly occur due to insufficient focus on discerning the question's intent and adequately gathering relevant context from the knowledge graph facts. Drawing on this analysis, we propose the Mindful-RAG approach, a framework designed for intent-based and contextually aligned knowledge retrieval. This method explicitly targets the identified failures and offers improvements in the correctness and relevance of responses provided by LLMs, representing a significant step forward from existing methods.
Mo Li, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, Kai Chen
The capability of large language models to handle long-context information is
crucial across various real-world applications. Existing evaluation methods
often rely either on real-world long texts, making it difficult to exclude the
influence of models' inherent knowledge, or introduce irrelevant filler content
to artificially achieve target lengths, reducing assessment effectiveness. To
address these limitations, we introduce NeedleBench, a synthetic framework for
assessing retrieval and reasoning performance in bilingual long-context tasks
with adaptive context lengths. NeedleBench systematically embeds key data
points at varying depths to rigorously test model capabilities. Tasks are
categorized into two scenarios: information-sparse, featuring minimal relevant
details within extensive irrelevant text to simulate simple retrieval tasks;
and information-dense (the Ancestral Trace Challenge), where relevant
information is continuously distributed throughout the context to simulate
complex reasoning tasks. Our experiments reveal that although recent reasoning
models like Deepseek-R1 and OpenAI's o3 excel in mathematical reasoning, they
struggle with continuous retrieval and reasoning in information-dense
scenarios, even at shorter context lengths. We also characterize a phenomenon
termed 'under-thinking', where models prematurely conclude reasoning despite
available information. NeedleBench thus provides critical insights and targeted
tools essential for evaluating and improving LLMs' long-context capabilities.
All resources are available at OpenCompass:
https://github.com/open-compass/opencompass.
Authors' comments: v2: updated with tested models and Multi-Needle Reasoning
implementation
Kaiming Shen, Xichen Ding, Zixiang Zheng, Yuqi Gong, Qianqian Li, Zhongyi Liu, Guannan Zhang
The modeling of users' behaviors is crucial in modern recommendation systems.
A lot of research focuses on modeling users' lifelong sequences, which can be
extremely long and sometimes exceed thousands of items. These models use the
target item to search for the most relevant items from the historical sequence.
However, training lifelong sequences in click through rate (CTR) prediction or
personalized search ranking (PSR) is extremely difficult due to the
insufficient learning problem of ID embedding, especially when the IDs in the
lifelong sequence features do not exist in the samples of training dataset.
Additionally, existing target attention mechanisms struggle to learn the
multi-modal representations of items in the sequence well. The distribution of
multi-modal embedding (text, image and attributes) output of user's interacted
items are not properly aligned and there exist divergence across modalities. We
also observe that users' search query sequences and item browsing sequences can
fully depict users' intents and benefit from each other. To address these
challenges, we propose a unified lifelong multi-modal sequence model called
SEMINAR-Search Enhanced Multi-Modal Interest Network and Approximate Retrieval.
Specifically, a network called Pretraining Search Unit (PSU) learns the
lifelong sequences of multi-modal query-item pairs in a pretraining-finetuning
manner with multiple objectives: multi-modal alignment, next query-item pair
prediction, query-item relevance prediction, etc. After pretraining, the
downstream model restores the pretrained embedding as initialization and
finetunes the network. To accelerate the online retrieval speed of multi-modal
embedding, we propose a multi-modal codebook-based product quantization
strategy to approximate the exact attention calculati
Authors' comments: 9 pages,code released
Fengyu Cai, Xinran Zhao, Tong Chen, Sihao Chen, Hongming Zhang, Iryna Gurevych, Heinz Koeppl
Recent studies show the growing significance of document retrieval in the
generation of LLMs, i.e., RAG, within the scientific domain by bridging their
knowledge gap. However, dense retrievers often struggle with domain-specific
retrieval and complex query-document relationships, particularly when query
segments correspond to various parts of a document. To alleviate such prevalent
challenges, this paper introduces $\texttt{MixGR}$, which improves dense
retrievers' awareness of query-document matching across various levels of
granularity in queries and documents using a zero-shot approach.
$\texttt{MixGR}$ fuses various metrics based on these granularities to a united
score that reflects a comprehensive query-document similarity. Our experiments
demonstrate that $\texttt{MixGR}$ outperforms previous document retrieval by
24.7%, 9.8%, and 6.9% on nDCG@5 with unsupervised, supervised, and LLM-based
retrievers, respectively, averaged on queries containing multiple subqueries
from five scientific retrieval datasets. Moreover, the efficacy of two
downstream scientific question-answering tasks highlights the advantage of
$\texttt{MixGR}$ to boost the application of LLMs in the scientific domain. The
code and experimental datasets are available.
Authors' comments: EMNLP 2024 Main Conference
Huan Ning, Zhenlong Li, Temitope Akinboyewa, M. Naser Lessani
Powered by the emerging large language models (LLMs), autonomous geographic information systems (GIS) agents have the potential to accomplish spatial analyses and cartographic tasks. However, a research gap exists to support fully autonomous GIS agents: how to enable agents to discover and download the necessary data for geospatial analyses. This study proposes LLM-Find, an autonomous GIS agent framework capable of selecting and fetching required geospatial data by generating, executing, and debugging programs. LLM-Find utilizes the LLM as the decision-maker, selects the appropriate data source (s) from a pre-defined source list, and fetches the data from the chosen source. Each data source has a handbook that records the metadata and technical details for data retrieval. The proposed framework is designed in a plug-and-play style to ensure flexibility and extensibility. Human users or autonomous data scrawlers can add a new data source by adding a new handbook. We developed a prototype agent based on LLM-Find, and experiment results demonstrate its capability of retrieving data from various sources including OpenStreetMap, administrative boundaries and demographic data from the US Census Bureau, satellite basemaps from ESRI World Imagery, weather data from a commercial provider, and the COVID-19 data from the NYTimes GitHub. Our study is among the first attempts to develop an autonomous geospatial data retrieval agent.
Mingjie Shao, Wei-Kun Chen, Cheng-Yang Yu, Ya-Feng Liu, Wing-Kin Ma
As communication systems advance towards the future 6G era, the incorporation of large-scale antenna arrays in base stations (BSs) presents challenges such as increased hardware costs and energy consumption. To address these issues, the use of one-bit analog-to-digital converters (ADCs)/digital-to-analog converters (DACs) has gained significant attentions. This paper focuses on one-bit multiple-input multiple-output (MIMO) detection in an uplink multiuser transmission scenario where the BS employs one-bit ADCs. One-bit quantization retains only the sign information and loses the amplitude information, which poses a unique challenge in the corresponding detection problem. The maximum-likelihood (ML) formulation of one-bit MIMO detection has a challenging likelihood function that hinders the application of many high-performance detectors developed for classic MIMO detection (under high-resolution ADCs). While many approximate methods for the ML detection problem have been studied, it lacks an efficient global algorithm. This paper fills this gap by proposing an efficient branch-and-bound algorithm, which is guaranteed to find the global solution of the one-bit ML MIMO detection problem. Additionally, a new amplitude retrieval (AR) detection approach is developed, incorporating explicit amplitude variables into the problem formulation. The AR approach yields simpler objective functions that enable the development of efficient algorithms offering both global and approximate solutions. The paper also contributes to the computational complexity analysis of both ML and AR detection problems. Extensive simulations are conducted to demonstrate the effectiveness and efficiency of the proposed formulations and algorithms.
Vaibhav Balloli, Sara Beery, Elizabeth Bondi-Kelly
Image retrieval plays a pivotal role in applications from wildlife
conservation to healthcare, for finding individual animals or relevant images
to aid diagnosis. Although deep learning techniques for image retrieval have
advanced significantly, their imperfect real-world performance often
necessitates including human expertise. Human-in-the-loop approaches typically
rely on humans completing the task independently and then combining their
opinions with an AI model in various ways, as these models offer very little
interpretability or \textit{correctability}. To allow humans to intervene in
the AI model instead, thereby saving human time and effort, we adapt the
Concept Bottleneck Model (CBM) and propose \texttt{CHAIR}. \texttt{CHAIR} (a)
enables humans to correct intermediate concepts, which helps \textit{improve}
embeddings generated, and (b) allows for flexible levels of intervention that
accommodate varying levels of human expertise for better retrieval. To show the
efficacy of \texttt{CHAIR}, we demonstrate that our method performs better than
similar models on image retrieval metrics without any external intervention.
Furthermore, we also showcase how human intervention helps further improve
retrieval performance, thereby achieving human-AI complementarity.
Authors' comments: Accepted at Human-Centred AI Track at IJCAI 2024
Pranav Poudel, Prashant Shrestha, Sanskar Amgain, Yash Raj Shrestha, Prashnna Gyawali, Binod Bhattarai
Multimodal AI has demonstrated superior performance over unimodal approaches
by leveraging diverse data sources for more comprehensive analysis. However,
applying this effectiveness in healthcare is challenging due to the limited
availability of public datasets. Federated learning presents an exciting
solution, allowing the use of extensive databases from hospitals and health
centers without centralizing sensitive data, thus maintaining privacy and
security. Yet, research in multimodal federated learning, particularly in
scenarios with missing modalities a common issue in healthcare datasets remains
scarce, highlighting a critical area for future exploration. Toward this, we
propose a novel method for multimodal federated learning with missing
modalities. Our contribution lies in a novel cross-modal data augmentation by
retrieval, leveraging the small publicly available dataset to fill the missing
modalities in the clients. Our method learns the parameters in a federated
manner, ensuring privacy protection and improving performance in multiple
challenging multimodal benchmarks in the medical domain, surpassing several
competitive baselines. Code Available: https://github.com/bhattarailab/CAR-MFL
Authors' comments: Accepted at MICCAI 2024
Zhenhe Wu, Zhongqiu Li, Jie Zhang, Mengxiang Li, Yu Zhao, Ruiyu Fang, Zhongjiang He, Xuelong Li et al.
Large language models (LLMs) with in-context learning have significantly improved the performance of text-to-SQL task. Previous works generally focus on using exclusive SQL generation prompt to improve the LLMs' reasoning ability. However, they are mostly hard to handle large databases with numerous tables and columns, and usually ignore the significance of pre-processing database and extracting valuable information for more efficient prompt engineering. Based on above analysis, we propose RB-SQL, a novel retrieval-based LLM framework for in-context prompt engineering, which consists of three modules that retrieve concise tables and columns as schema, and targeted examples for in-context learning. Experiment results demonstrate that our model achieves better performance than several competitive baselines on public datasets BIRD and Spider.