Regan Bolton, Mohammadreza Sheikhfathollahi, Simon Parkinson, Dan Basher, Howard Parkinson
Operational Technology Cybersecurity (OTCS) continues to be a dominant challenge for critical infrastructure such as railways. As these systems become increasingly vulnerable to malicious attacks due to digitalization, effective documentation and compliance processes are essential to protect these safety-critical systems. This paper proposes a novel system that leverages Large Language Models (LLMs) and multi-stage retrieval to enhance the compliance verification process against standards like IEC 62443 and the rail-specific IEC 63452. We first evaluate a Baseline Compliance Architecture (BCA) for answering OTCS compliance queries, then develop an extended approach called Parallel Compliance Architecture (PCA) that incorporates additional context from regulatory standards. Through empirical evaluation comparing OpenAI-gpt-4o and Claude-3.5-haiku models in these architectures, we demonstrate that the PCA significantly improves both correctness and reasoning quality in compliance verification. Our research establishes metrics for response correctness, logical reasoning, and hallucination detection, highlighting the strengths and limitations of using LLMs for compliance verification in railway cybersecurity. The results suggest that retrieval-augmented approaches can significantly improve the efficiency and accuracy of compliance assessments, particularly valuable in an industry facing a shortage of cybersecurity expertise.
Dachun Sun, You Lyu, Jinning Li, Yizhuo Chen, Tianshi Wang, Tomoyoshi Kimura, Tarek Abdelzaher
This paper introduces SCRAG, a prediction framework inspired by social computing, designed to forecast community responses to real or hypothetical social media posts. SCRAG can be used by public relations specialists (e.g., to craft messaging in ways that avoid unintended misinterpretations) or public figures and influencers (e.g., to anticipate social responses), among other applications related to public sentiment prediction, crisis management, and social what-if analysis. While large language models (LLMs) have achieved remarkable success in generating coherent and contextually rich text, their reliance on static training data and susceptibility to hallucinations limit their effectiveness at response forecasting in dynamic social media environments. SCRAG overcomes these challenges by integrating LLMs with a Retrieval-Augmented Generation (RAG) technique rooted in social computing. Specifically, our framework retrieves (i) historical responses from the target community to capture their ideological, semantic, and emotional makeup, and (ii) external knowledge from sources such as news articles to inject time-sensitive context. This information is then jointly used to forecast the responses of the target community to new posts or narratives. Extensive experiments across six scenarios on the X platform (formerly Twitter), tested with various embedding models and LLMs, demonstrate over 10% improvements on average in key evaluation metrics. A concrete example further shows its effectiveness in capturing diverse ideologies and nuances. Our work provides a social computing tool for applications where accurate and concrete insights into community responses are crucial.
Quentin Romero Lauro, Shreya Shankar, Sepanta Zeighami, Aditya Parameswaran
Retrieval-augmented generation (RAG) pipelines have become the de-facto
approach for building AI assistants with access to external, domain-specific
knowledge. Given a user query, RAG pipelines typically first retrieve (R)
relevant information from external sources, before invoking a Large Language
Model (LLM), augmented (A) with this information, to generate (G) responses.
Modern RAG pipelines frequently chain multiple retrieval and generation
components, in any order. However, developing effective RAG pipelines is
challenging because retrieval and generation components are intertwined, making
it hard to identify which component(s) cause errors in the eventual output. The
parameters with the greatest impact on output quality often require hours of
pre-processing after each change, creating prohibitively slow feedback cycles.
To address these challenges, we present RAGGY, a developer tool that combines a
Python library of composable RAG primitives with an interactive interface for
real-time debugging. We contribute the design and implementation of RAGGY,
insights into expert debugging patterns through a qualitative study with 12
engineers, and design implications for future RAG tools that better align with
developers' natural workflows.
Authors' comments: 15 pages, 7 figures, 2 tables
Haoxuan Li, Yi Bin, Yunshan Ma, Guoqing Wang, Yang Yang, See-Kiong Ng, Tat-Seng Chua
Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.
Hao Deng, Haibo Xing, Kanefumi Matsuyama, Moyu Zhang, Jinxin Hu, Hong Wen, Yu Zhang, Xiaoyi Zeng et al.
Multi-objective embedding-based retrieval (EBR) has become increasingly
critical due to the growing complexity of user behaviors and commercial
objectives. While traditional approaches often suffer from data sparsity and
limited information sharing between objectives, recent methods utilizing a
shared network alongside dedicated sub-networks for each objective partially
address these limitations. However, such methods significantly increase the
model parameters, leading to an increased retrieval latency and a limited
ability to model causal relationships between objectives. To address these
challenges, we propose the Cascaded Selective Mask Fine-Tuning (CSMF), a novel
method that enhances both retrieval efficiency and serving performance for
multi-objective EBR. The CSMF framework selectively masks model parameters to
free up independent learning space for each objective, leveraging the cascading
relationships between objectives during the sequential fine-tuning. Without
increasing network parameters or online retrieval overhead, CSMF computes a
linearly weighted fusion score for multiple objective probabilities while
supporting flexible adjustment of each objective's weight across various
recommendation scenarios. Experimental results on real-world datasets
demonstrate the superior performance of CSMF, and online experiments validate
its significant practical value.
Authors' comments: 10 pages, 8 figures, Proceedings of the 48th International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR '25),
July 13--18, 2025, Padua, Italy
Nearchos Potamitis, Akhil Arora
Recent advancements in large language models (LLMs) have catalyzed the
development of general-purpose autonomous agents, demonstrating remarkable
performance in complex reasoning tasks across various domains. This surge has
spurred the evolution of a plethora of prompt-based reasoning frameworks. A
recent focus has been on iterative reasoning strategies that refine outputs
through self-evaluation and verbalized feedback. However, these strategies
require additional computational complexity to enable models to recognize and
correct their mistakes, leading to a significant increase in their cost. In
this work, we introduce the concept of ``retrials without feedback'', an
embarrassingly simple yet powerful mechanism for enhancing reasoning frameworks
by allowing LLMs to retry problem-solving attempts upon identifying incorrect
answers. Unlike conventional iterative refinement methods, our method does not
require explicit self-reflection or verbalized feedback, simplifying the
refinement process. Our findings indicate that simpler retrial-based approaches
often outperform more sophisticated reasoning frameworks, suggesting that the
benefits of complex methods may not always justify their computational costs.
By challenging the prevailing assumption that more intricate reasoning
strategies inherently lead to better performance, our work offers new insights
into how simpler, more efficient approaches can achieve optimal results. So,
are retrials all you need?
Authors' comments: 8 pages, 16 figures, 1 table. arXiv admin note: text overlap with
arXiv:2405.06691
Adithya Pratapa, Teruko Mitamura
Recent advances in long-context reasoning abilities of language models led to interesting applications in large-scale multi-document summarization. However, prior work has shown that these long-context models are not effective at their claimed context windows. To this end, retrieval-augmented systems provide an efficient and effective alternative. However, their performance can be highly sensitive to the choice of retrieval context length. In this work, we present a hybrid method that combines retrieval-augmented systems with long-context windows supported by recent language models. Our method first estimates the optimal retrieval length as a function of the retriever, summarizer, and dataset. On a randomly sampled subset of the dataset, we use a panel of LLMs to generate a pool of silver references. We use these silver references to estimate the optimal context length for a given RAG system configuration. Our results on the multi-document summarization task showcase the effectiveness of our method across model classes and sizes. We compare against length estimates from strong long-context benchmarks such as RULER and HELMET. Our analysis also highlights the effectiveness of our estimation method for very long-context LMs and its generalization to new classes of LMs.
Jiatai Wang, Zhiwei Xu, Di Jin, Xuewen Yang, Tao Li
The proliferation of large language models (LLMs) has significantly advanced information retrieval systems, particularly in response generation (RG). Unfortunately, LLMs often face knowledge conflicts between internal memory and retrievaled external information, arising from misinformation, biases, or outdated knowledge. These conflicts undermine response reliability and introduce uncertainty in decision-making. In this work, we analyze how LLMs navigate knowledge conflicts from an information-theoretic perspective and reveal that when conflicting and supplementary information exhibit significant differences, LLMs confidently resolve their preferences. However, when the distinction is ambiguous, LLMs experience heightened uncertainty. Based on this insight, we propose Swin-VIB, a novel framework that integrates a pipeline of variational information bottleneck models into adaptive augmentation of retrieved information and guiding LLM preference in response generation. Extensive experiments on single-choice, open-ended question-answering (QA), and retrieval augmented generation (RAG) validate our theoretical findings and demonstrate the efficacy of Swin-VIB. Notably, our method improves single-choice task accuracy by at least 7.54\% over competitive baselines.
Zheng Wang, Shu Xian Teo, Jun Jie Chew, Wei Shi
Recent advancements in large language models (LLMs) have enabled their use as
agents for planning complex tasks. Existing methods typically rely on a
thought-action-observation (TAO) process to enhance LLM performance, but these
approaches are often constrained by the LLMs' limited knowledge of complex
tasks. Retrieval-augmented generation (RAG) offers new opportunities by
leveraging external databases to ground generation in retrieved information. In
this paper, we identify two key challenges (enlargability and transferability)
in applying RAG to task planning. We propose InstructRAG, a novel solution
within a multi-agent meta-reinforcement learning framework, to address these
challenges. InstructRAG includes a graph to organize past instruction paths
(sequences of correct actions), an RL-Agent with Reinforcement Learning to
expand graph coverage for enlargability, and an ML-Agent with Meta-Learning to
improve task generalization for transferability. The two agents are trained
end-to-end to optimize overall planning performance. Our experiments on four
widely used task planning datasets demonstrate that InstructRAG significantly
enhances performance and adapts efficiently to new tasks, achieving up to a
19.2% improvement over the best existing approach.
Authors' comments: This paper has been accepted by SIGIR 2025
WonJun Moon, Cheol-Ho Cho, Woojin Jun, Minho Shim, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Jae-Pil Heo
In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.
Amey Hengle, Prasoon Bajpai, Soham Dan, Tanmoy Chakraborty
Existing multilingual long-context benchmarks, often based on the popular
needle-in-a-haystack test, primarily evaluate a model's ability to locate
specific information buried within irrelevant texts. However, such a
retrieval-centric approach is myopic and inherently limited, as successful
recall alone does not indicate a model's capacity to reason over extended
contexts. Moreover, these benchmarks are susceptible to data leakage,
short-circuiting, and risk making the evaluation a priori identifiable. To
address these limitations, we introduce MLRBench, a new synthetic benchmark for
multilingual long-context reasoning. Unlike existing benchmarks, MLRBench goes
beyond surface-level retrieval by including tasks that assess multi-hop
inference, aggregation, and epistemic reasoning. Spanning seven languages,
MLRBench is designed to be parallel, resistant to leakage, and scalable to
arbitrary context lengths. Our extensive experiments with an open-weight large
language model (LLM) reveal a pronounced gap between high- and low-resource
languages, particularly for tasks requiring the model to aggregate multiple
facts or predict the absence of information. We also find that, in multilingual
settings, LLMs effectively utilize less than 30% of their claimed context
length. Although off-the-shelf Retrieval Augmented Generation helps alleviate
this to a certain extent, it does not solve the long-context problem. We
open-source MLRBench to enable future research in improved evaluation and
training of multilingual LLMs.
Authors' comments: 33 Pages in Total - 23 (Main Manuscript) + 10 (Appendix)
Grace Byun, Shinsun Lee, Nayoung Choi, Jinho Choi
Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9 percent win rates across correctness, richness, and helpfulness in LLM-based evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG.
Changjiang Gao, Hankun Lin, Shujian Huang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Jiajun Chen
The ability of cross-lingual context retrieval is a fundamental aspect of cross-lingual alignment of large language models (LLMs), where the model extracts context information in one language based on requests in another language. Despite its importance in real-life applications, this ability has not been adequately investigated for state-of-the-art models. In this paper, we evaluate the cross-lingual context retrieval ability of over 40 LLMs across 12 languages to understand the source of this ability, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that several small, post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our interpretability analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training, respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential. Our code and is available at https://github.com/NJUNLP/Cross-Lingual-Context-Retrieval
Bingjie Gao, Xinyu Gao, Xiaoxue Wu, Yujie Zhou, Yu Qiao, Li Niu, Xinyuan Chen, Yaohui Wang
The evolution of Text-to-video (T2V) generative models, trained on
large-scale datasets, has been marked by significant progress. However, the
sensitivity of T2V generative models to input prompts highlights the critical
role of prompt design in influencing generative outcomes. Prior research has
predominantly relied on Large Language Models (LLMs) to align user-provided
prompts with the distribution of training prompts, albeit without tailored
guidance encompassing prompt vocabulary and sentence structure nuances. To this
end, we introduce RAPO, a novel Retrieval-Augmented Prompt Optimization
framework. In order to address potential inaccuracies and ambiguous details
generated by LLM-generated prompts. RAPO refines the naive prompts through dual
optimization branches, selecting the superior prompt for T2V generation. The
first branch augments user prompts with diverse modifiers extracted from a
learned relational graph, refining them to align with the format of training
prompts via a fine-tuned LLM. Conversely, the second branch rewrites the naive
prompt using a pre-trained LLM following a well-defined instruction set.
Extensive experiments demonstrate that RAPO can effectively enhance both the
static and dynamic dimensions of generated videos, demonstrating the
significance of prompt optimization for user-provided prompts.
Authors' comments: accepted by CVPR2025, Project website:
https://whynothaha.github.io/Prompt_optimizer/RAPO.html
Xing David Wang, Ulf Leser
Curation of biomedical knowledge bases (KBs) relies on extracting accurate multi-entity relational facts from the literature - a process that remains largely manual and expert-driven. An essential step in this workflow is retrieving documents that can support or complete partially observed n-ary relations. We present a neural retrieval model designed to assist KB curation by identifying documents that help fill in missing relation arguments and provide relevant contextual evidence. To reduce dependence on scarce gold-standard training data, we exploit existing KB records to construct weakly supervised training sets. Our approach introduces two key technical contributions: (i) a layered contrastive loss that enables learning from noisy and incomplete relational structures, and (ii) a balanced sampling strategy that generates high-quality negatives from diverse KB records. On two biomedical retrieval benchmarks, our approach achieves state-of-the-art performance, outperforming strong baselines in NDCG@10 by 5.7 and 3.7 percentage points, respectively.
Kartik Ramkrishnan, Antonia Zhai, Stephen McCamant, Pen Chung Yew
Microarchitectural attacks are a significant concern, leading to many
hardware-based defense proposals. However, different defenses target different
classes of attacks, and their impact on each other has not been fully
considered. To raise awareness of this problem, we study an interaction between
two state-of-the art defenses in this paper, timing obfuscations of remote
cache lines (TORC) and delaying speculative changes to remote cache lines
(DSRC). TORC mitigates cache-hit based attacks and DSRC mitigates speculative
coherence state change attacks.
We observe that DSRC enables coherence information to be retrieved into the
processor core, where it is out of the reach of timing obfuscations to protect.
This creates an unforeseen consequence that redo operations can be triggered
within the core to detect the presence or absence of remote cache lines, which
constitutes a security vulnerability. We demonstrate that a new covert channel
attack is possible using this vulnerability. We propose two ways to mitigate
the attack, whose performance varies depending on an application's cache usage.
One way is to never send remote exclusive coherence state (E) information to
the core even if it is created. The other way is to never create a remote E
state, which is responsible for triggering redos.
We demonstrate the timing difference caused by this microarchitectural
defense assumption violation using GEM5 simulations. Performance evaluation on
SPECrate 2017 and PARSEC benchmarks of the two fixes show less than 32\%
average overhead across both sets of benchmarks. The repair which prevented the
creation of remote E state had less than 2.8% average overhead.
Authors' comments: 12 pages
Dan Luo, Chengyuan Ma, Weiqin Li, Jun Wang, Wei Chen, Zhiyong Wu
With the advancement of speech synthesis technology, users have higher
expectations for the naturalness and expressiveness of synthesized speech. But
previous research ignores the importance of prompt selection. This study
proposes a text-to-speech (TTS) framework based on Retrieval-Augmented
Generation (RAG) technology, which can dynamically adjust the speech style
according to the text content to achieve more natural and vivid communication
effects. We have constructed a speech style knowledge database containing
high-quality speech samples in various contexts and developed a style matching
scheme. This scheme uses embeddings, extracted by Llama, PER-LLM-Embedder,and
Moka, to match with samples in the knowledge database, selecting the most
appropriate speech style for synthesis. Furthermore, our empirical research
validates the effectiveness of the proposed method. Our demo can be viewed at:
https://thuhcsi.github.io/icme2025-AutoStyle-TTS
Authors' comments: accepted by ICME25
I. Ben Soltane, M. Roy, R. Andre, N. Bonod
The Singularity Expansion Method Parameter Optimizer - SEMPO - is a toolbox
to extract the complex poles, zeros and residues of an arbitrary response
function acquired along the real frequency axis. SEMPO allows to determine this
full set of complex parameters of linear physical systems from their spectral
responses only, without prior information about the system. The method
leverages on the Singularity Expansion Method of the physical signal. This
analytical expansion of the meromorphic function in the complex frequency plane
motivates the use of the Cauchy method and auto-differentiation-based
optimization approach to retrieve the complex poles, zeros and residues from
the knowledge of the spectrum over a finite and real spectral range. Both
approaches can be sequentially associated to provide highly accurate
reconstructions of physical signals in large spectral windows. The performances
of SEMPO are assessed and analysed in several configurations that include the
dielectric permittivity of materials and the optical response spectra of
various optical metasurfaces.
Authors' comments: 31 pages, 8 figures
Arka Ujjal Dey, Muhammad Junaid Awan, Georgia Channing, Christian Schroeder de Witt, John Collomosse
We propose CRAVE (Cluster-based Retrieval Augmented Verification with Explanation); a novel framework that integrates retrieval-augmented Large Language Models (LLMs) with clustering techniques to address fact-checking challenges on social media. CRAVE automatically retrieves multimodal evidence from diverse, often contradictory, sources. Evidence is clustered into coherent narratives, and evaluated via an LLM-based judge to deliver fact-checking verdicts explained by evidence summaries. By synthesizing evidence from both text and image modalities and incorporating agent-based refinement, CRAVE ensures consistency and diversity in evidence representation. Comprehensive experiments demonstrate CRAVE's efficacy in retrieval precision, clustering quality, and judgment accuracy, showcasing its potential as a robust decision-support tool for fact-checkers.
Zihan Ling, Zhiyao Guo, Yixuan Huang, Yi An, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, Bo Zheng
Recent advancements in large language models (LLMs) and multi-modal LLMs have been remarkable. However, these models still rely solely on their parametric knowledge, which limits their ability to generate up-to-date information and increases the risk of producing erroneous content. Retrieval-Augmented Generation (RAG) partially mitigates these challenges by incorporating external data sources, yet the reliance on databases and retrieval systems can introduce irrelevant or inaccurate documents, ultimately undermining both performance and reasoning quality. In this paper, we propose Multi-Modal Knowledge-Based Retrieval-Augmented Generation (MMKB-RAG), a novel multi-modal RAG framework that leverages the inherent knowledge boundaries of models to dynamically generate semantic tags for the retrieval process. This strategy enables the joint filtering of retrieved documents, retaining only the most relevant and accurate references. Extensive experiments on knowledge-based visual question-answering tasks demonstrate the efficacy of our approach: on the E-VQA dataset, our method improves performance by +4.2\% on the Single-Hop subset and +0.4\% on the full dataset, while on the InfoSeek dataset, it achieves gains of +7.8\% on the Unseen-Q subset, +8.2\% on the Unseen-E subset, and +8.1\% on the full dataset. These results highlight significant enhancements in both accuracy and robustness over the current state-of-the-art MLLM and RAG frameworks.