Yu Liu, Yanbing Liu, Fangfang Yuan, Cong Cao, Youbang Sun, Kun Peng, WeiZhuo Chen, Jianjun Li et al.
Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks: 1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions. 2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents. 3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge. Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design. Code is available at https://github.com/Ameame1/OPERA.
Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne
Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.
Huangyu Dai, Lingtao Mao, Ben Chen, Zihan Wang, Zihan Liang, Ying Han, Chenyi Lei, Han Li
Hotword customization is crucial in ASR to enhance the accuracy of domain-specific terms. It has been primarily driven by the advancements in traditional models and Audio large language models (LLMs). However, existing models often struggle with large-scale hotwords, as the recognition rate drops dramatically with the number of hotwords increasing. In this paper, we introduce a novel hotword customization system that utilizes a hotword pre-retrieval module (H-PRM) to identify the most relevant hotword candidate by measuring the acoustic similarity between the hotwords and the speech segment. This plug-and-play solution can be easily integrated into traditional models such as SeACo-Paraformer, significantly enhancing hotwords post-recall rate (PRR). Additionally, we incorporate H-PRM into Audio LLMs through a prompt-based approach, enabling seamless customization of hotwords. Extensive testing validates that H-PRM can outperform existing methods, showing a new direction for hotword customization in ASR.
Mengzheng Yang, Yanfei Ren, David Osei Opoku, Ruochang Li, Peng Ren, Chunxiao Xing
Current general-purpose large language models (LLMs) commonly exhibit knowledge hallucination and insufficient domain-specific adaptability in domain-specific tasks, limiting their effectiveness in specialized question answering scenarios. Retrieval-augmented generation (RAG) effectively tackles these challenges by integrating external knowledge to enhance accuracy and relevance. However, traditional RAG still faces limitations in domain knowledge accuracy and context modeling.To enhance domain-specific question answering performance, this work focuses on a graph-based RAG framework, emphasizing the critical role of knowledge graph quality during the generation process. We propose DSRAG (Domain-Specific RAG), a multimodal knowledge graph-driven retrieval-augmented generation framework designed for domain-specific applications. Our approach leverages domain-specific documents as the primary knowledge source, integrating heterogeneous information such as text, images, and tables to construct a multimodal knowledge graph covering both conceptual and instance layers. Building on this foundation, we introduce semantic pruning and structured subgraph retrieval mechanisms, combining knowledge graph context and vector retrieval results to guide the language model towards producing more reliable responses. Evaluations using the Langfuse multidimensional scoring mechanism show that our method excels in domain-specific question answering, validating the efficacy of integrating multimodal knowledge graphs with retrieval-augmented generation.
Authors' comments: 12 pages, 5 figures. Accepted to the 22nd International Conference on Web Information Systems and Applications (WISA 2025)
Xiaogang Yang, Dawit Hailu, Vojtěch Kulvait, Thomas Jentschke, Silja Flenner, Imke Greving, Stuart I. Campbell, Johannes Hagemann et al.
X-ray phase contrast imaging significantly improves the visualization of
structures with weak or uniform absorption, broadening its applications across
a wide range of scientific disciplines. Propagation-based phase contrast is
particularly suitable for time- or dose-critical in vivo/in situ/operando
(tomography) experiments because it requires only a single intensity
measurement. However, the phase information of the wave field is lost during
the measurement and must be recovered. Conventional algebraic and iterative
methods often rely on specific approximations or boundary conditions that may
not be met by many samples or experimental setups. In addition, they require
manual tuning of reconstruction parameters by experts, making them less
adaptable for complex or variable conditions. Here we present a self-learning
approach for solving the inverse problem of phase retrieval in the near-field
regime of Fresnel theory using a single intensity measurement (hologram). A
physics-informed generative adversarial network is employed to reconstruct both
the phase and absorbance of the unpropagated wave field in the sample plane
from a single hologram. Unlike most deep learning approaches for phase
retrieval, our approach does not require paired, unpaired, or simulated
training data. This significantly broadens the applicability of our approach,
as acquiring or generating suitable training data remains a major challenge due
to the wide variability in sample types and experimental configurations. The
algorithm demonstrates robust and consistent performance across diverse imaging
conditions and sample types, delivering quantitative, high-quality
reconstructions for both simulated data and experimental datasets acquired at
beamline P05 at PETRA III (DESY, Hamburg), operated by Helmholtz-Zentrum
Hereon. Furthermore, it enables the simultaneous retrieval of both phase and
absorption information.
Authors' comments: Version of record published in Optics Express, Vol. 33, Issue 17, pp.
35832-35851 (2025). Merged article, 20 pages of main text, 1 page of
supplement header, and 7 pages of supplement (total 28 pages). Contains 10
figures in the main article and 5 figures in the supplement
Aleksandr Listopad
Conventional vector-based memory systems rely on cosine or inner product similarity within real-valued embedding spaces. While computationally efficient, such approaches are inherently phase-insensitive and limited in their ability to capture resonance phenomena crucial for meaning representation. We propose Wave-Based Semantic Memory, a novel framework that models knowledge as wave patterns $Ï(x) = A(x) e^{iÏ(x)}$ and retrieves it through resonance-based interference. This approach preserves both amplitude and phase information, enabling more expressive and robust semantic similarity. We demonstrate that resonance-based retrieval achieves higher discriminative power in cases where vector methods fail, including phase shifts, negations, and compositional queries. Our implementation, ResonanceDB, shows scalability to millions of patterns with millisecond latency, positioning wave-based memory as a viable alternative to vector stores for AGI-oriented reasoning and knowledge representation.
Authors' comments: 9 pages, 6 figures
Skatje Myers, Dmitriy Dligach, Timothy A. Miller, Samantha Barr, Yanjun Gao, Matthew Churpek, Anoop Mayampurath, Majid Afshar
Electronic health records (EHRs) are long, noisy, and often redundant, posing a major challenge for the clinicians who must navigate them. Large language models (LLMs) offer a promising solution for extracting and reasoning over this unstructured text, but the length of clinical notes often exceeds even state-of-the-art models' extended context windows. Retrieval-augmented generation (RAG) offers an alternative by retrieving task-relevant passages from across the entire EHR, potentially reducing the amount of required input tokens. In this work, we propose three clinical tasks designed to be replicable across health systems with minimal effort: 1) extracting imaging procedures, 2) generating timelines of antibiotic use, and 3) identifying key diagnoses. Using EHRs from actual hospitalized patients, we test three state-of-the-art LLMs with varying amounts of provided context, using either targeted text retrieval or the most recent clinical notes. We find that RAG closely matches or exceeds the performance of using recent notes, and approaches the performance of using the models' full context while requiring drastically fewer input tokens. Our results suggest that RAG remains a competitive and efficient approach even as newer models become capable of handling increasingly longer amounts of text.
Haoyu Zhao, Jiaxi Gu, Shicong Wang, Xing Zhang, Hang Xu, Zuxuan Wu, Yu-Gang Jiang
The explosive growth of video streaming presents challenges in achieving high
accuracy and low training costs for video-language retrieval. However, existing
methods rely on large-scale pre-training to improve video retrieval
performance, resulting in significant computational demands. Additionally, the
fine-grained information in videos and texts remains underexplored. To
alleviate these problems, we propose a novel framework to learn fine-grained
features for better alignment and introduce an inference pipeline to improve
performance without additional training. Specifically, we employ coarse-to-fine
objectives to understand the semantic information of video-text pairs,
including contrastive and matching learning. The fine-grained data used for
training is obtained through the Granularity-Aware Representation module, which
is designed based on similarity analysis between video frames and words in
captions. Furthermore, we observe that the repetition of keywords in the
original captions, referred to as "Repetition", can enhance retrieval
performance and improve alignment between video and text. Based on this
insight, we propose a novel and effective inference pipeline that incorporates
a voting mechanism and a new Matching Entropy metric to achieve better
retrieval performance without requiring additional pre-training. Experimental
results on four benchmarks demonstrate that the proposed method outperforms
previous approaches. Additionally, our inference pipeline achieves significant
performance improvements, with a 2.1% increase in Recall@1 on the MSR-VTT
dataset and a 1.6% increase on the DiDeMo dataset.
Authors' comments: 11 pages, 4 figures
Chengcheng Guo, Junda She, Kuo Cai, Shiyao Wang, Qigen Hu, Qiang Luo, Kun Gai, Guorui Zhou
Large-scale industrial recommendation systems typically employ a two-stage
paradigm of retrieval and ranking to handle huge amounts of information. Recent
research focuses on improving the performance of retrieval model. A promising
way is to introduce extensive information about users and items. On one hand,
lifelong sequential behavior is valuable. Existing lifelong behavior modeling
methods in ranking stage focus on the interaction of lifelong behavior and
candidate items from retrieval stage. In retrieval stage, it is difficult to
utilize lifelong behavior because of a large corpus of candidate items. On the
other hand, existing retrieval methods mostly relay on interaction information,
potentially disregarding valuable multi-modal information. To solve these
problems, we represent the pioneering exploration of leveraging multi-modal
information and lifelong sequence model within the advanced tree-based
retrieval model. We propose Multi-modal Indexing and Searching with lifelong
Sequence (MISS), which contains a multi-modal index tree and a multi-modal
lifelong sequence modeling module. Specifically, for better index structure, we
propose multi-modal index tree, which is built using the multi-modal embedding
to precisely represent item similarity. To precisely capture diverse user
interests in user lifelong sequence, we propose collaborative general search
unit (Co-GSU) and multi-modal general search unit (MM-GSU) for
multi-perspective interests searching.
Authors' comments: CIKM 2025
Yixin Chen, Ying Xiong, Shangyu Wu, Yufei Cui, Xue Liu, Nan Guan, Chun Jason Xue
Tool-augmented large language models (LLMs) leverage external functions to extend their capabilities, but inaccurate function calls can lead to inefficiencies and increased costs.Existing methods address this challenge by fine-tuning LLMs or using demonstration-based prompting, yet they often suffer from high training overhead and fail to account for inconsistent demonstration samples, which misguide the model's invocation behavior. In this paper, we trained a behavior-aligned retriever (BAR), which provides behaviorally consistent demonstrations to help LLMs make more accurate tool-using decisions. To train the BAR, we construct a corpus including different function-calling behaviors, i.e., calling or non-calling.We use the contrastive learning framework to train the BAR with customized positive/negative pairs and a dual-negative contrastive loss, ensuring robust retrieval of behaviorally consistent examples.Experiments demonstrate that our approach significantly reduces erroneous function calls while maintaining high task performance, offering a cost-effective and efficient solution for tool-augmented LLMs.
Chongyu Qu, Allen J. Luna, Thomas Z. Li, Junchao Zhu, Junlin Guo, Juming Xiong, Kim L. Sandler, Bennett A. Landman et al.
Accurate lung cancer risk prediction remains challenging due to substantial variability across patient populations and clinical settings -- no single model performs best for all cohorts. To address this, we propose a personalized lung cancer risk prediction agent that dynamically selects the most appropriate model for each patient by combining cohort-specific knowledge with modern retrieval and reasoning techniques. Given a patient's CT scan and structured metadata -- including demographic, clinical, and nodule-level features -- the agent first performs cohort retrieval using FAISS-based similarity search across nine diverse real-world cohorts to identify the most relevant patient population from a multi-institutional database. Second, a Large Language Model (LLM) is prompted with the retrieved cohort and its associated performance metrics to recommend the optimal prediction algorithm from a pool of eight representative models, including classical linear risk models (e.g., Mayo, Brock), temporally-aware models (e.g., TDVIT, DLSTM), and multi-modal computer vision-based approaches (e.g., Liao, Sybil, DLS, DLI). This two-stage agent pipeline -- retrieval via FAISS and reasoning via LLM -- enables dynamic, cohort-aware risk prediction personalized to each patient's profile. Building on this architecture, the agent supports flexible and cohort-driven model selection across diverse clinical populations, offering a practical path toward individualized risk assessment in real-world lung cancer screening.
Youssef Maklad, Fares Wael, Ali Hamdi, Wael Elsersy, Khaled Shaban
Traditional protocol fuzzing techniques, such as those employed by AFL-based systems, often lack effectiveness due to a limited semantic understanding of complex protocol grammars and rigid seed mutation strategies. Recent works, such as ChatAFL, have integrated Large Language Models (LLMs) to guide protocol fuzzing and address these limitations, pushing protocol fuzzers to wider exploration of the protocol state space. But ChatAFL still faces issues like unreliable output, LLM hallucinations, and assumptions of LLM knowledge about protocol specifications. This paper introduces MultiFuzz, a novel dense retrieval-based multi-agent system designed to overcome these limitations by integrating semantic-aware context retrieval, specialized agents, and structured tool-assisted reasoning. MultiFuzz utilizes agentic chunks of protocol documentation (RFC Documents) to build embeddings in a vector database for a retrieval-augmented generation (RAG) pipeline, enabling agents to generate more reliable and structured outputs, enhancing the fuzzer in mutating protocol messages with enhanced state coverage and adherence to syntactic constraints. The framework decomposes the fuzzing process into modular groups of agents that collaborate through chain-of-thought reasoning to dynamically adapt fuzzing strategies based on the retrieved contextual knowledge. Experimental evaluations on the Real-Time Streaming Protocol (RTSP) demonstrate that MultiFuzz significantly improves branch coverage and explores deeper protocol states and transitions over state-of-the-art (SOTA) fuzzers such as NSFuzz, AFLNet, and ChatAFL. By combining dense retrieval, agentic coordination, and language model reasoning, MultiFuzz establishes a new paradigm in autonomous protocol fuzzing, offering a scalable and extensible foundation for future research in intelligent agentic-based fuzzing systems.
Yi Wang, Haoran Luo, Lu Meng
With the widespread application of electroencephalography (EEG) in neuroscience and clinical practice, efficiently retrieving and semantically interpreting large-scale, multi-source, heterogeneous EEG data has become a pressing challenge. We propose EEG-MedRAG, a three-layer hypergraph-based retrieval-augmented generation framework that unifies EEG domain knowledge, individual patient cases, and a large-scale repository into a traversable n-ary relational hypergraph, enabling joint semantic-temporal retrieval and causal-chain diagnostic generation. Concurrently, we introduce the first cross-disease, cross-role EEG clinical QA benchmark, spanning seven disorders and five authentic clinical perspectives. This benchmark allows systematic evaluation of disease-agnostic generalization and role-aware contextual understanding. Experiments show that EEG-MedRAG significantly outperforms TimeRAG and HyperGraphRAG in answer accuracy and retrieval, highlighting its strong potential for real-world clinical decision support. Our data and code are publicly available at https://github.com/yi9206413-boop/EEG-MedRAG.
Zefang Liu, Arman Anwar
Incident response (IR) requires fast, coordinated, and well-informed decision-making to contain and mitigate cyber threats. While large language models (LLMs) have shown promise as autonomous agents in simulated IR settings, their reasoning is often limited by a lack of access to external knowledge. In this work, we present AutoBnB-RAG, an extension of the AutoBnB framework that incorporates retrieval-augmented generation (RAG) into multi-agent incident response simulations. Built on the Backdoors & Breaches (B&B) tabletop game environment, AutoBnB-RAG enables agents to issue retrieval queries and incorporate external evidence during collaborative investigations. We introduce two retrieval settings: one grounded in curated technical documentation (RAG-Wiki), and another using narrative-style incident reports (RAG-News). We evaluate performance across eight team structures, including newly introduced argumentative configurations designed to promote critical reasoning. To validate practical utility, we also simulate real-world cyber incidents based on public breach reports, demonstrating AutoBnB-RAG's ability to reconstruct complex multi-stage attacks. Our results show that retrieval augmentation improves decision quality and success rates across diverse organizational models. This work demonstrates the value of integrating retrieval mechanisms into LLM-based multi-agent systems for cybersecurity decision-making.
Hecheng Wang, Jiankun Ren, Jia Yu, Lizhe Qi, Yunquan Sun
Humans effortlessly retrieve objects in cluttered, partially observable environments by combining visual reasoning, active viewpoint adjustment, and physical interaction-with only a single pair of eyes. In contrast, most existing robotic systems rely on carefully positioned fixed or multi-camera setups with complete scene visibility, which limits adaptability and incurs high hardware costs. We present \textbf{RoboRetriever}, a novel framework for real-world object retrieval that operates using only a \textbf{single} wrist-mounted RGB-D camera and free-form natural language instructions. RoboRetriever grounds visual observations to build and update a \textbf{dynamic hierarchical scene graph} that encodes object semantics, geometry, and inter-object relations over time. The supervisor module reasons over this memory and task instruction to infer the target object and coordinate an integrated action module combining \textbf{active perception}, \textbf{interactive perception}, and \textbf{manipulation}. To enable task-aware scene-grounded active perception, we introduce a novel visual prompting scheme that leverages large reasoning vision-language models to determine 6-DoF camera poses aligned with the semantic task goal and geometry scene context. We evaluate RoboRetriever on diverse real-world object retrieval tasks, including scenarios with human intervention, demonstrating strong adaptability and robustness in cluttered scenes with only one RGB-D camera.
Zhe Chen, Yusheng Liao, Shuyang Jiang, Zhiyuan Zhu, Haolin Li, Yanfeng Wang, Yu Wang
Medical large vision-language Models (Med-LVLMs) have shown promise in clinical applications but suffer from factual inaccuracies and unreliable outputs, posing risks in real-world diagnostics. While retrieval-augmented generation has emerged as a potential solution, current medical multimodal RAG systems are unable to perform effective retrieval across heterogeneous sources. The irrelevance of retrieved reports affects the factuality of analysis, while insufficient knowledge affects the credibility of clinical decision-making. To bridge the gap, we construct MedAtlas, which includes extensive multimodal report repositories and diverse text corpora. Based on it, we present HeteroRAG, a novel framework that enhances Med-LVLMs through heterogeneous knowledge sources. The framework introduces Modality-specific CLIPs for effective report retrieval and a Multi-corpora Query Generator for dynamically constructing queries for diverse corpora. Incorporating knowledge from such multifaceted sources, Med-LVLM is then trained with Heterogeneous Knowledge Preference Tuning to achieve cross-modality and multi-source knowledge alignment. Extensive experiments across 12 datasets and 3 modalities demonstrate that the proposed HeteroRAG achieves state-of-the-art performance in most medical vision language benchmarks, significantly improving factual accuracy and reliability of Med-LVLMs.
Giavanna Jadick, Patrick La Rivière
Multi-energy CT has long demonstrated its ability to enhance image quality
with material decomposition. Yet, it has largely been limited to applications
that already have high contrast. More recently, x-ray phase-contrast (XPC)
imaging has gained interest for its potential to improve detectability in tasks
lacking such contrast. Previous work has demonstrated the benefit of combining
multi-energy imaging with XPC for material decomposition. While the existing
method is promising, its analytical approach requires several approximations
that limit its broad applicability, and it is based on projection imaging,
requiring separate tomographic reconstruction for three-dimensional (3D)
volumes. We propose a natively 3D optimization-based solution that leverages
modern computational advances, namely automatic differentiability, to
efficiently solve the multi-energy CT phase retrieval problem with a more
accurate forward model. In a simulation study, we quantitatively demonstrate
the improvement of our proposed technique relative to the existing analytical
standard, and we affirm the benefits of phase contrast over traditional
absorption-only x-ray imaging.
Authors' comments: Presented at the 18th International Meeting on Fully 3D Image
Reconstruction in Radiology and Nuclear Medicine (2025)
Zhuorui Liu, Chen Zhang, Dawei Song
With the rapid development of large language models (LLMs), handling long
context has become one of the vital abilities in LLMs. Such long-context
ability is accompanied by difficulties in deployment, especially due to the
increased consumption of KV cache. There is certain work aiming to optimize the
memory footprint of KV cache, inspired by the observation that attention heads
can be categorized into retrieval heads that are of great significance and
streaming heads that are of less significance. Typically, identifying the
streaming heads and and waiving the KV cache in the streaming heads would
largely reduce the overhead without hurting the performance that much. However,
since employing both retrieval and streaming heads in one layer decomposes one
large round of attention computation into two small ones, it may unexpectedly
bring extra latency on accessing and indexing tensors. Based on this intuition,
we impose an important improvement to the identification process of retrieval
and streaming heads, in which we design a criterion that enforces exclusively
retrieval or streaming heads gathered in one unique layer. In this way, we
further eliminate the extra latency and only incur negligible performance
degradation. Our method named \textsc{ZigzagAttention} is competitive among
considered baselines owing to reduced latency and comparable performance.
Authors' comments: 5 pages, 4 figures
Chor Boon Tan, Conghui Hu, Gim Hee Lee
The recent growth of large foundation models that can easily generate
pseudo-labels for huge quantity of unlabeled data makes unsupervised Zero-Shot
Cross-Domain Image Retrieval (UZS-CDIR) less relevant. In this paper, we
therefore turn our attention to weakly supervised ZS-CDIR (WSZS-CDIR) with
noisy pseudo labels generated by large foundation models such as CLIP. To this
end, we propose CLAIR to refine the noisy pseudo-labels with a confidence score
from the similarity between the CLIP text and image features. Furthermore, we
design inter-instance and inter-cluster contrastive losses to encode images
into a class-aware latent space, and an inter-domain contrastive loss to
alleviate domain discrepancies. We also learn a novel cross-domain mapping
function in closed-form, using only CLIP text embeddings to project image
features from one domain to another, thereby further aligning the image
features for retrieval. Finally, we enhance the zero-shot generalization
ability of our CLAIR to handle novel categories by introducing an extra set of
learnable prompts. Extensive experiments are carried out using TUBerlin,
Sketchy, Quickdraw, and DomainNet zero-shot datasets, where our CLAIR
consistently shows superior performance compared to existing state-of-the-art
methods.
Authors' comments: BMVC 2025
Haidong Xu, Guangwei Xu, Zhedong Zheng, Xiatian Zhu, Wei Ji, Xiangtai Li, Ruijie Guo, Meishan Zhang et al.
This paper introduces VimoRAG, a novel video-based retrieval-augmented motion
generation framework for motion large language models (LLMs). As motion LLMs
face severe out-of-domain/out-of-vocabulary issues due to limited annotated
data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D
motion generation by retrieving relevant 2D human motion signals. While
video-based motion RAG is nontrivial, we address two key bottlenecks: (1)
developing an effective motion-centered video retrieval model that
distinguishes human poses and actions, and (2) mitigating the issue of error
propagation caused by suboptimal retrieval results. We design the Gemini Motion
Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer,
enabling effective retrieval and generation processes. Experimental results
show that VimoRAG significantly boosts the performance of motion LLMs
constrained to text-only input.
Authors' comments: 20 pages,13 figures