F. Lesjak, L. Nortmann, D. Cont, P. J. Amado, M. Azzaro, J. A. Caballero, S. Czesla, A. Hatzes et al.
WASP-69b and KELT-11b are two low-density hot Jupiters, which are expected to
show strong atmospheric features in their transmission spectra. Such features
offer valuable insights into the chemical composition, thermal structure, and
cloud properties of exoplanet atmospheres. High-resolution spectroscopic
observations can be used to study the line-forming regions in exoplanet
atmospheres and potentially detect signals despite the presence of clouds. We
aimed to detect various molecular species and constrain the chemical abundances
and cloud deck pressures using high-resolution spectroscopy. We observed
multiple transits of these planets with CARMENES and applied the
cross-correlation method to detect atmospheric signatures. Further, we used an
injection-recovery approach and retrievals to place constraints on the
atmospheric properties. We detected a tentative H$_2$O signal for KELT-11b but
not for WASP-69b, and searches for other molecules such as H$_2$S and CH$_4$
resulted in non-detections for both planets. By investigating the signal
strength of injected synthetic models, we constrained which atmospheric
abundances and cloud deck pressures are consistent with our cross-correlation
results. In addition, we show that a retrieval-based approach leads to similar
constraints of these parameters.
Authors' comments: Accepted for publication in Astronomy & Astrophysics, 18 pages, 17
figures
Yawei Cai, Jiapeng Mi, Nan Ji, Haotian Rong, Yawei Zhang, Zhangti Li, Wenbin Guo, Rensong Xie
Composed Image Retrieval (CIR) is a cross-modal task that aims to retrieve
target images from large-scale databases using a reference image and a
modification text. Most existing methods rely on a single model to perform
feature fusion and similarity matching. However, this paradigm faces two major
challenges. First, one model alone can't see the whole picture and the tiny
details at the same time; it has to handle different tasks with the same
weights, so it often misses the small but important links between image and
text. Second, the absence of dynamic weight allocation prevents adaptive
leveraging of complementary model strengths, so the resulting embedding drifts
away from the target and misleads the nearest-neighbor search in CIR. To
address these limitations, we propose Dynamic Adaptive Fusion (DAFM) for
multi-model collaboration in CIR. Rather than optimizing a single method in
isolation, DAFM exploits the complementary strengths of heterogeneous models
and adaptively rebalances their contributions. This not only maximizes
retrieval accuracy but also ensures that the performance gains are independent
of the fusion order, highlighting the robustness of our approach. Experiments
on the CIRR and FashionIQ benchmarks demonstrate consistent improvements. Our
method achieves a Recall@10 of 93.21 and an Rmean of 84.43 on CIRR, and an
average Rmean of 67.48 on FashionIQ, surpassing recent strong baselines by up
to 4.5%. These results confirm that dynamic multi-model collaboration provides
an effective and general solution for CIR.
Authors' comments: 10 pages,4 figures
Hyunkyu Kim, Yeeun Yoo, Youngjun Kwak
As financial applications of large language models (LLMs) gain attention,
accurate Information Retrieval (IR) remains crucial for reliable AI services.
However, existing benchmarks fail to capture the complex and domain-specific
information needs of real-world banking scenarios. Building domain-specific IR
benchmarks is costly and constrained by legal restrictions on using real
customer data. To address these challenges, we propose a systematic methodology
for constructing domain-specific IR benchmarks through LLM-based query
generation. As a concrete implementation of this methodology, our pipeline
combines single and multi-document query generation with an enhanced and
reasoning-augmented answerability assessment method, achieving stronger
alignment with human judgments than prior approaches. Using this methodology,
we construct KoBankIR, comprising 815 queries derived from 204 official banking
documents. Our experiments show that existing retrieval models struggle with
the complex multi-document queries in KoBankIR, demonstrating the value of our
systematic approach for domain-specific benchmark construction and underscoring
the need for improved retrieval techniques in financial domains.
Authors' comments: Accepted(Oral) by ICAIF 2025. Hyunkyu Kim and Yeeun Yoo contributed
equally to this work
Harshit Nainwani, Hediyeh Baban
Retrieval systems are essential to contemporary AI pipelines, although most
confuse two separate processes: finding relevant information and giving enough
context for reasoning. We introduce the Search-Is-Not-Retrieve (SINR)
framework, a dual-layer architecture that distinguishes between fine-grained
search representations and coarse-grained retrieval contexts. SINR enhances the
composability, scalability, and context fidelity of retrieval systems by
directly connecting small, semantically accurate search chunks to larger,
contextually complete retrieve chunks, all without incurring extra processing
costs. This design changes retrieval from a passive step to an active one,
making the system architecture more like how people process information. We
discuss the SINR framework's conceptual foundation, formal structure,
implementation issues, and qualitative outcomes. This provides a practical
foundation for the next generation of AI systems that use retrieval.
Authors' comments: 22 pages, 2 figures, technical framework paper
Yu Li, Lehui Li, Qingmin Liao, Fengli Xu, Yong Li
Large language model agents are becoming increasingly capable at web-centric
tasks such as information retrieval, complex reasoning. These emerging
capabilities have given rise to surge research interests in developing LLM
agent for facilitating scientific quest. One key application in AI research is
to automate experiment design through agentic dataset and baseline retrieval.
However, prior efforts suffer from limited data coverage, as recommendation
datasets primarily harvest candidates from public portals and omit many
datasets actually used in published papers, and from an overreliance on content
similarity that biases model toward superficial similarity and overlooks
experimental suitability. Harnessing collective perception embedded in the
baseline and dataset citation network, we present a comprehensive framework for
baseline and dataset recommendation. First, we design an automated
data-collection pipeline that links roughly one hundred thousand accepted
papers to the baselines and datasets they actually used. Second, we propose a
collective perception enhanced retriever. To represent the position of each
dataset or baseline within the scholarly network, it concatenates
self-descriptions with aggregated citation contexts. To achieve efficient
candidate recall, we finetune an embedding model on these representations.
Finally, we develop a reasoning-augmented reranker that exact interaction
chains to construct explicit reasoning chains and finetunes a large language
model to produce interpretable justifications and refined rankings. The dataset
we curated covers 85\% of the datasets and baselines used at top AI conferences
over the past five years. On our dataset, the proposed method outperforms the
strongest prior baseline with average gains of +5.85\% in Recall@20, +8.30\% in
HitRate@5. Taken together, our results advance reliable, interpretable
automation of experimental design.
Authors' comments: 10 pages
Jaehoon Lee, Sohyun Kim, Wanggeun Park, Geon Lee, Seungkyung Kim, Minyoung Lee
Existing benchmarks for visual document retrieval (VDR) largely overlook
non-English languages and the structural complexity of official publications.
To address this critical gap, we introduce SDS KoPub VDR, the first
large-scale, publicly available benchmark for retrieving and understanding
Korean public documents. The benchmark is built upon a corpus of 361 real-world
documents (40,781 pages), including 256 files under the KOGL Type 1 license and
105 from official legal portals, capturing complex visual elements like tables,
charts, and multi-column layouts. To establish a challenging and reliable
evaluation set, we constructed 600 query-page-answer triples. These were
initially generated using multimodal models (e.g., GPT-4o) and subsequently
underwent a rigorous human verification and refinement process to ensure
factual accuracy and contextual relevance. The queries span six major public
domains and are systematically categorized by the reasoning modality required:
text-based, visual-based (e.g., chart interpretation), and cross-modal. We
evaluate SDS KoPub VDR on two complementary tasks that reflect distinct
retrieval paradigms: (1) text-only retrieval, which measures a model's ability
to locate relevant document pages based solely on textual signals, and (2)
multimodal retrieval, which assesses retrieval performance when visual features
(e.g., tables, charts, and layouts) are jointly leveraged alongside text. This
dual-task evaluation reveals substantial performance gaps, particularly in
multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art
models. As a foundational resource, SDS KoPub VDR not only enables rigorous and
fine-grained evaluation across textual and multimodal retrieval tasks but also
provides a clear roadmap for advancing multimodal AI in complex, real-world
document intelligence.
Authors' comments: 27 pages, 15 figures, 6 tables
Vasco V. Branco, Jandó Benedek, Lidia Pivovarova, Luís Correia, Pedro Cardoso
1. A hard stop for the implementation of rigorous conservation initiatives is our lack of key species data, especially occurrence data. Furthermore, researchers have to contend with an accelerated speed at which new information must be collected and processed due to anthropogenic activity. Publications ranging from scientific papers to gray literature contain this crucial information but their data are often not machine-readable, requiring extensive human work to be retrieved. 2. We present the ARETE R package, an open-source software aiming to automate data extraction of species occurrences powered by large language models, namely using the chatGPT Application Programming Interface. This R package integrates all steps of the data extraction and validation process, from Optical Character Recognition to detection of outliers and output in tabular format. Furthermore, we validate ARETE through systematic comparison between what is modelled and the work of human annotators. 3. We demonstrate the usefulness of the approach by comparing range maps produced using GBIF data and with those automatically extracted for 100 species of spiders. Newly extracted data allowed to expand the known Extent of Occurrence by a mean three orders of magnitude, revealing new areas where the species were found in the past, which mayhave important implications for spatial conservation planning and extinction risk assessments. 4. ARETE allows faster access to hitherto untapped occurrence data, a potential game changer in projects requiring such data. Researchers will be able to better prioritize resources, manually verifying selected species while maintaining automated extraction for the majority. This workflow also allows predicting available bibliographic data during project planning.
Sadia Sultana, Saiyma Sittul Muna, Mosammat Zannatul Samarukh, Ajwad Abrar, Tareque Mohmud Chowdhury
Developing accurate biomedical Question Answering (QA) systems in
low-resource languages remains a major challenge, limiting equitable access to
reliable medical knowledge. This paper introduces BanglaMedQA and
BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice
Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical
artificial intelligence (AI). The study applies and benchmarks several
Retrieval-Augmented Generation (RAG) strategies, including Traditional,
Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining
textbook-based and web retrieval with generative reasoning to improve factual
accuracy. A key novelty lies in integrating a Bangla medical textbook corpus
through Optical Character Recognition (OCR) and implementing an Agentic RAG
pipeline that dynamically selects between retrieval and reasoning strategies.
Experimental results show that the Agentic RAG achieved the highest accuracy
89.54% with openai/gpt-oss-120b, outperforming other configurations and
demonstrating superior rationale quality. These findings highlight the
potential of RAG-based methods to enhance the reliability and accessibility of
Bangla medical QA, establishing a foundation for future research in
multilingual medical artificial intelligence.
Authors' comments: Under Review
Xinying Qian, Ying Zhang, Yu Zhao, Baohang Zhou, Xuhui Sui, Xiaojie Yuan
Temporal Knowledge Graph Question Answering (TKGQA) aims to answer
time-sensitive questions by leveraging factual information from Temporal
Knowledge Graphs (TKGs). While previous studies have employed pre-trained TKG
embeddings or graph neural networks to inject temporal knowledge, they fail to
fully understand the complex semantic information of time constraints.
Recently, Large Language Models (LLMs) have shown remarkable progress,
benefiting from their strong semantic understanding and reasoning
generalization capabilities. However, their temporal reasoning ability remains
limited. LLMs frequently suffer from hallucination and a lack of knowledge. To
address these limitations, we propose the Plan of Knowledge framework with a
contrastive temporal retriever, which is named PoK. Specifically, the proposed
Plan of Knowledge module decomposes a complex temporal question into a sequence
of sub-objectives from the pre-defined tools, serving as intermediate guidance
for reasoning exploration. In parallel, we construct a Temporal Knowledge Store
(TKS) with a contrastive retrieval framework, enabling the model to selectively
retrieve semantically and temporally aligned facts from TKGs. By combining
structured planning with temporal knowledge retrieval, PoK effectively enhances
the interpretability and factual consistency of temporal reasoning. Extensive
experiments on four benchmark TKGQA datasets demonstrate that PoK significantly
improves the retrieval precision and reasoning accuracy of LLMs, surpassing the
performance of the state-of-the-art TKGQA methods by 56.0% at most.
Authors' comments: Submitted to the IEEE for possible publication
Shiyin Lin
Large Language Models (LLMs) enhanced with retrieval -- commonly referred to as Retrieval-Augmented Generation (RAG) -- have demonstrated strong performance in knowledge-intensive tasks. However, RAG pipelines often fail when retrieved evidence is incomplete, leaving gaps in the reasoning process. In such cases, \emph{abductive inference} -- the process of generating plausible missing premises to explain observations -- offers a principled approach to bridge these gaps. In this paper, we propose a framework that integrates abductive inference into retrieval-augmented LLMs. Our method detects insufficient evidence, generates candidate missing premises, and validates them through consistency and plausibility checks. Experimental results on abductive reasoning and multi-hop QA benchmarks show that our approach improves both answer accuracy and reasoning faithfulness. This work highlights abductive inference as a promising direction for enhancing the robustness and explainability of RAG systems.
Günther Schindler, Maximilian Schambach, Michael Medek, Sam Thelin
We study LLMs for tabular prediction with mixed text, numeric, and categorical fields. We introduce TabGemma, a schema-agnostic in-context learner that treats rows as sequences and tackles two practical hurdles when adapting pretrained LLMs for tabular predictions: unstable numeric tokenization and limited context size. We propose to canonicalize numbers via signed scientific notation and continue pretraining of a 12B Gemma 3 model with a target imputation objective using a large-scale real world dataset. For inference, we use a compact n-gram-based retrieval to select informative exemplars that fit within a 128k-token window. On semantically rich benchmarks, TabGemma establishes a new state of the art on classification across low- and high-data regimes and improves monotonically with more context rows. For regression, it is competitive at small sample sizes but trails conventional approaches as data grows. Our results show that LLMs can be effective tabular in-context learners on highly semantic tasks when paired with dedicated numeric handling and context retrieval, while motivating further advances in numeric modeling and long-context scaling.
One Octadion, Bondan Sapta Prakoso, Nanang Yudi Setiawan, Novanto Yudistira
In this study, we explore the fine-tuning of Large Language Models (LLMs) to
better support policymakers in their crucial work of understanding, analyzing,
and crafting legal regulations. To equip the model with a deep understanding of
legal texts, we curated a supervised dataset tailored to the specific needs of
the legal domain. Additionally, we integrated the Retrieval-Augmented
Generation (RAG) method, enabling the LLM to access and incorporate up-to-date
legal knowledge from external sources. This combination of fine-tuning and
RAG-based augmentation results in a tool that not only processes legal
information but actively assists policymakers in interpreting regulations and
drafting new ones that align with current needs. The results demonstrate that
this approach can significantly enhance the effectiveness of legal research and
regulation development, offering a valuable resource in the ever-evolving field
of law.
Authors' comments: 11 pages (including references), 2 figures, 4 tables, published in
Atlantis Press (Open Access under CC BY-NC 4.0 license)
Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang
In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.
Shantanu Agarwal, Joel Barry, Elizabeth Boschee, Scott Miller
Machine Translation for English Retrieval of Information in Any Language (MATERIAL) is an IARPA initiative targeted to advance the state of cross-lingual information retrieval (CLIR). This report provides a detailed description of Information Sciences Institute's (ISI's) Summarization and domain-Adaptive Retrieval Across Language's (SARAL's) effort for MATERIAL. Specifically, we outline our team's novel approach to handle CLIR with emphasis in developing an approach amenable to retrieve a query-relevant document \textit{set}, and not just a ranked document-list. In MATERIAL's Phase-3 evaluations, SARAL exceeded the performance of other teams in five out of six evaluation conditions spanning three different languages (Farsi, Kazakh, and Georgian).
Wenchang Lei, Ping Zou, Yue Wang, Feng Sun, Lei Zhao
Large language models (LLMs) exhibit strong semantic understanding, yet
struggle when user instructions involve ambiguous or conceptually misaligned
terms. We propose the Language Graph Model (LGM) to enhance conceptual clarity
by extracting meta-relations-inheritance, alias, and composition-from natural
language. The model further employs a reflection mechanism to validate these
meta-relations. Leveraging a Concept Iterative Retrieval Algorithm, these
relations and related descriptions are dynamically supplied to the LLM,
improving its ability to interpret concepts and generate accurate responses.
Unlike conventional Retrieval-Augmented Generation (RAG) approaches that rely
on extended context windows, our method enables large language models to
process texts of any length without the need for truncation. Experiments on
standard benchmarks demonstrate that the LGM consistently outperforms existing
RAG baselines.
Authors' comments: 30 pages, 5 figures
C. A. Murray, Z. Berta-Thompson
Starspot-crossing events (SCEs) in exoplanet transit lightcurves are becoming
increasingly common as we focus on cooler host stars and observe higher
precision photometric and spectroscopic lightcurves. In this work we explore
how these events affect our retrievals of transit depths, and the accuracy with
which we can derive spot properties. We inject and recover synthetic SCEs in
photometric lightcurves using starry. We find that for high signal-to-noise
SCEs we constrain the spot longitudes tightly (>80% within 1 degree of the true
value), but degeneracies complicate retrieving spot contrasts, radii and
latitudes (within 17%, 19%, and 9 degrees respectively). On average the
difference between injected and recovered transit depths is 0.78% or 78.3ppm.
In most (80%) injections we recover the transit depth to within 0.6%. For
transit depths inflated >1.3% by the Transit Light Source Effect (TLSE),
fitting for a spot-crossing improves the transit depth retrieval over masking
the SCE in >95% of cases. However, we find that for spots with small contrasts
(<5%) and/or covering fractions (<2%), we are likely to over-correct for the
TLSE, recovering a worse transit depth than simply masking. In addition, even
when fitted, we find SCEs can inflate the uncertainties on recovered transit
depths significantly, especially for JWST-like precisions. Finally, we
determine how SCE observables can narrow the degenerate spot parameter space to
provide useful priors for MCMC sampling, demonstrating this technique on a real
SCE observed in Kepler-51d's lightcurve.
Authors' comments: Submitted to AJ
Xu Liu, Yan Chen, Kan Ling, Yichi Zhu, Hengrun Zhang, Guisheng Fan, Huiqun Yu
The widespread deployment of Large Language Models (LLMs) as public-facing web services and APIs has made their security a core concern for the web ecosystem. Jailbreak attacks, as one of the significant threats to LLMs, have recently attracted extensive research. In this paper, we reveal a jailbreak strategy which can effectively evade current defense strategies. It can extract valuable information from failed or partially successful attack attempts and contains self-evolution from attack interactions, resulting in sufficient strategy diversity and adaptability. Inspired by continuous learning and modular design principles, we propose ASTRA, a jailbreak framework that autonomously discovers, retrieves, and evolves attack strategies to achieve more efficient and adaptive attacks. To enable this autonomous evolution, we design a closed-loop "attack-evaluate-distill-reuse" core mechanism that not only generates attack prompts but also automatically distills and generalizes reusable attack strategies from every interaction. To systematically accumulate and apply this attack knowledge, we introduce a three-tier strategy library that categorizes strategies into Effective, Promising, and Ineffective based on their performance scores. The strategy library not only provides precise guidance for attack generation but also possesses exceptional extensibility and transferability. We conduct extensive experiments under a black-box setting, and the results show that ASTRA achieves an average Attack Success Rate (ASR) of 82.7%, significantly outperforming baselines.
Yuting Yang, Tiancheng Yuan, Jamal Hashim, Thiago Garrett, Jeffrey Qian, Ann Zhang, Yifan Wang, Weijia Song et al.
There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.
Elias Lumer, Faheem Nizar, Anmol Gulati, Pradeep Honaganahalli Basavaraju, Vamse Kumar Subbiah
Recent advances in LLM Multi-Agent Systems enable scalable orchestration of sub-agents, each coordinating hundreds or thousands of tools or Model Context Protocol (MCP) servers. However, existing retrieval methods typically match queries against coarse agent-level descriptions before routing, which obscures fine-grained tool functionality and often results in suboptimal agent selection. We introduce Tool-to-Agent Retrieval, a unified framework that embeds both tools and their parent agents in a shared vector space and connects them through metadata relationships. By explicitly representing tool capabilities and traversing metadata to the agent level, Tool-to-Agent Retrieval enables granular tool-level or agent-level retrieval, ensuring that agents and their underlying tools or MCP servers are equally represented without the context dilution that arises from chunking many tools together. Evaluating Tool-to-Agent Retrieval across eight embedding models, our approach achieves consistent improvements of 19.4% in Recall@5 and 17.7% in nDCG@5 over previous state-of-the-art agent retrievers on the LiveMCPBench benchmark.
Shamse Tasnim Cynthia, Banani Roy
Scientific Workflow Management Systems (SWfMSs) such as Galaxy have become essential infrastructure in bioinformatics, supporting the design, execution, and sharing of complex multi-step analyses. Despite hosting hundreds of reusable workflows across domains, Galaxy's current keyword-based retrieval system offers limited support for semantic query interpretation and often fails to surface relevant workflows when exact term matches are absent. To address this gap, we propose a task-aware, two-stage retrieval framework that integrates dense vector search with large language model (LLM)-based reranking. Our system first retrieves candidate workflows using state-of-the-art embedding models and then reranks them using instruction-tuned generative LLMs (GPT-4o, Mistral-7B) based on semantic task alignment. To support robust evaluation, we construct a benchmark dataset of Galaxy workflows annotated with semantic topics via BERTopic and synthesize realistic task-oriented queries using LLMs. We conduct a comprehensive comparison of lexical, dense, and reranking models using standard IR metrics, presenting the first systematic evaluation of retrieval performance in the Galaxy ecosystem. Results show that our approach significantly improves top-k accuracy and relevance, particularly for long or under-specified queries. We further integrate our system as a prototype tool within Galaxy, providing a proof-of-concept for LLM-enhanced workflow search. This work advances the usability and accessibility of scientific workflows, especially for novice users and interdisciplinary researchers.