Jinxin Shi, Zongsheng Cao, Runmin Ma, Yusong Hu, Jie Zhou, Xin Li, Lei Bai, Liang He et al.
The deep-research framework orchestrates external tools to perform complex,
multi-step scientific reasoning that exceeds the native limits of a single
large language model. However, it still suffers from context pollution, weak
evidentiary support, and brittle execution paths. To address these issues, we
propose DualResearch, a retrieval and fusion framework that matches the
epistemic structure of tool-intensive reasoning by jointly modeling two
complementary graphs: a breadth semantic graph that encodes stable background
knowledge, and a depth causal graph that captures execution provenance. Each
graph has a layer-native relevance function, seed-anchored semantic diffusion
for breadth, and causal-semantic path matching with reliability weighting for
depth. To reconcile their heterogeneity and query-dependent uncertainty,
DualResearch converts per-layer path evidence into answer distributions and
fuses them in log space via an entropy-gated rule with global calibration. The
fusion up-weights the more certain channel and amplifies agreement. As a
complement to deep-research systems, DualResearch compresses lengthy multi-tool
execution logs into a concise reasoning graph, and we show that it can
reconstruct answers stably and effectively. On the scientific reasoning
benchmarks HLE and GPQA, DualResearch achieves competitive performance. Using
log files from the open-source system InternAgent, its accuracy improves by
7.7% on HLE and 6.06% on GPQA.
Authors' comments: 16 pages, 6 figures, 5 tables, Under Review
Kostiantyn Bevziuk, Andrii Fatula, Svetozar Lashin Yaroslav Opanasenko, Anna Tukhtarova, Ashok Jallepalli Pradeepkumar Sharma, Hritvik Shrivastava
We present a repository decomposition system that converts large software repositories into a vectorized knowledge graph which mirrors project architectural and semantic structure, capturing semantic relationships and allowing a significant level of automatization of further repository development. The graph encodes syntactic relations such as containment, implementation, references, calls, and inheritance, and augments nodes with LLM-derived summaries and vector embeddings. A hybrid retrieval pipeline combines semantic retrieval with graph-aware expansion, and an LLM-based assistant formulates constrained, read-only graph requests and produces human-oriented explanations.
Vasudha Yanuganti, Ishaan Puri, Swapnil Chhatre, Mantinder Singh, Ashok Jallepalli, Hritvik Shrivastava, Pradeep Kumar Sharma
Modern codebases make it hard for developers and AI coding assistants to find the right source files when answering questions like "How does this feature work?" or "Where was the bug introduced?" Traditional code search (keyword or IR based) often misses semantic context and cross file links, while large language models (LLMs) understand natural language but lack repository specific detail. We present a method for file path retrieval that fine tunes a strong LLM (Qwen3-8B) with QLoRA and Unsloth optimizations to predict relevant file paths directly from a natural language query. To build training data, we introduce six code aware strategies that use abstract syntax tree (AST) structure and repository content to generate realistic question-answer pairs, where answers are sets of file paths. The strategies range from single file prompts to hierarchical repository summaries, providing broad coverage. We fine tune on Python projects including Flask, Click, Jinja, FastAPI, and PyTorch, and obtain high retrieval accuracy: up to 91\% exact match and 93\% recall on held out queries, clearly beating single strategy training. On a large codebase like PyTorch (about 4,000 Python files), the model reaches 59\% recall, showing scalability. We analyze how multi level code signals help the LLM reason over cross file context and discuss dataset design, limits (for example, context length in very large repos), and future integration of retrieval with LLM based code intelligence.
Jianlyu Chen, Junwei Lan, Chaofan Li, Defu Lian, Zheng Liu
In this paper, we introduce ReasonEmbed, a novel text embedding model
developed for reasoning-intensive document retrieval. Our work includes three
key technical contributions. First, we propose ReMixer, a new data synthesis
method that overcomes the triviality problem prevalent in previous synthetic
datasets, enabling large-scale production of 82K high-quality training samples.
Second, we design Redapter, a self-adaptive learning algorithm that dynamically
adjusts training each sample's weight based on its reasoning intensity. This
allows the model to effectively capture the complex semantic relationships
between queries and documents. Third, we implement ReasonEmbed across multiple
backbones of varying sizes, all of which achieve superior performance on
reasoning-intensive retrieval tasks. Notably, our ReasonEmbed-Qwen3-8B model
offers a record-high nDCG@10 score of 38.1 on the BRIGHT benchmark, which
significantly outperforms existing text embedding models. We will fully
open-source our created resources in ReasonEmbed to push forward the research
advancement in this field.
Authors' comments: 17 pages, 3 figures
Daniel Huwiler, Kurt Stockinger, Jonathan Fürst
Retrieval-Augmented Generation (RAG) systems fail when documents evolve through versioning-a ubiquitous characteristic of technical documentation. Existing approaches achieve only 58-64% accuracy on version-sensitive questions, retrieving semantically similar content without temporal validity checks. We present VersionRAG, a version-aware RAG framework that explicitly models document evolution through a hierarchical graph structure capturing version sequences, content boundaries, and changes between document states. During retrieval, VersionRAG routes queries through specialized paths based on intent classification, enabling precise version-aware filtering and change tracking. On our VersionQA benchmark-100 manually curated questions across 34 versioned technical documents-VersionRAG achieves 90% accuracy, outperforming naive RAG (58%) and GraphRAG (64%). VersionRAG reaches 60% accuracy on implicit change detection where baselines fail (0-10%), demonstrating its ability to track undocumented modifications. Additionally, VersionRAG requires 97% fewer tokens during indexing than GraphRAG, making it practical for large-scale deployment. Our work establishes versioned document QA as a distinct task and provides both a solution and benchmark for future research.
Yuxin Huang, Simeng Wu, Ran Song, Yan Xiang, Yantuan Xian, Shengxiang Gao, Zhengtao Yu
Generative Information Retrieval is an emerging retrieval paradigm that
exhibits remarkable performance in monolingual scenarios.However, applying
these methods to multilingual retrieval still encounters two primary
challenges, cross-lingual identifier misalignment and identifier inflation. To
address these limitations, we propose Multilingual Generative Retrieval via
Cross-lingual Semantic Compression (MGR-CSC), a novel framework that unifies
semantically equivalent multilingual keywords into shared atoms to align
semantics and compresses the identifier space, and we propose a dynamic
multi-step constrained decoding strategy during retrieval. MGR-CSC improves
cross-lingual alignment by assigning consistent identifiers and enhances
decoding efficiency by reducing redundancy. Experiments demonstrate that
MGR-CSC achieves outstanding retrieval accuracy, improving by 6.83% on
mMarco100k and 4.77% on mNQ320k, while reducing document identifiers length by
74.51% and 78.2%, respectively.
Authors' comments: EMNLP 2025, Findings, Long
Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Kaiyu He, Xinya Du, Zhiyu Chen
Agentic RAG is a powerful technique for incorporating external information
that LLMs lack, enabling better problem solving and question answering.
However, suboptimal search behaviors exist widely, such as over-search
(retrieving information already known) and under-search (failing to search when
necessary), which leads to unnecessary overhead and unreliable outputs. Current
training methods, which typically rely on outcome-based rewards in a RL
framework, lack the fine-grained control needed to address these
inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for
Efficient agentic RAG (HiPRAG), a training methodology that incorporates a
fine-grained, knowledge-grounded process reward into the RL training. Our
approach evaluates the necessity of each search decision on-the-fly by
decomposing the agent's reasoning trajectory into discrete, parsable steps. We
then apply a hierarchical reward function that provides an additional bonus
based on the proportion of optimal search and non-search steps, on top of
commonly used outcome and format rewards. Experiments on the Qwen2.5 and
Llama-3.2 models across seven diverse QA benchmarks show that our method
achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished
while improving search efficiency, reducing the over-search rate to just 2.3%
and concurrently lowering the under-search rate. These results demonstrate the
efficacy of optimizing the reasoning process itself, not just the final
outcome. Further experiments and analysis demonstrate that HiPRAG shows good
generalizability across a wide range of RL algorithms, model families, sizes,
and types. This work demonstrates the importance and potential of fine-grained
control through RL, for improving the efficiency and optimality of reasoning
for search agents.
Authors' comments: Under review
Markus Reuter, Tobias Lingenberg, Rūta Liepiņa, Francesca Lagioia, Marco Lippi, Giovanni Sartor, Andrea Passerini, Burcu Sayin
Retrieval-Augmented Generation (RAG) is a promising approach to mitigate
hallucinations in Large Language Models (LLMs) for legal applications, but its
reliability is critically dependent on the accuracy of the retrieval step. This
is particularly challenging in the legal domain, where large databases of
structurally similar documents often cause retrieval systems to fail. In this
paper, we address this challenge by first identifying and quantifying a
critical failure mode we term Document-Level Retrieval Mismatch (DRM), where
the retriever selects information from entirely incorrect source documents. To
mitigate DRM, we investigate a simple and computationally efficient technique
which we refer to as Summary-Augmented Chunking (SAC). This method enhances
each text chunk with a document-level synthetic summary, thereby injecting
crucial global context that would otherwise be lost during a standard chunking
process. Our experiments on a diverse set of legal information retrieval tasks
show that SAC greatly reduces DRM and, consequently, also improves text-level
retrieval precision and recall. Interestingly, we find that a generic
summarization strategy outperforms an approach that incorporates legal expert
domain knowledge to target specific legal elements. Our work provides evidence
that this practical, scalable, and easily integrable technique enhances the
reliability of RAG systems when applied to large-scale legal document datasets.
Authors' comments: Accepted for the 7th Natural Legal Language Processing Workshop (NLLP
2025), co-located with EMNLP 2025
Junki Mori, Kazuya Kakizaki, Taiki Miyagawa, Jun Sakuma
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by
grounding them in external knowledge. However, its application in sensitive
domains is limited by privacy risks. Existing private RAG methods typically
rely on query-time differential privacy (DP), which requires repeated noise
injection and leads to accumulated privacy loss. To address this issue, we
propose DP-SynRAG, a framework that uses LLMs to generate differentially
private synthetic RAG databases. Unlike prior methods, the synthetic text can
be reused once created, thereby avoiding repeated noise injection and
additional privacy costs. To preserve essential information for downstream RAG
tasks, DP-SynRAG extends private prediction, which instructs LLMs to generate
text that mimics subsampled database records in a DP manner. Experiments show
that DP-SynRAG achieves superior performanec to the state-of-the-art private
RAG systems while maintaining a fixed privacy budget, offering a scalable
solution for privacy-preserving RAG.
Authors' comments: Under review
Avishree Khare, Hideki Okamoto, Bardh Hoxha, Georgios Fainekos, Rajeev Alur
Neural models such as YOLO and HuBERT can be used to detect local properties such as objects ("car") and emotions ("angry") in individual frames of videos and audio clips respectively. The likelihood of these detections is indicated by scores in [0, 1]. Lifting these scores to temporal properties over sequences can be useful for several downstream applications such as query matching (e.g., "does the speaker eventually sound happy in this audio clip?"), and ranked retrieval (e.g., "retrieve top 5 videos with a 10 second scene where a car is detected until a pedestrian is detected"). In this work, we formalize this problem of assigning Scores for TempOral Properties (STOPs) over sequences, given potentially noisy score predictors for local properties. We then propose a scoring function called LogSTOP that can efficiently compute these scores for temporal properties represented in Linear Temporal Logic. Empirically, LogSTOP, with YOLO and HuBERT, outperforms Large Vision / Audio Language Models and other Temporal Logic-based baselines by at least 16% on query matching with temporal properties over objects-in-videos and emotions-in-speech respectively. Similarly, on ranked retrieval with temporal properties over objects and actions in videos, LogSTOP with Grounding DINO and SlowR50 reports at least a 19% and 16% increase in mean average precision and recall over zero-shot text-to-video retrieval baselines respectively.
Yoav Gur-Arieh, Mor Geva, Atticus Geiger
A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann" when asked "Who loves pie?" Prior research on short lists of bound entities found strong evidence that LMs implement such retrieval via a positional mechanism, where "Ann" is retrieved based on its position in context. In this work, we find that this mechanism generalizes poorly to more complex settings; as the number of bound entities in context increases, the positional mechanism becomes noisy and unreliable in middle positions. To compensate for this, we find that LMs supplement the positional mechanism with a lexical mechanism (retrieving "Ann" using its bound counterpart "pie") and a reflexive mechanism (retrieving "Ann" through a direct pointer). Through extensive experiments on nine models and ten binding tasks, we uncover a consistent pattern in how LMs mix these mechanisms to drive model behavior. We leverage these insights to develop a causal model combining all three mechanisms that estimates next token distributions with 95% agreement. Finally, we show that our model generalizes to substantially longer inputs of open-ended text interleaved with entity groups, further demonstrating the robustness of our findings in more natural settings. Overall, our study establishes a more complete picture of how LMs bind and retrieve entities in-context.
Deshui Yu, Yizhi Wang, Saihui Jin, Taojie Zhu, Fanyi Zeng, Wen Qian, Zirui Huang, Jingli Ouyang et al.
Large language models (LLMs) excel on general tasks yet still hallucinate in high-barrier domains such as pathology. Prior work often relies on domain fine-tuning, which neither expands the knowledge boundary nor enforces evidence-grounded constraints. We therefore build a pathology vector database covering 28 subfields and 1.53 million paragraphs, and present YpathRAG, a pathology-oriented RAG framework with dual-channel hybrid retrieval (BGE-M3 dense retrieval coupled with vocabulary-guided sparse retrieval) and an LLM-based supportive-evidence judgment module that closes the retrieval-judgment-generation loop. We also release two evaluation benchmarks, YpathR and YpathQA-M. On YpathR, YpathRAG attains Recall@5 of 98.64%, a gain of 23 percentage points over the baseline; on YpathQA-M, a set of the 300 most challenging questions, it increases the accuracies of both general and medical LLMs by 9.0% on average and up to 15.6%. These results demonstrate improved retrieval quality and factual reliability, providing a scalable construction paradigm and interpretable evaluation for pathology-oriented RAG.
Bruno Korbar, Andrew Zisserman
The goal of this paper is to be able to retrieve images using a compound
query that combines object instance information from an image, with a natural
text description of what that object is doing or where it is. For example, to
retrieve an image of "Fluffy the unicorn (specified by an image) on someone's
head". To achieve this we design a mapping network that can "translate" from a
local image embedding (of the object instance) to a text token, such that the
combination of the token and a natural language query is suitable for CLIP
style text encoding, and image retrieval. Generating a text token in this
manner involves a simple training procedure, that only needs to be performed
once for each object instance. We show that our approach of using a trainable
mapping network, termed pi-map, together with frozen CLIP text and image
encoders, improves the state of the art on two benchmarks designed to assess
personalized retrieval.
Authors' comments: Published as an oral in CBMI2025
Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz et al.
Large language models (LLMs) often fail to scale their performance on
long-context tasks performance in line with the context lengths they support.
This gap is commonly attributed to retrieval failures -- the models' inability
to identify relevant information in the long inputs. Accordingly, recent
efforts often focus on evaluating and improving LLMs' retrieval performance: if
retrieval is perfect, a model should, in principle, perform just as well on a
long input as it does on a short one -- or should it? This paper presents
findings that the answer to this question may be negative. Our systematic
experiments across 5 open- and closed-source LLMs on math, question answering,
and coding tasks reveal that, even when models can perfectly retrieve all
relevant information, their performance still degrades substantially
(13.9%--85%) as input length increases but remains well within the models'
claimed lengths. This failure occurs even when the irrelevant tokens are
replaced with minimally distracting whitespace, and, more surprisingly, when
they are all masked and the models are forced to attend only to the relevant
tokens. A similar performance drop is observed when all relevant evidence is
placed immediately before the question. Our findings reveal a
previously-unrealized limitation: the sheer length of the input alone can hurt
LLM performance, independent of retrieval quality and without any distraction.
They motivate our simple, model-agnostic mitigation strategy that transforms a
long-context task into a short-context one by prompting the model to recite the
retrieved evidence before attempting to solve the problem. On RULER, we observe
a consistent improvement of GPT-4o up to 4% on an already strong baseline.
Authors' comments: 18 pages (9 pages of main content), 5 figures, accepted at the
Findings of EMNLP 2025
Omri Uzan, Asaf Yehudai, Roi pony, Eyal Shnarch, Ariel Gera
Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model's representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever's query embedding using guidance from a complementary retriever's scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval
Yongan Yu, Xianda Du, Qingchen Hu, Jiahao Liang, Jingwei Ni, Dan Qiang, Kaiyu Huang, Grant McKenzie et al.
Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system's ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at https://anonymous.4open.science/r/WeatherArchive-Bench/.
Faisal Hamman, Chenyang Zhu, Anoop Kumar, Xujun Peng, Sanghamitra Dutta, Daben Liu, Alfy Samuel
RAG systems are increasingly deployed in high-stakes domains where users
expect outputs to be consistent across semantically equivalent queries.
However, existing systems often exhibit significant inconsistencies due to
variability in both the retriever and generator (LLM), undermining trust and
reliability. In this work, we focus on information consistency, i.e., the
requirement that outputs convey the same core content across semantically
equivalent inputs. We introduce a principled evaluation framework that
decomposes RAG consistency into retriever-level, generator-level, and
end-to-end components, helping identify inconsistency sources. To improve
consistency, we propose Paraphrased Set Group Relative Policy Optimization
(PS-GRPO), an RL approach that leverages multiple rollouts across paraphrased
set to assign group similarity rewards. We leverage PS-GRPO to achieve
Information Consistent RAG (Con-RAG), training the generator to produce
consistent outputs across paraphrased queries and remain robust to
retrieval-induced variability. Because exact reward computation over paraphrase
sets is computationally expensive, we also introduce a scalable approximation
method that retains effectiveness while enabling efficient, large-scale
training. Empirical evaluations across short-form, multi-hop, and long-form QA
benchmarks demonstrate that Con-RAG significantly improves both consistency and
accuracy over strong baselines, even in the absence of explicit ground-truth
supervision. Our work provides practical solutions for evaluating and building
reliable RAG systems for safety-critical deployments.
Authors' comments: Accepted at NeurIPS 2025 Workshop on Reliable ML from Unreliable Data
Lingnan Xu, Chong Feng, Kaiyuan Zhang, Liu Zhengyong, Wenqiang Xu, Fanqing Meng
While large language models (LLMs) demonstrate impressive capabilities, their
reliance on parametric knowledge often leads to factual inaccuracies.
Retrieval-Augmented Generation (RAG) mitigates this by leveraging external
documents, yet existing approaches treat retrieved passages as isolated chunks,
ignoring valuable structure that is crucial for document organization.
Motivated by this gap, we propose Retrieve-DocumentRoute-Read (RDR2), a novel
framework that explicitly incorporates structural information throughout the
RAG process. RDR2 employs an LLM-based router to dynamically navigate document
structure trees, jointly evaluating content relevance and hierarchical
relationships to assemble optimal evidence. Our key innovation lies in
formulating document routing as a trainable task, with automatic action
curation and structure-aware passage selection inspired by human reading
strategies. Through comprehensive evaluation on five challenging datasets, RDR2
achieves state-of-the-art performance, demonstrating that explicit structural
awareness significantly enhances RAG systems' ability to acquire and utilize
knowledge, particularly in complex scenarios requiring multi-document
synthesis.
Authors' comments: EMNLP2025 Findings
Simon Lupart, Daniël van Dijk, Eric Langezaal, Ian van Dort, Mohammad Aliannejadi
Personalized Conversational Information Retrieval (CIR) has seen rapid
progress in recent years, driven by the development of Large Language Models
(LLMs). Personalized CIR aims to enhance document retrieval by leveraging
user-specific information, such as preferences, knowledge, or constraints, to
tailor responses to individual needs. A key resource for this task is the TREC
iKAT 2023 dataset, designed to evaluate personalization in CIR pipelines.
Building on this resource, Mo et al. explored several strategies for
incorporating Personal Textual Knowledge Bases (PTKB) into LLM-based query
reformulation. Their findings suggested that personalization from PTKBs could
be detrimental and that human annotations were often noisy. However, these
conclusions were based on single-run experiments using the GPT-3.5 Turbo model,
raising concerns about output variability and repeatability. In this
reproducibility study, we rigorously reproduce and extend their work, focusing
on LLM output variability and model generalization. We apply the original
methods to the new TREC iKAT 2024 dataset and evaluate a diverse range of
models, including Llama (1B-70B), Qwen-7B, GPT-4o-mini. Our results show that
human-selected PTKBs consistently enhance retrieval performance, while
LLM-based selection methods do not reliably outperform manual choices. We
further compare variance across datasets and observe higher variability on iKAT
than on CAsT, highlighting the challenges of evaluating personalized CIR.
Notably, recall-oriented metrics exhibit lower variance than precision-oriented
ones, a critical insight for first-stage retrievers. Finally, we underscore the
need for multi-run evaluations and variance reporting when assessing LLM-based
CIR systems. By broadening evaluation across models, datasets, and metrics, our
study contributes to more robust and generalizable practices for personalized
CIR.
Authors' comments: 11 pages, 5 figures, SIGIR-AP'25 Proceedings of the 2025 Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval in the Asia Pacific Region (SIGIR-AP 2025), December 7--10, 2025,
Xi'an, China
Aneesha Sampath, Oya Aran, Emily Mower Provost
We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.