Katsunori Kitano, Toshio Aoyagi
It is well known that a sparsely coded network in which the activity level is
extremely low has intriguing equilibrium properties. In the present work, we
study the dynamical properties of a neural network designed to store sparsely
coded sequential patterns rather than static ones. Applying the theory of
statistical neurodynamics, we derive the dynamical equations governing the
retrieval process which are described by some macroscopic order parameters such
as the overlap. It is found that our theory provides good predictions for the
storage capacity and the basin of attraction obtained through numerical
simulations. The results indicate that the nature of the basin of attraction
depends on the methods of activity control employed. Furthermore, it is found
that robustness against random synaptic dilution slightly deteriorates with the
degree of sparseness.
Authors' comments: 9 pages including 4 EPSF figures, latex209, ref[21] is modefied
David A. Evans, Chengxiang Zhai
Information retrieval is an important application area of natural-language
processing where one encounters the genuine challenge of processing large
quantities of unrestricted natural-language text. This paper reports on the
application of a few simple, yet robust and efficient noun-phrase analysis
techniques to create better indexing phrases for information retrieval. In
particular, we describe a hybrid approach to the extraction of meaningful
(continuous or discontinuous) subcompounds from complex noun phrases using both
corpus statistics and linguistic heuristics. Results of experiments show that
indexing based on such extracted subcompounds improves both recall and
precision in an information retrieval system. The noun-phrase analysis
techniques are also potentially useful for book indexing and automatic
thesaurus extraction.
Authors' comments: 8 pages, gzipped, uuencoded Postscript file, to appear in ACL'96
Jun-ichi Inoue
We investigate the retrieval phase diagrams of an asynchronous
fully-connected attractor network with non-monotonic transfer function by means
of a mean-field approximation. We find for the noiseless zero-temperature case
that this non-monotonic Hopfield network can store more patterns than a network
with monotonic transfer function investigated by Amit et al. Properties of
retrieval phase diagrams of non-monotonic networks agree with the results
obtained by Nishimori and Opris who treated synchronous networks. We also
investigate the optimal storage capacity of the non-monotonic Hopfield model
with state-dependent synaptic couplings introduced by Zertuche et el. We show
that the non-monotonic Hopfield model with state-dependent synapses stores more
patterns than the conventional Hopfield model. Our formulation can be easily
extended to a general transfer function.
Authors' comments: Latex 13 pages using IOP style file
Toshio Aoyagi
We propose a network of oscillators to retrieve given patterns in which the
oscillators keep a fixed phase relationship with one another. In this
description, the phase and the amplitude of the oscillators can be regarded as
the timing and the strength of the neuronal spikes, respectively. Using the
amplitudes for encoding, we enable the network to realize not only oscillatory
states but also non-firing states. In addition, it is shown that under suitable
conditions the system has a Lyapunov function ensuring a stable retrieval
process. Finally, the associative memory capability of the network is
demonstrated numerically.
Authors' comments: 9 pages (Revtex source file) including 2 figures (postscript)
Neil C. Rowe
We discuss implementation issues of MARIE-1, a mostly symbolic parser fully
implemented, and MARIE-2, a more statistical parser partially implemented. They
address a corpus of 100,000 picture captions. We argue that the mixed approach
of MARIE-2 should be better for this corpus because its algorithms (not data)
are simpler.
Authors' comments: Workshop on the Balancing Act, ACL-94, Las Cruces NM, July 1994
Iman Saberi, Fatemeh Fard
Large Language Models (LLMs) and Code-LLMs (CLLMs) have significantly
improved code generation, but, they frequently face difficulties when dealing
with challenging and complex problems. Retrieval-Augmented Generation (RAG)
addresses this issue by retrieving and integrating external knowledge at the
inference time. However, retrieval models often fail to find most relevant
context, and generation models, with limited context capacity, can hallucinate
when given irrelevant data. We present a novel framework that leverages a
Programming Knowledge Graph (PKG) to semantically represent and retrieve code.
This approach enables fine-grained code retrieval by focusing on the most
relevant segments while reducing irrelevant context through a tree-pruning
technique. PKG is coupled with a re-ranking mechanism to reduce even more
hallucinations by selectively integrating non-RAG solutions. We propose two
retrieval approaches-block-wise and function-wise-based on the PKG, optimizing
context granularity. Evaluations on the HumanEval and MBPP benchmarks show our
method improves pass@1 accuracy by up to 20%, and outperforms state-of-the-art
models by up to 34% on MBPP. Our contributions include PKG-based retrieval,
tree pruning to enhance retrieval precision, a re-ranking method for robust
solution selection and a Fill-in-the-Middle (FIM) enhancer module for automatic
code augmentation with relevant comments and docstrings.
Authors' comments: 20 pages, Conference
Partha Pratim Roy, Ayan Kumar Bhunia, Avirup Bhattacharyya, Umapada Pal
Retrieval of text information from natural scene images and video frames is a
challenging task due to its inherent problems like complex character shapes,
low resolution, background noise, etc. Available OCR systems often fail to
retrieve such information in scene/video frames. Keyword spotting, an
alternative way to retrieve information, performs efficient text searching in
such scenarios. However, current word spotting techniques in scene/video images
are script-specific and they are mainly developed for Latin script. This paper
presents a novel word spotting framework using dynamic shape coding for text
retrieval in natural scene image and video frames. The framework is designed to
search query keyword from multiple scripts with the help of on-the-fly
script-wise keyword generation for the corresponding script. We have used a
two-stage word spotting approach using Hidden Markov Model (HMM) to detect the
translated keyword in a given text line by identifying the script of the line.
A novel unsupervised dynamic shape coding based scheme has been used to group
similar shape characters to avoid confusion and to improve text alignment.
Next, the hypotheses locations are verified to improve retrieval performance.
To evaluate the proposed system for searching keyword from natural scene image
and video frames, we have considered two popular Indic scripts such as Bangla
(Bengali) and Devanagari along with English. Inspired by the zone-wise
recognition approach in Indic scripts[1], zone-wise text information has been
used to improve the traditional word spotting performance in Indic scripts. For
our experiment, a dataset consisting of images of different scenes and video
frames of English, Bangla and Devanagari scripts were considered. The results
obtained showed the effectiveness of our proposed word spotting approach.
Authors' comments: Multimedia Tools and Applications, Springer
Caprice L. Phillips, Jacqueline K. Faherty, Ben Burningham, Johanna M. Vos, Eileen Gonzales, Emily J. Griffith, Sherelyn Alejandro Merchan, Emily Calamari et al.
We present an atmospheric retrieval analysis on a set of young, cloudy, red
L-dwarfs -- CWISER J124332.12+600126.2 and WISEP J004701.06+680352.1 -- using
the \textit{Brewster} retrieval framework. We also present the first elemental
abundance measurements of the young K-dwarf (K0) host star, BD+60 1417 using
high resolution~(R = 50,000) spectra taken with PEPSI/LBT. In the complex
cloudy L-dwarf regime the emergence of condensate cloud species complicates
retrieval analysis when only near-infrared data is available. We find that for
both L dwarfs in this work, despite testing three different thermal profile
parameterizations we are unable to constrain reliable abundance measurements
and thus the C/O ratio. While we can not conclude what the abundances are, we
can conclude that the data strongly favor a cloud model over a cloudless model.
We note that the difficulty in retrieval constraints persists regardless of the
signal to noise of the data examined (S/N $\sim$ 10 for CWISER
J124332.12+600126.2 and~40 for WISEP J004701.06+680352.1). The results
presented in this work provide valuable lessons about retrieving young,
low-surface gravity, cloudy L-dwarfs. This work provides continued evidence of
missing information in models and the crucial need for JWST to guide and inform
retrieval analysis in this regime.
Authors' comments: Accepted to ApJ
Ben W. P. Lew, Thomas Roellig, Natasha E. Batalha, Michael Line, Thomas Greene, Sagnick Murkherjee, Richard Freedman, Michael Meyer et al.
The launch of the James Webb Space Telescope (JWST) marks a pivotal moment
for precise atmospheric characterization of Y dwarfs, the coldest brown dwarf
spectral type. In this study, we leverage moderate spectral resolution
observations (R $\sim$ 2700) with the G395H grating of the Near-Infrared
Spectrograph (NIRSpec) onboard of JWST to characterize the nearby (9.9 pc) Y
dwarf WISEPA J182831.08+265037.8 (WISE 1828). With the NIRSpec G395H 2.88-5.12
$\mathrm{\mu}$m spectrum, we measure the abundances of CO, CO$_2$, CH$_4$,
H$_2$S, NH$_3$, and H$_2$O, which are the major carbon, nitrogen, oxygen, and
sulfur bearing species in the atmosphere. Based on the retrieved volume mixing
ratios with the atmospheric retrieval framework CHIMERA, we report that the C/O
ratio is $0.45 \pm 0.01$, close to the solar C/O value of 0.55, and the
metallicity to be +0.30 $\pm$ 0.02 dex. Comparison between the retrieval
results with the forward modeling results suggests that the model bias for C/O
and metallicity could be as high as 0.03 and 0.97 dex respectively. We also
report a lower limit of the $^{12}$CO/$^{13}$CO ratio of $>40 $, being
consistent with the nominal solar value of 90. Our results highlight the
potential of JWST in measuring the C/O ratios down to percent-level precision
and characterizing isotopologues of cold planetary atmospheres similar to WISE
1828.
Authors' comments: 18 pages + references, including 11 figures, accepted for publication
in The Astronomical Journal
Shengwei Zhao, Jingwen Yao, Sitong Wei, Linhai Xu, Yuying Liu, Dong Zhang, Zhiqiang Tian, Shaoyi Du
Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge, thus demonstrating impressive performance in complex multi-modal scenarios. However, existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation, which limits the explainability of the results. To address this gap, we propose to introduce reinforcement learning into multi-modal retrieval-augmented generation, enhancing the reasoning capabilities of multi-modal large language models through a two-stage reinforcement fine-tuning framework to achieve explainable multi-modal retrieval-augmented generation. Specifically, in the first stage, rule-based reinforcement fine-tuning is employed to perform coarse-grained point-wise ranking of multi-modal documents, effectively filtering out those that are significantly irrelevant. In the second stage, reasoning-based reinforcement fine-tuning is utilized to jointly optimize fine-grained list-wise ranking and answer generation, guiding multi-modal large language models to output explainable reasoning logic in the MMRAG process. Our method achieves state-of-the-art results on WebQA and MultimodalQA, two benchmark datasets for multi-modal retrieval-augmented generation, and its effectiveness is validated through comprehensive ablation experiments.
Authors' comments: This paper was accepted to AAAI2026
Kunming Shao, Zhipeng Liao, Jiangnan Yu, Liang Zhao, Qiwei Li, Xijie Huang, Jingyu He, Fengshi Tian et al.
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by
integrating external knowledge retrieval but faces challenges on edge devices
due to high storage, energy, and latency demands. Computing-in-Memory (CIM)
offers a promising solution by storing document embeddings in CIM macros and
enabling in-situ parallel retrievals but is constrained by either low memory
density or limited computational accuracy. To address these challenges, we
present DIRCRAG, a novel edge RAG acceleration architecture leveraging Digital
In-ReRAM Computation (DIRC). DIRC integrates a high-density multi-level ReRAM
subarray with an SRAM cell, utilizing SRAM and differential sensing for robust
ReRAM readout and digital multiply-accumulate (MAC) operations. By storing all
document embeddings within the CIM macro, DIRC achieves ultra-low-power,
single-cycle data loading, substantially reducing both energy consumption and
latency compared to offchip DRAM. A query-stationary (QS) dataflow is supported
for RAG tasks, minimizing on-chip data movement and reducing SRAM buffer
requirements. We introduce error optimization for the DIRC ReRAM-SRAM cell by
extracting the bit-wise spatial error distribution of the ReRAM subarray and
applying targeted bit-wise data remapping. An error detection circuit is also
implemented to enhance readout resilience against deviceand circuit-level
variations. Simulation results demonstrate that DIRC-RAG under TSMC40nm process
achieves an on-chip non-volatile memory density of 5.18Mb/mm2 and a throughput
of 131 TOPS. It delivers a 4MB retrieval latency of 5.6{\mu}s/query and an
energy consumption of 0.956{\mu}J/query, while maintaining the retrieval
precision.
Authors' comments: Accepted by 2025 IEEE/ACM ISLPED
Xinyu Ma, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Xueqi Cheng
Dense retrieval has shown promising results in many information retrieval
(IR) related tasks, whose foundation is high-quality text representation
learning for effective search. Some recent studies have shown that
autoencoder-based language models are able to boost the dense retrieval
performance using a weak decoder. However, we argue that 1) it is not
discriminative to decode all the input texts and, 2) even a weak decoder has
the bypass effect on the encoder. Therefore, in this work, we introduce a novel
contrastive span prediction task to pre-train the encoder alone, but still
retain the bottleneck ability of the autoencoder. % Therefore, in this work, we
propose to drop out the decoder and introduce a novel contrastive span
prediction task to pre-train the encoder alone. The key idea is to force the
encoder to generate the text representation close to its own random spans while
far away from others using a group-wise contrastive loss. In this way, we can
1) learn discriminative text representations efficiently with the group-wise
contrastive learning over spans and, 2) avoid the bypass effect of the decoder
thoroughly. Comprehensive experiments over publicly available retrieval
benchmark datasets show that our approach can outperform existing pre-training
methods for dense retrieval significantly.
Authors' comments: Accepted to SIGIR 2022
Zilin Xiao, Pavel Suma, Ayush Sachdeva, Hao-Jen Wang, Giorgos Kordopatis-Zilos, Giorgos Tolias, Vicente Ordonez
We introduce LOCORE, Long-Context Re-ranker, a model that takes as input
local descriptors corresponding to an image query and a list of gallery images
and outputs similarity scores between the query and each gallery image. This
model is used for image retrieval, where typically a first ranking is performed
with an efficient similarity measure, and then a shortlist of top-ranked images
is re-ranked based on a more fine-grained similarity measure. Compared to
existing methods that perform pair-wise similarity estimation with local
descriptors or list-wise re-ranking with global descriptors, LOCORE is the
first method to perform list-wise re-ranking with local descriptors. To achieve
this, we leverage efficient long-context sequence models to effectively capture
the dependencies between query and gallery images at the local-descriptor
level. During testing, we process long shortlists with a sliding window
strategy that is tailored to overcome the context size limitations of sequence
models. Our approach achieves superior performance compared with other
re-rankers on established image retrieval benchmarks of landmarks (ROxf and
RPar), products (SOP), fashion items (In-Shop), and bird species (CUB-200)
while having comparable latency to the pair-wise local descriptor re-rankers.
Authors' comments: CVPR 2025
Jingwei Zhuo, Ziru Xu, Wei Dai, Han Zhu, Han Li, Jian Xu, Kun Gai
Retrieving relevant targets from an extremely large target set under
computational limits is a common challenge for information retrieval and
recommendation systems. Tree models, which formulate targets as leaves of a
tree with trainable node-wise scorers, have attracted a lot of interests in
tackling this challenge due to their logarithmic computational complexity in
both training and testing. Tree-based deep models (TDMs) and probabilistic
label trees (PLTs) are two representative kinds of them. Though achieving many
practical successes, existing tree models suffer from the training-testing
discrepancy, where the retrieval performance deterioration caused by beam
search in testing is not considered in training. This leads to an intrinsic gap
between the most relevant targets and those retrieved by beam search with even
the optimally trained node-wise scorers. We take a first step towards
understanding and analyzing this problem theoretically, and develop the concept
of Bayes optimality under beam search and calibration under beam search as
general analyzing tools for this purpose. Moreover, to eliminate the
discrepancy, we propose a novel algorithm for learning optimal tree models
under beam search. Experiments on both synthetic and real data verify the
rationality of our theoretical analysis and demonstrate the superiority of our
algorithm compared to state-of-the-art methods.
Authors' comments: To appear in the 37th International Conference on Machine Learning
(ICML 2020)
Zoe Fingleton, Nazanin Siavash, Armin Moin
In this paper, we focus on automating two of the widely used Verification and Validation (V&V) activities in the Software Development Lifecycle (SDLC): Software testing and software inspection (also known as review). Concerning the former, we concentrate on automated test case generation using Large Language Models (LLMs). For the latter, we enable inspection of the source code by LLMs. To address the known LLM hallucination problem, in which LLMs confidently produce incorrect outputs, we implement a Retrieval Augmented Generation (RAG) pipeline to integrate supplementary knowledge sources and provide additional context to the LLM. Our experimental results indicate that incorporating external context via the RAG pipeline has a generally positive impact on both test case generation and code inspection. This novel approach reduces the total project cost by saving human testers'/inspectors' time. It also improves the effectiveness and efficiency of these V&V activities, as evidenced by our experimental study.
Joel Perca, Luis Sante, Juanpablo Heredia, Joao Rulff, Claudio Silva, Jorge Poco
Extracting actionable insights from long-duration urban videos is often labor-intensive: analysts must manually sift through raw footage to pinpoint target events or uncover broader behavioral trends. In this work, we present URBANCLIPATLAS, a visual analytics system for exploring long urban videos recorded at street intersections. URBANCLIPATLAS combines retrieval-augmented generation (RAG), taxonomy-aware entity extraction, and video grounding to support event retrieval and interpretation. The system segments extended recordings into short clips, generates textual descriptions with a vision-language model, and indexes them for semantic retrieval. A knowledge graph maps entities and relations from LLM answers onto a domain-specific taxonomy and aligns them with detected objects and trajectories to support visual grounding and verification. URBANCLIPATLAS supports scene retrieval through an augmented chat-based interface and improves scene interpretation by tightly aligning textual outputs with video evidence. This design strengthens the connection between textual reasoning and visual evidence, reducing the effort required to validate model outputs and refine hypotheses. We demonstrate the usefulness of URBANCLIPATLAS on the StreetAware dataset through two case studies involving hazardous scenarios and crossing dynamics at street intersections. URBANCLIPATLAS helps analysts reason about safety- and mobility-related patterns across large urban video collections.
Authors' comments: 12 pages and 6 figures
Yangchen Zeng, Zhenyu Yu, Dongming Jiang, Wenbo Zhang, Yifan Hong, Zhanhua Hu, Jiao Luo, Kangning Cui
Transformer-based detectors have advanced small-object detection, but they often remain inefficient and vulnerable to background-induced query noise, which motivates deep decoders to refine low-quality queries. We present HELP (Heatmap-guided Embedding Learning Paradigm), a noise-aware positional-semantic fusion framework that studies where to embed positional information by selectively preserving positional encodings in foreground-salient regions while suppressing background clutter. Within HELP, we introduce Heatmap-guided Positional Embedding (HPE) as the core embedding mechanism and visualize it with a heatbar for interpretable diagnosis and fine-tuning. HPE is integrated into both the encoder and decoder: it guides noise-suppressed feature encoding by injecting heatmap-aware positional encoding, and it enables high-quality query retrieval by filtering background-dominant embeddings via a gradient-based mask filter before decoding. To address feature sparsity in complex small targets, we integrate Linear-Snake Convolution to enrich retrieval-relevant representations. The gradient-based heatmap supervision is used during training only, incurring no additional gradient computation at inference. As a result, our design reduces decoder layers from eight to three and achieves a 59.4% parameter reduction (66.3M vs. 163M) while maintaining consistent accuracy gains under a reduced compute budget across benchmarks. Code Repository: https://github.com/yidimopozhibai/Noise-Suppressed-Query-Retrieval
Authors' comments: Accepted to ACM ICMR 2026; 14 pages, 6 figures, and 4 tables
Gabriele Mattioli, Evelyn Turri, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.
Authors' comments: ICPR 2026
Xin Xie, Dongyun Xue, Wuguannan Yao, Mingxiao Feng, Wengang Zhou, Xiang Qi, Houqiang Li, Peng Zhang
LLM-powered systems require complex multi-step decision-making abilities to solve real-world tasks, yet current planning approaches face a trade-off between the high latency of inference-time search and the limited generalization of supervised fine-tuning. To address this limitation, we introduce \textbf{SGA-MCTS}, a framework that casts LLM planning as non-parametric retrieval. Offline, we leverage Monte Carlo Tree Search (MCTS) to explore the solution space and distill high-fidelity trajectories into State-Goal-Action (SGA) atoms. These atoms are de-lexicalized primitives that abstract concrete entities into symbolic slots, preserving reusable causal logic while discarding domain-specific noise. Online, a retrieval-augmented agent employs a hybrid symbolic-semantic mechanism to fetch relevant SGAs and re-ground them into the current context as soft reasoning hints. Empirical results on complex benchmarks demonstrate that this paradigm enables frozen, open-weights models to match the performance of SOTA systems (e.g., GPT-5) without task-specific fine-tuning. By effectively amortizing the heavy computational cost of search, SGA-MCTS achieves System 2 reasoning depth at System 1 inference speeds, rendering autonomous planning both scalable and real-time feasible.
Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons
Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the $K$ most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6\% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}.