Jimmy Lin
Practitioners working on dense retrieval today face a bewildering number of choices. Beyond selecting the embedding model, another consequential choice is the actual implementation of nearest-neighbor vector search. While best practices recommend HNSW indexes, flat vector indexes with brute-force search represent another viable option, particularly for smaller corpora and for rapid prototyping. In this paper, we provide experimental results on the BEIR dataset using the open-source Lucene search library that explicate the tradeoffs between HNSW and flat indexes (including quantized variants) from the perspectives of indexing time, query evaluation performance, and retrieval quality. With additional comparisons between dense and sparse retrievers, our results provide guidance for today's search practitioner in understanding the design space of dense and sparse retrievers. To our knowledge, we are the first to provide operational advice supported by empirical experiments in this regard.
Zi Yang
We argue that there are two major distinct capabilities in long context understanding: retrieval and holistic understanding. Understanding and further improving LLMs' long context capabilities would not be possible without knowing the tasks' focus categories. We aim to automatically identify retrieval focused and holistic understanding focused problems from suites of benchmarks and quantitatively measure the difficulty within each focus. In this paper, we present the Dolce framework, which parameterizes each problem by $\lambda$ (complexity) and $k$ (redundancy) and assigns to one of five predefined focus categories. We propose to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans. To find the $\lambda$ and $k$ for each problem, we further propose a mixture model of a non-parametric background noise component and a parametric/non-parametric hybrid oracle component, where we derive the probability functions parameterized by $\lambda$ and $k$ for both the correct-or-wrong (COW) scenario and the partial-point-in-grading (PIG) scenario. Our proposed methods can identify 0% to 67% of the problems are retrieval focused and 0% to 90% of the problems are holistic understanding focused across 44 existing long context evaluation tasks.
Jihyun Lee, Gary Geunbae Lee
Traditional dialogue state tracking approaches heavily rely on extensive training data and handcrafted features, limiting their scalability and adaptability to new domains. In this paper, we propose a novel method that leverages inference and in-context learning with ChatGPT for domain transfer in dialogue state tracking, without any parameter updates. By guiding ChatGPT's chain of thought, we enable it to retrieve relevant examples and generalize knowledge to accurately infer dialogue states, solely through inference. Experimental results on the MultiWOZ dataset demonstrate competitive performance and promising generalization across domains. Our parameter-free approach offers a scalable and adaptable solution, opening new research directions in domain transfer learning.
Zhicong Wu, Qifeng Su, Ke Gu, Xiaodong Shi
Oracle Bone Inscription (OBI) is the earliest mature writing system in China, which represents a crucial stage in the development of hieroglyphs. Nevertheless, the substantial quantity of undeciphered OBI characters remains a significant challenge for scholars, while conventional methods of ancient script research are both time-consuming and labor-intensive. In this paper, we propose a cross-font image retrieval network (CFIRN) to decipher OBI characters by establishing associations between OBI characters and other script forms, simulating the interpretive behavior of paleography scholars. Concretely, our network employs a siamese framework to extract deep features from character images of various fonts, fully exploring structure clues with different resolutions by multiscale feature integration (MFI) module and multiscale refinement classifier (MRC). Extensive experiments on three challenging cross-font image retrieval datasets demonstrate that, given undeciphered OBI characters, our CFIRN can effectively achieve accurate matches with characters from other gallery fonts, thereby facilitating the deciphering.
Runqing Zhang, Xue Zhou
Most existing text-to-image person retrieval methods usually assume that the training image-text pairs are perfectly aligned; however, the noisy correspondence(NC) issue (i.e., incorrect or unreliable alignment) exists due to poor image quality and labeling errors. Additionally, random masking augmentation may inadvertently discard critical semantic content, introducing noisy matches between images and text descriptions. To address the above two challenges, we propose a noise label suppression method to mitigate NC and an Attention-Weighted Selective Mask (AWM) strategy to resolve the issues caused by random masking. Specifically, the Bidirectional Similarity Distribution Matching (BSDM) loss enables the model to effectively learn from positive pairs while preventing it from over-relying on them, thereby mitigating the risk of overfitting to noisy labels. In conjunction with this, Weight Adjustment Focal (WAF) loss improves the model's ability to handle hard samples. Furthermore, AWM processes raw images through an EMA version of the image encoder, selectively retaining tokens with strong semantic connections to the text, enabling better feature extraction. Extensive experiments demonstrate the effectiveness of our approach in addressing noise-related issues and improving retrieval performance.
Daniel B. Hier, Thanh Son Do, Tayo Obafemi-Ajayi
Large language models (LLMs) have shown improved accuracy in phenotype term
normalization tasks when augmented with retrievers that suggest candidate
normalizations based on term definitions. In this work, we introduce a
simplified retriever that enhances LLM accuracy by searching the Human
Phenotype Ontology (HPO) for candidate matches using contextual word embeddings
from BioBERT without the need for explicit term definitions. Testing this
method on terms derived from the clinical synopses of Online Mendelian
Inheritance in Man (OMIM), we demonstrate that the normalization accuracy of a
state-of-the-art LLM increases from a baseline of 62.3% without augmentation to
90.3% with retriever augmentation. This approach is potentially generalizable
to other biomedical term normalization tasks and offers an efficient
alternative to more complex retrieval methods.
Authors' comments: Published by Frontiers in Digital Health
Adarsh Barik, Anand Krishna, Vincent Y. F. Tan
In this work, we study the robust phase retrieval problem where the task is to recover an unknown signal $\theta^* \in \mathbb{R}^d$ in the presence of potentially arbitrarily corrupted magnitude-only linear measurements. We propose an alternating minimization approach that incorporates an oracle solver for a non-convex optimization problem as a subroutine. Our algorithm guarantees convergence to $\theta^*$ and provides an explicit polynomial dependence of the convergence rate on the fraction of corrupted measurements. We then provide an efficient construction of the aforementioned oracle under a sparse arbitrary outliers model and offer valuable insights into the geometric properties of the loss landscape in phase retrieval with corrupted measurements. Our proposed oracle avoids the need for computationally intensive spectral initialization, using a simple gradient descent algorithm with a constant step size and random initialization instead. Additionally, our overall algorithm achieves nearly linear sample complexity, $\mathcal{O}(d \, \mathrm{polylog}(d))$.
Ren-Di Wu, Yu-Yen Lin, Huei-Fang Yang
Composed image retrieval (CIR), which formulates the query as a combination
of a reference image and modified text, has emerged as a new form of image
search due to its enhanced ability to capture user intent. However, training a
CIR model in a supervised manner typically requires labor-intensive collection
of (reference image, text modifier, target image) triplets. While existing
zero-shot CIR (ZS-CIR) methods eliminate the need for training on specific
downstream datasets, they still require additional pretraining on large-scale
image datasets. In this paper, we introduce a training-free approach for
ZS-CIR. Our approach, Weighted Modality fusion and similarity for CIR
(WeiMoCIR), operates under the assumption that image and text modalities can be
effectively combined using a simple weighted average. This allows the query
representation to be constructed directly from the reference image and text
modifier. To further enhance retrieval performance, we employ multimodal large
language models (MLLMs) to generate image captions for the database images and
incorporate these textual captions into the similarity computation by combining
them with image information using a weighted average. Our approach is simple,
easy to implement, and its effectiveness is validated through experiments on
the FashionIQ and CIRR datasets. Code is available at
https://github.com/whats2000/WeiMoCIR.
Authors' comments: 14 pages, 6 figures, International Conference on Technologies and
Applications of Artificial Intelligence (TAAI) Camera Ready
Aliakbar Nafar, Kristen Brent Venable, Parisa Kordjamshidi
Generative Large Language Models (LLMs) are capable of being in-context learners. However, the underlying mechanism of in-context learning (ICL) is still a major research question, and experimental research results about how models exploit ICL are not always consistent. In this work, we propose a framework for evaluating in-context learning mechanisms, which we claim are a combination of retrieving internal knowledge and learning from in-context examples by focusing on regression tasks. First, we show that LLMs can solve real-world regression problems and then design experiments to measure the extent to which the LLM retrieves its internal knowledge versus learning from in-context examples. We argue that this process lies on a spectrum between these two extremes. We provide an in-depth analysis of the degrees to which these mechanisms are triggered depending on various factors, such as prior knowledge about the tasks and the type and richness of the information provided by the in-context examples. We employ three LLMs and utilize multiple datasets to corroborate the robustness of our findings. Our results shed light on how to engineer prompts to leverage meta-learning from in-context examples and foster knowledge retrieval depending on the problem being addressed.
Saghir Alfasly, Peyman Nejat, Ghazal Alabtah, Sobhan Hemati, Krishna Rani Kalari, H. R. Tizhoosh
We have tested recently published foundation models for histopathology for
image retrieval. We report macro average of F1 score for top-1 retrieval,
majority of top-3 retrievals, and majority of top-5 retrievals. We perform
zero-shot retrievals, i.e., we do not alter embeddings and we do not train any
classifier. As test data, we used diagnostic slides of TCGA, The Cancer Genome
Atlas, consisting of 23 organs and 117 cancer subtypes. As a search platform we
used Yottixel that enabled us to perform WSI search using patches. Achieved F1
scores show low performance, e.g., for top-5 retrievals, 27% +/- 13%
(Yottixel-DenseNet), 42% +/- 14% (Yottixel-UNI), 40%+/-13% (Yottixel-Virchow),
and 41%+/-13% (Yottixel-GigaPath). The results for GigaPath WSI will be delayed
due to the significant computational resources required for processing
Authors' comments: This paper will be updated with more results
Paulina Toro Isaza, Michael Nidd, Noah Zheutlin, Jae-wook Ahn, Chidansh Amitkumar Bhatt, Yu Deng, Ruchi Mahindru, Martin Franz et al.
Clients wishing to implement generative AI in the domain of IT Support and
AIOps face two critical issues: domain coverage and model size constraints due
to model choice limitations. Clients might choose to not use larger proprietary
models such as GPT-4 due to cost and privacy concerns and so are limited to
smaller models with potentially less domain coverage that do not generalize to
the client's domain. Retrieval augmented generation is a common solution that
addresses both of these issues: a retrieval system first retrieves the
necessary domain knowledge which a smaller generative model leverages as
context for generation. We present a system developed for a client in the IT
Support domain for support case solution recommendation that combines retrieval
augmented generation (RAG) for answer generation with an encoder-only model for
classification and a generative large language model for query generation. We
cover architecture details, data collection and annotation, development journey
and preliminary validations, expected final deployment process and evaluation
plans, and finally lessons learned.
Authors' comments: 7 pages, 3 figures, 6 tables
Jiasheng Zhang, Jialin Chen, Ali Maatouk, Ngoc Bui, Qianqian Xie, Leandros Tassiulas, Jie Shao, Hua Xu et al.
With the advent of large language models (LLMs), managing scientific
literature via LLMs has become a promising direction of research. However,
existing approaches often overlook the rich structural and semantic relevance
among scientific literature, limiting their ability to discern the
relationships between pieces of scientific knowledge, and suffer from various
types of hallucinations. These methods also focus narrowly on individual
downstream tasks, limiting their applicability across use cases. Here we
propose LitFM, the first literature foundation model designed for a wide
variety of practical downstream tasks on domain-specific literature, with a
focus on citation information. At its core, LitFM contains a novel graph
retriever to integrate graph structure by navigating citation graphs and
extracting relevant literature, thereby enhancing model reliability. LitFM also
leverages a knowledge-infused LLM, fine-tuned through a well-developed
instruction paradigm. It enables LitFM to extract domain-specific knowledge
from literature and reason relationships among them. By integrating citation
graphs during both training and inference, LitFM can generalize to unseen
papers and accurately assess their relevance within existing literature.
Additionally, we introduce new large-scale literature citation benchmark
datasets on three academic fields, featuring sentence-level citation
information and local context. Extensive experiments validate the superiority
of LitFM, achieving 28.1% improvement on retrieval task in precision, and an
average improvement of 7.52% over state-of-the-art across six downstream
literature-related tasks
Authors' comments: 18 pages, 12 figures
Jizhou Huang, Haifeng Wang, Yibo Sun, Miao Fan, Zhengjie Huang, Chunyuan Yuan, Yawen Li
The increasing interest in international travel has raised the demand of
retrieving point of interests in multiple languages. This is even superior to
find local venues such as restaurants and scenic spots in unfamiliar languages
when traveling abroad. Multilingual POI retrieval, enabling users to find
desired POIs in a demanded language using queries in numerous languages, has
become an indispensable feature of today's global map applications such as
Baidu Maps. This task is non-trivial because of two key challenges: (1)
visiting sparsity and (2) multilingual query-POI matching. To this end, we
propose a Heterogeneous Graph Attention Matching Network (HGAMN) to
concurrently address both challenges. Specifically, we construct a
heterogeneous graph that contains two types of nodes: POI node and query node
using the search logs of Baidu Maps. To alleviate challenge \#1, we construct
edges between different POI nodes to link the low-frequency POIs with the
high-frequency ones, which enables the transfer of knowledge from the latter to
the former. To mitigate challenge \#2, we construct edges between POI and query
nodes based on the co-occurrences between queries and POIs, where queries in
different languages and formulations can be aggregated for individual POIs.
Moreover, we develop an attention-based network to jointly learn node
representations of the heterogeneous graph and further design a cross-attention
module to fuse the representations of both types of nodes for query-POI
relevance scoring. Extensive experiments conducted on large-scale real-world
datasets from Baidu Maps demonstrate the superiority and effectiveness of
HGAMN. In addition, HGAMN has already been deployed in production at Baidu
Maps, and it successfully keeps serving hundreds of millions of requests every
day.
Authors' comments: Accepted by KDD'21
E. Béguin, A. Chiavassa, A. Ahmad, B. Freytag, S. Uttenthaler
Context. The complex dynamics of asymptotic giant branch (AGB) stars and the
resulting stellar winds have a significant impact on the measurements of
stellar parameters and amplify their uncertainties. Three-dimensional (3D)
radiative hydrodynamic (RHD) simulations of convection suggest that
convection-related structures at the surface of AGB star affect the photocentre
displacement and the parallax uncertainty measured by Gaia. Aims. We explore
the impact of the convection on the photocentre variability and aim to
establish analytical laws between the photocentre displacement and stellar
parameters to retrieve such parameters from the parallax uncertainty. Methods.
We used a selection of 31 RHD simulations with CO5BOLD and the post-processing
radiative transfer code Optim3D to compute intensity maps in the Gaia G band
[320-1050 nm]. From these maps, we calculated the photocentre position and
temporal fluctuations. We then compared the synthetic standard deviation to the
parallax uncertainty of a sample of 53 Mira stars observed with Gaia. Results.
The simulations show a displacement of the photocentre across the surface
ranging from 4 to 13 % of the corresponding stellar radius, in agreement with
previous studies. We provide an analytical law relating the pulsation period of
the simulations and the photocentre displacement as well as the pulsation
period and stellar parameters. By combining these laws, we retrieve the surface
gravity, the effective temperature, and the radius for the stars in our sample.
Conclusions. Our analysis highlights an original procedure to retrieve stellar
parameters by using both state-of-the-art 3D numerical simulations of AGB
stellar convection and parallax observations of AGB stars. This will help us
refine our understanding of these giants.
Authors' comments: 14 pages, 17 figures
Carlos E. Muñoz-Romero, Andrea Banzatti, Karin I. Öberg, Klaus M. Pontoppidan, Colette Salyk, Joan Najita, Geoffrey A. Blake, Sebastiaan Krijt et al.
The mid-infrared water vapor emission spectrum provides a novel way to
characterize the delivery of icy pebbles towards the innermost ($<5$ au)
regions of planet-forming disks. Recently, JWST MIRI-MRS showed that compact
disks exhibit an excess of low-energy water vapor emission relative to extended
multi-gapped disks, suggesting that icy pebble drift is more efficient in the
former. We carry out detailed emission line modeling to retrieve the excitation
conditions of rotational water vapor emission in a sample of four compact and
three extended disks within the JDISC Survey. We present two-temperature H$_2$O
slab model retrievals and, for the first time, constrain the spatial
distribution of water vapor by fitting parametric radial temperature and column
density profiles. Such models statistically outperform the two-temperature slab
fits. We find a correlation between the observable hot water vapor mass and
stellar mass accretion rate, as well as an anti-correlation between cold water
vapor mass and sub-mm dust disk radius, confirming previously reported water
line flux trends. We find that the mid-IR spectrum traces H$_2$O with
temperatures down to 180-300 K, but the coldest 150-170 K gas remains
undetected. Furthermore the H$_2$O temperature profiles are generally steeper
and cooler than the expected `super-heated' dust temperature in passive
irradiated disks. The column density profiles are used to estimate icy pebble
mass fluxes, which suggest that compact and extended disks may produce markedly
distinct inner-disk exoplanet populations if local feeding mechanisms dominate
their assembly.
Authors' comments: Accepted for publication in ApJ
Manu Gaur, Darshan Singh S, Makarand Tapaswi
Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.
Phuc-Tinh Pham Do, Duy-Ngoc Dinh Cao, Khanh Quoc Tran, Kiet Van Nguyen
In this article, we propose the R2GQA system, a Retriever-Reader-Generator Question Answering system, consisting of three main components: Document Retriever, Machine Reader, and Answer Generator. The Retriever module employs advanced information retrieval techniques to extract the context of articles from a dataset of legal regulation documents. The Machine Reader module utilizes state-of-the-art natural language understanding algorithms to comprehend the retrieved documents and extract answers. Finally, the Generator module synthesizes the extracted answers into concise and informative responses to questions of students regarding legal regulations. Furthermore, we built the ViRHE4QA dataset in the domain of university training regulations, comprising 9,758 question-answer pairs with a rigorous construction process. This is the first Vietnamese dataset in the higher regulations domain with various types of answers, both extractive and abstractive. In addition, the R2GQA system is the first system to offer abstractive answers in Vietnamese. This paper discusses the design and implementation of each module within the R2GQA system on the ViRHE4QA dataset, highlighting their functionalities and interactions. Furthermore, we present experimental results demonstrating the effectiveness and utility of the proposed system in supporting the comprehension of students of legal regulations in higher education settings. In general, the R2GQA system and the ViRHE4QA dataset promise to contribute significantly to related research and help students navigate complex legal documents and regulations, empowering them to make informed decisions and adhere to institutional policies effectively. Our dataset is available for research purposes.
Fatma Yasmine Loumachi, Mohamed Chahine Ghanem, Mohamed Amine Ferrag
Cyber timeline analysis, or forensic timeline analysis, is crucial in Digital
Forensics and Incident Response (DFIR). It examines artefacts and events
particularly timestamps and metadata to detect anomalies, establish
correlations, and reconstruct incident timelines. Traditional methods rely on
structured artefacts, such as logs and filesystem metadata, using specialised
tools for evidence identification and feature extraction. This paper introduces
GenDFIR, a framework leveraging large language models (LLMs), specifically
Llama 3.1 8B in zero shot mode, integrated with a Retrieval-Augmented
Generation (RAG) agent. Incident data is preprocessed into a structured
knowledge base, enabling the RAG agent to retrieve relevant events based on
user prompts. The LLM interprets this context, offering semantic enrichment.
Tested on synthetic data in a controlled environment, results demonstrate
GenDFIR's reliability and robustness, showcasing LLMs potential to automate
timeline analysis and advance threat detection.
Authors' comments: 24 pages V5.3
Mitchell DeHaven
In this paper we present a multi-adapter retrieval augmented generation
system (MARAGS) for Meta's Comprehensive RAG (CRAG) competition for KDD CUP
2024. CRAG is a question answering dataset contains 3 different subtasks aimed
at realistic question and answering RAG related tasks, with a diverse set of
question topics, question types, time dynamic answers, and questions featuring
entities of varying popularity.
Our system follows a standard setup for web based RAG, which uses processed
web pages to provide context for an LLM to produce generations, while also
querying API endpoints for additional information. MARAGS also utilizes
multiple different adapters to solve the various requirements for these tasks
with a standard cross-encoder model for ranking candidate passages relevant for
answering the question. Our system achieved 2nd place for Task 1 as well as 3rd
place on Task 2.
Authors' comments: Accepted to CRAG KDD Cup 24 Workshop
Ingo Ziegler, Abdullatif Köksal, Desmond Elliott, Hinrich Schütze
Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.