Zechen Bai, Tianjun Xiao, Tong He, Pichao Wang, Zheng Zhang, Thomas Brox, Mike Zheng Shou
As online video content rapidly grows, the task of text-video retrieval (TVR)
becomes increasingly important. A key challenge in TVR is the information
asymmetry between video and text: videos are inherently richer in information,
while their textual descriptions often capture only fragments of this
complexity. This paper introduces a novel, data-centric framework to bridge
this gap by enriching textual representations to better match the richness of
video content. During training, videos are segmented into event-level clips and
captioned to ensure comprehensive coverage. During retrieval, a large language
model (LLM) generates semantically diverse queries to capture a broader range
of possible matches. To enhance retrieval efficiency, we propose a query
selection mechanism that identifies the most relevant and diverse queries,
reducing computational cost while improving accuracy. Our method achieves
state-of-the-art results across multiple benchmarks, demonstrating the power of
data-centric approaches in addressing information asymmetry in TVR. This work
paves the way for new research focused on leveraging data to improve
cross-modal retrieval.
Authors' comments: Accepted by ICLR 2025
Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, Conghui He
Cross-view geolocalization identifies the geographic location of street view
images by matching them with a georeferenced satellite database. Significant
challenges arise due to the drastic appearance and geometry differences between
views. In this paper, we propose a new approach for cross-view image
geo-localization, i.e., the Panorama-BEV Co-Retrieval Network. Specifically, by
utilizing the ground plane assumption and geometric relations, we convert
street view panorama images into the BEV view, reducing the gap between street
panoramas and satellite imagery. In the existing retrieval of street view
panorama images and satellite images, we introduce BEV and satellite image
retrieval branches for collaborative retrieval. By retaining the original
street view retrieval branch, we overcome the limited perception range issue of
BEV representation. Our network enables comprehensive perception of both the
global layout and local details around the street view capture locations.
Additionally, we introduce CVGlobal, a global cross-view dataset that is closer
to real-world scenarios. This dataset adopts a more realistic setup, with
street view directions not aligned with satellite images. CVGlobal also
includes cross-regional, cross-temporal, and street view to map retrieval
tests, enabling a comprehensive evaluation of algorithm performance. Our method
excels in multiple tests on common cross-view datasets such as CVUSA, CVACT,
VIGOR, and our newly introduced CVGlobal, surpassing the current
state-of-the-art approaches. The code and datasets can be found at
\url{https://github.com/yejy53/EP-BEV}.
Authors' comments: Accepted by ECCV 2024
Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali, Dhagash Mehta
Extraction and interpretation of intricate information from unstructured text
data arising in financial applications, such as earnings call transcripts,
present substantial challenges to large language models (LLMs) even using the
current best practices to use Retrieval Augmented Generation (RAG) (referred to
as VectorRAG techniques which utilize vector databases for information
retrieval) due to challenges such as domain specific terminology and complex
formats of the documents. We introduce a novel approach based on a combination,
called HybridRAG, of the Knowledge Graphs (KGs) based RAG techniques (called
GraphRAG) and VectorRAG techniques to enhance question-answer (Q&A) systems for
information extraction from financial documents that is shown to be capable of
generating accurate and contextually relevant answers. Using experiments on a
set of financial earning call transcripts documents which come in the form of
Q&A format, and hence provide a natural set of pairs of ground-truth Q&As, we
show that HybridRAG which retrieves context from both vector database and KG
outperforms both traditional VectorRAG and GraphRAG individually when evaluated
at both the retrieval and generation stages in terms of retrieval accuracy and
answer generation. The proposed technique has applications beyond the financial
domain
Authors' comments: 9 pages, 2 figures, 5 tables
Fahimeh Arabyani Neyshaburi, Ali Akbar Arefijamaal, Ghadir Sadeghi
Projective Hilbert spaces as the underlying spaces of this paper are obtained by identifying two vectors of a Hilbert space $\mathcal{H}$ which have the same phase and denoted by $\hat{\mathcal{H}}$. For a family $\Phi$ of vectors of $\mathcal{H}$ we introduce a topology $\tau_{\Phi}$ on $\hat{\mathcal{H}}$ and provide a topology-based approach for analyzing $\hat{\mathcal{H}}$. This leads to a new classification of phase retrieval property. We prove that $(\hat{\mathcal{H}}, \tau_{\Phi})$ is $\sigma$-compact, as well as it is Hausdorff if and only if $\Phi$ does phase retrieval. In particular, if $\Phi$ is phase retrieval, then we prove that $(\hat{\mathcal{H}}, \tau_{\Phi})$ is metrizable and $\hat{\mathcal{H}}$ is paracompact by a direct limit topology. Also, we make a comparison between $\tau_{\Phi}$ and some known topologies including the quotient topology, the weak topology and the direct-limit topology. Furthermore, we establish a metric $d_{\Phi}$ on $\hat{\mathcal{H}}$ and show that $d_{\Phi}$ is weaker than the Bures-Wasserstein distance on $\hat{\mathcal{H}}$. As a result, in the finite dimensional case, we prove that $\tau_{\Phi}$ coincides with the weak topology and $\tau_{d_{\Phi}}$ on $\hat{\mathcal{H}}$ if and only if $\Phi$ is phase retrieval.
Francesco Busolin, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Salvatore Trani
Learned dense representations are a popular family of techniques for encoding
queries and documents using high-dimensional embeddings, which enable retrieval
by performing approximate k nearest-neighbors search (A-kNN). A popular
technique for making A-kNN search efficient is based on a two-level index,
where the embeddings of documents are clustered offline and, at query
processing, a fixed number N of clusters closest to the query is visited
exhaustively to compute the result set. In this paper, we build upon
state-of-the-art for early exit A-kNN and propose an unsupervised method based
on the notion of patience, which can reach competitive effectiveness with large
efficiency gains. Moreover, we discuss a cascade approach where we first
identify queries that find their nearest neighbor within the closest t << N
clusters, and then we decide how many more to visit based on our patience
approach or other state-of-the-art strategies. Reproducible experiments
employing state-of-the-art dense retrieval models and publicly available
resources show that our techniques improve the A-kNN efficiency with up to 5x
speedups while achieving negligible effectiveness losses. All the code used is
available at https://github.com/francescobusolin/faiss_pEE
Authors' comments: 6 pages, published at CIKM 2024
Jan Hartman, Rishabh Mehrotra, Hitesh Sagtani, Dominic Cooney, Rafal Gajdulewicz, Beyang Liu, Julie Tibshirani, Quinn Slack
In this work, we discuss a recently popular type of recommender system: an LLM-based coding assistant. Connecting the task of providing code recommendations in multiple formats to traditional RecSys challenges, we outline several similarities and differences due to domain specifics. We emphasize the importance of providing relevant context to an LLM for this use case and discuss lessons learned from context enhancements & offline and online evaluation of such AI-assisted coding systems.
Marko Hostnik, Marko Robnik-Šikonja
The use of large language models (LLMs) is becoming increasingly widespread
among software developers. However, privacy and computational requirements are
problematic with commercial solutions and the use of LLMs. In this work, we
focus on using relatively small and efficient LLMs with 160M parameters that
are suitable for local execution and augmentation with retrieval from local
projects. We train two open transformer-based models, the generative GPT-2 and
the retrieval-adapted RETRO, on open-source Python files, and empirically
compare them, confirming the benefits of embedding-based retrieval.
Furthermore, we improve our models' performance with In-context
retrieval-augmented generation (RAG), which retrieves code snippets using the
Jaccard similarity of tokens. We evaluate In-context RAG on larger models and
determine that, despite its simplicity, the approach is more suitable than
using the RETRO architecture. Experimental results indicate that In-context RAG
improves the code completion baseline by over 26%, while RETRO improves over
the similarly sized GPT-2 baseline by 12%. We highlight the key role of proper
tokenization in achieving the full potential of LLMs in code completion.
Authors' comments: 30 pages, 15 figures; Accepted manuscript for Expert Systems with
Applications
Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
Learned sparse representations form an effective and interpretable class of embeddings for text retrieval. While exact top-k retrieval over such embeddings faces efficiency challenges, a recent algorithm called Seismic has enabled remarkably fast, highly-accurate approximate retrieval. Seismic statically prunes inverted lists, organizes each list into geometrically-cohesive blocks, and augments each block with a summary vector. At query time, each inverted list associated with a query term is traversed one block at a time in an arbitrary order, with the inner product between the query and summaries determining if a block must be evaluated. When a block is deemed promising, its documents are fully evaluated with a forward index. Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and significantly outperforms the winning graph-based submissions to the BigANN 2023 Challenge. In this work, we speed up Seismic further by introducing two innovations to its query processing subroutine. First, we traverse blocks in order of importance, rather than arbitrarily. Second, we take the list of documents retrieved by Seismic and expand it to include the neighbors of each document using an offline k-regular nearest neighbor graph; the expanded list is then ranked to produce the final top-k set. Experiments on two public datasets show that our extension, named SeismicWave, can reach almost-exact accuracy levels and is up to 2.2x faster than Seismic.
Seong-Il Park, Seung-Woo Choi, Na-Hyun Kim, Jay-Yoon Lee
Retrieval-Augmented Language Models (RALMs) have significantly improved
performance in open-domain question answering (QA) by leveraging external
knowledge. However, RALMs still struggle with unanswerable queries, where the
retrieved contexts do not contain the correct answer, and with conflicting
information, where different sources provide contradictory answers due to
imperfect retrieval. This study introduces an in-context learning-based
approach to enhance the reasoning capabilities of RALMs, making them more
robust in imperfect retrieval scenarios. Our method incorporates Machine
Reading Comprehension (MRC) demonstrations, referred to as cases, to boost the
model's capabilities to identify unanswerabilities and conflicts among the
retrieved contexts. Experiments on two open-domain QA datasets show that our
approach increases accuracy in identifying unanswerable and conflicting
scenarios without requiring additional fine-tuning. This work demonstrates that
in-context learning can effectively enhance the robustness of RALMs in
open-domain QA tasks.
Authors' comments: 10 pages, 2 figures
Usama Ahmed Jamal
Blogs and social networking sites serve as a platform to the users for
expressing their interests, ideas and thoughts. Targeted marketing uses the
recommendation systems for suggesting their services and products to the users
or clients. So the method used by target marketing is extraction of keywords
and main topics from the user generated texts. Most of conventional methods
involve identifying the personal interests just on the basis of surveys and
rating systems. But the proposed research differs in manner that it aim at
using the user generated text as a source medium for identifying and analyzing
the personal interest as a knowledge base area of users. Semantic graph based
approach is proposed research work that identifies the references of clients
and users by analyzing their own texts such as tweets. The keywords need to be
extracted from the text generated by the user on the social networking sites.
This can be made possible by using several algorithms that extracts the
keywords automatically from the available content provided by the user. Based
on frequency and degree it ranks the extracted keywords. Furthermore, semantic
graph based model assists in providing useful suggestions just by extracting
the interests of users by analyzing their contents from social media. In this
approach graph comprises of nodes and edges where nodes represents the keywords
extracted by the algorithm and edges shows the semantic connection between the
nodes. The method does not require internet related user activities like
surveys or ratings to gather user interest related information.
Authors' comments: This research was conducted as part of Master Thesis in Computer
Science by the first author at HITEC University Taxila
Junde Wu, Jiayuan Zhu, Yunli Qi
We introduce a novel graph-based Retrieval-Augmented Generation (RAG) framework specifically designed for the medical domain, called \textbf{MedGraphRAG}, aimed at enhancing Large Language Model (LLM) capabilities and generating evidence-based results, thereby improving safety and reliability when handling private medical data. Our comprehensive pipeline begins with a hybrid static-semantic approach to document chunking, significantly improving context capture over traditional methods. Extracted entities are used to create a three-tier hierarchical graph structure, linking entities to foundational medical knowledge sourced from medical papers and dictionaries. These entities are then interconnected to form meta-graphs, which are merged based on semantic similarities to develop a comprehensive global graph. This structure supports precise information retrieval and response generation. The retrieval process employs a U-retrieve method to balance global awareness and indexing efficiency of the LLM. Our approach is validated through a comprehensive ablation study comparing various methods for document chunking, graph construction, and information retrieval. The results not only demonstrate that our hierarchical graph construction method consistently outperforms state-of-the-art models on multiple medical Q\&A benchmarks, but also confirms that the responses generated include source documentation, significantly enhancing the reliability of medical LLMs in practical applications. Code will be at: https://github.com/MedicineToken/Medical-Graph-RAG/tree/main
Zifan Wang, Christopher Ormerod
Automated Short Answer Scoring (ASAS) is a critical component in educational
assessment. While traditional ASAS systems relied on rule-based algorithms or
complex deep learning methods, recent advancements in Generative Language
Models (GLMs) offer new opportunities for improvement. This study explores the
application of GLMs to ASAS, leveraging their off-the-shelf capabilities and
performance in various domains. We propose a novel pipeline that combines
vector databases, transformer-based encoders, and GLMs to enhance short answer
scoring accuracy. Our approach stores training responses in a vector database,
retrieves semantically similar responses during inference, and employs a GLM to
analyze these responses and determine appropriate scores. We further optimize
the system through fine-tuned retrieval processes and prompt engineering.
Evaluation on the SemEval 2013 dataset demonstrates a significant improvement
on the SCIENTSBANK 3-way and 2-way tasks compared to existing methods,
highlighting the potential of GLMs in advancing ASAS technology.
Authors' comments: 20 pages, 2 figures
Jinzhao Zhou, Yiqun Duan, Ziyi Zhao, Yu-Cheng Chang, Yu-Kai Wang, Thomas Do, Chin-Teng Lin
Decoding linguistic information from non-invasive brain signals using EEG has gained increasing research attention due to its vast applicational potential. Recently, a number of works have adopted a generative-based framework to decode electroencephalogram (EEG) signals into sentences by utilizing the power generative capacity of pretrained large language models (LLMs). However, this approach has several drawbacks that hinder the further development of linguistic applications for brain-computer interfaces (BCIs). Specifically, the ability of the EEG encoder to learn semantic information from EEG data remains questionable, and the LLM decoder's tendency to generate sentences based on its training memory can be hard to avoid. These issues necessitate a novel approach for converting EEG signals into sentences. In this paper, we propose a novel two-step pipeline that addresses these limitations and enhances the validity of linguistic EEG decoding research. We first confirm that word-level semantic information can be learned from EEG data recorded during natural reading by training a Conformer encoder via a masked contrastive objective for word-level classification. To achieve sentence decoding results, we employ a training-free retrieval method to retrieve sentences based on the predictions from the EEG encoder. Extensive experiments and ablation studies were conducted in this paper for a comprehensive evaluation of the proposed approach. Visualization of the top prediction candidates reveals that our model effectively groups EEG segments into semantic categories with similar meanings, thereby validating its ability to learn patterns from unspoken EEG recordings. Despite the exploratory nature of this work, these results suggest that our method holds promise for providing more reliable solutions for converting EEG signals into text.
Ruizhe Zhang, Yongxin Xu, Yuzhen Xiao, Runchuan Zhu, Xinke Jiang, Xu Chu, Junfeng Zhao, Yasha Wang
By integrating external knowledge, Retrieval-Augmented Generation (RAG) has become an effective strategy for mitigating the hallucination problems that large language models (LLMs) encounter when dealing with knowledge-intensive tasks. However, in the process of integrating external non-parametric supporting evidence with internal parametric knowledge, inevitable knowledge conflicts may arise, leading to confusion in the model's responses. To enhance the knowledge selection of LLMs in various contexts, some research has focused on refining their behavior patterns through instruction-tuning. Nonetheless, due to the absence of explicit negative signals and comparative objectives, models fine-tuned in this manner may still exhibit undesirable behaviors in the intricate and realistic retrieval scenarios. To this end, we propose a Knowledge-aware Preference Optimization, dubbed KaPO, aimed at achieving controllable knowledge selection in real retrieval scenarios. Concretely, we explore and simulate error types across diverse context combinations and learn how to avoid these negative signals through preference optimization methods. Simultaneously, by adjusting the balance between response length and the proportion of preference data representing different behavior patterns, we enhance the adherence capabilities and noise robustness of LLMs in a balanced manner. Experimental results show that KaPO outperforms previous methods for handling knowledge conflicts by over 37%, while also exhibiting robust generalization across various out-of-distribution datasets.
Pavel Suma, Giorgos Kordopatis-Zilos, Ahmet Iscen, Giorgos Tolias
This work investigates the problem of instance-level image retrieval
re-ranking with the constraint of memory efficiency, ultimately aiming to limit
memory usage to 1KB per image. Departing from the prevalent focus on
performance enhancements, this work prioritizes the crucial trade-off between
performance and memory requirements. The proposed model uses a
transformer-based architecture designed to estimate image-to-image similarity
by capturing interactions within and across images based on their local
descriptors. A distinctive property of the model is the capability for
asymmetric similarity estimation. Database images are represented with a
smaller number of descriptors compared to query images, enabling performance
improvements without increasing memory consumption. To ensure adaptability
across different applications, a universal model is introduced that adjusts to
a varying number of local descriptors during the testing phase. Results on
standard benchmarks demonstrate the superiority of our approach over both
hand-crafted and learned models. In particular, compared with current
state-of-the-art methods that overlook their memory footprint, our approach not
only attains superior performance but does so with a significantly reduced
memory footprint. The code and pretrained models are publicly available at:
https://github.com/pavelsuma/ames
Authors' comments: ECCV 2024
Tiezheng Guo, Chen Wang, Yanyi Liu, Jiawei Tang, Pan Li, Sai Xu, Qingwen Yang, Xianlin Gao et al.
Retrieving external knowledge and prompting large language models with relevant information is an effective paradigm to enhance the performance of question-answering tasks. Previous research typically handles paragraphs from external documents in isolation, resulting in a lack of context and ambiguous references, particularly in multi-document and complex tasks. To overcome these challenges, we propose a new retrieval framework IIER, that leverages Inter-chunk Interactions to Enhance Retrieval. This framework captures the internal connections between document chunks by considering three types of interactions: structural, keyword, and semantic. We then construct a unified Chunk-Interaction Graph to represent all external documents comprehensively. Additionally, we design a graph-based evidence chain retriever that utilizes previous paths and chunk interactions to guide the retrieval process. It identifies multiple seed nodes based on the target question and iteratively searches for relevant chunks to gather supporting evidence. This retrieval process refines the context and reasoning chain, aiding the large language model in reasoning and answer generation. Extensive experiments demonstrate that IIER outperforms strong baselines across four datasets, highlighting its effectiveness in improving retrieval and reasoning capabilities.
Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat, Peter Izsak
Implementing Retrieval-Augmented Generation (RAG) systems is inherently
complex, requiring deep understanding of data, use cases, and intricate design
decisions. Additionally, evaluating these systems presents significant
challenges, necessitating assessment of both retrieval accuracy and generative
quality through a multi-faceted approach. We introduce RAG Foundry, an
open-source framework for augmenting large language models for RAG use cases.
RAG Foundry integrates data creation, training, inference and evaluation into a
single workflow, facilitating the creation of data-augmented datasets for
training and evaluating large language models in RAG settings. This integration
enables rapid prototyping and experimentation with various RAG techniques,
allowing users to easily generate datasets and train RAG models using internal
or specialized knowledge sources. We demonstrate the framework effectiveness by
augmenting and fine-tuning Llama-3 and Phi-3 models with diverse RAG
configurations, showcasing consistent improvements across three
knowledge-intensive datasets. Code is released as open-source in
https://github.com/IntelLabs/RAGFoundry.
Authors' comments: 10 pages
Gongxin Yao, Xinyang Li, Yixin Xuan, Yu Pan
Image-to-point cloud registration seeks to estimate their relative camera
pose, which remains an open question due to the data modality gaps. The recent
matching-based methods tend to tackle this by building 2D-3D correspondences.
In this paper, we reveal the information loss inherent in these methods and
propose a matching-free paradigm, named MaFreeI2P. Our key insight is to
actively retrieve the camera pose in SE(3) space by contrasting the geometric
features between the point cloud and the query image. To achieve this, we first
sample a set of candidate camera poses and construct their cost volume using
the cross-modal features. Superior to matching, cost volume can preserve more
information and its feature similarity implicitly reflects the confidence level
of the sampled poses. Afterwards, we employ a convolutional network to
adaptively formulate a similarity assessment function, where the input cost
volume is further improved by filtering and pose-based weighting. Finally, we
update the camera pose based on the similarity scores, and adopt a heuristic
strategy to iteratively shrink the pose sampling space for convergence. Our
MaFreeI2P achieves a very competitive registration accuracy and recall on the
KITTI-Odometry and Apollo-DaoxiangLake datasets.
Authors' comments: Accepted to IEEE Conference on Multimedia Expo 2024
Taichi Nishimura, Shota Nakada, Hokuto Munakata, Tatsuya Komatsu
We propose Lighthouse, a user-friendly library for reproducible video moment
retrieval and highlight detection (MR-HD). Although researchers proposed
various MR-HD approaches, the research community holds two main issues. The
first is a lack of comprehensive and reproducible experiments across various
methods, datasets, and video-text features. This is because no unified training
and evaluation codebase covers multiple settings. The second is user-unfriendly
design. Because previous works use different libraries, researchers set up
individual environments. In addition, most works release only the training
codes, requiring users to implement the whole inference process of MR-HD.
Lighthouse addresses these issues by implementing a unified reproducible
codebase that includes six models, three features, and five datasets. In
addition, it provides an inference API and web demo to make these methods
easily accessible for researchers and developers. Our experiments demonstrate
that Lighthouse generally reproduces the reported scores in the reference
papers. The code is available at https://github.com/line/lighthouse.
Authors' comments: accepted at EMNLP2024 - system demonstration track
Jihye Choi, Nils Palumbo, Prasad Chalasani, Matthew M. Engelhard, Somesh Jha, Anivarya Kumar, David Page
In the era of Large Language Models (LLMs), given their remarkable text
understanding and generation abilities, there is an unprecedented opportunity
to develop new, LLM-based methods for trustworthy medical knowledge synthesis,
extraction and summarization. This paper focuses on the problem of
Pharmacovigilance (PhV), where the significance and challenges lie in
identifying Adverse Drug Events (ADEs) from diverse text sources, such as
medical literature, clinical notes, and drug labels. Unfortunately, this task
is hindered by factors including variations in the terminologies of drugs and
outcomes, and ADE descriptions often being buried in large amounts of narrative
text. We present MALADE, the first effective collaborative multi-agent system
powered by LLM with Retrieval Augmented Generation for ADE extraction from drug
label data. This technique involves augmenting a query to an LLM with relevant
information extracted from text resources, and instructing the LLM to compose a
response consistent with the augmented data. MALADE is a general LLM-agnostic
architecture, and its unique capabilities are: (1) leveraging a variety of
external sources, such as medical literature, drug labels, and FDA tools (e.g.,
OpenFDA drug information API), (2) extracting drug-outcome association in a
structured format along with the strength of the association, and (3) providing
explanations for established associations. Instantiated with GPT-4 Turbo or
GPT-4o, and FDA drug label data, MALADE demonstrates its efficacy with an Area
Under ROC Curve of 0.90 against the OMOP Ground Truth table of ADEs. Our
implementation leverages the Langroid multi-agent LLM framework and can be
found at https://github.com/jihyechoi77/malade.
Authors' comments: Paper published at Machine Learning for Healthcare 2024 (MLHC'24)