Jonathan Pan, Swee Liang Wong, Yidi Yuan
The ability to detect log anomalies from system logs is a vital activity
needed to ensure cyber resiliency of systems. It is applied for fault
identification or facilitate cyber investigation and digital forensics.
However, as logs belonging to different systems and components differ
significantly, the challenge to perform such analysis is humanly challenging
from the volume, variety and velocity of logs. This is further complicated by
the lack or unavailability of anomalous log entries to develop trained machine
learning or artificial intelligence models for such purposes. In this research
work, we explore the use of a Retrieval Augmented Large Language Model that
leverages a vector database to detect anomalies from logs. We used a Question
and Answer configuration pipeline. To the best of our knowledge, our experiment
which we called RAGLog is a novel one and the experimental results show much
promise.
Authors' comments: arXiv admin note: substantial text overlap with arXiv:2203.10960
Şeyma Bodur, Edgar Martínez-Moro, Diego Ruano
A Private Information Retrieval (PIR) protocol based on coding theory for a single server is proposed. It provides computational security against linear algebra attacks, addressing the main drawback of previous PIR proposals based on coding theory. The approach involves two types of codes each one over a different ring, an inner non-free linear code that will be used as a distinguisher of some elements added to the query matrix, and an outer code that will be used for generating the query matrix. Moreover, it only uses modular arithmetic at the server level and the recovering stage if the base ring chosen for the inner code is $\mathbb Z_m$.
Mohammad Bokaei, Saeed Razavikia, Stefano Rini, Arash Amini, Hamid Behrouzi
In this paper, we investigate the problem of recovering the frequency components of a mixture of $K$ complex sinusoids from a random subset of $N$ equally-spaced time-domain samples. Because of the random subset, the samples are effectively non-uniform. Besides, the frequency values of each of the $K$ complex sinusoids are assumed to vary continuously within a given range. For this problem, we propose a two-step strategy: (i) we first lift the incomplete set of uniform samples (unavailable samples are treated as missing data) into a structured matrix with missing entries, which is potentially low-rank; then (ii) we complete the matrix using a weighted nuclear minimization problem. We call the method a \emph{ weighted lifted-structured (WLi) low-rank matrix recovery}. Our approach can be applied to a range of matrix structures such as Hankel and double-Hankel, among others, and provides improvement over the unweighted existing schemes such as EMaC and DEMaC. We provide theoretical guarantees for the proposed method, as well as numerical simulations in both noiseless and noisy settings. Both the theoretical and the numerical results confirm the superiority of the proposed approach.
Zilin Xiao, Ming Gong, Jie Wu, Xingyao Zhang, Linjun Shou, Jian Pei, Daxin Jiang
Generative approaches powered by large language models (LLMs) have
demonstrated emergent abilities in tasks that require complex reasoning
abilities. Yet the generative nature still makes the generated content suffer
from hallucinations, thus unsuitable for entity-centric tasks like entity
linking (EL) requiring precise entity predictions over a large knowledge base.
We present Instructed Generative Entity Linker (INSGENEL), the first approach
that enables casual language models to perform entity linking over knowledge
bases. Several methods to equip language models with EL capability were
proposed in this work, including (i) a sequence-to-sequence training EL
objective with instruction-tuning, (ii) a novel generative EL framework based
on a light-weight potential mention retriever that frees the model from heavy
and non-parallelizable decoding, achieving 4$\times$ speedup without compromise
on linking metrics. INSGENEL outperforms previous generative alternatives with
+6.8 F1 points gain on average, also with a huge advantage in training data
efficiency and training compute consumption. In addition, our skillfully
engineered in-context learning (ICL) framework for EL still lags behind
INSGENEL significantly, reaffirming that the EL task remains a persistent
hurdle for general LLMs.
Authors' comments: Accepted to EMNLP 2023 Main
Yucan Guo, Zixuan Li, Xiaolong Jin, Yantao Liu, Yutao Zeng, Wenxuan Liu, Xiang Li, Pan Yang et al.
Information Extraction (IE) aims to extract structural knowledge (e.g., entities, relations, events) from natural language texts, which brings challenges to existing methods due to task-specific schemas and complex text expressions. Code, as a typical kind of formalized language, is capable of describing structural knowledge under various schemas in a universal way. On the other hand, Large Language Models (LLMs) trained on both codes and texts have demonstrated powerful capabilities of transforming texts into codes, which provides a feasible solution to IE tasks. Therefore, in this paper, we propose a universal retrieval-augmented code generation framework based on LLMs, called Code4UIE, for IE tasks. Specifically, Code4UIE adopts Python classes to define task-specific schemas of various structural knowledge in a universal way. By so doing, extracting knowledge under these schemas can be transformed into generating codes that instantiate the predefined Python classes with the information in texts. To generate these codes more precisely, Code4UIE adopts the in-context learning mechanism to instruct LLMs with examples. In order to obtain appropriate examples for different tasks, Code4UIE explores several example retrieval strategies, which can retrieve examples semantically similar to the given texts. Extensive experiments on five representative IE tasks across nine datasets demonstrate the effectiveness of the Code4UIE framework.
Paul Irofti
We present here a reverse engineering tool that can be used for information retrieval and anti-malware techniques. Our main contribution is the design and implementation of an instrumentation framework aimed at providing insight on the emulation process. Sample emulation is achieved via translation of the binary code to an intermediate representation followed by compilation and execution. The design makes this a versatile tool that can be used for multiple task such as information retrieval, reverse engineering, debugging, and integration with anti-malware products.
Fan Luo, Mihai Surdeanu
Lexical and semantic matches are commonly used as relevance measurements for
information retrieval. Together they estimate the semantic equivalence between
the query and the candidates. However, semantic equivalence is not the only
relevance signal that needs to be considered when retrieving evidences for
multi-hop questions. In this work, we demonstrate that textual entailment
relation is another important relevance dimension that should be considered. To
retrieve evidences that are either semantically equivalent to or entailed by
the question simultaneously, we divide the task of evidence retrieval for
multi-hop question answering (QA) into two sub-tasks, i.e., semantic textual
similarity and inference similarity retrieval. We propose two ensemble models,
EAR and EARnest, which tackle each of the sub-tasks separately and then jointly
re-rank sentences with the consideration of the diverse relevance signals.
Experimental results on HotpotQA verify that our models not only significantly
outperform all the single retrieval models it is based on, but is also more
effective than two intuitive ensemble baseline models.
Authors' comments: Accepted by NAACL-HLT SRW 2022
Mayank Kothyari, Dhruva Dhingra, Sunita Sarawagi, Soumen Chakrabarti
Existing Text-to-SQL generators require the entire schema to be encoded with
the user text. This is expensive or impractical for large databases with tens
of thousands of columns. Standard dense retrieval techniques are inadequate for
schema subsetting of a large structured database, where the correct semantics
of retrieval demands that we rank sets of schema elements rather than
individual elements. In response, we propose a two-stage process for effective
coverage during retrieval. First, we instruct an LLM to hallucinate a minimal
DB schema deemed adequate to answer the query. We use the hallucinated schema
to retrieve a subset of the actual schema, by composing the results from
multiple dense retrievals. Remarkably, hallucination $\unicode{x2013}$
generally considered a nuisance $\unicode{x2013}$ turns out to be actually
useful as a bridging mechanism. Since no existing benchmarks exist for schema
subsetting on large databases, we introduce three benchmarks. Two
semi-synthetic datasets are derived from the union of schemas in two well-known
datasets, SPIDER and BIRD, resulting in 4502 and 798 schema elements
respectively. A real-life benchmark called SocialDB is sourced from an actual
large data warehouse comprising 17844 schema elements. We show that our method1
leads to significantly higher recall than SOTA retrieval-based augmentation
methods.
Authors' comments: To appear at EMNLP 2023 (Main)
Shicheng Xu, Liang Pang, Jiangnan Li, Mo Yu, Fandong Meng, Huawei Shen, Xueqi Cheng, Jie Zhou
Retrieving relevant plots from the book for a query is a critical task, which can improve the reading experience and efficiency of readers. Readers usually only give an abstract and vague description as the query based on their own understanding, summaries, or speculations of the plot, which requires the retrieval model to have a strong ability to estimate the abstract semantic associations between the query and candidate plots. However, existing information retrieval (IR) datasets cannot reflect this ability well. In this paper, we propose Plot Retrieval, a labeled dataset to train and evaluate the performance of IR models on the novel task Plot Retrieval. Text pairs in Plot Retrieval have less word overlap and more abstract semantic association, which can reflect the ability of the IR models to estimate the abstract semantic association, rather than just traditional lexical or semantic matching. Extensive experiments across various lexical retrieval, sparse retrieval, dense retrieval, and cross-encoder methods compared with human studies on Plot Retrieval show current IR models still struggle in capturing abstract semantic association between texts. Plot Retrieval can be the benchmark for further research on the semantic association modeling ability of IR models.
Maxwell A. Xu, Alexander Moreno, Hui Wei, Benjamin M. Marlin, James M. Rehg
The success of self-supervised contrastive learning hinges on identifying
positive data pairs, such that when they are pushed together in embedding
space, the space encodes useful information for subsequent downstream tasks.
Constructing positive pairs is non-trivial as the pairing must be similar
enough to reflect a shared semantic meaning, but different enough to capture
within-class variation. Classical approaches in vision use augmentations to
exploit well-established invariances to construct positive pairs, but
invariances in the time-series domain are much less obvious. In our work, we
propose a novel method of using a learned measure for identifying positive
pairs. Our Retrieval-Based Reconstruction (REBAR) measure measures the
similarity between two sequences as the reconstruction error that results from
reconstructing one sequence with retrieved information from the other. Then, if
the two sequences have high REBAR similarity, we label them as a positive pair.
Through validation experiments, we show that the REBAR error is a predictor of
mutual class membership. Once integrated into a contrastive learning framework,
our REBAR method learns an embedding that achieves state-of-the-art performance
on downstream tasks across various modalities.
Authors' comments: ICLR 2024 | Code available at: https://github.com/maxxu05/rebar
Xiaoqian Li, Ercong Nie, Sheng Liang
The promise of Large Language Models (LLMs) in Natural Language Processing
has often been overshadowed by their limited performance in low-resource
languages such as Bangla. To address this, our paper presents a pioneering
approach that utilizes cross-lingual retrieval augmented in-context learning.
By strategically sourcing semantically similar prompts from high-resource
language, we enable multilingual pretrained language models (MPLMs), especially
the generative model BLOOMZ, to successfully boost performance on Bangla tasks.
Our extensive evaluation highlights that the cross-lingual retrieval augmented
prompts bring steady improvements to MPLMs over the zero-shot performance.
Authors' comments: In The 1st Bangla Language Processing (BLP) Workshop, held in
conjunction with The Conference on Empirical Methods in Natural Language
Processing (EMNLP), December 2023
Yuqi Wang, Zeqiang Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De
In the era of the Internet of Things (IoT), the retrieval of relevant medical information has become essential for efficient clinical decision-making. This paper introduces MedFusionRank, a novel approach to zero-shot medical information retrieval (MIR) that combines the strengths of pre-trained language models and statistical methods while addressing their limitations. The proposed approach leverages a pre-trained BERT-style model to extract compact yet informative keywords. These keywords are then enriched with domain knowledge by linking them to conceptual entities within a medical knowledge graph. Experimental evaluations on medical datasets demonstrate MedFusion Rank's superior performance over existing methods, with promising results with a variety of evaluation metrics. MedFusionRank demonstrates efficacy in retrieving relevant information, even from short or single-term queries.
Daman Arora, Anush Kini, Sayak Ray Chowdhury, Nagarajan Natarajan, Gaurav Sinha, Amit Sharma
Given a query and a document corpus, the information retrieval (IR) task is
to output a ranked list of relevant documents. Combining large language models
(LLMs) with embedding-based retrieval models, recent work shows promising
results on the zero-shot retrieval problem, i.e., no access to labeled data
from the target domain. Two such popular paradigms are generation-augmented
retrieval or GAR (generate additional context for the query and then retrieve),
and retrieval-augmented generation or RAG (retrieve relevant documents as
context and then generate answers). The success of these paradigms hinges on
(i) high-recall retrieval models, which are difficult to obtain in the
zero-shot setting, and (ii) high-precision (re-)ranking models which typically
need a good initialization. In this work, we propose a novel GAR-meets-RAG
recurrence formulation that overcomes the challenges of existing paradigms. Our
method iteratively improves retrieval (via GAR) and rewrite (via RAG) stages in
the zero-shot setting. A key design principle is that the rewrite-retrieval
stages improve the recall of the system and a final re-ranking stage improves
the precision. We conduct extensive experiments on zero-shot passage retrieval
benchmarks, BEIR and TREC-DL. Our method establishes a new state-of-the-art in
the BEIR benchmark, outperforming previous best results in Recall@100 and
nDCG@10 metrics on 6 out of 8 datasets, with up to 17% relative gains over the
previous best.
Authors' comments: preprint
Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang et al.
Recently, the emergence of large language models (LLMs) has revolutionized
the paradigm of information retrieval (IR) applications, especially in web
search, by generating vast amounts of human-like texts on the Internet. As a
result, IR systems in the LLM era are facing a new challenge: the indexed
documents are now not only written by human beings but also automatically
generated by the LLMs. How these LLM-generated documents influence the IR
systems is a pressing and still unexplored question. In this work, we conduct a
quantitative evaluation of IR models in scenarios where both human-written and
LLM-generated texts are involved. Surprisingly, our findings indicate that
neural retrieval models tend to rank LLM-generated documents higher. We refer
to this category of biases in neural retrievers towards the LLM-generated
content as the \textbf{source bias}. Moreover, we discover that this bias is
not confined to the first-stage neural retrievers, but extends to the
second-stage neural re-rankers. Then, in-depth analyses from the perspective of
text compression indicate that LLM-generated texts exhibit more focused
semantics with less noise, making it easier for neural retrieval models to
semantic match. To mitigate the source bias, we also propose a plug-and-play
debiased constraint for the optimization objective, and experimental results
show its effectiveness. Finally, we discuss the potential severe concerns
stemming from the observed source bias and hope our findings can serve as a
critical wake-up call to the IR community and beyond. To facilitate future
explorations of IR in the LLM era, the constructed two new benchmarks are
available at https://github.com/KID-22/Source-Bias.
Authors' comments: KDD 2024
Rodrigo Braz Teixeira, Giorgio Carugno, Izaak Neri, Pablo Sartori
Biological mixtures, such as the cellular cytoplasm, are composed of a large number of different components. From this heterogeneity, ordered mesoscopic structures emerge, such as liquid phases with controlled composition. These structures compete with each other for the same components. This raises several questions, such as what types of interactions allow the retrieval of multiple ordered mesoscopic structures, and what are the physical limitations for the retrieval of said structures. In this work, we develop an analytically tractable model for liquids capable of retrieving states with target compositions. We name this model the liquid Hopfield model in reference to corresponding work in the theory of associative neural networks. By solving this model, we show that non-linear repulsive interactions are necessary for retrieval of target structures. We demonstrate that this is because liquid mixtures at low temperatures tend to transition to phases with few components, a phenomenon that we term localization. Taken together, our results demonstrate a trade-off between retrieval and localization phenomena in liquid mixtures.
Haitao Li, Yunqiu Shao, Yueyue Wu, Qingyao Ai, Yixiao Ma, Yiqun Liu
As an important component of intelligent legal systems, legal case retrieval plays a critical role in ensuring judicial justice and fairness. However, the development of legal case retrieval technologies in the Chinese legal system is restricted by three problems in existing datasets: limited data size, narrow definitions of legal relevance, and naive candidate pooling strategies used in data sampling. To alleviate these issues, we introduce LeCaRDv2, a large-scale Legal Case Retrieval Dataset (version 2). It consists of 800 queries and 55,192 candidates extracted from 4.3 million criminal case documents. To the best of our knowledge, LeCaRDv2 is one of the largest Chinese legal case retrieval datasets, providing extensive coverage of criminal charges. Additionally, we enrich the existing relevance criteria by considering three key aspects: characterization, penalty, procedure. This comprehensive criteria enriches the dataset and may provides a more holistic perspective. Furthermore, we propose a two-level candidate set pooling strategy that effectively identify potential candidates for each query case. It's important to note that all cases in the dataset have been annotated by multiple legal experts specializing in criminal law. Their expertise ensures the accuracy and reliability of the annotations. We evaluate several state-of-the-art retrieval models at LeCaRDv2, demonstrating that there is still significant room for improvement in legal case retrieval. The details of LeCaRDv2 can be found at the anonymous website https://github.com/anonymous1113243/LeCaRDv2.
Palak Jain, Livio Baldini Soares, Tom Kwiatkowski
We present 1-Pager the first system that answers a question and retrieves
evidence using a single Transformer-based model and decoding process. 1-Pager
incrementally partitions the retrieval corpus using constrained decoding to
select a document and answer string, and we show that this is competitive with
comparable retrieve-and-read alternatives according to both retrieval and
answer accuracy metrics. 1-Pager also outperforms the equivalent closed-book
question answering model, by grounding predictions in an evidence corpus. While
1-Pager is not yet on-par with more expensive systems that read many more
documents before generating an answer, we argue that it provides an important
step toward attributed generation by folding retrieval into the
sequence-to-sequence paradigm that is currently dominant in NLP. We also show
that the search paths used to partition the corpus are easy to read and
understand, paving a way forward for interpretable neural retrieval.
Authors' comments: Accepted at EMNLP 2023 (Findings)
Peixuan Han, Zhenghao Liu, Zhiyuan Liu, Chenyan Xiong
The anchor-document data derived from web graphs offers a wealth of paired information for training dense retrieval models in an unsupervised manner. However, unsupervised data contains diverse patterns across the web graph and often exhibits significant imbalance, leading to suboptimal performance in underrepresented or difficult groups. In this paper, we introduce WebDRO, an efficient approach for clustering the web graph data and optimizing group weights to enhance the robustness of dense retrieval models. Initially, we build an embedding model for clustering anchor-document pairs. Specifically, we contrastively train the embedding model for link prediction, which guides the embedding model in capturing the document features behind the web graph links. Subsequently, we employ the group distributional robust optimization to recalibrate the weights across different clusters of anchor-document pairs during training retrieval models. During training, we direct the model to assign higher weights to clusters with higher loss and focus more on worst-case scenarios. This approach ensures that the model has strong generalization ability on all data patterns. Our experiments on MS MARCO and BEIR demonstrate that our method can effectively improve retrieval performance in unsupervised training and finetuning settings. Further analysis confirms the stability and validity of group weights learned by WebDRO. The code of this paper can be obtained from https://github.com/Hanpx20/GroupDRO_Dense_Retrieval.
Qingquan Li, Yiran Hu, Feng Yao, Chaojun Xiao, Zhiyuan Liu, Maosong Sun, Weixing Shen
Similar case retrieval (SCR) is a representative legal AI application that
plays a pivotal role in promoting judicial fairness. However, existing SCR
datasets only focus on the fact description section when judging the similarity
between cases, ignoring other valuable sections (e.g., the court's opinion)
that can provide insightful reasoning process behind. Furthermore, the case
similarities are typically measured solely by the textual semantics of the fact
descriptions, which may fail to capture the full complexity of legal cases from
the perspective of legal knowledge. In this work, we present MUSER, a similar
case retrieval dataset based on multi-view similarity measurement and
comprehensive legal element with sentence-level legal element annotations.
Specifically, we select three perspectives (legal fact, dispute focus, and law
statutory) and build a comprehensive and structured label schema of legal
elements for each of them, to enable accurate and knowledgeable evaluation of
case similarities. The constructed dataset originates from Chinese civil cases
and contains 100 query cases and 4,024 candidate cases. We implement several
text classification algorithms for legal element prediction and various
retrieval methods for retrieving similar cases on MUSER. The experimental
results indicate that incorporating legal elements can benefit the performance
of SCR models, but further efforts are still required to address the remaining
challenges posed by MUSER. The source code and dataset are released at
https://github.com/THUlawtech/MUSER.
Authors' comments: Accepted by CIKM 2023 Resource Track
Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Rahee Ghosh Peshawaria, Ranjita Naik, Besmira Nushi
We study the ability of state-of-the art models to answer constraint
satisfaction queries for information retrieval (e.g., 'a list of ice cream
shops in San Diego'). In the past, such queries were considered to be tasks
that could only be solved via web-search or knowledge bases. More recently,
large language models (LLMs) have demonstrated initial emergent abilities in
this task. However, many current retrieval benchmarks are either saturated or
do not measure constraint satisfaction. Motivated by rising concerns around
factual incorrectness and hallucinations of LLMs, we present KITAB, a new
dataset for measuring constraint satisfaction abilities of language models.
KITAB consists of book-related data across more than 600 authors and 13,000
queries, and also offers an associated dynamic data collection and constraint
verification approach for acquiring similar test data for other authors. Our
extended experiments on GPT4 and GPT3.5 characterize and decouple common
failure modes across dimensions such as information popularity, constraint
types, and context availability. Results show that in the absence of context,
models exhibit severe limitations as measured by irrelevant information,
factual errors, and incompleteness, many of which exacerbate as information
popularity decreases. While context availability mitigates irrelevant
information, it is not helpful for satisfying constraints, identifying
fundamental barriers to constraint satisfaction. We open source our
contributions to foster further research on improving constraint satisfaction
abilities of future models.
Authors' comments: 23 pages