Generative AI systems powered by Large Language Models have become increasingly popular in recent years. Lately, due to the risk of providing users with unsafe information, the adoption of those systems in safety-critical domains has raised significant concerns. To respond to this situation, input-output filters, commonly called guardrail models, have been proposed to complement other measures, such as model alignment. Unfortunately, the lack of a standard benchmark for guardrail models poses significant evaluation issues and makes it hard to compare results across scientific publications. To fill this gap, we introduce GuardBench, a large-scale benchmark for guardrail models comprising 40 safety evaluation datasets. To facilitate the adoption of GuardBench, we release a Python library providing an automated evaluation pipeline built on top of it. With our benchmark, we also share the first large-scale prompt moderation datasets in German, French, Italian, and Spanish. To assess the current state-of-the-art, we conduct an extensive comparison of recent guardrail models and show that a general-purpose instruction-following model of comparable size achieves competitive results without the need for specific fine-tuning.
In Advances in Information Retrieval - 46th European Conference on Information
Retrieval, ECIR 2024, Glasgow, UK, March 24-28, 2024, Proceedings,
Part V, Nov 2024
indxr is a Python utility for indexing file lines that allows users to dynamically access specific ones, avoiding loading the entire file in the computer’s main memory. indxr addresses two main issues related to working with textual data. First, users who do not have plenty of RAM at their disposal may struggle to work with large datasets. Since indxr allows accessing specific lines without loading entire files, users can work with datasets that do not fit into their computer’s main memory. For example, it enables users to perform complex tasks with limited RAM without noticeable slowdowns, such as pre-processing texts and training Neural models for Information Retrieval or other tasks. Second, indxr reduces the burden of working with datasets split among multiple files by allowing users to load specific data by providing the related line numbers or the identifiers of the information they describe, thus providing convenient access to such data. This paper overviews indxr’s main features.
Personalized Query Expansion, the task of expanding queries with additional terms extracted from the user-related vocabulary, is a well-known solution to improve the retrieval performance of a system w.r.t. short queries. Recent approaches rely on word embeddings to select expansion terms from user-related texts. Although delivering promising results with former word embedding techniques, we argue that these methods are not suited for contextual word embeddings, which produce a unique vector representation for each term occurrence. In this article, we propose a Personalized Query Expansion method designed to solve the issues arising from the use of contextual word embeddings with the current Personalized Query Expansion approaches based on word embeddings. Specifically, we employ a clustering-based procedure to identify the terms that better represent the user interests and to improve the diversity of those selected for expansion, achieving improvements up to 4% w.r.t. the best-performing baseline in terms of MAP@100. Moreover, our approach outperforms previous ones in terms of efficiency, allowing us to achieve sub-millisecond expansion times even in data-rich scenarios. Finally, we introduce a novel metric to evaluate the expansion terms diversity and empirically show the unsuitability of previous approaches based on word embeddings when employed along with contextual word embeddings, which cause the selection of semantically overlapping expansion terms.
The personalization of search results has gained increasing attention in the past few years, thanks to the development of Neural Networks-based approaches for Information Retrieval and the importance of personalization in many search scenarios. Recent works have proposed to build user models at query time by leveraging the Attention mechanism, which allows weighing the contribution of the user-related information w.r.t. the current query. This approach allows taking into account the diversity of the user’s interests by giving more importance to those related to the current search performed by the user. In this paper, we first discuss some shortcomings of the standard Attention formulation when employed for personalization. In particular, we focus on issues related to its normalization mechanism and its inability to entirely filter out noisy user-related information. Then, we introduce the Denoising Attention mechanism: an Attention variant that directly tackles the above shortcomings by adopting a robust normalization scheme and introducing a filtering mechanism. The reported experimental evaluation shows the benefits of the proposed approach over other Attention-based variants.
ranxhub is an online repository for sharing artifacts deriving from the evaluation of Information Retrieval systems. Specifically, we provide a platform for sharing pre-computed runs: the ranked lists of documents retrieved for a specific set of queries by a retrieval model.We also extend ranx, a Python library for the evaluation and comparison of Information Retrieval runs, adding functionalities to integrate the usage of ranxhubseamlessly, allowing the user to compare the results of multiple systems in just a few lines of code. In this paper, we first outline the many advantages and implications that an online repository for sharing runs can bring to the table. Then, we introduce ranxhuband its integration with ranx, showing its very simple usage. Finally, we discuss some use cases for which ranxhubcan be highly valuable for the research community.
This paper presents ranx.fuse, a Python library for Metasearch. Built following a user-centered design, it provides easy-to-use tools for combining the results of multiple search engines. ranx.fuse comprises 25 Metasearch algorithms implemented with Numba, a just-in-time compiler for Python code, for efficient vector operations and automatic parallelization. Moreover, in conjunction with the Metasearch algorithms, our library implements six normalization strategies that transform the search engines’ result lists to make them comparable, a mandatory step for Metasearch. Finally, as many Metasearch algorithms require a training or optimization step, ranx.fuse offers a convenient functionality for their optimization that evaluates pre-defined hyper-parameters configurations via grid search. By relying on the provided functions, the user can optimally combine the results of multiple search engines in very few lines of code. ranx.fuse can also serve as a user-friendly tool for fusing the rankings computed by a first-stage retriever and a re-ranker, as a library providing several baselines for Metasearch, and as a playground to test novel normalization strategies.
Personalization in Information Retrieval has been a hot topic in both academia and industry for the past two decades. However, there is still a lack of high-quality standard benchmark datasets for conducting offline comparative evaluations in this context. To mitigate this problem, in the past few years, approaches to derive synthetic datasets suited for evaluating Personalized Search models have been proposed. In this paper, we put forward a novel evaluation benchmark for Personalized Search with more than 18 million documents and 1.9 million queries across four domains. We present a detailed description of the benchmark construction procedure, highlighting its characteristics and challenges. We provide baseline performance including pre-trained neural models, opening room for the evaluation of personalized approaches, as well as domain adaptation and transfer learning scenarios. We make both datasets and models available for future research.
This paper presents ranx, a Python evaluation library for Information Retrieval built on top of Numba. ranx provides a user-friendly interface to the most common ranking evaluation metrics, such as MAP, MRR, and NDCG. Moreover, it offers a convenient way of managing the evaluation results, comparing different runs, performing statistical tests between them, and exporting LaTeX tables ready to be used in scientific publications, all in a few lines of code. The efficiency brought by Numba, a just-in-time compiler for Python code, makes the adoption ranx convenient even for industrial applications.
Semantic Query Labeling is the task of locating the constituent parts of a query and assigning domain-specific semantic labels to each of them. It allows unfolding the relations between the query terms and the documents’ structure while leaving unaltered the keyword-based query formulation. In this paper, we investigate the pre-training of a semantic query-tagger with synthetic data generated by leveraging the documents’ structure. By simulating a dynamic environment, we also evaluate the consistency of performance improvements brought by pre-training as real-world training data becomes available. The results of our experiments suggest both the utility of pre-training with synthetic data and its improvements’ consistency over time.
In recent years, a multitude of e-commerce websites arose. Product Search is a fundamental part of these websites, which is often managed as a traditional retrieval task. However, Product Search has the ultimate goal of satisfying specific and personal user needs, leading users to find and purchase what they are looking for, based on their preferences. To maximize users’ satisfaction, Product Search should be treated as a personalized task. In this paper, we propose and evaluate a simple yet effective personalized results re-ranking approach based on the fusion of the relevance score computed by a well-known ranking model, namely BM25, with the scores deriving from multiple user/item representations. Our main contributions are: (1) we propose a score fusion-based approach for personalized re-ranking that leverages multiple user/item representations, (2) our approach accounts for both content-based features and collaborative information (i.e. features extracted from the user-item interactions graph), (3) the proposed approach is fast and scalable, can be easily added on top of any search engine and it can be extended to include additional features. The performed comparative evaluations show that our model can significantly increase the retrieval effectiveness of the underlying retrieval model and, in the great majority of cases, outperforms modern Neural Network-based personalized retrieval models for Product Search.
This manuscript discusses our ongoing work on ranx, a Python evaluation library for Information Retrieval. First, we introduce our work, summarize the already available functionalities, show the user-friendly nature of our tool through code snippets, and briefly discuss the technologies we relied on for the implementation and their advantages. Then, we present the upcoming features, such as several Metasearch algorithms, and introduce the long-term goals of our project.
Searching in a domain-specific corpus of structured documents (e.g., e-commerce, media streaming services, job-seeking platforms) is often managed as a traditional retrieval task or through faceted search. Semantic Query Labeling — the task of locating the con- stituent parts of a query and assigning domain-specific predefined semantic labels to each of them — allows leveraging the structure of documents during retrieval while leaving unaltered the keyword- based query formulation. Due to both the lack of a publicly available dataset and the high cost of producing one, there have been few published works in this regard. In this paper, basing on the as- sumption that a corpus already contains the information the users search, we propose a method for the automatic generation of se- mantically labeled queries and show that a semantic tagger — based on BERT, gazetteers-based features, and Conditional Random Fields — trained on our synthetic queries achieves results comparable to those obtained by the same model trained on real-world data. We also provide a large dataset of manually annotated queries in the movie domain suitable for studying Semantic Query Labeling. We hope that the public availability of this dataset will stimulate future research in this area.
In this manuscript, we review the work we undertake to build a large-scale benchmark dataset for an understudied Information Retrieval task called Semantic Query Labeling. This task is particularly relevant for search tasks that involve structured documents, such as Vertical Search, and consists of automatically recognizing the parts that compose a query and unfolding the relations between the query terms and the documents’ fields. We first motivate the importance of building novel evaluation datasets for less popular Information Retrieval tasks. Then, we give an in-depth description of the procedure we followed to build our dataset.
With the development of Web 2.0 technologies, people have gone from being mere content users to content generators. In this context, the evaluation of the quality of (potential) information available online has become a crucial issue. Nowadays, one of the biggest online resources that users rely on as a knowledge base is Wikipedia. The collaborative aspect at the basis of Wikipedia can let to the possible creation of low-quality articles or even misinformation if the process of monitoring the generation and the revision of articles is not performed in a precise and timely way. For this reason, in this paper, the problem of automatically evaluating the quality of Wikipedia contents is considered, by proposing a supervised approach based on Machine Learning to perform the classification of articles on qualitative bases. With respect to prior literature, a wider set of features connected to Wikipedia articles has been taken into account, as well as previously unconsidered aspects connected to the generation of a labeled dataset to train the model, and the use of Gradient Boosting, which produced encouraging results.
Wikipedia is nowadays one of the biggest online resources on which users rely as a source of information. The amount of collaboratively generated content that is sent to the online encyclopedia every day can let to the possible creation of low-quality articles (and, consequently, misinformation) if not properly monitored and revised. For this reason, in this paper, the problem of automatically assessing the quality of Wikipedia articles is considered. In particular, the focus is (i) on the analysis of groups of hand-crafted features that can be employed by supervised machine learning techniques to classify Wikipedia articles on qualitative bases, and (ii) on the analysis of some issues behind the construction of a suitable ground truth. Evaluations are performed, on the analyzed features and on a specifically built labeled dataset, by implementing different supervised classifiers based on distinct machine learning algorithms, which produced promising results.