Generative AI systems powered by Large Language Models have become increasingly popular in recent years. Lately, due to the risk of providing users with unsafe information, the adoption of those systems in safety-critical domains has raised significant concerns. To respond to this situation, input-output filters, commonly called guardrail models, have been proposed to complement other measures, such as model alignment. Unfortunately, the lack of a standard benchmark for guardrail models poses significant evaluation issues and makes it hard to compare results across scientific publications. To fill this gap, we introduce GuardBench, a large-scale benchmark for guardrail models comprising 40 safety evaluation datasets. To facilitate the adoption of GuardBench, we release a Python library providing an automated evaluation pipeline built on top of it. With our benchmark, we also share the first large-scale prompt moderation datasets in German, French, Italian, and Spanish. To assess the current state-of-the-art, we conduct an extensive comparison of recent guardrail models and show that a general-purpose instruction-following model of comparable size achieves competitive results without the need for specific fine-tuning.
In Advances in Information Retrieval - 46th European Conference on Information
Retrieval, ECIR 2024, Glasgow, UK, March 24-28, 2024, Proceedings,
Part V, Nov 2024
indxr is a Python utility for indexing file lines that allows users to dynamically access specific ones, avoiding loading the entire file in the computer’s main memory. indxr addresses two main issues related to working with textual data. First, users who do not have plenty of RAM at their disposal may struggle to work with large datasets. Since indxr allows accessing specific lines without loading entire files, users can work with datasets that do not fit into their computer’s main memory. For example, it enables users to perform complex tasks with limited RAM without noticeable slowdowns, such as pre-processing texts and training Neural models for Information Retrieval or other tasks. Second, indxr reduces the burden of working with datasets split among multiple files by allowing users to load specific data by providing the related line numbers or the identifiers of the information they describe, thus providing convenient access to such data. This paper overviews indxr’s main features.
Personalized Query Expansion, the task of expanding queries with additional terms extracted from the user-related vocabulary, is a well-known solution to improve the retrieval performance of a system w.r.t. short queries. Recent approaches rely on word embeddings to select expansion terms from user-related texts. Although delivering promising results with former word embedding techniques, we argue that these methods are not suited for contextual word embeddings, which produce a unique vector representation for each term occurrence. In this article, we propose a Personalized Query Expansion method designed to solve the issues arising from the use of contextual word embeddings with the current Personalized Query Expansion approaches based on word embeddings. Specifically, we employ a clustering-based procedure to identify the terms that better represent the user interests and to improve the diversity of those selected for expansion, achieving improvements up to 4% w.r.t. the best-performing baseline in terms of MAP@100. Moreover, our approach outperforms previous ones in terms of efficiency, allowing us to achieve sub-millisecond expansion times even in data-rich scenarios. Finally, we introduce a novel metric to evaluate the expansion terms diversity and empirically show the unsuitability of previous approaches based on word embeddings when employed along with contextual word embeddings, which cause the selection of semantically overlapping expansion terms.
@article{DBLP:journals/tois/BassaniTP24,author={Bassani, Elias and Tonellotto, Nicola and Pasi, Gabriella},title={Personalized Query Expansion with Contextual Word Embeddings},journal={{ACM} Trans. Inf. Syst.},volume={42},number={2},pages={61:1--61:35},year={2024},url={https://doi.org/10.1145/3624988},doi={10.1145/3624988},timestamp={Fri, 26 Jan 2024 07:56:37 +0100}}
The personalization of search results has gained increasing attention in the past few years, thanks to the development of Neural Networks-based approaches for Information Retrieval and the importance of personalization in many search scenarios. Recent works have proposed to build user models at query time by leveraging the Attention mechanism, which allows weighing the contribution of the user-related information w.r.t. the current query. This approach allows taking into account the diversity of the user’s interests by giving more importance to those related to the current search performed by the user. In this paper, we first discuss some shortcomings of the standard Attention formulation when employed for personalization. In particular, we focus on issues related to its normalization mechanism and its inability to entirely filter out noisy user-related information. Then, we introduce the Denoising Attention mechanism: an Attention variant that directly tackles the above shortcomings by adopting a robust normalization scheme and introducing a filtering mechanism. The reported experimental evaluation shows the benefits of the proposed approach over other Attention-based variants.
@article{DBLP:journals/corr/abs-2308-15968,author={Bassani, Elias and Kasela, Pranav and Pasi, Gabriella},title={Denoising Attention for Query-aware User Modeling in Personalized
Search},journal={CoRR},volume={abs/2308.15968},year={2023},url={https://doi.org/10.48550/arXiv.2308.15968},doi={10.48550/arXiv.2308.15968},eprinttype={arXiv},eprint={2308.15968},timestamp={Mon, 04 Sep 2023 15:29:24 +0200}}
ranxhub is an online repository for sharing artifacts deriving from the evaluation of Information Retrieval systems. Specifically, we provide a platform for sharing pre-computed runs: the ranked lists of documents retrieved for a specific set of queries by a retrieval model.We also extend ranx, a Python library for the evaluation and comparison of Information Retrieval runs, adding functionalities to integrate the usage of ranxhubseamlessly, allowing the user to compare the results of multiple systems in just a few lines of code. In this paper, we first outline the many advantages and implications that an online repository for sharing runs can bring to the table. Then, we introduce ranxhuband its integration with ranx, showing its very simple usage. Finally, we discuss some use cases for which ranxhubcan be highly valuable for the research community.
@inproceedings{DBLP:conf/sigir/Bassani23,author={Bassani, Elias},title={ranxhub: An Online Repository for Information Retrieval Runs},booktitle={{SIGIR}},publisher={{ACM}},year={2023},url={https://doi.org/10.1145/3539618.3591823},doi={10.1145/3539618.3591823},timestamp={Fri, 21 Jul 2023 22:25:24 +0200},}
This paper presents ranx.fuse, a Python library for Metasearch. Built following a user-centered design, it provides easy-to-use tools for combining the results of multiple search engines. ranx.fuse comprises 25 Metasearch algorithms implemented with Numba, a just-in-time compiler for Python code, for efficient vector operations and automatic parallelization. Moreover, in conjunction with the Metasearch algorithms, our library implements six normalization strategies that transform the search engines’ result lists to make them comparable, a mandatory step for Metasearch. Finally, as many Metasearch algorithms require a training or optimization step, ranx.fuse offers a convenient functionality for their optimization that evaluates pre-defined hyper-parameters configurations via grid search. By relying on the provided functions, the user can optimally combine the results of multiple search engines in very few lines of code. ranx.fuse can also serve as a user-friendly tool for fusing the rankings computed by a first-stage retriever and a re-ranker, as a library providing several baselines for Metasearch, and as a playground to test novel normalization strategies.
@inproceedings{DBLP:conf/cikm/BassaniR22,author={Bassani, Elias and Romelli, Luca},editor={Hasan, Mohammad Al and Xiong, Li},title={ranx.fuse: {A} Python Library for Metasearch},booktitle={{CIKM}},pages={4808--4812},publisher={{ACM}},year={2022},url={https://doi.org/10.1145/3511808.3557207},doi={10.1145/3511808.3557207},timestamp={Wed, 19 Oct 2022 17:09:02 +0200},}
Personalization in Information Retrieval has been a hot topic in both academia and industry for the past two decades. However, there is still a lack of high-quality standard benchmark datasets for conducting offline comparative evaluations in this context. To mitigate this problem, in the past few years, approaches to derive synthetic datasets suited for evaluating Personalized Search models have been proposed. In this paper, we put forward a novel evaluation benchmark for Personalized Search with more than 18 million documents and 1.9 million queries across four domains. We present a detailed description of the benchmark construction procedure, highlighting its characteristics and challenges. We provide baseline performance including pre-trained neural models, opening room for the evaluation of personalized approaches, as well as domain adaptation and transfer learning scenarios. We make both datasets and models available for future research.
@inproceedings{DBLP:conf/cikm/BassaniKRP22,author={Bassani, Elias and Kasela, Pranav and Raganato, Alessandro and Pasi, Gabriella},editor={Hasan, Mohammad Al and Xiong, Li},title={A Multi-Domain Benchmark for Personalized Search Evaluation},booktitle={{CIKM}},pages={3822--3827},publisher={{ACM}},year={2022},url={https://doi.org/10.1145/3511808.3557536},doi={10.1145/3511808.3557536},timestamp={Wed, 19 Oct 2022 17:09:02 +0200},}
This paper presents ranx, a Python evaluation library for Information Retrieval built on top of Numba. ranx provides a user-friendly interface to the most common ranking evaluation metrics, such as MAP, MRR, and NDCG. Moreover, it offers a convenient way of managing the evaluation results, comparing different runs, performing statistical tests between them, and exporting LaTeX tables ready to be used in scientific publications, all in a few lines of code. The efficiency brought by Numba, a just-in-time compiler for Python code, makes the adoption ranx convenient even for industrial applications.
@inproceedings{DBLP:conf/ecir/Bassani22,author={Bassani, Elias},editor={Hagen, Matthias and Verberne, Suzan and Macdonald, Craig and Seifert, Christin and Balog, Krisztian and N{\o}rv{\aa}g, Kjetil and Setty, Vinay},title={ranx: {A} Blazing-Fast Python Library for Ranking Evaluation and Comparison},booktitle={{ECIR}},series={Lecture Notes in Computer Science},volume={13186},pages={259--264},publisher={Springer},year={2022},url={https://doi.org/10.1007/978-3-030-99739-7\_30},doi={10.1007/978-3-030-99739-7\_30},timestamp={Wed, 27 Apr 2022 20:12:25 +0200},}
Semantic Query Labeling is the task of locating the constituent parts of a query and assigning domain-specific semantic labels to each of them. It allows unfolding the relations between the query terms and the documents’ structure while leaving unaltered the keyword-based query formulation. In this paper, we investigate the pre-training of a semantic query-tagger with synthetic data generated by leveraging the documents’ structure. By simulating a dynamic environment, we also evaluate the consistency of performance improvements brought by pre-training as real-world training data becomes available. The results of our experiments suggest both the utility of pre-training with synthetic data and its improvements’ consistency over time.
@inproceedings{DBLP:conf/ecir/BassaniP22,author={Bassani, Elias and Pasi, Gabriella},editor={Hagen, Matthias and Verberne, Suzan and Macdonald, Craig and Seifert, Christin and Balog, Krisztian and N{\o}rv{\aa}g, Kjetil and Setty, Vinay},title={Evaluating the Use of Synthetic Queries for Pre-training a Semantic
Query Tagger},booktitle={{ECIR}},series={Lecture Notes in Computer Science},volume={13186},pages={39--46},publisher={Springer},year={2022},url={https://doi.org/10.1007/978-3-030-99739-7\_5},doi={10.1007/978-3-030-99739-7\_5},timestamp={Wed, 27 Apr 2022 20:12:25 +0200},}
In recent years, a multitude of e-commerce websites arose. Product Search is a fundamental part of these websites, which is often managed as a traditional retrieval task. However, Product Search has the ultimate goal of satisfying specific and personal user needs, leading users to find and purchase what they are looking for, based on their preferences. To maximize users’ satisfaction, Product Search should be treated as a personalized task. In this paper, we propose and evaluate a simple yet effective personalized results re-ranking approach based on the fusion of the relevance score computed by a well-known ranking model, namely BM25, with the scores deriving from multiple user/item representations. Our main contributions are: (1) we propose a score fusion-based approach for personalized re-ranking that leverages multiple user/item representations, (2) our approach accounts for both content-based features and collaborative information (i.e. features extracted from the user-item interactions graph), (3) the proposed approach is fast and scalable, can be easily added on top of any search engine and it can be extended to include additional features. The performed comparative evaluations show that our model can significantly increase the retrieval effectiveness of the underlying retrieval model and, in the great majority of cases, outperforms modern Neural Network-based personalized retrieval models for Product Search.
@article{DBLP:journals/inffus/BassaniP22,author={Bassani, Elias and Pasi, Gabriella},title={A multi-representation re-ranking model for Personalized Product Search},journal={Inf. Fusion},volume={81},pages={240--249},year={2022},url={https://doi.org/10.1016/j.inffus.2021.11.010},doi={10.1016/j.inffus.2021.11.010},timestamp={Wed, 23 Feb 2022 11:17:45 +0100},}
This manuscript discusses our ongoing work on ranx, a Python evaluation library for Information Retrieval. First, we introduce our work, summarize the already available functionalities, show the user-friendly nature of our tool through code snippets, and briefly discuss the technologies we relied on for the implementation and their advantages. Then, we present the upcoming features, such as several Metasearch algorithms, and introduce the long-term goals of our project.
@inproceedings{DBLP:conf/iir/Bassani22,author={Bassani, Elias},editor={Pasi, Gabriella and Cremonesi, Paolo and Orlando, Salvatore and Zanker, Markus and Massimo, David and Turati, Gloria},title={Towards an Information Retrieval Evaluation Library},booktitle={IIR},series={{CEUR} Workshop Proceedings},volume={3177},publisher={CEUR-WS.org},year={2022},timestamp={Fri, 10 Mar 2023 16:22:48 +0100},}
Searching in a domain-specific corpus of structured documents (e.g., e-commerce, media streaming services, job-seeking platforms) is often managed as a traditional retrieval task or through faceted search. Semantic Query Labeling — the task of locating the con- stituent parts of a query and assigning domain-specific predefined semantic labels to each of them — allows leveraging the structure of documents during retrieval while leaving unaltered the keyword- based query formulation. Due to both the lack of a publicly available dataset and the high cost of producing one, there have been few published works in this regard. In this paper, basing on the as- sumption that a corpus already contains the information the users search, we propose a method for the automatic generation of se- mantically labeled queries and show that a semantic tagger — based on BERT, gazetteers-based features, and Conditional Random Fields — trained on our synthetic queries achieves results comparable to those obtained by the same model trained on real-world data. We also provide a large dataset of manually annotated queries in the movie domain suitable for studying Semantic Query Labeling. We hope that the public availability of this dataset will stimulate future research in this area.
@inproceedings{DBLP:conf/sigir/BassaniP21,author={Bassani, Elias and Pasi, Gabriella},editor={Diaz, Fernando and Shah, Chirag and Suel, Torsten and Castells, Pablo and Jones, Rosie and Sakai, Tetsuya},title={Semantic Query Labeling Through Synthetic Query Generation},booktitle={{SIGIR}},pages={2278--2282},publisher={{ACM}},year={2021},url={https://doi.org/10.1145/3404835.3463071},doi={10.1145/3404835.3463071},timestamp={Sun, 02 Oct 2022 16:15:14 +0200},}
In this manuscript, we review the work we undertake to build a large-scale benchmark dataset for an understudied Information Retrieval task called Semantic Query Labeling. This task is particularly relevant for search tasks that involve structured documents, such as Vertical Search, and consists of automatically recognizing the parts that compose a query and unfolding the relations between the query terms and the documents’ fields. We first motivate the importance of building novel evaluation datasets for less popular Information Retrieval tasks. Then, we give an in-depth description of the procedure we followed to build our dataset.
@inproceedings{DBLP:conf/iir/BassaniP21,author={Bassani, Elias and Pasi, Gabriella},editor={Anelli, Vito Walter and Noia, Tommaso Di and Ferro, Nicola and Narducci, Fedelucio},title={On Building Benchmark Datasets for Understudied Information Retrieval
Tasks: the Case of Semantic Query Labeling},booktitle={IIR},series={{CEUR} Workshop Proceedings},volume={2947},publisher={CEUR-WS.org},year={2021},timestamp={Fri, 10 Mar 2023 16:22:48 +0100},}
With the development of Web 2.0 technologies, people have gone from being mere content users to content generators. In this context, the evaluation of the quality of (potential) information available online has become a crucial issue. Nowadays, one of the biggest online resources that users rely on as a knowledge base is Wikipedia. The collaborative aspect at the basis of Wikipedia can let to the possible creation of low-quality articles or even misinformation if the process of monitoring the generation and the revision of articles is not performed in a precise and timely way. For this reason, in this paper, the problem of automatically evaluating the quality of Wikipedia contents is considered, by proposing a supervised approach based on Machine Learning to perform the classification of articles on qualitative bases. With respect to prior literature, a wider set of features connected to Wikipedia articles has been taken into account, as well as previously unconsidered aspects connected to the generation of a labeled dataset to train the model, and the use of Gradient Boosting, which produced encouraging results.
@inproceedings{DBLP:conf/sac/BassaniV19,author={Bassani, Elias and Viviani, Marco},editor={Hung, Chih{-}Cheng and Papadopoulos, George A.},title={Automatically assessing the quality of Wikipedia contents},booktitle={{SAC}},pages={804--807},publisher={{ACM}},year={2019},url={https://doi.org/10.1145/3297280.3297357},doi={10.1145/3297280.3297357},timestamp={Sun, 02 Oct 2022 16:14:25 +0200},}
Wikipedia is nowadays one of the biggest online resources on which users rely as a source of information. The amount of collaboratively generated content that is sent to the online encyclopedia every day can let to the possible creation of low-quality articles (and, consequently, misinformation) if not properly monitored and revised. For this reason, in this paper, the problem of automatically assessing the quality of Wikipedia articles is considered. In particular, the focus is (i) on the analysis of groups of hand-crafted features that can be employed by supervised machine learning techniques to classify Wikipedia articles on qualitative bases, and (ii) on the analysis of some issues behind the construction of a suitable ground truth. Evaluations are performed, on the analyzed features and on a specifically built labeled dataset, by implementing different supervised classifiers based on distinct machine learning algorithms, which produced promising results.
@inproceedings{DBLP:conf/ic3k/BassaniV19,author={Bassani, Elias and Viviani, Marco},editor={Fred, Ana L. N. and Filipe, Joaquim},title={Quality of Wikipedia Articles: Analyzing Features and Building a Ground
Truth for Supervised Classification},booktitle={{KDIR}},pages={338--346},publisher={ScitePress},year={2019},url={https://doi.org/10.5220/0008149303380346},doi={10.5220/0008149303380346},timestamp={Sun, 02 Oct 2022 16:03:00 +0200},}