# SEMINARIO DI INFORMATICA

**Responsabile didattico: **Alberto Policriti

**Durata: **28 ore

## Programma

**SETTORE SCIENTIFICO DISCIPLINARE - INF/01**

**6h - prof. Alberto Policriti**

**6h - prof.ssa Giovanna D'Agostino**

**16h- Conferenze a cura dei dott.ri N.Prezza e A.Tomescu**

The disciplinary course “Seminario di Informatica” will illustrate tools, techniques, and results (some of them very recent) relative to algorithmic aspects of large text-dataset representation. In particular, the themes of indexing, compression, and retrieval will be introduced and illustrated on a variety of data-structures, among which automata.

The course will consist of the following three parts, will be moderatly technical, and its only prerequisite should be a mathematical maturity.

Contact prof. Policriti (alberto.policriti@uniud.it) for any further question.

—————————————————————————————————————————————————————————

**Part I. Title: Compressed Text Indexing (N. Prezza)**

Text indexing is a classical algorithmic problem that has been studied for over four decades: given a string S, the goal is to build a data structure (an index) that permits to efficiently locate substrings in S. The first optimal-time solution to the problem, suffix trees, dates back to 1973. A suffix tree, however, requires too much space to be stored: in practice, several times the space of S itself.

Due to the recent explosion in the production of large repetitive datasets in applications such as bioinformatics (where nowadays it is required to index thousands of human genomes), the indexing community has lately shifted its attention towards techniques aimed at building compressed indexes. The space of a compressed index is proportional to the amount of information contained in S, rather than on its length.

In this course I will give an overview of the main techniques developed to date to solve the compressed indexing problem. In particular, I will discuss standard solutions based on compressed suffix arrays and on the Lempel-Ziv compression algorithm [1]. Finally, I will introduce a recent technique - string attractors [2] - that generalizes all previous solutions and permits to obtain, for the first time, universally-compressed indexes that answer queries in optimal time [3].

[1] Navarro G, Mäkinen V.: Compressed full-text indexes. ACM Computing Surveys (CSUR). 2007 Apr 12;39(1):2.

[2] Kempa D, Prezza N.: At the roots of dictionary compression: String attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. 2018 Jun 20 (pp. 827-840). ACM.

[3] Navarro G, Prezza N.: Universal compressed text indexing. Theoretical Computer Science. 2019 Mar 1;762:41-50.

————————————————

**Part II. “Indexing sets of strings: Wheeler languages" **

Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing texts be extended to indexing languaes? We illustrate such an extension considering the sub-class of regular languages accepted by automata whose states correspond to collections of strings that are naturally sorted co-lexicographically. Starting from the recently introduced notion of Wheeler graph [1]---which extends the concept of prefix-sorting to labeled graphs—,we investigate the properties of Wheeler languages, that is, regular languages admitting an accepting Wheeler finite automaton.

We prove that the order relation over states induces an interval-order among the elements of the language accepted by the automaton. This allows elegant counterparts of classic results (e.g. Myhill-Nerode theorem) as well as simplifications of constructions and algorithms (e.g. determinisation) which are exponential in the general case while polynomial in the Wheeler case.

In the first seminar we start with a gentle introduction to string manipulation, prefix sorting, Burrows-Wheeler Transform and a reminder of the general theory of finite automata, completed with the definition and first properties of Wheeler automata.

During the second seminar we inllustrate recent results [2] of a theory of Wheeler Languages.

[1] T. Gagie, G. Manzini, J. Sirén: Wheeler graphs: A framework for BWT-based data structures. Theoretical computer science, 2017

[2] J. Alanko, G. D'Agostino, A. Policriti, and N. Prezza: Regular Languages meet Prefix Sorting. To appear SODA 2020

—————————————————

**Part III: Title: Fine-grained Complexity and Automata (A. Tomescu)**

A large body of traditional complexity theory is focused on classifying problems as tractable or intractable, where tractable generally means polynomially-time solvable. This is based on reductions between problems, such as NP-hardness reductions, and some conjectures, like P \neq NP. However, such an approach groups all polynomially-time solvable problems into one class, and does not allow to prove, for example, that a problem cannot be solved in better than quadratic time.

An emerging field aiming to tackle this issue is that of "fine-grained complexity" [1]. The aim is to find analogous reductions between problems ("fine-grained reductions") and conjectures (such as the Strong Exponential Time Hypothesis) allowing to have a much finer classification of the complexity of polynomially-time solvable problems.

In this course we will give an overview of the main concepts and present fine-grained reductions of some well-known problems. One of these is the problem of deciding whether a finite automaton accepts a given word. Equivalently, whether a given string appears as the label of a walk in a labeled graph. We show that this cannot be solved in better than quadratic time, even if the graph has many nice properties (acyclicity, determinism) [2].

[1] Virginia Vassilevska Williams: Hardness of Easy Problems: Basing Hardness on Popular Conjectures such as the Strong Exponential Time Hypothesis (Invited Talk). IPEC 2015: 17-29

[2] Massimo Equi, Roberto Grossi, Veli Mäkinen, Alexandru I. Tomescu: On the Complexity of String Matching for Graphs. ICALP 2019: 55:1-55:15