Algorithms for massive data

Responsabile didattico: Nicola Prezza

Durata: 28 ore

Periodo didattico: annuale



28h - prof. Nicola Prezza


The goal of this course is to introduce algorithmic techniques for dealing with massive data: data so large that it does not fit in the computer's memory. In such extreme scenarios, data compression comes at rescue. One way to compress data is to throw away uninteresting features and keep only information useful for our task at hand. The most important concept here will be that of randomized data sketches: despite being exponentially smaller than the data itself, sketches carry enough information to allow us to estimate some properties of the original objects, for example set cardinalities and distances with respect to some metric. If, on the other hand, the original data must not be lost, then lossless compression is the solution. Here we will exploit the fact that, in some scenarios, data is extremely redundant and can be considerably reduced in size (even by orders of magnitude) without losing any information. Interestingly, we will see that often computation on compressed data can be performed faster than on the original data: the active field of compressed data structures exploits the duality between compression and computation to achieve this feat.


1. Basics

- Probability theory recap (random variables, concentration inequalities)

- Hashing (universal and k-independent hashing)

2. Similarity-preserving sketching

- Karp-Rabin hashing

- Jaccard similarity via minhash

- Locality-sensitive hashing

3. Mining data streams

- Pattern matching (Karp-Rabin's and Porat-Porat's algorithms)

- Counting with small registers: Morris' algorithm

- Counting distinct elements (Flajolet-Martin's and bottom-k algorithms)

- Counting ones in a window: Datar-Gionis-Indyk-Motwani's algorithm

4. Compressed data structures

- Data compression (entropy, codes, dictionary compressors)

- Dictionary data structures (Elias-Fano, succinct bitvectors)

Compressed suffix arrays

- FM index

- Beyond statistical compression (LZ-indexes, run-length BWT)

Elenco corsi 2021/2022