Algorithmic Techniques for Taming Big Data (DS-563/CS-543, Fall 2021)

Course Description: Growing amounts of available data lead to significant challenges in processing them efficiently. In many cases, it is no longer possible to design feasible algorithms that can freely access the entire data set. Instead of that, we often have to resort to techniques that allow for reducing the amount of data such as sampling, sketching, dimensionality reduction, and core sets. Apart from these approaches, the course will also explore scenarios in which large data sets are distributed across several machines, or even geographical locations, and the goal is to design efficient communication protocols or MapReduce algorithms.

The course will include a final project and programming assignments in which we will explore the performance of our techniques when applied to publicly available data sets. Throughout the course, we will explore various strategies for implementing techniques that have theoretical guarantees in practice.

Instructor: Krzysztof Onak (konak@bu.edu)
Office Hours: Tuesday 4–6pm, MCS 138N (or adjacent common area)

Teaching Fellow: Nadya Voronova (voronova@bu.edu)
Office Hours: Wednesday 2–4pm, MCS 141

Lecture: Tuesday/Thursday 2–3:15, MCS B33
Discussion: Wednesday 10:10–11am, MCS B37

Piazza (announcements and discussions): https://piazza.com/bu/fall2021/ds563cs543/home
Gradescope code for submitting homework: BPDKV8

Satisfies the Following HUB Units: Quantitative Reasoning II (QR2) and Creativity/Innovation (CRI)

Lectures

9/2 (Lecture 1): Introduction and class overview. CountMin sketch. [Handout] [Notes]
9/7 (Lecture 2): CountMin sketch (continued). Heavy hitters. [Notes]
9/9 (Lecture 3): Counting distinct elements.
9/14 (Lecture 4): Brief discussion of general moment estimation. Adversarially robust streaming algorithms. [Notes]
9/16 (Lecture 5): Adversarially robust streaming algorithms (continued). [Notes]
9/21 (Lecture 6): Graph connectivity sketches. [Notes]
9/27 (Lecture 7): Johnson–Lindenstrauss lemma. [Notes]
9/28 (Lecture 8): Applications of dimensionality reduction.
9/30 (Lecture 9): Brief discussion: optimality of the Johnson–Lindenstrauss lemma, dimensionality reduction in $\ell_1$. Locality sensitive hashing.
10/5 (Lecture 10): Locality sensitive hashing (continued).
10/7 (Lecture 11): Core sets (minimum enclosing ball).
10/12: Virtual Monday. No lecture
10/14 (Lecture 12): Core sets (approximate median).
10/19 (Lecture 13): Core sets ($k$-center clustering).
10/21 (Lecture 14): Estimation/learning of discrete distributions. Uniformity testing.
10/26 (Lecture 15): Uniformity testing (continued).
10/28 (Lecture 16): Estimation/learning of monotone and unimodal distributions.
11/2 (Lecture 17): Estimation/learning of monotone and unimodal distribution. Monotonicity testing.
11/4 (Lecture 18): Monotonicity testing (continued).
11/9 (Lecture 19): Estimation of the number of edges in a graph.
11/11 (Lecture 20): Estimation of the number of edges in a graph (continued).
11/16 (Lecture 21): Estimation of the maximum matching size in sparse graphs.
11/18 (Lecture 22): The Massively Parallel Computation model (MPC). Sample MPC algorithms.
11/23 (Lecture 23): More MPC algorithms: matchings, sorting.
11/30 (Lecture 24): Simulating random walks in MPC
12/2 (Lecture 25): Connectivity in MPC and conditional lower bounds. Communication complexity for proving streaming lower bounds.
Lecture 12/7, Discussion 12/8, Lecture 12/9: Final project presentations

Useful Auxiliary Materials

Useful probabilistic inequalities
There is no textbook. A good list of resources—including books, surveys, lectures notes, and presentations—on many topics covered in this class can be found at https://sublinear.info/index.php?title=Resources.
Another list of related courses: https://www.sketchingbigdata.org.

Assignments

Note: the due date is subject to change until the assignment is handed out.

Course Requirements

Apart from active participation, the class will require solving homework problem sets, two experimental programming assignments, and a final project. The overall grade will be based on the following factors:

class participation: 5%
homework: 25%
two programming assignments: 25%
project proposal: 5%
final project: 40%

Programming Assignments: The course will feature two programming assignments in which students will implement algorithms covered in class and apply them to data sets of their choice. Collaboration here is not allowed (except for discussing high–level ideas), i.e., students are required to implement algorithms and run experiments on their own.

Final Project: Possible final projects ideas include but are not limited to

implementing an algorithm not covered in class and testing its practical performance on real–world data,
creating an open–source implementation of one of the algorithms with easy to follow documentation,
developing a new algorithm with good theoretical or practical guarantees.

The outcome of a project will be a short technical report, describing obtained results and conclusions. As opposed to programming assignments, students are allowed to work in teams of two or three. A list of potential projects topics will be provided, but students are encouraged to develop their own ideas. These projects have to be approved by the instructor.

Late Homework Policy: No extensions (except for extraordinary circumstances). If you submit your programming assignment or homework one day late, you may lose 10% of points.

Laptop and Cellphone Policy

Using laptops, cellphones, tablets, and other similar electronic devices is generally not allowed. If you want to use your laptop or tablet for taking notes, you have to email a copy of your notes to me after the class and you are not allowed to use your device for other purposes, such as replying to emails or browsing the web.

Prerequisites

Programming: Fluency with programming and basic data structures is required (commensurate with CS-111, EK-125, or equivalent). Familiarity with C++, Java, or Python is recommended.
Algorithms: Familiarity with basic topics on algorithms and computation complexity (commensurate with CS-330, EC-330, or equivalent) is required. These topics are covered in textbooks such as Cormen, Leiserson, Rivest, and Stein "Introduction to Algorithms."
Mathematics: Familiarity with basics of linear algebra (commensurate with CS-132, MA-242, or equivalent) and probability (commensurate with MA-115, CS-237, EK-381, or equivalent) is required. These topics are covered in textbooks such as Strang "Introduction to Linear Algebra," Lay, Lay, and McDonald "Linear Algebra and Its Applications," and Pishro-Nik "Introduction to Probability, Statistics, and Random Processes."

Self–assessment questionnaire: Since advanced courses offered through the Faculty of Computing & Data Sciences are meant to be open to students from various disciplines, we provide an on-line questionnaire to assist students with self–assessment and placement. Please follow this link to access it.

Tentative Syllabus

Section 1: Data projections
- Lecture 1: Course overview. Frequency estimation (CountMin sketch).
- Lecture 2: Approximate counting. Estimation of data frequency moments.
- Lecture 3: Estimation of data frequency moments.
- Lecture 4: Applications to distributed monitoring.
- Lectures 5 & 6: Compressed graph representations with applications (graph sketches).
- Lecture 7: Data dimensionality reduction (Johnson–Lindenstrauss Lemma).
- Lecture 8: Data dimensionality reduction for clustering.
- Lectures 9 & 10: Finding similar data points (nearest–neighbor search).
Section 2: Selection of representative subsets
- Lecture 11: Simple geometric problems. Clustering via core sets.
- Lecture 12: Clustering via core sets.
- Lecture 13: Diversity maximization via core sets.
Section 3: Sampling from probability distributions
- Lecture 14: Estimation of distributions and their properties.
- Lecture 15: Verification of a distribution's uniformity.
- Lecture 16: Verification of other properties. Access methods beyond sampling.
Section 4: Querying and sampling subsets of data sets
- Lecture 17: Estimation of data parameters and approximate verification of properties.
- Lecture 18: Efficient local sparse graph exploration techniques. Estimating graph parameters.
Section 5: Distributed computation
- Lecture 19: MapReduce and the Massively Parallel Computation model. Sample MPC algorithms.
- Lecture 20: Clustering on MPC.
- Lecture 21: Graph algorithms on MPC.
- Lecture 22: Limitations of distributed algorithms.
Section 6: Closing lectures
- Lectures 23 & 24: Efficient sparse least–square regression.
- Lecture 25: Overview of additional topics not covered in detail.
- Lectures 26 & 27: Project presentations and discussions.
Possible additional and replacement topics:
- Learning–augmented algorithms
- Adversarially robust streaming algorithms