Algorithmic Techniques for Taming Big Data (DS-563/CS-543, Spring 2025)
Growing amounts of available data lead to significant challenges in processing them efficiently. In many cases, it is no longer possible to design feasible algorithms that can freely access the entire data set. Instead of that, we often have to resort to techniques that allow for reducing the amount of data such as sampling, sketching, dimensionality reduction, and core sets. Apart from these approaches, the course will also explore scenarios in which large data sets are distributed across several machines, or even geographical locations, and the goal is to design efficient communication protocols or MapReduce algorithms.
The course will include a final project and programming assignments in which we will explore the performance of our techniques when applied to publicly available data sets. Throughout the course, we will explore various strategies for implementing techniques that have theoretical guarantees in practice.
• Tuesdays: 3:30–4:45pm at CDS 701
• Wednesdays: 1:25–2:15pm at SOC B61 or 2:30–3:20pm at CGS 423
• Thursdays: 3:30–4:45pm at CDS 701
- Tuesday 1/21: Class topics and logistics. Introductions. [Syllabus]
- Wednesday 1/22: Review of useful probabilistic facts and inequalities. [Notes] [Handout with probabilistic inequalities]
- Thursday 1/23: Frequency estimation in small space via Count-Min Sketch. [Notes]
- Tuesday 1/28: Continue the analysis of Count-Min Sketch. Linear sketching algorithms. [Notes]
- Wednesday 1/29: Properties of linear sketching algorithms. Streaming algorithms. Frequency moments. [Notes]
- Thursday 1/30: AMS sketch for $F_2$. [Notes]
- Tuesday 2/04: Amplifying the probability of success. [Notes]
- Wednesday 2/05: Amplifying the probability of success and the Chernoff bound. [Notes]
- Thursday 2/06: Constructing $k$-wise independent hash functions. [Notes]
- Tuesday 2/11: Estimating the number of distinct element. [Notes]
- Wednesday 2/12: Estimating the number of distinct element (continued). (See previous notes.)
- Thursday 2/13: Graph connectivity sketches. [Notes]
[Graph connectivity sketches]
- Tuesday 2/18: No lecture. BU follows the Monday schedule.
- Wednesday 2/19: Graph connectivity sketches (continued). [Graph connectivity sketches]
- Thursday 2/20: Graph connectivity sketches. Adaptive usage of randomized algorithms. [Graph connectivity sketches]
- Tuesday 2/25: Adversarial robustness of streaming algorithms: algorithm switching for insertion–only streams. [Notes on adversarial robustness]
- Wednesday 2/26: Adversarial robustness of streaming algorithms: properties of the algorithm switching technique. [Notes on adversarial robustness]
- Thursday 2/27: Adversarial robustness of streaming algorithms: properties of the algorithm switching technique and a quick discussion of sparse–dense trade-offs for insertion–deletion streams. [Notes on adversarial robustness]
- Tuesday 3/04: Dimensionality reduction via the Johnson–Lindenstrauss Lemma. [Notes on JL Lemma]
- Wednesday 3/05: Wrapping up the proof of the Johnson-Lindenstrauss Lemma. [Notes on JL Lemma]
- Thursday 3/06: Nearest neighbor search via Locality Sensitive Hashing.