Course Description: Growing amounts of available data lead to significant challenges in processing them efficiently. In many cases, it is no longer possible to design feasible algorithms that can freely access the entire data set. Instead of that, we often have to resort to techniques that allow for reducing the amount of data such as sampling, sketching, dimensionality reduction, and core sets. Apart from these approaches, the course will also explore scenarios in which large data sets are distributed across several machines, or even geographical locations, and the goal is to design efficient communication protocols or MapReduce algorithms.

The course will include a final project and programming assignments in which we will explore the performance of our techniques when applied to publicly available data sets. Throughout the course, we will explore various strategies for implementing techniques that have theoretical guarantees in practice.

Instructor: Krzysztof Onak (konak@bu.edu)

Office Hours: Fridays 3–5pm, CCDS 1336 (or an adjacent common area)

Teaching Assistant: Ben Badnani (bbadnani@bu.edu)

Office Hours: Wednesdays 4–6pm, CCDS 1420B (yellow southwest corner)

Lecture: Tuesday/Thursday 2–3:15, MCS B29

Discussion sections:

• Wednesday 1:25–2:15pm, IEC B01

• Wednesday 2:30–3:20pm, CGS 123

Piazza (announcements and discussions): https://piazza.com/bu/spring2023/ds563cs543

Gradescope code for submitting homework: 4VYBJ6

Satisfies the Following HUB Units: Quantitative Reasoning II (QR2) and Creativity/Innovation (CRI)

1/19 | |
---|---|

Class overview. CountMin sketch. [Notes] External materials: [Roughgarden & Valiant] | |

1/24 | 1/26 |

CountMin sketch. Heavy hitters. [Notes] | Heavy hitters. Moments and AMS sketch. [Notes] |

1/31 | 2/2 |

AMS sketch. Distinct elements. [Notes] External materials: [Chakrabarti: Units 2&3] | Distinct elements (continued). [Notes] |

2/7 | 2/9 |

Adversarially robust streaming. [Notes] | Adversarially robust streaming. [Notes] Due: 2/10, programming assignment 1 |

2/14 | 2/16 |

Guest lecture: Prof. Alina Ene, Submodular optimization in the streaming model | Guest lecture: Prof. Charalampos Tsourakakis, Implementing HyperLogLog++ |

2/23 | |

Graph connectivity sketches. [Notes 1] [Notes 2] Due: 2/24, homework 1 | |

2/28 | 3/2 |

Johnson–Lindenstrauss Lemma. [Notes 1] [Notes 2] | JL Lemma (continued). Locality sensitive hashing. [Notes] Due: 3/3, final project proposal |

3/6–3/10 | |

SPRING BREAK | |

3/14 | 3/16 |

Locality sensitive hashing (continued). Coresets for quantiles. [Notes] | Coresets for quantiles (continued). Due: 3/17, programming assignment 2 |

3/21 | 3/23 |

Coresets for $k$-median. Learning general discrete distributions. [Notes] | Learning general and monotone discrete distributions. [Notes] |

3/28 | 3/30 |

Uniformity testing. Other testable properties of distributions. External materials: [Goldreich's book: Section 11.2] | Estimating the maximum matching and minimum vertex cover size in sublinear time. [Notes] Due: 3/31, homework 2 |

4/4 | 4/6 |

Estimating the maximum matching and minimum vertex cover size (continued). | Estimating the number of edges in a graph. |

4/11 | 4/13 |

Estimating the number of edges in a graph (continued). [Notes] | Monotonicity testing [Notes] Due: 4/14, programming assignment 3 |

4/18 | 4/20 |

MPC: the Massively Parallel Computation model. | [predicted] More MPC: Sorting. Maximal matching in large local space. |

4/25 | 4/27 |

[predicted] More MPC: Large matching in small local space. [Additional lecture on 4/26] More MPC: random walks and PageRank. | Final project presentations |

5/2 | |

Final project presentations | |

5/9 | |

Final project report due |

- Useful probabilistic inequalities
- There is no textbook. A good list of resources—including books,
surveys, lectures notes, and presentations—on many topics covered in this
class can be found at
`https://sublinear.info/index.php?title=Resources`

. - Another list of related courses:
`https://www.sketchingbigdata.org`

.

- Programming assignment 1 (due 2/10)
- Homework 1 (due 2/24)
- Final project proposal (due 3/3)
- Programming assignment 2 (due 3/17)
- Homework 2 (due 3/31)
- Programming assignment 3 (due 4/14)
- Final project presentations (4/27, 5/2, 5/3)
- Final project reports (due 5/9)

Note: The due date is subject to change until the assignment is handed out.

Apart from active participation, the class will require homework, three experimental programming assignments, and a final project. The overall grade will be based on the following factors:

- class participation: 5%
- homework: 25%
- programming assignments: 25%
- project proposal: 5%
- final project: 40%

Programming Assignments: The course will feature three programming assignments in which students will implement algorithms covered in class and apply them to data sets of their choice. Collaboration here is not allowed (except for discussing high–level ideas), i.e., students are required to implement algorithms and run experiments on their own.

Final Project: Possible final projects ideas include but are not limited to

- implementing an algorithm not covered in class and testing its practical performance on real–world data,
- creating an open–source implementation of one of the algorithms with easy to follow documentation,
- developing a new algorithm with good theoretical or practical guarantees.

The outcome of a project will be a short technical report, describing obtained results and conclusions. As opposed to programming assignments, students are allowed to work in teams of two or three. A list of potential projects topics will be provided, but students are encouraged to develop their own ideas. These projects have to be approved by the instructor.

Late Homework Policy: No extensions (except for extraordinary circumstances). If you submit your programming assignment or homework one day late, you may lose 10% of points.

Using laptops, cellphones, tablets, and other similar electronic devices is generally not allowed. If you want to use your laptop or tablet for taking notes, you have to email a copy of your notes to the TA after the class. You are not allowed to use your device for other purposes, such as replying to emails or browsing the web.

- Programming: Fluency with programming and basic data structures is required (commensurate with DS-110, CS-111, EK-125, or equivalent). Familiarity with C++, Java, Rust, or Python is recommended.
- Algorithms: Familiarity with basic topics in algorithms and computation complexity (commensurate with DS-320, CS-330, EC-330, or equivalent) is required. These topics are, for instance, covered in Cormen, Leiserson, Rivest, and Stein "Introduction to Algorithms."
- Mathematics: Familiarity with basics of linear algebra, probability, and calculus is required. These topics are covered in the DS math sequence (DS-120, DS-121, DS-122) and multiple other courses across the BU campus (including CS-132, MA-242, MA-115, CS-237, EK-381). Textbooks that discuss these topics include Strang "Introduction to Linear Algebra," Lay, Lay, and McDonald "Linear Algebra and Its Applications," and Pishro-Nik "Introduction to Probability, Statistics, and Random Processes."

Self–assessment questionnaire: Since advanced courses offered through the Faculty of Computing & Data Sciences are meant to be open to students from various disciplines, we provide an on-line questionnaire to assist students with self–assessment and placement. Please follow this link to access it.

The exact content may still be adjusted.

- Section 1: Data projections
- Lecture 1: Course overview. Frequency estimation (CountMin sketch).
- Lecture 2: Heavy hitters. AMS sketch.
- Lecture 3: Counting distinct elements.
- Lectures 4 & 5: Adversarially robust streaming algorithms.
- Lectures 6 & 7: Compressed graph representations with applications (graph sketches).
- Lecture 8: Johnson–Lindenstrauss lemma.
- Lecture 9: Applications of dimensionality reduction.
- Lectures 10 & 11: Locality sensitive hashing.

- Section 2: Selection of representative subsets.
- Lecture 12: Core sets for $k$-center clustering.
- Lecture 13: Core sets for approximate median.
- Lecture 14: Diversity maximization via core sets.

- Section 3: Sampling from probability distributions
- Lecture 15: Estimation of distributions and their properties.
- Lecture 16: Uniformity testing.
- Lecture 17: Estimation of monotone and unimodal distributions.

- Section 4: Querying and sampling subsets of data sets
- Lecture 18: Monotonicity testing.
- Lecture 19: Estimation of the number of edges.
- Lecture 20: Efficient local sparse graph exploration techniques. Estimating maximum matching size.

- Section 5: Distributed computation
- Lecture 21: MapReduce and the Massively Parallel Computation (MPC) model. Sample MPC algorithms.
- Lecture 22: Sample MPC algorithms: matching, sorting.
- Lecture 23: Random walks on MPC.
- Lecture 24: Limitations of distributed algorithms.

- Section 6: Closing lectures
- Lecture 25: Overview of additional topics not covered in detail.
- Lectures 26 & 27: Project presentations and discussions.