Nowadays we are witnessing the continuous generation of a huge volume of data from heterogeneous sources. Managing, processing, and analyzing such data becomes challenging and requires the adoption of high-performance computing frameworks and techniques. The course discusses the performance bottlenecks in traditional data processing and analytical tools and techniques and presents the opportunities to design scalable solutions leveraging distributed cluster environment architectures. Emphasis is put on the Hadoop and Spark frameworks, with practical examples of big data processing and analytical tasks in real-world applications.
In this course, we will be using the Python programming interface to the Apache Spark framework (PySpark), in combination with Databricks. You will be able to write and execute PySpark code directly in your browser, without worrying about standalone configuration, and leveraging free access to Databricks' powerful cloud infrastructure. This structure also facilitates collaborative coding.
Our High-Performance Computing course has joined the Databricks University Alliance, an active community of professors and educators who collaboratively share ideas to improve the teaching experience and provide students with most recent and relevant developments in terms of data science tools and concepts adopted in the industry. In order to start using Databricks, you can set up a free personal account (Community Edition). This options gives you the option to use Databricks on Amazon AWS for free. Instructions to sign up will be provided in the course.
Date | Topic | Module | Deadlines | ||
---|---|---|---|---|---|
Week 1 | |||||
Aug 30 | Introduction to HPC | S1 | |||
Sep 2 | Big Data Analytics / NoSQL | S2 - S3 | |||
Week 2 | |||||
Sep 6 | / | / | Labor Day | ||
Sep 9 | Hadoop I | S4 | Install VM | ||
Week 3 | |||||
Sep 13 | Hadoop II | S4 | |||
Sep 16 | Spark: Intro, RDDs & Databricks Platform | S5 | Create Databricks Account | ||
Week 4 | |||||
Sep 20 | Spark: Dataframes | S5 | |||
Sep 23 | Spark: Transformations | S6 | |||
Week 5 | |||||
Sep 27 | Spark: Internals | S7 | |||
Sep 30 | Midterm Exam I | ||||
Week 6 | |||||
Oct 4 | Spark: Structured Streaming / Delta Lakes | S8 | |||
Oct 7 | Spark: ML and MLlib / Linear Regression | S9 | |||
Week 7 | |||||
Oct 11 | Spark: MLflow / Decision Trees / Random Forest | S10 | Project Assignment (S) | ||
Oct 14 | Spark: HyperOpt / AutoML / XGBoost | S11 | |||
Week 8 | |||||
Oct 18 | Spark: MLlib Deployment / Pandas UDF / Koalas Logistic Regression / Collaborative Filtering |
S12 | |||
Oct 21 | Guest Lecture: Bartosz Krawczyk | ||||
Week 9 | |||||
Oct 25 | Guest Lecture: Eftim Zdravevski | ||||
Oct 28 | Time Series Forecasting Graph Analysis: GraphX / Case Study / GraphFrames |
S13 | |||
Week 10 | |||||
Nov 1 | Spark: Deep Learning I | S14 | Project: Selected dataset and tasks | ||
Nov 4 | Spark: Deep Learning II | S15 | |||
Week 11 | |||||
Nov 8 | Spark: Deep Learning III | S15 | |||
Nov 11 | Spark: NLP I | S16 | |||
Week 12 | |||||
Nov 15 | Midterm Exam II | Project: Data Exploration / Pre-processing | |||
Nov 18 | Spark: NLP II | S16 | |||
Week 13 | |||||
Nov 22 | Spark: ML Deployment | S17 | |||
Nov 29 | Spark: ML in production | S18 | Project: Modeling | ||
Week 14 | |||||
Dec 2 | Spark: Performance Optimization I | S19 | Project: Report draft | ||
Dec 6 | Spark: Performance Optimization II | S19 | |||
Week 15 | |||||
Dec 9 | Guest Lecture: Herna Viktor | Project: Submission | |||
Dec 13 | Final Exam | ||||
Component | Weight |
---|---|
Project | 40% |
Midterm Exams (2x 15 ea.) | 30% |
Final Exam (cumulative: half old, half new) | 30% |
Students are recommended to attend all lectures. Prolonged absences must be discussed with the instructor. If you cannot attend lectures regularly, due to work or other obligations during remote learning, then please reach out to the instructor so that I know about it.
Exams cover the material from the lectures, projects, and reading. While not necessarily cumulative, each exam will require understanding many of the concepts covered in the preceding exams. Exams consist of multiple choice, short answer, and long answer questions. Each exam, except the final, is weighted equally.
The final exam is cumulative: half of the final exam will be material covered for prior exams, half will be material that is new since the previous exam.
Range | Letter |
---|---|
>=93 | A |
>=90 | A- |
>=87 | B+ |
>=83 | B |
>=80 | B- |
>=77 | C+ |
>=73 | C |
>=70 | C- |
>=60 | D |
<60 | F |
Even though we encourage collaboration with a partner, sharing code between groups is strictly forbidden - this is a form of plagiarism. As is showing your work to other students, even just for a second. There is rarely one single correct way to write code that solves a problem. While we want you to feel free to discuss your approach freely with a partner, you should know that there are often many solutions for a given problem and it's typically obvious when one student shares code with another. If you directly copy and paste code from the Internet (or even the text), cite your source in your comments (but also ensure that you understand what the code is doing - not all code on the web is good!). Assignments will be checked using plagiarism detection software and by hand to ensure the originality of the work.
Do not share your code with anyone other than a partner. Do not let someone look at your screen. You may get behind, or your friend may ask for help, but the consequences for plagiarism are far worse than an incomplete submission - for the submission, you will still likely get some points. If I suspect that you have purposely shared code with another student or presented someone else's work as your own, the matter will be referred to the Academic Integrity Code Administrator for adjudication. If you are found responsible for an academic integrity violation, sanctions can include a failing grade for the course, suspension for one or more academic terms, dismissal from the university, or other measures as deemed appropriate by the Dean.
All students are expected to adhere to the American University Honor Code. If you have a question about whether or not something is permissible, ask the instructor or the TA first.