The free lunch is over! Computer systems up until the turn of the century became constantly faster without any particular effort simply because the hardware they were running on increased its clock speed with every new release. This trend has changed and today's CPUs stall at around 3 GHz. The size of modern computer systems in terms of contained transistors (cores in CPUs/GPUs, CPUs/GPUs in compute nodes, compute nodes in clusters), however, still increases constantly. This caused a paradigm shift in writing software: instead of optimizing code for a single thread, applications now need to solve their given tasks in parallel in order to expect noticeable performance gains. Distributed computing, i.e., the distribution of work on (potentially) physically isolated compute nodes is the most extreme method of parallelization.
Big Data Analytics is a multi-million dollar market that grows constantly! Data and the ability to control and use it is the most valuable ability of today's computer systems. Because data volumes grow so rapidly and with them the complexity of questions they should answer, data analytics, i.e., the ability of extracting any kind of information from the data becomes increasingly difficult. As data analytics systems cannot hope for their hardware getting any faster to cope with performance problems, they need to embrace new software trends that let their performance scale with the still increasing number of processing elements.
In this lecture, we take a look a various technologies involved in building distributed, data-intensive systems. We discuss theoretical concepts (data models, encoding, replication, ...) as well as some of their practical implementations (Akka, MapReduce, Spark, ...). Since workload distribution is a concept which is useful for many applications, we focus in particular on data analytics.
Introduction | 01:15:38 | |
---|---|---|
Examples Distributed Systems | 00:15:17 | |
Lecture Organization | 00:13:39 | |
Motivation "Distributed" | 00:14:45 | |
Motivation "Data" | 00:13:27 | |
Motivation "Management" | 00:18:30 |
Foundations | 01:21:37 | |
---|---|---|
Big Data | 00:33:50 | |
Data-Intensive Applications | 00:42:43 | |
Consistency Models | 00:19:31 | |
Distributed Computing | 00:19:48 |
Encoding and Communication | 01:29:36 | |
---|---|---|
Encoding | 00:08:20 | |
Language-Specific Encoding | 00:23:31 | |
JSON/XML Encoding | 00:08:11 | |
Binary Encoding | 00:30:19 | |
Communication | 00:19:15 |
Encoding and Communication 2 | 01:31:35 | |
---|---|---|
Models of Dataflow | 00:18:12 | |
Remote Procedure Call | 00:16:49 | |
Popular Service Protocols | 00:17:32 | |
Communication Principle | 00:17:04 | |
Examples | 00:21:58 |
Akka Actor Programming | 01:31:24 | |
---|---|---|
Actor Model (Recap) | 00:18:28 | |
Basic Concepts | 00:26:08 | |
Runtime Architecture | 00:26:03 | |
Demo | 00:20:45 |
Akka Actor Programming 2 | 01:30:36 | |
---|---|---|
Messaging | 00:34:47 | |
Parallelization | 00:08:25 | |
Remoting | 00:12:08 | |
Clustering | 00:35:16 |
Akka Actor Programming 3 - Patterns | 01:31:16 | |
---|---|---|
Master/Worker | 00:20:48 | |
Proxy | 00:21:39 | |
Singleton | 00:07:21 | |
Reaper | 00:16:46 | |
Homework | 00:24:41 |
Data Models and Query Languages | 01:28:29 | |
---|---|---|
The Relational Data Model | 00:19:20 | |
The Key-Value Data Model | 00:09:03 | |
The Column-Family Data Model | 00:19:43 | |
The Document Data Model | 00:29:51 | |
The Graph Data Model | 00:10:32 |
The Graph Data Model | 01:29:05 | |
---|---|---|
Querrying: Cypher | 00:23:30 | |
Ways to Model Properties/Relationships | 00:14:13 | |
Querying: SPARQL | 00:15:35 | |
Storage and Retrieval | 00:13:44 | |
Segmentation | 00:22:03 |
Storage and Retrieval & Replication | 01:20:49 | |
---|---|---|
Fast Retrieval DBMS | 00:29:57 | |
Fast Storage and Retrieval | 00:27:57 | |
Excursus | 00:10:53 | |
Replication | 00:12:02 |
Replication 2 | 01:24:21 | |
---|---|---|
Single-Leader Replication | 00:38:35 | |
Multi-Leader Replication | 00:06:55 | |
Leaderless Replication | 00:38:51 |
Replication & Partitioning | 01:28:37 | |
---|---|---|
Leaderless Replication | 00:17:28 | |
Partitioning | 00:11:38 | |
Partitioning of Key-Value Data | 00:23:38 | |
Partitioning and Secondary Indexes | 00:09:51 | |
Rebalancing Partitions | 00:15:40 | |
Request Routing | 00:10:22 |
Distributed Systems | 01:32:13 | |
---|---|---|
Introduction | 00:23:16 | |
Unreliable Networks | 00:53:01 | |
Unreliable Clocks | 00:15:56 |
Distributed Systems & Consistency and Consensus | 01:29:52 | |
---|---|---|
Unreliable Clocks | 00:27:40 | |
Knowledge, Truth, Lies | 00:18:37 | |
Consistency and Consensus | 00:07:24 | |
Linearizability | 00:24:20 | |
Ordering Guarantees | 00:11:51 |
Consistency and Consensus & Transactions | 01:26:42 | |
---|---|---|
Ordering Guarantees | 00:21:53 | |
Concensus | 00:40:28 | |
Transactions | 00:24:21 |
Transactions & Batch Processing | 01:11:09 | |
---|---|---|
Consensus for Transaction Commits | 00:26:44 | |
Batch Processing | 00:19:22 | |
The Sushi Principle | 00:08:00 | |
Batch Processing | 00:17:03 |
Batch Processing: Distributed File Systems and MapReduce | 01:31:30 | |
---|---|---|
Distributed File System | 00:34:03 | |
Distributed Batch Processing | 00:12:57 | |
MapReduce | 00:34:20 | |
Apache Hadoop | 00:10:10 |
Beyond MapReduce | 01:29:04 | |
---|---|---|
Apache Hadoop | 00:16:13 | |
MapReduces's Workflow Design | 00:16:49 | |
Apache Spark | 00:37:47 | |
Graph Processing as Batch Job | 00:18:15 |
Exercise Evaluation Assignment 1-3 | 01:25:53 | |
---|---|---|
Exercise Evaluation | 00:05:38 | |
Task 1 - Akka Setup | 00:04:10 | |
Task 2 - Large Message Proxy | 00:32:36 | |
Task 3 - Password Cracking | 00:43:29 |
Spark Batch Processing | 01:28:12 | |
---|---|---|
Key Facts | 00:26:23 | |
Distributed Data | 00:18:50 | |
Transformation Pipelines | 00:15:53 | |
Parquet Files | 00:27:06 |
Spark Batch Processing 2 | 01:30:05 | |
---|---|---|
Exam Date | 00:09:30 | |
Spark SQL | 00:27:35 | |
Spark Tutorial | 00:53:00 |
Spark Batch Processing & Stream Processing | 01:27:06 | |
---|---|---|
Spark Tutorial | 00:58:33 | |
Stream Processing | 00:10:51 | |
Transmitting Event Streams | 00:17:42 |
Stream Processing | 01:26:02 | |
---|---|---|
Transmitting Event Streams | 01:08:57 | |
Databases and Streams | 00:13:54 | |
Processing Streams | 00:03:11 |
Stream Processing & Distributed DBMSs | 01:28:22 | |
---|---|---|
Processing Streams | 01:03:26 | |
Homework | 00:03:24 | |
Distributed DBMSs: Motivation | 00:09:31 | |
Distributed DBMSs | 00:12:01 |
Distributed DBMSs | 01:27:24 | |
---|---|---|
Materialized vs. Virtual | 00:44:11 | |
Data Warehouses | 00:43:13 |
Distributed DBMSs & Distributed Query Optimization | 01:29:20 | |
---|---|---|
Data Warehouses | 00:11:46 | |
Federated Database Management Systems | 00:11:18 | |
Distributed Query Optimization | 00:01:24 | |
Distributed Query Execution | 00:09:21 | |
Distributed Join Execution | 00:22:39 | |
Bloom filter Optimized Joins | 00:11:53 | |
Multi-Relation Joins | 00:20:59 |
Exam Preparation | 01:31:01 | |
---|---|---|
Exercise Evaluation Assignments 4 | 00:36:25 | |
Lecture Summary | 00:54:36 |