Distributed Data Management (WT 2019/20)

Dr. Thorsten Papenbrock

The free lunch is over! Computer systems up until the turn of the century became constantly faster without any particular effort simply because the hardware they were running on increased its clock speed with every new release. This trend has changed and today's CPUs stall at around 3 GHz. The size of modern computer systems in terms of contained transistors (cores in CPUs/GPUs, CPUs/GPUs in compute nodes, compute nodes in clusters), however, still increases constantly. This caused a paradigm shift in writing software: instead of optimizing code for a single thread, applications now need to solve their given tasks in parallel in order to expect noticeable performance gains. Distributed computing, i.e., the distribution of work on (potentially) physically isolated compute nodes is the most extreme method of parallelization.

Big Data Analytics is a multi-million dollar market that grows constantly! Data and the ability to control and use it is the most valuable ability of today's computer systems. Because data volumes grow so rapidly and with them the complexity of questions they should answer, data analytics, i.e., the ability of extracting any kind of information from the data becomes increasingly difficult. As data analytics systems cannot hope for their hardware getting any faster to cope with performance problems, they need to embrace new software trends that let their performance scale with the still increasing number of processing elements.

In this lecture, we take a look a various technologies involved in building distributed, data-intensive systems. We discuss theoretical concepts (data models, encoding, replication, ...) as well as some of their practical implementations (Akka, MapReduce, Spark, ...). Since workload distribution is a concept which is useful for many applications, we focus in particular on data analytics.

Introduction	01:15:38
Examples Distributed Systems	00:15:17
Lecture Organization	00:13:39
Motivation "Distributed"	00:14:45
Motivation "Data"	00:13:27
Motivation "Management"	00:18:30

Foundations	01:21:37
Big Data	00:33:50
Data-Intensive Applications	00:42:43
Consistency Models	00:19:31
Distributed Computing	00:19:48

Encoding and Communication	01:29:36
Encoding	00:08:20
Language-Specific Encoding	00:23:31
JSON/XML Encoding	00:08:11
Binary Encoding	00:30:19
Communication	00:19:15

Encoding and Communication 2	01:31:35
Models of Dataflow	00:18:12
Remote Procedure Call	00:16:49
Popular Service Protocols	00:17:32
Communication Principle	00:17:04
Examples	00:21:58

Akka Actor Programming	01:31:24
Actor Model (Recap)	00:18:28
Basic Concepts	00:26:08
Runtime Architecture	00:26:03
Demo	00:20:45

Akka Actor Programming 2	01:30:36
Messaging	00:34:47
Parallelization	00:08:25
Remoting	00:12:08
Clustering	00:35:16

Akka Actor Programming 3 - Patterns	01:31:16
Master/Worker	00:20:48
Proxy	00:21:39
Singleton	00:07:21
Reaper	00:16:46
Homework	00:24:41

Data Models and Query Languages	01:28:29
The Relational Data Model	00:19:20
The Key-Value Data Model	00:09:03
The Column-Family Data Model	00:19:43
The Document Data Model	00:29:51
The Graph Data Model	00:10:32

The Graph Data Model	01:29:05
Querrying: Cypher	00:23:30
Ways to Model Properties/Relationships	00:14:13
Querying: SPARQL	00:15:35
Storage and Retrieval	00:13:44
Segmentation	00:22:03

Storage and Retrieval & Replication	01:20:49
Fast Retrieval DBMS	00:29:57
Fast Storage and Retrieval	00:27:57
Excursus	00:10:53
Replication	00:12:02

Spark Batch Processing 2	01:30:05
Exam Date	00:09:30
Spark SQL	00:27:35
Spark Tutorial	00:53:00

Replication 2	01:24:21
Single-Leader Replication	00:38:35
Multi-Leader Replication	00:06:55
Leaderless Replication	00:38:51

Replication & Partitioning	01:28:37
Leaderless Replication	00:17:28
Partitioning	00:11:38
Partitioning of Key-Value Data	00:23:38
Partitioning and Secondary Indexes	00:09:51
Rebalancing Partitions	00:15:40
Request Routing	00:10:22

Distributed Systems	01:32:13
Introduction	00:23:16
Unreliable Networks	00:53:01
Unreliable Clocks	00:15:56

Distributed Systems & Consistency and Consensus	01:29:52
Unreliable Clocks	00:27:40
Knowledge, Truth, Lies	00:18:37
Consistency and Consensus	00:07:24
Linearizability	00:24:20
Ordering Guarantees	00:11:51

Consistency and Consensus & Transactions	01:26:42
Ordering Guarantees	00:21:53
Concensus	00:40:28
Transactions	00:24:21

Transactions & Batch Processing	01:11:09
Consensus for Transaction Commits	00:26:44
Batch Processing	00:19:22
The Sushi Principle	00:08:00
Batch Processing	00:17:03

Batch Processing: Distributed File Systems and MapReduce	01:31:30
Distributed File System	00:34:03
Distributed Batch Processing	00:12:57
MapReduce	00:34:20
Apache Hadoop	00:10:10

Beyond MapReduce	01:29:04
Apache Hadoop	00:16:13
MapReduces's Workflow Design	00:16:49
Apache Spark	00:37:47
Graph Processing as Batch Job	00:18:15

Exercise Evaluation Assignment 1-3	01:25:53
Exercise Evaluation	00:05:38
Task 1 - Akka Setup	00:04:10
Task 2 - Large Message Proxy	00:32:36
Task 3 - Password Cracking	00:43:29

Spark Batch Processing	01:28:12
Key Facts	00:26:23
Distributed Data	00:18:50
Transformation Pipelines	00:15:53
Parquet Files	00:27:06

Distributed Data Management (WT 2019/20)

Lectures

Spark Batch Processing & Stream Processing	01:27:06
Spark Tutorial	00:58:33
Stream Processing	00:10:51
Transmitting Event Streams	00:17:42

Stream Processing	01:26:02
Transmitting Event Streams	01:08:57
Databases and Streams	00:13:54
Processing Streams	00:03:11

Stream Processing & Distributed DBMSs	01:28:22
Processing Streams	01:03:26
Homework	00:03:24
Distributed DBMSs: Motivation	00:09:31
Distributed DBMSs	00:12:01

Distributed DBMSs	01:27:24
Materialized vs. Virtual	00:44:11
Data Warehouses	00:43:13

Distributed DBMSs & Distributed Query Optimization	01:29:20
Data Warehouses	00:11:46
Federated Database Management Systems	00:11:18
Distributed Query Optimization	00:01:24
Distributed Query Execution	00:09:21
Distributed Join Execution	00:22:39
Bloom filter Optimized Joins	00:11:53
Multi-Relation Joins	00:20:59

Exam Preparation	01:31:01
Exercise Evaluation Assignments 4	00:36:25
Lecture Summary	00:54:36