Distributed Data Management (WT 2018/19)

Prof. Dr. Felix Naumann, Dr. Thorsten Papenbrock

The free lunch is over! Computer systems up until the turn of the century became constantly faster without any particular effort simply because the hardware they were running on increased its clock speed with every new release. This trend has changed and today's CPUs stall at around 3 GHz. The size of modern computer systems in terms of contained transistors (cores in CPUs/GPUs, CPUs/GPUs in compute nodes, compute nodes in clusters), however, still increases constantly. This caused a paradigm shift in writing software: instead of optimizing code for a single thread, applications now need to solve their given tasks in parallel in order to expect noticeable performance gains. Distributed computing, i.e., the distribution of work on (potentially) physically isolated compute nodes is the most extreme method of parallelization.

Big Data Analytics is a multi-million dollar market that grows constantly! Data and the ability to control and use it is the most valuable ability of today's computer systems. Because data volumes grow so rapidly and with them the complexity of questions they should answer, data analytics, i.e., the ability of extracting any kind of information from the data becomes increasingly difficult. As data analytics systems cannot hope for their hardware getting any faster to cope with performance problems, they need to embrace new software trends that let their performance scale with the still increasing number of processing elements.

In this lecture, we take a look a various technologies involved in building distributed, data-intensive systems. We discuss theoretical concepts (data models, encoding, replication, ...) as well as some of their practical implementations (Akka, MapReduce, Spark, ...). Since workload distribution is a concept which is useful for many applications, we focus in particular on data analytics.

Successor of this series: Distributed Data Management (WT 2019/20)

Introduction

Prof. Dr. Felix Naumann , Dr. Thorsten Papenbrock

Date: October 15, 2018
Language: English
Duration: 01:12:41

Introduction	01:12:41
Introduction	00:14:20
Audience	00:14:53
This Lecture	00:17:51
Data Management	00:11:34
Related Topics	00:14:03

Foundations	01:23:06
Definition	00:21:04
Correlation vs. Causation	00:16:36
Design Comcerms	00:00:00
Consistency Models	00:32:20
Distributed Computing	00:13:06

Distributed DBMS	01:29:01
Introduction	00:10:00
Architectures of Distributed Database Systems	00:16:40
Materialized vs. Virtual	00:56:25
Federated Database Management Systems	00:05:56

Data Warehouses	01:17:07
Recap	00:06:24
OLTP vs. OLAP	00:16:49
Schema Design	00:28:19
Data Cube	00:15:53
Row- vs. Column-Orientation	00:09:53

Encoding and Evolution	01:11:08
Introduction	00:07:38
Formats for Encoding Data	00:20:13
JSON, XML, and Binary Variants	00:11:36
Thrift and Protocol Buffers	00:09:46
Avro	00:11:45
Models of Dataflow	00:10:10

Models of Dataflow	01:27:18
Dataflow Through Databases	00:04:18
Dataflow Through Services	00:42:07
Message-Passing Dataflow	00:40:53

Akka Actor-Programming Hands-on	01:28:43
Actor Model (Recap)	00:11:36
Basic Concepts	00:23:05
Runtime Architecture	00:19:36
Demo	00:14:36
Messaging	00:19:50

Akka Actor-Programming Part 2	01:29:57
Parallelization	00:10:42
Remoting	00:12:20
Clustering	00:30:07
Patterns	00:19:53
Homework	00:16:55

Patterns	01:29:00
Ask	00:19:51
Singleton	00:09:59
Reaper	00:19:28
Tutorial	00:39:42

Data Models and Query Languages	01:24:05
Introduction	00:06:18
The Relational Data Model	00:17:36
Non-Relational - Key-Value Model	00:09:55
Non-Relational - Column-Family Model	00:17:00
Non-Relational - Document Model	00:33:16

Storage and Retrieval	01:27:01
Layering Data Models	00:18:40
Segmentation	00:14:00
Sorted Dtring Tables	00:10:30
B-Tree	00:18:34
LSM-Trees	00:15:23
Alternativ Index Types	00:00:00

Replication	01:26:56
Distributing Data	00:14:28
Single-Leader	00:29:56
Multi-Leader	00:00:00
Leaderless	00:17:50
Gossip	00:18:24

Apache Spark	01:29:38
Degree of Parallelism	00:16:18
Spark SQL	00:09:19
Spark Streaming	00:20:53
Regular Expression Support	00:13:41
Typical Architectures	00:07:09
Homework	00:22:18

Spark - Hands On	01:28:41
Coding	01:28:11

Distributed Systems	01:33:19
Introduction	00:15:34
Unreliable Networks	00:46:25
Unreliable Clocks	00:28:33
Locking	00:02:47

Partitioning	01:19:06
Replication vs. Partitioning	00:08:34
Partitioning by Hash of Key	00:19:32
Secondary Indexes	00:13:21
Rebalancing	00:13:36
Partition Lookup	00:00:00

Batch Processing	01:29:20
Introduction	00:24:16
Batch Processing with Unix Tools	00:11:55
Distributed Files Systems	00:21:32
Distributed Batch Processing	00:14:03
MapReduce	00:17:34

Distributed File Systems and MapReduce	01:27:15
MapReduce	00:21:04
Apache Hadoop	00:33:30
Hadoop vs. MPP Databases	00:09:54
Data Flow Engines	00:12:48
Apache Tez	00:09:59

Beyond MapReduce	01:29:41
Apache Spark vs. Apache Flink	00:24:37
Raph Processing as Batch Job	00:19:52
Spark Batch Processing	00:14:39
Distributed Data	00:16:36
Distributed Transformations and Actions	00:13:57

Consistency and Consensus	01:30:02
Knowledge, Truth, Lies	00:19:19
The Situation	00:18:52
Ordering Guarantees	00:37:17
Consensus	00:00:00

Transactions	01:29:57
Fault-Tolerant Consensus	00:13:08
Consensus for Leaderless Cryptocurrencies	00:24:48
An OLTP Topic	00:20:34
Isolation	00:20:30
Consensus for Transaction Commits	00:00:00

Stream Processing	01:27:31
Rypes of Systems	00:19:27
Transmitting Event Streams	00:11:05
Message Brokers	00:20:06
Partitioned Logs	00:30:01
Data Bases and Streams	00:06:52

Processing Streams	01:33:58
Kafka	00:22:00
Keeping Systems in Sync	00:09:03
Processing Streams	00:18:39
Challenges und Limits	00:11:38
Windows and Parallelization	00:14:08
Event Time vs. Processing Time	00:18:30

Distributed Query Optimization (1)	01:26:03
Hadoop versus MPP	00:24:11
Kostenvergleich	00:17:28
Definition Semi-Join	00:19:50
Auswirkung auf S	00:16:44
Etwas Theorie	00:07:50

Distributed Query Optimization (2)	01:23:23
Etwas Theorie	00:22:59
Komplexität	00:16:11
Warnung	00:05:22
Garlic	00:21:03
Mariposa	00:00:00