Distributed Data Analytics (WT 2017/18)

Dr. Thorsten Papenbrock

The free lunch is over! Computer systems up until the turn of the century became constantly faster without any particular effort simply because the hardware they were running on increased its clock speed with every new release. This trend has changed and today's CPUs stall at around 3 GHz. The size of modern computer systems in terms of contained transistors (cores in CPUs/GPUs, CPUs/GPUs in compute nodes, compute nodes in clusters), however, still increases constantly. This caused a paradigm shift in writing software: instead of optimizing code for a single thread, applications now need to solve their given tasks in parallel in order to expect noticeable performance gains. Distributed computing, i.e., the distribution of work on (potentially) physically isolated compute nodes is the most extreme method of parallelization.

Big Data Analytics is a multi-million dollar market that grows constantly! Data and the ability to control and use it is the most valuable ability of today's computer systems. Because data volumes grow so rapidly and with them the complexity of questions they should answer, data analytics, i.e., the ability of extracting any kind of information from the data becomes increasingly difficult. As data analytics systems cannot hope for their hardware getting any faster to cope with performance problems, they need to embrace new software trends that let their performance scale with the still increasing number of processing elements.

In this lecture, we take a look a various technologies involved in building distributed, data-intensive systems. We discuss theoretical concepts (data models, encoding, replication, ...) as well as some of their practical implementations (Akka, MapReduce, Spark, ...). Since workload distribution is a concept which is useful for many applications, we focus in particular on data analytics.

Introduction

Introduction & Foundations

Dr. Thorsten Papenbrock

Date: October 18, 2017
Language: English
Duration: 01:29:32

Introduction & Foundations	01:29:32
Introduction	00:11:15
Distributed Data Analytics	00:16:03
Motivation: Data Analytics	00:23:13
Foundations	00:01:21
Big Data	00:19:32
Data-Intensive Applications	00:18:08

Data Models and Query Languages

Foundations & Data Models and Query Languages

Dr. Thorsten Papenbrock

Date: October 25, 2017
Language: English
Duration: 01:31:23

Foundations & Data Models and Query Languages	01:31:23
Consistency Models	00:11:45
OLAP and OLTP	00:04:32
Distributed Computing	00:16:07
Data Models and Query Languages	00:06:57
The Relational Data Model	00:20:06
The Key-Value Data Model	00:11:30
The Column-Family Data Model	00:20:26

The Document Data Model & The Graph Data Model

Dr. Thorsten Papenbrock

Date: November 1, 2017
Language: English
Duration: 01:32:00

The Document Data Model & The Graph Data Model	01:32:00
The Document Data Model	00:09:14
Querying: MongoDB API	00:18:26
The Graph Data Model	00:11:26
Querying: Cypher	00:22:25
Triple-Stores	00:13:39
Querying: SPARQL	00:16:50

Storage and Retrieval

Dr. Thorsten Papenbrock

Date: November 8, 2017
Language: English
Duration: 01:06:08

Storage and Retrieval	01:06:08
Index Structures	00:07:16
Hash Indexes & SSTables and LSM-Trees	00:20:14
B-Trees & Further Indexes	00:12:17
Online Analytical Processing	00:00:46
Data Warehousing & Star- and Snowflake Schemata	00:06:45
Column-Oriented Storage & Data Cubes and Materialized Views	00:18:50

Encoding and Evolution

Formats for Encoding Data & Models of Dataflow

Dr. Thorsten Papenbrock

Date: November 15, 2017
Language: English
Duration: 01:32:38

Formats for Encoding Data & Models of Dataflow	01:32:38
Formats for Encoding Data	00:02:42
Language-Specific Formats	00:07:52
JSON, XML and Binary Variants	00:06:09
Thrift and Protocol Buffers	00:08:15
Avro	00:06:57
Models of Dataflow	00:03:40
Dataflow Through Databases	00:03:13
Dataflow Through Service	00:39:16
Message-Passing Dataflow	00:14:34

Akka Actor Programming

Dr. Thorsten Papenbrock

Date: November 22, 2017
Language: English
Duration: 01:23:26

Akka Actor Programming	01:23:26
Actor Programming	00:07:05
Akka	00:13:39
Actor Hierarchies	00:12:04
Actor Systems	00:15:13
Application Shutdown	00:23:58
Live Demo	00:11:27

Replication

Dr. Thorsten Papenbrock

Date: November 29, 2017
Language: English
Duration: 01:23:21

Replication	01:23:21
Distributing Data	00:10:08
Single-Leader Replication	00:26:20
Multi-Leader Replication	00:08:30
Leaderless Replication	00:38:23

Partitioning & Transactions

Dr. Thorsten Papenbrock

Date: December 6, 2017
Language: German
Duration: 01:31:12

Partitioning & Transactions	01:31:12
Distributing Data	00:08:34
Partitioning of Key-Value Data	00:20:32
Partitioning and Secondary Indexes	00:08:14
Rebalancing Partitions	00:15:59
Request Routing	00:19:16
Transactions	00:18:37

Distributed Systems

Dr. Thorsten Papenbrock

Date: December 13, 2017
Language: English
Duration: 01:34:50

Distributed Systems	01:34:50
Introduction	00:09:04
Unrelieable Networks	00:42:47
Unreliable Clocks	00:34:53
Knowledge, Truth and Lies	00:08:06

Consistency and Consensus

Dr. Thorsten Papenbrock

Date: December 20, 2017
Language: English
Duration: 01:35:38

Consistency and Consensus	01:35:38
Introduction	00:09:04
Linearizability	00:14:47
Ordering Guarantees	00:24:28
Consensus	00:47:19

Batch Processing

Dr. Thorsten Papenbrock

Date: January 10, 2018
Language: English
Duration: 01:39:08

Batch Processing	01:39:08
Introduction	00:08:31
Batch Processing with Unix Tools	00:07:37
Distributed File Systems	00:18:11
MapReduce	00:35:30
Beyond MapReduce	00:29:19