Data Profiling and Data Cleansing (WS 2014/15)

Prof. Dr. Felix Naumann

Data profiling is the set of activities and processes to determine the metadata about a given dataset. Profiling data is an important and frequent activity of any IT professional and researcher.

It encompasses a vast array of methods to examine data sets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute usually involve multiple columns, such as inclusion dependencies or functional dependencies between columns. More advanced techniques detect approximate properties or conditional properties of the data set at hand. The first part of the lecture examines efficient detection methods for these properties.

Data profiling is relevant as a preparatory step to many use cases, such as query optimization, data mining, data integration, and data cleansing.

Many of the insights gained during data profiling point to deficiencies of the data. Profiling reveals data errors, such as inconsistent formatting within a column, missing values, or outliers. Profiling results can also be used to measure and monitor the general quality of a dataset, for instance by determining the number of records that do not conform to previously established constraints. The second part of the lecture examines various methods and algorithms to improve the quality of data, with an emphasis on the many existing duplicate detection approaches.

Introduction

Prof. Dr. Felix Naumann

Date: October 13, 2014
Language: German
Duration: 01:29:33

Introduction	01:29:33
Introduction	00:06:53
Introduction to Research Group	00:10:42
Lecture Organization	00:13:26
Big Data	00:14:53
Big and Small	00:19:10
Big Data and Ethics	00:17:56
Profiling	00:06:33

An Introduction to Data Profiling

Prof. Dr. Felix Naumann

Date: October 20, 2014
Language: German
Duration: 01:31:44

An Introduction to Data Profiling	01:31:44
Profiling	00:20:18
Cleansing	00:15:15
Overview of Semester	00:04:57
Profiling Tasks	00:16:28
Uniqueness and Keys	00:19:54
Profiling Tools	00:14:52

Visualization, Next Generation Profiling & Profiling Challenges

Prof. Dr. Felix Naumann

Date: October 23, 2014
Language: German
Duration: 01:24:32

Visualization, Next Generation Profiling & Profiling Challenges	01:24:32
Checking vs. Discovery	00:21:36
Visualization	00:06:53
Next Generation Profiling	00:14:13
Together: Profiling for Integration	00:13:06
Profiling Challenges	00:15:20
Profiling Query Results	00:13:24

Unique Column Combinations

Arvid Heise

Date: October 27, 2014
Language: German
Duration: 01:02:12

Unique Column Combinations	01:02:12
Introduction and Problem Statement	00:14:10
Null Values & General Pruning Techniques	00:07:02
Discovery Algorithms	00:13:24
DUCC & Gordian	00:21:25
Dynamic Data	00:06:11

Detecting Inclusion Dependencies

Prof. Dr. Felix Naumann

Date: November 10, 2014
Language: German
Duration: 01:20:03

Detecting Inclusion Dependencies	01:20:03
Dependencies	00:16:58
Inclusion Dependencies	00:09:50
IND Types	00:20:31
SQL	00:15:14
De Marchi et al.	00:17:30

SPIDER, Foreign Key Extraction & Conditional Inclusion Dependencies

Prof. Dr. Felix Naumann

Date: November 13, 2014
Language: German
Duration: 01:27:04

SPIDER, Foreign Key Extraction & Conditional Inclusion Dependencies	01:27:04
Wiederholung	00:04:55
SPIDER	00:15:45
SPIDER by Example	00:15:58
Foreign Key Extraction	00:23:06
Conditional Inclusion Dependencies	00:16:55
Discovering cINDs	00:10:25

Der Apriori Algorithmus, Discovering cINDs & Detecting Functional Dependencies

Prof. Dr. Felix Naumann

Date: November 24, 2014
Language: German
Duration: 01:24:57

Der Apriori Algorithmus, Discovering cINDs & Detecting Functional Dependencies	01:24:57
Einführung	00:16:26
Apriori	00:16:24
Challenges of CIND Discovery	00:13:29
Creating Conditions	00:13:12
Detecting Functional Dependencies	00:11:33
FD Discussion	00:13:53

Detecting Functional Dependencies

TANE

Prof. Dr. Felix Naumann

Date: December 1, 2014
Language: German
Duration: 01:28:46

TANE	01:28:46
Naive Discovery Approach	00:05:56
Tane	00:05:31
Candidate Sets	00:21:03
Examples	00:15:47
Pruning Algorithm	00:14:22
Pruning	00:26:07

Dependency Checking, Approximate FDs, FD_Mine and DFD

Prof. Dr. Felix Naumann

Date: December 11, 2014
Language: German
Duration: 01:29:33

Dependency Checking, Approximate FDs, FD_Mine and DFD	01:29:33
Wiederholung	00:03:35
Dependency Checking	00:14:21
Stripped Partitions	00:13:41
Computing Partitions	00:27:10
Approximate FDs	00:09:45
FD_Mine and DFD	00:21:01

Conditional Uniques & IND Detection at Scale

Discovery of Conditional Unique Column Combination

Jens Ehrlich

Date: December 4, 2014
Language: German
Duration: 00:24:04

Discovery of Conditional Unique Column Combination	00:24:04
Definition & Motivation	00:08:06
DoCU Algorithm	00:14:49
Benchmarks	00:01:09

IND Detection on very many Tables

Fabian Tschirschnitz

Date: December 4, 2014
Language: German
Duration: 00:41:02

IND Detection on very many Tables	00:41:02
Introduction	00:05:16
The Web Table	00:03:34
Bloom Filter	00:16:51
Filter	00:07:25
Visualisation	00:07:56

Data Quality and Data Cleansing

Prof. Dr. Felix Naumann

Date: December 15, 2014
Language: German
Duration: 01:21:07

Data Quality and Data Cleansing	01:21:07
Information Quality	00:13:19
Classification of Errors	00:18:18
IQ Criteria	00:16:10
IQ Assessment	00:03:54
Cleansing Tasks	00:06:29
IQ Anecdotes	00:22:57

Duplicate Detection

Prof. Dr. Felix Naumann

Date: December 18, 2014
Language: German
Duration: 01:30:07

Duplicate Detection	01:30:07
Duplicate Detection	00:16:13
Motivation	00:15:09
Similarity Measures	00:17:09
Algorithms	00:17:13
Data Sets and Evaluation	00:24:23

Similarity Measures

Prof. Dr. Felix Naumann

Date: January 5, 2015
Language: German
Duration: 01:29:06

Similarity Measures	01:29:06
Einführung	00:20:25
Levenshtein Distance	00:26:41
Jaro- & Winkler Similarity	00:16:38
Token-based	00:12:17
Phonetic	00:13:05

Similarity Measures & Generic Entity Resolution with Swoosh

Prof. Dr. Felix Naumann

Date: January 8, 2015
Language: German
Duration: 01:26:54

Similarity Measures & Generic Entity Resolution with Swoosh	01:26:54
Hybrid	00:14:45
Extended Jaccard Similarity	00:19:20
SoftTFIDF	00:05:08
Domain-dependant	00:15:34
Generic Entity Resolution with Swoosh	00:21:38
Domination	00:10:29

Sorted Neighborhood Methods

Prof. Dr. Felix Naumann

Date: January 15, 2015
Language: German
Duration: 01:25:58

Sorted Neighborhood Methods	01:25:58
The Original	00:13:29
SNM - Example	00:19:08
Sorted Neighborhood - Multipass Approach	00:17:10
Unique Sorting Keys	00:10:17
Adaptive NSM Part 1	00:14:02
Adaptive NSM Part 2	00:11:52

Sorted Neighborhood Methods & Generic Entity Resolution with Swoosh

Prof. Dr. Felix Naumann

Date: January 19, 2015
Language: German
Duration: 01:25:48

Sorted Neighborhood Methods & Generic Entity Resolution with Swoosh	01:25:48
Adaptive SNM Part 2	00:21:10
Results Cora: Comparisons	00:10:35
Sorted Blocks	00:08:44
Domain-independent SNM	00:18:52
Generic Entity Resolution with Swoosh	00:20:43
Naive Algorithms	00:05:44

Generic Entity Resolution with Swoosh

Prof. Dr. Felix Naumann

Date: January 26, 2015
Language: German
Duration: 00:44:04

Generic Entity Resolution with Swoosh	00:44:04
Naive Algorithms	00:12:18
R-Swoosh	00:12:57
F-Swoosh	00:15:21
Further Swooshs	00:03:28

Profiling Linked Data

Anja Jentzsch

Date: January 29, 2015
Language: English
Duration: 01:13:08

Profiling Linked Data	01:13:08
Introduction to Linked Data	00:23:56
Profiling Linked Data	00:08:56
ProLOD++	00:15:42
Uniqueness, Density and Keyness	00:10:52
Multi-Query Optimization for Linked Data Profiling Queries	00:13:42