Web Science & Technologies University of Koblenz Landau, Germany. Lecture Data Science. Statistics and Probabilities JProf. Dr.

Similar documents
Lecture Data Science

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

The following content is provided under a Creative Commons license. Your support

Lecture 2. Probability Distributions Theophanis Tsandilas

The normal distribution is a theoretical model derived mathematically and not empirically.

KARACHI UNIVERSITY BUSINESS SCHOOL UNIVERSITY OF KARACHI BS (BBA) VI

The Bernoulli distribution

Prof. Thistleton MAT 505 Introduction to Probability Lecture 3

Data Distributions and Normality

TABLE OF CONTENTS - VOLUME 2

Business Statistics 41000: Probability 3

Part V - Chance Variability

2011 Pearson Education, Inc

Binomial and Normal Distributions

Lecture 1: Review and Exploratory Data Analysis (EDA)

Module 4: Probability

Business Statistics 41000: Probability 4

STAT 157 HW1 Solutions

Review for Final Exam Spring 2014 Jeremy Orloff and Jonathan Bloom

Theoretical Foundations

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

DATA SUMMARIZATION AND VISUALIZATION

MVE051/MSG Lecture 7

A random variable (r. v.) is a variable whose value is a numerical outcome of a random phenomenon.

Statistics for IT Managers

Introduction to Probability and Inference HSSP Summer 2017, Instructor: Alexandra Ding July 19, 2017

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

Math 361. Day 8 Binomial Random Variables pages 27 and 28 Inv Do you have ESP? Inv. 1.3 Tim or Bob?

Lecture 2 Describing Data

Financial Econometrics (FinMetrics04) Time-series Statistics Concepts Exploratory Data Analysis Testing for Normality Empirical VaR

5.1 Personal Probability

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

DESCRIBING DATA: MESURES OF LOCATION

Unit 5: Sampling Distributions of Statistics

Unit 5: Sampling Distributions of Statistics

Lecture 9: Plinko Probabilities, Part III Random Variables, Expected Values and Variances

Stat 20: Intro to Probability and Statistics

Descriptive Statistics

ECON Introductory Econometrics. Lecture 1: Introduction and Review of Statistics

1/2 2. Mean & variance. Mean & standard deviation

Sampling Distributions and the Central Limit Theorem

Contents Part I Descriptive Statistics 1 Introduction and Framework Population, Sample, and Observations Variables Quali

Chapter 7 Sampling Distributions and Point Estimation of Parameters

CS145: Probability & Computing

Model Paper Statistics Objective. Paper Code Time Allowed: 20 minutes

Midterm Exam III Review

Some Characteristics of Data

Lecture 17: More on Markov Decision Processes. Reinforcement learning

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Chapter 3 - Lecture 5 The Binomial Probability Distribution

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 4: May 2, Abstract

A Skewed Truncated Cauchy Logistic. Distribution and its Moments

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

Probability is the tool used for anticipating what the distribution of data should look like under a given model.

Econ 6900: Statistical Problems. Instructor: Yogesh Uppal

1) 3 points Which of the following is NOT a measure of central tendency? a) Median b) Mode c) Mean d) Range

Basic Procedure for Histograms

Review: Population, sample, and sampling distributions

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

CS 237: Probability in Computing

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

The binomial distribution

Data Analysis and Statistical Methods Statistics 651

Some estimates of the height of the podium

Wk 2 Hrs 1 (Tue, Jan 10) Wk 2 - Hr 2 and 3 (Thur, Jan 12)

Probability Models.S2 Discrete Random Variables

Describing Data: One Quantitative Variable

Variance, Standard Deviation Counting Techniques

Binomial Random Variable - The count X of successes in a binomial setting

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

We use probability distributions to represent the distribution of a discrete random variable.

Chapter 8: Binomial and Geometric Distributions

Probability and Statistics

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

CABARRUS COUNTY 2008 APPRAISAL MANUAL

Chapter 6: Random Variables

Lecture Week 4 Inspecting Data: Distributions

Binomial Random Variables. Binomial Random Variables

2017 Fall QMS102 Tip Sheet 2

Review of the Topics for Midterm I

Random Variables CHAPTER 6.3 BINOMIAL AND GEOMETRIC RANDOM VARIABLES

MANAGEMENT PRINCIPLES AND STATISTICS (252 BE)

Part 1 In which we meet the law of averages. The Law of Averages. The Expected Value & The Standard Error. Where Are We Going?

2 Exploring Univariate Data

Appendix A. Selecting and Using Probability Distributions. In this appendix

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Basic Principles of Probability and Statistics. Lecture notes for PET 472 Spring 2010 Prepared by: Thomas W. Engler, Ph.D., P.E

Chapter 3 Discrete Random Variables and Probability Distributions

Copyright 2005 Pearson Education, Inc. Slide 6-1

Chapter 7. Inferences about Population Variances

Describing Uncertain Variables

Shifting our focus. We were studying statistics (data, displays, sampling...) The next few lectures focus on probability (randomness) Why?

Example - Let X be the number of boys in a 4 child family. Find the probability distribution table:

Statistics & Flood Frequency Chapter 3. Dr. Philip B. Bedient

Chapter 4 and 5 Note Guide: Probability Distributions

CPSC 540: Machine Learning

23.1 Probability Distributions

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Transcription:

Web Science & Technologies University of Koblenz Landau, Germany Lecture Data Science Statistics and Probabilities JProf. Dr. Claudia Wagner

Data Science Open Position @GESIS Student Assistant Job in Data Science at GESIS in Cologne Requirements: Good programming skills in python Some data mining experience Be able to work on a Unix Server Payment: 11,64 EUR per Hour 8-20 hours per week are possible If interested, send me an email with CV and transcript of records Claudia Wagner 2

Exam WED 6.2. 6pm E011 WED 27.3. 2pm D018 Claudia Wagner 3

Science Science is an evolutionary process which possibly allows us to gain knowledge about the world. Lets assume we have a question: Do first babies arrive later than other babies? How can we answer it scientifically? Collect data e.g. via a survey Analyze data using statistics Chapter 1, Think Stats Claudia Wagner 6

Science We test our hypothesis by creating a null-hypothesis which would falsify our hypothesis if it was true. Then we try to reject the null hypothesis (with a certain probability). If our hypothesis is true, we will be able to reject the null hypothesis in most experiments Example My hypothesis: First babies arrive later than other babies Null hypothesis: First babies arrive at the same time or earlier than other babies Claudia Wagner 7

Where do hypothesis come from? Theory (deduction) Observations (induction) Claudia Wagner 8

Claudia Wagner 9

WHAT SKILLS DO DATA SCIENTISTS NEED? Claudia Wagner 11

Data Science Claudia Wagner 12

Definition A data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician. - Josh Blumenstock Skills needed Statistics, machine learning, ability to handle big data Scientific curiosity & methodology, story telling, creativity, visualization skills and so on Claudia Wagner 13

What will you learn? Statistics & Probability Theory Descriptive Statistic Probability Theory Bayesian versus Frequentist thinking Statistical Inference Causal Inference Probabilistic Graphical Models Data Collection Methods Visualizing Data, Interpretations and Data Story Telling Claudia Wagner 15

Last Time Compare pregnancy length for first babies and later babies Claudia Wagner 16

Mode Applies to nominals already! Can be used for all types of data. The mode is the value that appears most often in a set of data. What is the mode of X = [17, 19, 20, 21, 22, 23, 23, 23, 23] Claudia Wagner 17

Mean (expected value) Applies to interval scales and ratios: Example: X = [17, 19, 20, 21, 22, 23, 23, 23, 23, 25] Claudia Wagner 18

Median X = [17, 19, 20, 21, 22, 23, 23, 23, 23, 25] Median of X is 22.5 X = [17, 19, 20, 21, 22, 23, 23, 23, 23] Median of X is 22 Median is useful for skewed distribution where mean is meaningless Applies to ordinals, intervals and ratios Claudia Wagner 19

Variance and Standard Deviation Variance = Standard Deviation is just the square root of variance Claudia Wagner 20

Mode, median, mean two log-normal distributions; https://en.wikipedia.org/wiki/file:comparison_mean_median_mode.svg Claudia Wagner 21

Probability Mass Function (PMF) Transform absolute frequencies into normalized ones (probabilities) Claudia Wagner 22

Limits of PMF PMFs work well if the number of values is small. As the number of values increases, the probability associated with each value gets smaller and the effect of random noise increases. Chapter 3, Think Stats Claudia Wagner 23

Solutions Choose different visualization techniques Boxplot Cumulative Distribution Function (CDF) Claudia Wagner 24

Percentile Claudia Wagner 25

Boxplots IQR = Q 3 Q 1 Outliers are usually 3 IQR or more above the third quartile or 3 IQR or more below the first quartile. Image: http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture2.pdf Claudia Wagner 26

Cumulative Distribution Function (CDF) To find CDF(x) for a particular value of x, we compute the fraction of the values in the sample that are less than (or equal to) x. Claudia Wagner 28 Chapter 3, Think Stats

CDF Why are CDFs useful? We overcome the binning issue by grouping all values equal or lower x We can easily answer the following questions: What is the probability of observing a value of x or lower? Given a probability p, computes the corresponding value, x; that is, the inverse CDF of p. Claudia Wagner 29

Shape of Distribution Skewness quantifies how symmetrical a distribution is. A symmetrical distribution has a skewness of zero. Negative values for the skewness indicate data is skewed left. Positive values for the skewness indicate data is skewed right. Skewness < 0 Left skew skewness=0 Skewness > 0 Right skew Claudia Wagner 32

Kurtosis Kurtosis quantifies how peaky a distribution is compared to a normal distribution A normal distribution has a kurtosis of 3. A flatter distribution has a negative kurtosis, A distribution more peaked than a Normal distribution has a positive kurtosis. - 3 kurtosis<0 kurtosis=0 kurtosis>0 Claudia Wagner 34

Normal Distribution More peaky than normal distribution! Positive Kurtosis! Claudia Wagner 35

Normal Distribution Flatter than normal distribution! Neg. Kurtosis! Slight left skew! But almost normal. Claudia Wagner 36

Normal Distribution Flatter than normal distribution! Neg. Kurtosis! Left skew! But almost normal. Claudia Wagner 37

Normal Distribution Flatter than normal distribution! Neg. Kurtosis! Slight right skew! Pos. Skewness! Claudia Wagner 38

Statistics src: https://www.autodeskresearch.com/publications/samestats Claudia Wagner 40

Statistics Simulation Population Probability Sample Descriptive Statistics Sample mean is called sample statistic Population mean is called parameter Inference Find good estimator Claudia Wagner 41

Most of the time we do not know the parameter of the true distribution that generated our sample data 1. But we can estimate the parameter from the observed sample data Inference! If we observe 5 times head in 6 coin tosses what was the parameter p of the coin? What is out best guess for p? How uncertain are we? 2. And we can test hypothesis about the parameter If we observe 5 times head in 6 coin tosses what is the probability that the coin was fair? Claudia Wagner 42

PARAMETER ESTIMATION Claudia Wagner 43

I flip a coin twice. What will come up next? Claudia Wagner 44

I flip a coin 100 times. What will come up next? Claudia Wagner 45

I flip a coin 100 times. 52 heads and 48 tails.what will come up next? P(head)=52/100 But confidence is low Claudia Wagner 46

Confidence Confidence in our parameter estimates depends upon two things Size of sample (e.g., 100 versus 2) Variance of sample (e.g., all heads versus 52 heads) As the variance grows, we need larger samples to have the same degree of confidence Claudia Wagner 47

Law of Large Numbers In repeated independent tests with the same actual probability p of a particular outcome in each test, the chance that the fraction of times that outcome occurs differs from p converges to zero as the number of trials goes to infinity Claudia Wagner 51

Law of Large Numbers In other words: the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed. https://en.wikipedia.org/wiki/law_of_large_numbers Claudia Wagner 52

Parameter Estimation Based on empirical observations (trials, experiments) Observe outcome of coin flips Observe survey data Based on simulations (Monte Carlo simulation) simulate data generation process (e.g. flip coins, spin roullette wheel, roll die) Point estimates and confidence intervals Claudia Wagner 53

HYPOTHESIS TESTING Claudia Wagner 54

Hypothesis Testing Example: my hypothesis is that the coin is unfair (p!=0.5). We create a null-hypothesis which would falsify our hypothesis if it was true. H0: p=0.5 Can I reject H0? Claudia Wagner 55

Hypothesis testing When can we reject H0? Distribution of outcomes of a Bernoulli random variable follows a binomial distribution with parameter p and n Claudia Wagner 56

Bernoulli Random Variable Binary Outcome Probability Mass Function Bernoulli Distribution has only one parameter p=0.6 Claudia Wagner 57

Bernoulli distribution with parameter p describes the probability distribution of a binary random variable (e.g., success/failure, yes/no, head/tail, red/not-red) Binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent Bernoulli experiments (i.e., binary outcomes) Claudia Wagner 58

Single experiment: toss coin multiple times Repeat experiment n times PMF of the Binomial distribution defines the probability that you have k successes within n trails: Probability of observing 3 heads when we toss a fair coin 4 times? Claudia Wagner 59

Binomial Coefficient Number of ways to choose an (unordered) subset of k elements from a set of n elements Number of outcomes that give 3 heads = 4!/(3!*1!) = 4 4/16= 0.25 Claudia Wagner 60

Example Probability of observing 3 heads when we toss a fair coin 4 times: 4!/(3! 1!) 0.5 3 0.5 1 = 0.25 Claudia Wagner 61

What is the probability of observing 3 heads when we toss the coin 4 times? #favorable outcome #all outcomes Claudia Wagner 62

Example One Experiment: Toss one coin 4 times (n=4) Coin shows either head H or tail T Number of all possible outcomes (with order)? 2 4 = 16 Claudia Wagner 63

Discrete Random Variable X What is the probability of observing 3 heads? #favorable outcome #all outcomes Probability of observing 3 heads: 4/16= 0.25 Claudia Wagner 64

QUESTIONS Claudia Wagner 80