Web Science & Technologies University of Koblenz Landau, Germany. Lecture Data Science. Statistics and Probabilities JProf. Dr.

Web Science & Technologies University of Koblenz Landau, Germany Lecture Data Science Statistics and Probabilities JProf. Dr. Claudia Wagner

Data Science Open Position @GESIS Student Assistant Job in Data Science at GESIS in Cologne Requirements: Good programming skills in python Some data mining experience Be able to work on a Unix Server Payment: 11,64 EUR per Hour 8-20 hours per week are possible If interested, send me an email with CV and transcript of records Claudia Wagner 2

Exam WED 6.2. 6pm E011 WED 27.3. 2pm D018 Claudia Wagner 3

Science Science is an evolutionary process which possibly allows us to gain knowledge about the world. Lets assume we have a question: Do first babies arrive later than other babies? How can we answer it scientifically? Collect data e.g. via a survey Analyze data using statistics Chapter 1, Think Stats Claudia Wagner 6

Science We test our hypothesis by creating a null-hypothesis which would falsify our hypothesis if it was true. Then we try to reject the null hypothesis (with a certain probability). If our hypothesis is true, we will be able to reject the null hypothesis in most experiments Example My hypothesis: First babies arrive later than other babies Null hypothesis: First babies arrive at the same time or earlier than other babies Claudia Wagner 7

Where do hypothesis come from? Theory (deduction) Observations (induction) Claudia Wagner 8

Claudia Wagner 9

WHAT SKILLS DO DATA SCIENTISTS NEED? Claudia Wagner 11

Data Science Claudia Wagner 12

Definition A data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician. - Josh Blumenstock Skills needed Statistics, machine learning, ability to handle big data Scientific curiosity & methodology, story telling, creativity, visualization skills and so on Claudia Wagner 13

What will you learn? Statistics & Probability Theory Descriptive Statistic Probability Theory Bayesian versus Frequentist thinking Statistical Inference Causal Inference Probabilistic Graphical Models Data Collection Methods Visualizing Data, Interpretations and Data Story Telling Claudia Wagner 15

Last Time Compare pregnancy length for first babies and later babies Claudia Wagner 16

Mode Applies to nominals already! Can be used for all types of data. The mode is the value that appears most often in a set of data. What is the mode of X = [17, 19, 20, 21, 22, 23, 23, 23, 23] Claudia Wagner 17

Mean (expected value) Applies to interval scales and ratios: Example: X = [17, 19, 20, 21, 22, 23, 23, 23, 23, 25] Claudia Wagner 18

Median X = [17, 19, 20, 21, 22, 23, 23, 23, 23, 25] Median of X is 22.5 X = [17, 19, 20, 21, 22, 23, 23, 23, 23] Median of X is 22 Median is useful for skewed distribution where mean is meaningless Applies to ordinals, intervals and ratios Claudia Wagner 19

Variance and Standard Deviation Variance = Standard Deviation is just the square root of variance Claudia Wagner 20

Mode, median, mean two log-normal distributions; https://en.wikipedia.org/wiki/file:comparison_mean_median_mode.svg Claudia Wagner 21

Probability Mass Function (PMF) Transform absolute frequencies into normalized ones (probabilities) Claudia Wagner 22

Limits of PMF PMFs work well if the number of values is small. As the number of values increases, the probability associated with each value gets smaller and the effect of random noise increases. Chapter 3, Think Stats Claudia Wagner 23

Solutions Choose different visualization techniques Boxplot Cumulative Distribution Function (CDF) Claudia Wagner 24

Percentile Claudia Wagner 25

Boxplots IQR = Q 3 Q 1 Outliers are usually 3 IQR or more above the third quartile or 3 IQR or more below the first quartile. Image: http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture2.pdf Claudia Wagner 26

Cumulative Distribution Function (CDF) To find CDF(x) for a particular value of x, we compute the fraction of the values in the sample that are less than (or equal to) x. Claudia Wagner 28 Chapter 3, Think Stats

CDF Why are CDFs useful? We overcome the binning issue by grouping all values equal or lower x We can easily answer the following questions: What is the probability of observing a value of x or lower? Given a probability p, computes the corresponding value, x; that is, the inverse CDF of p. Claudia Wagner 29

Shape of Distribution Skewness quantifies how symmetrical a distribution is. A symmetrical distribution has a skewness of zero. Negative values for the skewness indicate data is skewed left. Positive values for the skewness indicate data is skewed right. Skewness < 0 Left skew skewness=0 Skewness > 0 Right skew Claudia Wagner 32

Kurtosis Kurtosis quantifies how peaky a distribution is compared to a normal distribution A normal distribution has a kurtosis of 3. A flatter distribution has a negative kurtosis, A distribution more peaked than a Normal distribution has a positive kurtosis. - 3 kurtosis<0 kurtosis=0 kurtosis>0 Claudia Wagner 34

Normal Distribution More peaky than normal distribution! Positive Kurtosis! Claudia Wagner 35

Normal Distribution Flatter than normal distribution! Neg. Kurtosis! Slight left skew! But almost normal. Claudia Wagner 36

Normal Distribution Flatter than normal distribution! Neg. Kurtosis! Left skew! But almost normal. Claudia Wagner 37

Normal Distribution Flatter than normal distribution! Neg. Kurtosis! Slight right skew! Pos. Skewness! Claudia Wagner 38

Statistics src: https://www.autodeskresearch.com/publications/samestats Claudia Wagner 40

Statistics Simulation Population Probability Sample Descriptive Statistics Sample mean is called sample statistic Population mean is called parameter Inference Find good estimator Claudia Wagner 41

Most of the time we do not know the parameter of the true distribution that generated our sample data 1. But we can estimate the parameter from the observed sample data Inference! If we observe 5 times head in 6 coin tosses what was the parameter p of the coin? What is out best guess for p? How uncertain are we? 2. And we can test hypothesis about the parameter If we observe 5 times head in 6 coin tosses what is the probability that the coin was fair? Claudia Wagner 42

PARAMETER ESTIMATION Claudia Wagner 43

I flip a coin twice. What will come up next? Claudia Wagner 44

I flip a coin 100 times. What will come up next? Claudia Wagner 45

I flip a coin 100 times. 52 heads and 48 tails.what will come up next? P(head)=52/100 But confidence is low Claudia Wagner 46

Confidence Confidence in our parameter estimates depends upon two things Size of sample (e.g., 100 versus 2) Variance of sample (e.g., all heads versus 52 heads) As the variance grows, we need larger samples to have the same degree of confidence Claudia Wagner 47

Law of Large Numbers In repeated independent tests with the same actual probability p of a particular outcome in each test, the chance that the fraction of times that outcome occurs differs from p converges to zero as the number of trials goes to infinity Claudia Wagner 51

Law of Large Numbers In other words: the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed. https://en.wikipedia.org/wiki/law_of_large_numbers Claudia Wagner 52

Parameter Estimation Based on empirical observations (trials, experiments) Observe outcome of coin flips Observe survey data Based on simulations (Monte Carlo simulation) simulate data generation process (e.g. flip coins, spin roullette wheel, roll die) Point estimates and confidence intervals Claudia Wagner 53

HYPOTHESIS TESTING Claudia Wagner 54

Hypothesis Testing Example: my hypothesis is that the coin is unfair (p!=0.5). We create a null-hypothesis which would falsify our hypothesis if it was true. H0: p=0.5 Can I reject H0? Claudia Wagner 55

Hypothesis testing When can we reject H0? Distribution of outcomes of a Bernoulli random variable follows a binomial distribution with parameter p and n Claudia Wagner 56

Bernoulli Random Variable Binary Outcome Probability Mass Function Bernoulli Distribution has only one parameter p=0.6 Claudia Wagner 57

Bernoulli distribution with parameter p describes the probability distribution of a binary random variable (e.g., success/failure, yes/no, head/tail, red/not-red) Binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent Bernoulli experiments (i.e., binary outcomes) Claudia Wagner 58

Single experiment: toss coin multiple times Repeat experiment n times PMF of the Binomial distribution defines the probability that you have k successes within n trails: Probability of observing 3 heads when we toss a fair coin 4 times? Claudia Wagner 59

Binomial Coefficient Number of ways to choose an (unordered) subset of k elements from a set of n elements Number of outcomes that give 3 heads = 4!/(3!*1!) = 4 4/16= 0.25 Claudia Wagner 60

Example Probability of observing 3 heads when we toss a fair coin 4 times: 4!/(3! 1!) 0.5 3 0.5 1 = 0.25 Claudia Wagner 61

What is the probability of observing 3 heads when we toss the coin 4 times? #favorable outcome #all outcomes Claudia Wagner 62

Example One Experiment: Toss one coin 4 times (n=4) Coin shows either head H or tail T Number of all possible outcomes (with order)? 2 4 = 16 Claudia Wagner 63

Discrete Random Variable X What is the probability of observing 3 heads? #favorable outcome #all outcomes Probability of observing 3 heads: 4/16= 0.25 Claudia Wagner 64

QUESTIONS Claudia Wagner 80