Web Science & Technologies University of Koblenz Landau, Germany Lecture Data Science Statistics and Probabilities JProf. Dr. Claudia Wagner
Data Science Open Position @GESIS Student Assistant Job in Data Science at GESIS in Cologne Requirements: Good programming skills in python Some data mining experience Be able to work on a Unix Server Payment: 11,64 EUR per Hour 8-20 hours per week are possible If interested, send me an email with CV and transcript of records Claudia Wagner 2
Exam WED 6.2. 6pm E011 WED 27.3. 2pm D018 Claudia Wagner 3
Science Science is an evolutionary process which possibly allows us to gain knowledge about the world. Lets assume we have a question: Do first babies arrive later than other babies? How can we answer it scientifically? Collect data e.g. via a survey Analyze data using statistics Chapter 1, Think Stats Claudia Wagner 6
Science We test our hypothesis by creating a null-hypothesis which would falsify our hypothesis if it was true. Then we try to reject the null hypothesis (with a certain probability). If our hypothesis is true, we will be able to reject the null hypothesis in most experiments Example My hypothesis: First babies arrive later than other babies Null hypothesis: First babies arrive at the same time or earlier than other babies Claudia Wagner 7
Where do hypothesis come from? Theory (deduction) Observations (induction) Claudia Wagner 8
Claudia Wagner 9
WHAT SKILLS DO DATA SCIENTISTS NEED? Claudia Wagner 11
Data Science Claudia Wagner 12
Definition A data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician. - Josh Blumenstock Skills needed Statistics, machine learning, ability to handle big data Scientific curiosity & methodology, story telling, creativity, visualization skills and so on Claudia Wagner 13
What will you learn? Statistics & Probability Theory Descriptive Statistic Probability Theory Bayesian versus Frequentist thinking Statistical Inference Causal Inference Probabilistic Graphical Models Data Collection Methods Visualizing Data, Interpretations and Data Story Telling Claudia Wagner 15
Last Time Compare pregnancy length for first babies and later babies Claudia Wagner 16
Mode Applies to nominals already! Can be used for all types of data. The mode is the value that appears most often in a set of data. What is the mode of X = [17, 19, 20, 21, 22, 23, 23, 23, 23] Claudia Wagner 17
Mean (expected value) Applies to interval scales and ratios: Example: X = [17, 19, 20, 21, 22, 23, 23, 23, 23, 25] Claudia Wagner 18
Median X = [17, 19, 20, 21, 22, 23, 23, 23, 23, 25] Median of X is 22.5 X = [17, 19, 20, 21, 22, 23, 23, 23, 23] Median of X is 22 Median is useful for skewed distribution where mean is meaningless Applies to ordinals, intervals and ratios Claudia Wagner 19
Variance and Standard Deviation Variance = Standard Deviation is just the square root of variance Claudia Wagner 20
Mode, median, mean two log-normal distributions; https://en.wikipedia.org/wiki/file:comparison_mean_median_mode.svg Claudia Wagner 21
Probability Mass Function (PMF) Transform absolute frequencies into normalized ones (probabilities) Claudia Wagner 22
Limits of PMF PMFs work well if the number of values is small. As the number of values increases, the probability associated with each value gets smaller and the effect of random noise increases. Chapter 3, Think Stats Claudia Wagner 23
Solutions Choose different visualization techniques Boxplot Cumulative Distribution Function (CDF) Claudia Wagner 24
Percentile Claudia Wagner 25
Boxplots IQR = Q 3 Q 1 Outliers are usually 3 IQR or more above the third quartile or 3 IQR or more below the first quartile. Image: http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture2.pdf Claudia Wagner 26
Cumulative Distribution Function (CDF) To find CDF(x) for a particular value of x, we compute the fraction of the values in the sample that are less than (or equal to) x. Claudia Wagner 28 Chapter 3, Think Stats
CDF Why are CDFs useful? We overcome the binning issue by grouping all values equal or lower x We can easily answer the following questions: What is the probability of observing a value of x or lower? Given a probability p, computes the corresponding value, x; that is, the inverse CDF of p. Claudia Wagner 29
Shape of Distribution Skewness quantifies how symmetrical a distribution is. A symmetrical distribution has a skewness of zero. Negative values for the skewness indicate data is skewed left. Positive values for the skewness indicate data is skewed right. Skewness < 0 Left skew skewness=0 Skewness > 0 Right skew Claudia Wagner 32
Kurtosis Kurtosis quantifies how peaky a distribution is compared to a normal distribution A normal distribution has a kurtosis of 3. A flatter distribution has a negative kurtosis, A distribution more peaked than a Normal distribution has a positive kurtosis. - 3 kurtosis<0 kurtosis=0 kurtosis>0 Claudia Wagner 34
Normal Distribution More peaky than normal distribution! Positive Kurtosis! Claudia Wagner 35
Normal Distribution Flatter than normal distribution! Neg. Kurtosis! Slight left skew! But almost normal. Claudia Wagner 36
Normal Distribution Flatter than normal distribution! Neg. Kurtosis! Left skew! But almost normal. Claudia Wagner 37
Normal Distribution Flatter than normal distribution! Neg. Kurtosis! Slight right skew! Pos. Skewness! Claudia Wagner 38
Statistics src: https://www.autodeskresearch.com/publications/samestats Claudia Wagner 40
Statistics Simulation Population Probability Sample Descriptive Statistics Sample mean is called sample statistic Population mean is called parameter Inference Find good estimator Claudia Wagner 41
Most of the time we do not know the parameter of the true distribution that generated our sample data 1. But we can estimate the parameter from the observed sample data Inference! If we observe 5 times head in 6 coin tosses what was the parameter p of the coin? What is out best guess for p? How uncertain are we? 2. And we can test hypothesis about the parameter If we observe 5 times head in 6 coin tosses what is the probability that the coin was fair? Claudia Wagner 42
PARAMETER ESTIMATION Claudia Wagner 43
I flip a coin twice. What will come up next? Claudia Wagner 44
I flip a coin 100 times. What will come up next? Claudia Wagner 45
I flip a coin 100 times. 52 heads and 48 tails.what will come up next? P(head)=52/100 But confidence is low Claudia Wagner 46
Confidence Confidence in our parameter estimates depends upon two things Size of sample (e.g., 100 versus 2) Variance of sample (e.g., all heads versus 52 heads) As the variance grows, we need larger samples to have the same degree of confidence Claudia Wagner 47
Law of Large Numbers In repeated independent tests with the same actual probability p of a particular outcome in each test, the chance that the fraction of times that outcome occurs differs from p converges to zero as the number of trials goes to infinity Claudia Wagner 51
Law of Large Numbers In other words: the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed. https://en.wikipedia.org/wiki/law_of_large_numbers Claudia Wagner 52
Parameter Estimation Based on empirical observations (trials, experiments) Observe outcome of coin flips Observe survey data Based on simulations (Monte Carlo simulation) simulate data generation process (e.g. flip coins, spin roullette wheel, roll die) Point estimates and confidence intervals Claudia Wagner 53
HYPOTHESIS TESTING Claudia Wagner 54
Hypothesis Testing Example: my hypothesis is that the coin is unfair (p!=0.5). We create a null-hypothesis which would falsify our hypothesis if it was true. H0: p=0.5 Can I reject H0? Claudia Wagner 55
Hypothesis testing When can we reject H0? Distribution of outcomes of a Bernoulli random variable follows a binomial distribution with parameter p and n Claudia Wagner 56
Bernoulli Random Variable Binary Outcome Probability Mass Function Bernoulli Distribution has only one parameter p=0.6 Claudia Wagner 57
Bernoulli distribution with parameter p describes the probability distribution of a binary random variable (e.g., success/failure, yes/no, head/tail, red/not-red) Binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent Bernoulli experiments (i.e., binary outcomes) Claudia Wagner 58
Single experiment: toss coin multiple times Repeat experiment n times PMF of the Binomial distribution defines the probability that you have k successes within n trails: Probability of observing 3 heads when we toss a fair coin 4 times? Claudia Wagner 59
Binomial Coefficient Number of ways to choose an (unordered) subset of k elements from a set of n elements Number of outcomes that give 3 heads = 4!/(3!*1!) = 4 4/16= 0.25 Claudia Wagner 60
Example Probability of observing 3 heads when we toss a fair coin 4 times: 4!/(3! 1!) 0.5 3 0.5 1 = 0.25 Claudia Wagner 61
What is the probability of observing 3 heads when we toss the coin 4 times? #favorable outcome #all outcomes Claudia Wagner 62
Example One Experiment: Toss one coin 4 times (n=4) Coin shows either head H or tail T Number of all possible outcomes (with order)? 2 4 = 16 Claudia Wagner 63
Discrete Random Variable X What is the probability of observing 3 heads? #favorable outcome #all outcomes Probability of observing 3 heads: 4/16= 0.25 Claudia Wagner 64
QUESTIONS Claudia Wagner 80