Introduction to Probability and Inference HSSP Summer 2017, Instructor: Alexandra Ding July 19, 2017

Similar documents
Review for Final Exam Spring 2014 Jeremy Orloff and Jonathan Bloom

Chapter 5. Sampling Distributions

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

4 Random Variables and Distributions

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

Lecture 23. STAT 225 Introduction to Probability Models April 4, Whitney Huang Purdue University. Normal approximation to Binomial

Section 0: Introduction and Review of Basic Concepts

Example - Let X be the number of boys in a 4 child family. Find the probability distribution table:

Probability Distributions for Discrete RV

The Bernoulli distribution

MLLunsford 1. Activity: Central Limit Theorem Theory and Computations

The normal distribution is a theoretical model derived mathematically and not empirically.

Statistics, Their Distributions, and the Central Limit Theorem

Business Statistics 41000: Probability 4

BIOL The Normal Distribution and the Central Limit Theorem

UQ, STAT2201, 2017, Lectures 3 and 4 Unit 3 Probability Distributions.

Back to estimators...

What was in the last lecture?

Lecture 2. Probability Distributions Theophanis Tsandilas

Example - Let X be the number of boys in a 4 child family. Find the probability distribution table:

Chapter 3 Discrete Random Variables and Probability Distributions

Statistics and Probability

6 Central Limit Theorem. (Chs 6.4, 6.5)

Chapter 7. Sampling Distributions and the Central Limit Theorem

A random variable (r. v.) is a variable whose value is a numerical outcome of a random phenomenon.

Chapter 7. Sampling Distributions and the Central Limit Theorem

Commonly Used Distributions

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

LECTURE CHAPTER 3 DESCRETE RANDOM VARIABLE

Chapter 5. Statistical inference for Parametric Models

Business Statistics 41000: Probability 3

Normal Distribution. Notes. Normal Distribution. Standard Normal. Sums of Normal Random Variables. Normal. approximation of Binomial.

Chapter 4 Continuous Random Variables and Probability Distributions

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Confidence Intervals Introduction

Web Science & Technologies University of Koblenz Landau, Germany. Lecture Data Science. Statistics and Probabilities JProf. Dr.

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

Chapter 3 - Lecture 5 The Binomial Probability Distribution

Probability Theory and Simulation Methods. April 9th, Lecture 20: Special distributions

5.3 Statistics and Their Distributions

AMS 7 Sampling Distributions, Central limit theorem, Confidence Intervals Lecture 4

Lecture Data Science

Module 4: Probability

2011 Pearson Education, Inc

Simple Random Sample

4-1. Chapter 4. Commonly Used Distributions by The McGraw-Hill Companies, Inc. All rights reserved.

STA 6166 Fall 2007 Web-based Course. Notes 10: Probability Models

Part V - Chance Variability

continuous rv Note for a legitimate pdf, we have f (x) 0 and f (x)dx = 1. For a continuous rv, P(X = c) = c f (x)dx = 0, hence

Statistics 13 Elementary Statistics

MA : Introductory Probability

Chapter 4 Continuous Random Variables and Probability Distributions

PROBABILITY DISTRIBUTIONS

MATH 3200 Exam 3 Dr. Syring

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Section 2: Estimation, Confidence Intervals and Testing Hypothesis

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 3, 1.9

Binomial Random Variables. Binomial Random Variables

STA258H5. Al Nosedal and Alison Weir. Winter Al Nosedal and Alison Weir STA258H5 Winter / 41

Sampling Distributions and the Central Limit Theorem

Part 1 In which we meet the law of averages. The Law of Averages. The Expected Value & The Standard Error. Where Are We Going?

The Central Limit Theorem. Sec. 8.2: The Random Variable. it s Distribution. it s Distribution

Central Limit Theorem, Joint Distributions Spring 2018

Unit 5: Sampling Distributions of Statistics

Unit 5: Sampling Distributions of Statistics

Lecture 9. Probability Distributions. Outline. Outline

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

BIO5312 Biostatistics Lecture 5: Estimations

ECO220Y Continuous Probability Distributions: Normal Readings: Chapter 9, section 9.10

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

Chapter 7: Point Estimation and Sampling Distributions

Example. Chapter 8 Probability Distributions and Statistics Section 8.1 Distributions of Random Variables

Lecture 9. Probability Distributions

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Chapter 3 Discrete Random Variables and Probability Distributions

Statistical Methods in Practice STAT/MATH 3379

A useful modeling tricks.

Math 227 Elementary Statistics. Bluman 5 th edition

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

Chapter 5. Discrete Probability Distributions. McGraw-Hill, Bluman, 7 th ed, Chapter 5 1

ECON 214 Elements of Statistics for Economists 2016/2017

Section Distributions of Random Variables

3.3-Measures of Variation

4.3 Normal distribution

Central Limit Theorem (cont d) 7/28/2006

(# of die rolls that satisfy the criteria) (# of possible die rolls)

Chapter 7 presents the beginning of inferential statistics. The two major activities of inferential statistics are

ECE 340 Probabilistic Methods in Engineering M/W 3-4:15. Lecture 10: Continuous RV Families. Prof. Vince Calhoun

Central Limit Theorem 11/08/2005

6.1 Binomial Theorem

ECO220Y Estimation: Confidence Interval Estimator for Sample Proportions Readings: Chapter 11 (skip 11.5)

Discrete Random Variables and Probability Distributions. Stat 4570/5570 Based on Devore s book (Ed 8)

Sampling and sampling distribution

Normal distribution Approximating binomial distribution by normal 2.10 Central Limit Theorem

4.2 Bernoulli Trials and Binomial Distributions

Chapter 8. Variables. Copyright 2004 Brooks/Cole, a division of Thomson Learning, Inc.

Homework: Due Wed, Feb 20 th. Chapter 8, # 60a + 62a (count together as 1), 74, 82

Statistics, Measures of Central Tendency I

A probability distribution shows the possible outcomes of an experiment and the probability of each of these outcomes.

Probability Theory. Probability and Statistics for Data Science CSE594 - Spring 2016

Transcription:

Introduction to Probability and Inference HSSP Summer 2017, Instructor: Alexandra Ding July 19, 2017 Please fill out the attendance sheet! Suggestions Box: Feedback and suggestions are important to the success of this class and my experience as a teacher, so please send comments to alexawding@gmail.com! 2 Lecture 2 Recap: Random Variables and Distributions Random Variable: maps the outcomes in the sample space to the real line Random Variables can be Continuous (Height, Car mileage) or Discrete (coin toss, die roll, number of rain drops falling on my car) Probability Mass Function (PMF): for a discrete RV X, the PMF, denoted f X (x) or f(x) tells us what the probablity that our Random Variable equals some value. In other words: f X (x) = P (X = x) Cumulative Distribution Function (CDF): the distribution of a discrete RV can also be described via the CDF. The CDF answers the question of what is the probability that my RV is less than some x? CDF for a RV X is denoted as F X, and is basically a cumulative sum of PMFs. F X (x) = P (X x) Bernoulli Distribution: a discrete distribution. It has one parameter, p. A Bernoulli distributed RV has only two possible outcomes (1 or 0) and is 1 with probability p, and 0 with probability 1 p. Y Bern(p) Y is distributed Bernoulli with parameterp f Y (y = 1) = p Bernoulli PMF Binomial Distribution: a discrete distribution with two parameters n and p. The Binomial reflects the number of successes in n independent trials, where the probability of success on each trial is p. The Binomial RV is the sum of independent Bernoulli RVs! X Bin(n, p) X is distributed Binomial with some n and p ( ) n P (X = x) = p x (1 p) n x x Probability Density Function (PDF): for a continuous random variable, the PDF gives the distribution. This function gives the likelihood or density of observing some value of a RV (not the probability). We usually have to evaluate integrals to find the probability that a RV takes on a certain interval of values. Normal Distribution: a VERY special continuous distribution. If X is a RV and is distributed Normal, the notation is: X N(µ, σ 2 ) Where µ is the mean and σ 2 is the variance (a measure of spread). The density function is f X (x) = 1 µ)2 exp( (x ) 2πσ 2 2σ 2 1

2.1 Warmup Puzzles 1. Act Normal!: Recall that the Empirical Rule (also known as the 68, 95, 99.7 Rule) reflects the probabilty that an observation of a RV falls within 1, 2 and 3 standard deviations from the mean. Suppose that blood glucose in a patient population is distributed Normally with mean 15 and variance 4. In other words: X N(µ = 15, σ 2 = 4) What is the probability that blood glucose is between 13 and 17? What is the probability that blood glucose is between 11 and 15? 2. Z Scores: A Z score tells you how many standard deviations σ your observation is from the mean, µ. observation µ Z Score = σ Suppose you, a sabermetrician (baseball statistician) have measured the number of Home Runs that every team in Major League Baseball has hit in 2017, and find that the Mean is 115, with a standard deviation of 18. The Boston Red Sox have hit 94 home runs. 1 Calculate the Z Score of this observation. 3. Simpson s Paradox: You re still a sabermetrician and are at home looking at batting averages of different players. A batting average is the number of safe hits divided by the number of total at-bats. Suppose you re looking at the batting averages of two players, Derek Jeter and David Justice, in the years 1995 and 1996, as well as in both years combined. 2 Here s what you observe: 1995 1996 Combined Jeter 0.250 0.314 0.310 Justice 0.253 0.321 0.270 Who has the higher average in 1995? Who has the higher average in 1996? Who has the higher combined average? Does this strike you as strange? 1 SOURCE: http://www.espn.com/mlb/stats/team/ /stat/batting 2 https://en.wikipedia.org/wiki/simpson%27s paradox 2

3 Lecture 3: Introduction to Inference 3.1 Inference: Into the Wild Motivation: When we collect data, we are usually interested in estimating real-world quantities and answering relevant questions. In Statistical Inference, we use observable data to make a statement/decision about a statistical model. Email Marketing Campaign- What is the estimated Bounce Rate of email type A vs. type B? Disease Incidence- Has the incidence of cholera in Pakistan changed over the past 5 years? Transportation- What is the average wait time of passengers at Downtown Crossing? Disease Treatment- What is the 5-year survival rate of patients on drug vs. control? Survey Sampling: obtaining information about a larger population by examining a small fraction of observations. For a population of size N, measuring some characteristics of a subset of n < N. Simple Random Sample: Out of a population of N, each sample of size n is equally likely to be selected. How many unique samples are there? Different sampling techniques exist. More on this in Expt. Design 3.2 Random Variables and Measurement Consider each measurement in our sample to be a Random Variable (Capitalized). Let X i be a RV denoting the height of the ith person in our sample. The Observed value of height is x i (lowercase). Notation: Random Variables are CAPITALIZED. Observations are LOWERCASE. Suppose we are interested in measuring the height in inches of n randomly selected people, denoted as X 1, X 2...X n. Then we might observe that x 1 = 64, x 2 = 75 etc. Categorical vs. Quantitative/Numerical Data (Variables): Examples? 3 from The Manga Guide to Statistics 3 3

3.3 Statistics summarize data Statistic: Numerical summary of observed variables (data). Random Variables. Statistics are also RVs. Formally, a real-valued function of Statistics can be calculated on the whole population or on a sample of the population. Since the goal of inference is to validate statistical models, statistics are used to approximate parameters of statistical models Parameter: Generally notated as θ. Summary of a statistical model or of a population. Determines the shape of a distribution for a set of RVs of interest. Recall the parameters in X Bin(n, p) and Y N(µ, σ 2 ) Estimator: A type of statistic that aims to estimate a model or population parameter. Examples of Population and Sample statistics (most common) Statistic Population Sample Mean µ X Proportion p ˆp Variance σ 2 S 2 Standard Deviation σ S Independence and Distribution Assumption: We assume that members of the same population are IID (independent and identically distributed). And if our sample is random, the members of our sample are also assumed to be IID. Note that IID does not always hold! (see example below) Random Variables X and Y are independent if P (X = x, Y = y) = P (X = x)p (Y = y) for Discrete RV X and Y P (X x, Y y) = P (X x)p (Y y) for both Discrete and Continuous RV X and Y Independence Example 4 : Suppose you are an investment banker at Goldman Sachs and are evaluating a Collateralized Debt Obligation (CDO). A (simplified explanation of a) CDO is a bet taken on a pool of assets, for example, mortgages. In this case, imagine that you have a pool of 5 mortgages, each with an estimated 5% chance of defaulting (failing). You have a few options for the types of bets you can take: Bet Alpha: pays out K cash unless all five of the mortgages default Bet Epsilon: pays out K cash unless any one of the mortgages default Which type of bet has greater risk? What is the probability of losing? 4 From Nate Silver s The Signal and the Noise: Why so many predictions fail, but some don t, 2012 4

3.4 Population Statistics Suppose we have a population of size N. Population Statistics are computed on the measurements of all members of the population. Population Mean: reflects the center of the dataset. µ = 1 N Population Variance: reflects how spread out the datapoints are around the population mean, µ σ 2 = 1 N N x i N (x i µ) 2 Population Proportion: For observations of categorical Y i, reflects the proportion of positive observations. N p = y i N Suppose you, a guidance counselor, are interested in the average GPA of all 300 students in your class year. You have access to all of these measurements. What population parameter would you calculate? Suppose that in the population of sea turtles on this beach, 900 eggs have been laid, and 88 hatch. What is the population proportion of eggs that have hatched? 3.5 Sample Statistics and Standard Error Population statistics are often unknown or difficult to measure. Thus, taking a sample of the population and calculating statistics on this is often useful. For population of size N, take sample of size n. A good rule of thumb is that n > 30. What makes a useful estimator of a population parameter? Unbiasedness and Consistency. Sample Mean: reflects the center of the measurements x i n X = 1 n x i Sample mean is a good estimator of µ because E( X) = µ. Intuition: in the long run, we expect sample mean to be centered around pop mean = Unbiasedness! As we take more samples (as n gets larger), the standard error of our estimate will decrease (on average sample mean gets closer to pop mean) = Consistency. 5

Sample Proportion: reflects the proportion of positive measurements in our sample (for a categorical variable) n p = y i n Sample Variance: reflects how spread out the datapoints are around the sample mean, X. S 2 = (xi X) 2 n 1 Compare with population proportion, and take a closer look at that denominator! Standard Error: Standard Error is similar to a measurement error. Each sample of n from the population of N will be slightly different, and thus you ll have different svalues of your sample statistic. In other words, Standard Error is the Error in our estimate of the parameter, NOT the spread of the parameter itself! Standard Error decreases with sample size, and depends on the estimator you re using. 3.6 Practice with Sample Statistics 1. Jersey Shore Infrastructure: Suppose you, a transportation official, are interested in estimating the average number of cars traversing bridges in New Jersey at 2pm on a Tuesday. Of NJ s 6500 bridges, you get your employees to take a sample of 10 and count the number of cars that pass within 30 minutes. 60, 40, 30, 35, 50, 10, 30, 15, 20, 65 (a) Calculate the sample mean, median and mode (b) Calculate the sample variance 2. AB Testing: You are working for a marketing analytics company, and your client, who hosts a website and is interested in adding new features, is concerned about the website s bounce rate. The bounce rate is the proportion of visitors to a website who navigate away from the site after viewing only one page. Her web developers create two new versions of the website, A and B, and try these new versions on a sample of the website visitors. Out of the 1000 visitors who saw website A, 850 of them left after viewing one page. Out of the 1000 who saw B, 900 of them left. 6

Calculate the sample proportion (bounce rate) for websites A and B. Is one better than the other? Do you think they are significantly different from each other? 3.7 Linking Inference, Probability, and Distributions REFER BACK TO WARMUP PUZZLE 1 Statistical Model: Collection of RVs that describe observable data, their distributions and distribution parameters Parameter θ: Summary of a statistical model or population. Characteristic(s) that determine the joint distribution for RVs of interest. Parameter Space describes the set of all possible values of a parameter (ex: λ [0, ) ) In statistics, we assume that we are observing a random sample of data from a distribution with parameter θ, for example: X i N(µ, σ 2 ) for i = 1...N And we observe: X 1...X n, and calculate some statistics (ex: sample mean) Every Sample Statistic has a Sampling Distribution. This Sampling Distribution is the distribution that a statistic will take given a specific population probability model with some set of parameters. For example, the distribution of the Sample Mean is (via the Central Limit Theorem) X N(µ, σ 2 /n) Sampling Distribution is different from the population distribution CRITICAL NOTE: This is NOT the SD of individuals- rather, it is the SD of our estimate of the mean, if we resample many times. The true mean is just a constant but what makes our RV random is the sampling procedure! 3.8 Normal Distribution is special The Normal Distribution is a good model of many natural phenomena. due to the Central Limit Theorem. Central Limit Theorem (Less Casual Definition): the sum of IID random variables (with finite variance) tends toward a normal distribution, even if the RVs themseleves are not normally distributed. In other words, the Normal Distribution can be used to approximate the distribution of the sample mean, even if the original observations are not normally distributed! 7

Example: Amount of Sleep for Harvard Students Define a RV: Suppose that out of N = 6700 undergrads, we sample n = 100 students. Let X i denote the amount of sleep the i th student in our sample gets. Why is X i a random variable? Draw a reasonable distribution for X i? Draw a reasonable distribution for X, the sample mean? Binomial Approximation: Normal distribution can also be used to approximate the binomial distribution under certain conditions. 3.9 Tools of Inference: Confidence Intervals and Hypothesis Testing In inference, we want to measure population parameters using sample statistics. 2 major tools of statistical inference: Confidence Intervals and Hypothesis Testing. Conceptual Example of CI: We take a survey of a random sample of 1500 teenagers in the United States, and ask them how much they spent that year on movie tickets. Sample mean is $55. Do you think the population mean (i.e. the average amount spent by teenagers in the US) is EXACTLY $55? If we took another random sample of the population, do you expect to get a sample mean of $55 again? Confidence Interval: describes the uncertainty associated with a sampling method/sampling statistic. For a 95% confidence interval, is interpreted as having a 95% Chance that the true parameter value is contained within this interval. Note that the true parameter value is fixed, but the bounds of the CI are random variables. Example of CI: 55 ± 3.4. Also written as: (51.6, 58.4) How would you interpret this confidence interval? 8

Size of a CI depends on the sample size and confidence level. 3.10 Next Lecture: Constructing a Confidence Interval (in brief) 1. Choose a Sample Statistic: calculate a point estimate using your sample. 2. Select a Confidence Level 3. Calculate the Standard Error 4. Calculate the Confidence Interval Estimate ± (Z-Value) x (SD of Estimate) Hypothesis Testing: A statistical hypothesis is an assumption about a population parameter. This assumption may or may not be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses. 5 3.11 Executive Summary The goal of inference is to use observable data to make a statement about a statistical model. Because population statistics are often unknown or difficult to measure, we can take a sample of the population and calculate statistics on it. Sample Statistics include the sample mean, proportion and variance Sample Statistics have Sampling Distributions, with an associated Standard Error The Normal Distribution is a good model of many types of data Confidence intervals reflect the range where we expect the true parameter value to be. The width of the CI depends on the sample size and confidence level. Next Week: Computing Confidence Intervals, Hypothesis Testing 5 Definition from: http://stattrek.com/hypothesis-test/hypothesis-testing.aspx 9