Lecture 4: Parameter Estimation and Confidence Intervals. GENOME 560 Doug Fowler, GS

Lecture 4: Parameter Estimatio ad Cofidece Itervals GENOME 560 Doug Fowler, GS (dfowler@uw.edu) 1

Review: Probability Distributios Discrete: Biomial distributio Hypergeometric distributio Poisso distributio 2

Review: Probability Distributios Discrete: Biomial distributio Hypergeometric distributio Poisso distributio Cotiuous: Uiform distributio Expoetial distributio Gamma distributio Normal distributio 3

Review: Probability Distributios Discrete: Biomial distributio Hypergeometric distributio Poisso distributio Cotiuous: Uiform distributio Expoetial distributio Gamma distributio Normal distributio The sums or meas of samples draw from ay dist are ormally distributed 4

Goals Basic cocepts of parameter estimatio Cofidece itervals 5

What Is Parameter? 6

What Is Parameter? Variables vs. Parameters Accordig to Bard & Yoatha (1974) * Usually a probabilistic model is desiged to explai the relatioships that exist amog quatities which ca be measured idepedetly i a experimet; these are the variables of the model. To formulate these relatioships, however, oe frequetly itroduces "costats" which stad for iheret properties of ature. These are the parameters. We ofte deote by θ * Bard, Yoatha (1974). Noliear Parameter Estimatio. New York 7

Which are parameters, variables? Biomial distributio (coi tossig) X: umber of Heads after coi tosses P æö k -k { X = k} = ç p (1 - p) èk ø variable parameter θ = p Poisso distributio X: umber of experimets withi a week P variable { X = k} = e k -l l k! parameter θ = λ 8

Parameters Ca Tell Us About Samples If we ca describe a populatio usig a parametric pdf or pmf f(x θ), ad we kow the parameter values, the we ca say what typical samples from the populatio will look like 9

ad Samples Ca Tell Us About Parameters If we ca describe a populatio usig a parametric pdf or pmf f(x θ), ad we kow the parameter values, the we ca say what typical samples from the populatio will look like We ca use sample data to estimate parameter values If we are tossig a coi we would like to estimate the parameter p If we are coutig the umber of experimets per week, we would like to estimate λ 10

Cetral Dogma of Statistics If we ca describe the populatio usig a parametric distributio, ad we kow the parameter values, the we ca say what typical samples from the populatio will look like 11

Parameter Estimatio Estimator: Statistic whose calculated value is used to estimate a parameter, θ Estimate: A particular realizatio of a estimator, θ Types of estimates: Poit estimate: sigle umber that ca be regarded as the most plausible value of θ, give the data we have Iterval estimate: a rage of umbers, called a cofidece iterval, that iforms us about the quality of our estimate 12

Simple Example Estimators Suppose we take a sample from a biomial distributio whose parameters are ukow. We get m successes from samples. How ca we estimate the parameter π (the populatio p)? 13

Simple Example Estimators Suppose we take a sample from a biomial distributio whose parameters are ukow. We get m successes from samples. How ca we estimate the parameter π (the populatio p)? Method 1: We could just use m ad Method 2: Alterately, we could just look i the literature for similar experimets. We could igore our data ad set 15

Good Estimators Are: Cosistet: as sample size icreases, gets closer to θ Would our example estimators be cosistet? 16

Good Estimators Are: Cosistet: as sample size icreases, gets closer to θ Would our example estimators be cosistet? Estimator 1, yes (m/ will approach π, law of large umbers) Estimator 2, o (our data does t matter) 17

Good Estimators Are: Cosistet: as sample size icreases, gets closer to θ Ubiased: 18

What do we mea by ubiased? A biased estimator diverges systematically from the true parameter value 19

Good Estimators Are: Cosistet: as sample size icreases, gets closer to θ Ubiased: the expected value of is equal to θ 20

Good Estimators Are: Cosistet: as sample size icreases, gets closer to θ Ubiased: the expected value of is equal to θ Are our example estimators biased? 21

Good Estimators Are: Cosistet: as sample size icreases, gets closer to θ Ubiased: the expected value of is equal to θ Are our example estimators biased? Estimator 1 turs out to be ubiased Estimator 2 is has a bias 22

Good Estimators Are: Cosistet: as sample size icreases, gets closer to θ Ubiased: Precise: 23

What do we mea by precise? A imprecise estimator is subject to large amouts of radom variability 24

Good Estimators Are: Cosistet: as sample size icreases, gets closer to θ Ubiased: Precise: the variace of should be miimal Estimator 1 Estimator 2? 25

Good Estimators Are: Cosistet: as sample size icreases, gets closer to θ Ubiased: Precise: the variace of should be miimal Estimator 1 Estimator 2 has zero variace Bias ad variace are itertwied, ad ofte you will have to chose to miimize oe or the other 26

Estimators for ormally distributed data Sice we kow that much experimetal data is ormally distributed, let s start here Geeral methods for estimatig parameters (MLE, Bayesia) will be covered later. 27

Estimators for ormally distributed data F(x) 0.0 0.1 0.2 0.3 0.4 What two parameters defie a ormal distributio? 3 2 1 0 1 2 3 x 28

Estimators for ormally distributed data F(x) 0.0 0.1 0.2 0.3 0.4 What two parameters defie a ormal distributio? mea = μ stadard deviatio = σ 3 2 1 0 1 2 3 x 29

Estimators for ormally distributed data F(x) 0.0 0.1 0.2 0.3 0.4 μ σ What two parameters defie a ormal distributio? mea = μ stadard deviatio = σ 3 2 1 0 1 2 3 x 30

Estimators for ormally distributed data Give a sample from a ormally distributed populatio, what estimators would you use for μ,σ? ˆµ = ˆ = 31

Estimators for ormally distributed data Give a sample from a ormally distributed populatio, what estimators would you use for μ,σ? ˆµ = ˆ = 32

Cofidece itervals: how good is my parameter estimate? 33

Back to our fluorescet yeast Let s say we measure the fluorescece of 25 yeast cells ad fid x = 89.1; s = 24.25 How good is our estimate ˆµ = 89.1? 34

Back to our fluorescet yeast Let s say we measure the fluorescece of 25 yeast cells ad fid x = 89.1; s = 24.25 How good is our estimate ˆµ = 89.1? O what will the goodess of the estimate deped? 35

Back to our fluorescet yeast Let s say we measure the fluorescece of 25 yeast cells ad fid x = 89.1; s = 24.25 How good is our estimate ˆµ = 89.1? O what will the goodess of the estimate deped? Sample size Variability of the populatio from which the samples were draw 36

A simple startig poit What is the probability that a secod sample from the culture is withi 79.1 ad 99.1? P ( x 2 is withi 10 of x) Recall that x is a RV with its ow samplig distributio 0.0 0.1 0.2 0.3 0.4 ˆ = µ x The samplig distributio of the sample mea is: Normal (by cetral limit theorem) Has µ x = µ = x Has x = p = s p 37

Stadard Error of the Mea SEM is the stadard deviatio of the samplig distributio of the mea Ofte cofused with stadard deviatio of a sample i the literature. The stadard deviatio is descriptive of the sample we took, but SEM describes the spread of the samplig distributio of the mea itself. SD of the sample is the degree to which idividuals withi a sample differ from the sample mea SEM reflects ucertaity about where the populatio mea might be located, give our sample 38

A simple startig poit What is the probability that a secod sample from the culture is withi 79.1 ad 99.1? P ( x 2 is withi 10 of x) Recall that x is a RV with a samplig distributio 0.0 0.1 0.2 0.3 0.4 ˆ = µ x The samplig distributio of the sample mea is: Normal Has µ x = µ = x Has x = p = s p Give that we have a estimate of the PDF of the samplig distributio of the sample mea, how might we try to we calculate the probability that a secod sample is withi some distace of the mea of the samplig distributio? 39

A simple startig poit What is the probability that a secod sample from the culture is withi 79.1 ad 99.1? P ( x 2 is withi 10 of x) Recall that x is a RV with a samplig distributio 0.0 0.1 0.2 0.3 0.4 ˆ = µ x The samplig distributio of the sample mea is: Normal Has µ x = µ = x Has x = p = s p We just eed to fid the area uder the samplig distributio of the sample mea correspodig to the mea +/- 10 This will be the probability that μ is withi 10 of µ x 40

A simple startig poit What is the probability that a secod sample from the culture is withi 79.1 ad 99.1? P ( x 2 is withi 10 of x) Recall that x is a RV with a samplig distributio 0.0 0.1 0.2 0.3 0.4 ˆ = µ x The samplig distributio of the sample mea is: Normal Has µ x = µ = x Has x = p = s p = 24.25/5 =4.85 > porm(c(79.1, 99.1), mea = 89.1, sd = 4.85) [1] 0.01961074 0.98038926 > 0.9804-0.01961 [1] 0.9607 42

Geeralizatio We wat to set a cofidece iterval such that 95% of sample meas from the distributio are withi the iterval Give that we ca estimate the mea ad stadard deviatio of the samplig distributio of the sample mea, how do we do this? 43

Geeralizatio We fid the umber of stadard deviatios (z) we must move away from the mea to ecompass 95% of the samplig distributio of the sample mea 5% of total area 0.0 0.1 0.2 0.3 0.4 z µ x 44

Geeralizatio Sice the distributio is symmetric, we ca just use the CDF to accomplish this 97.5% of total area 0.0 0.1 0.2 0.3 0.4 z To fid z such that CI 95% = µ x ± s p z We ca use the ormal cumulative distributio fuctio µ x 45

Geeralizatio Now we ca set a 95% CI for our fluorescece data 97.5% of total area 0.0 0.1 0.2 0.3 0.4 µ x z To fid z such that CI 95% = µ x ± s p z We ca use the ormal cumulative distributio fuctio > mi(which(porm(seq(-3,3,0.01))>=0.975)) [1] 497 > seq(-3,3,0.01)[497] [1] 1.96 > 89.1-4.85 * 1.96 [1] 79.594 > 89.1 + 4.85 * 1.96 [1] 98.606 46

A practical ote Whe sample sizes are greater tha ~30, the samplig distributio of the sample mea is ormal ad x = p s is a good estimate Whe sample sizes are smaller tha ~30, is a uderestimate x = s p Thus, i practice we use the t-distributio as opposed to the ormal distributio (more o this later) 47

Wait a miute We bega by takig a sample ad usig it to estimate the samplig distributio of the sample mea. The, usig the cetral limit theorem ad the ormal CDF, we computed the iterval withi which 95% of the area of our sample-based estimate of the samplig distributio of the sample mea falls. We might coclude that there was a 95% chace that this iterval cotaied the true populatio mea. Does ayoe have a problem with this? 48

Wait a miute Philosophically, this makes o sese. The populatio mea is a fixed quatity, so it is either iside or outside the iterval we calculated. Period. Additioally, the process of samplig is subject to samplig variability. So, we might have draw a really weird sample that poorly represets the populatio. 49

Iterpretatio of cofidece itervals I fact, it s better to build the idea of samplig variatio ito our iterpretatio of a cofidece iterval If you repeatedly sample the same populatio, the CI (which differs for each sample) would cotai the true populatio parameter X% of the time 50

Iterpretatio of cofidece itervals If you repeatedly sample the same populatio, the CI (which differs for each sample) would cotai the true populatio parameter X% of the time NOT the probability that this particular CI from this particular sample actually cotais the populatio parameter NOT that there is a X% probability of a sample mea from a repeat experimet fallig withi the iterval 51

R Sessio Goals Cofidece iterval calculatios User-defied fuctios 52

Stadard Error of the Mea are idepedet obs from a pop. with mea μ ad stdev σ 53

Stadard Error of the Mea are idepedet obs from a pop. with mea μ ad stdev σ 54

Stadard Error of the Mea are idepedet obs from a pop. with mea μ ad stdev σ Is a property of RV 55

Stadard Error of the Mea are idepedet obs from a pop. with mea μ ad stdev σ Is a property of RV 56

Stadard Error of the Mea are idepedet obs from a pop. with mea μ ad stdev σ Is a property of RV 57

Stadard Error of the Mea are idepedet obs from a pop. with mea μ ad stdev σ Is a property of RV 58

Stadard Error of the Mea are idepedet obs from a pop. with mea μ ad stdev σ Is a property of RV 59

Multivariate Hypergeometric Dist The HGD ca be geeralized to pickig a sample of size where there are exactly (k 1, k 2 k c ) items from each of c classes from a populatio of N items of c classes where there are K i items of of class i pmf: Q c i=1 N K i k i Example: There are 5 black, 10 white ad 15 red balls i a ur. If you draw six without replacemet, what is the probability that you pick 2 of each color? 5 2 10 2 30 6 15 2 =0.08 60