Lecture III. 1. common parametric models 2. model fitting 2a. moment matching 2b. maximum likelihood 3. hypothesis testing 3a. p-values 3b.

Size: px

Start display at page:

Download "Lecture III. 1. common parametric models 2. model fitting 2a. moment matching 2b. maximum likelihood 3. hypothesis testing 3a. p-values 3b."

Flora Porter
5 years ago
Views:

1 Lecture III 1. common parametric models 2. model fitting 2a. moment matching 2b. maximum likelihood 3. hypothesis testing 3a. p-values 3b. simulation

2 Parameters Parameters are knobs that control the amount of probability we assign to particular events. In the last homework, λ was a parameter. Parameters let us assign probabilities to many events (even an infinity of them) using just a few numbers. Simplicity, simplicity, simplicity!

3 Bernoulli In the Bernoulli coin flip model, the only parameter is the probability of a success. p x (1 p) 1 x, x {0, 1} A success could be defined as: 1. a win or a loss in a sporting event 2. a profit or a loss in a fiscal year 3. a male or female newborn...

4 Binomial When we go on to consider the sum of n Bernoulli trials, we get a binomial model, which has parameters n and p. Pr(X = x) = n p x (1 p) n x x X {0, 1,,n} Could be used to model votes in a population, season win-loss record, number of students who get an A in a class...

5 Binomial Pr(X = x) = n p x (1 p) n x x X {0, 1,,n} > n <- 10 > x <- seq(0,n,by=1) > px <- dbinom(x,n,0.3) > barplot(px,names=x)

6 Binomial Pr(X = x) = n p x (1 p) n x x X {0, 1,,n} > n <- 10 > x <- seq(0,n,by=1) > px <- dbinom(x,n,0.8) > barplot(px,names=x)

7 Binomial Pr(X = x) = n p x (1 p) n x x X {0, 1,,n} > n <- 55 > x <- seq(0,n,by=1) > px <- dbinom(x,n,0.8) > barplot(px,names=x)

8 Poisson For large n we might consider a model which allows arbitrarily large integers. Pr(X = x) = λx x! exp ( λ) E(X) =λ Var(X) =λ Could be used to model number of infections in a large population (epidemiology), number of web hits in a given day (e-commerce) or counts of a certain word in a document (text analysis). What purposed does the exponent serve in this expression?

9 Poisson Pr(X = x) = λx x! exp ( λ) > lambda <- 1 > x <- seq(0,15,by=1) > px <- dpois(x,lambda) > barplot(px,names=x)

10 Poisson Pr(X = x) = λx x! exp ( λ) > lambda <- 3 > x <- seq(0,15,by=1) > px <- dpois(x,lambda) > barplot(px,names=x)

11 Poisson Pr(X = x) = λx x! exp ( λ) > lambda <- 20 > x <- seq(0,50,by=1) > px <- dpois(x,lambda) > barplot(px,names=x)

12 Normal distribution The renown of this distribution is well-deserved, as it arises in a variety of different contexts. We will see this connections in a moment. f(x) = 1 2πσ exp 1 2 (x µ) 2 σ 2 We can use this distribution to model log-returns of stocks, heights of individuals in the general population, or astronomical measurements taken by different individuals at different times, etc.

13 Normal distribution The parameters of the normal distribution are easy to interpret, because they give the mean and variance of the distribution, respectively. The expected value and the dispersion. f(x) = 1 2πσ exp 1 2 (x µ) 2 σ 2 E(X) =µ Var(X) =σ 2 E(X) = Var(X) = xf(x)dx (x E(X)) 2 f(x)dx Just remember, integrals are basically sums.

14 Normal distribution f(x) = 1 2πσ exp 1 2 (x µ) 2 σ 2 fx > mu <- -5 > sigma <- 1 > x<- seq(-10,10,by=0.1) > fx <- dnorm(x,mu,sigma) > plot(x,fx,type='l',bty='n') x

15 Normal distribution f(x) = 1 2πσ exp 1 2 (x µ) 2 σ 2 fx x > mu <- 1 > sigma <- 2.5 > x<- seq(-10,10,by=0.1) > fx <- dnorm(x,mu,sigma) > plot(x,fx,type='l',bty='n')

16 Central limit theorems As n gets large, the binomial distribution starts to look like a normal distribution with parameters N(np, np(1 p)) p = 0.3, n = 10

17 Central limit theorems As n gets large, the binomial distribution starts to look like a normal distribution with parameters N(np, np(1 p)) p = 0.3, n = 20

18 Central limit theorems As n gets large, the binomial distribution starts to look like a normal distribution with parameters N(np, np(1 p)) p = 0.3, n = 100

19 Central limit theorems One can also approximate the Poisson distribution using a normal distribution, again by making the mean and variance match

20 Central limit theorems A CLT is a theoretical result which says that sums and averages of random variables with finite mean and variance come to be normally distributed as the number of terms in the sum grows large. The binomial and Poisson approximations are a special case of this. Recall that a binomial random variable is a sum of independent Bernoulli random variables. The Poisson distribution can be similarly characterized.

21 Quincunx

22 Quincunx

23 Linear combinations of normals A fact about the normal distribution which is related to the central limit theorem is that weighted sums of normal random variables are still normally distributed, just with different parameters. X N(µ, σ 2 ) Y = α + βx Y N(α + βµ,β 2 σ 2 )

24 Linear combinations of normals In the case of two independent random variates we have: X 1 N(µ 1,v 1 ) X 2 N(µ 2,v 2 ) Y = α + β 1 X 1 + β 2 X 2 Y N(α + β 1 µ 1 + β 2 µ 2,β1v β 2 v 2 ) Intuitively, this works because the normal distribution is uniquely characterized by its first and second moment (e.g., its mean and variance), and...

25 Mean and variance of linear combinations RVs... in general, mean and variance of linear combinations may be calculated as Y = α + β 1 X 1 + β 2 X 2 Var(Y ) = Var(α + β 1 X 1 + β 2 X 2 ) = Var(α) + Var(β 1 X 1 ) + Var(β 2 X 2 ) = β1var(x 2 1 )+β2var(x 2 2 ) E(Y )=E(α + β 1 X 1 + β 2 X 2 ) =E(α)+E(β 1 X 1 )+E(β 2 X 2 ) = α + β 1 E(X 1 )+β 2 E(X 2 )

26 Exponential distribution For strictly positive quantities, an exponential distribution may be appropriate. fx f(x) = exp ( x/λ) λ E(X) =λ Var(X) =λ 2 Look familiar? x

27 Exponential distribution We have been reporting the probability density functions for these distributions, but we could as well have reported the cumulative distribution functions. F (x) =1 exp x λ Fx What is the inverse of this? x

28 Using the CDF Recall that CDFs allow us to compute the probability assigned to intervals. Fx x

29 Using the CDF In terms of the pdf, this is the area under the curve on that interval. I think the CDF is easier to think about -- remember the darts! fx x

30 t distribution Recall the emerging markets data had a decent proportion of astronomical profits/losses. The normal distribution does not have this feature (shown in green for comparison). f(x) = Γ( ν+1 2 ) νπγ( ν 2 ) 1+ (x µ)2 ν ν+1 2 fx x

31 Cauchy distribution With nu=1 we have the Cauchy distribution. The mean of this distribution doesn t exist! Practically, this means that if you collected a bunch of realizations of Cauchy random variables, the sample mean would never settle down no matter how many samples you collect. Cumulative mean Number of samples

32 Model fitting So far we ve considered the forward direction, where we specify a probability distribution from which we can calculate the probability of various events. There are two additional components to modeling real data. First is to figure out what to set the parameter values at, based on the observed data. This step is called estimation or model fitting. Last is the uncertainty assessment part, we try to characterize how confident we are in our estimates. This can be done with techniques called hypothesis testing and confidence intervals.

33 Moment matching One intuitive approach to model fitting is to select the parameter so that certain features of the observed data match up with features of the parametric distribution. Specifically, we can pick the parameter so that the population mean is equal to the observed sample mean. So in the case of a normal model, we might set ˆµ = x n ˆσ2 = n j=1 (x j x n ) 2 n

34 Sample mean vs population mean These are not the same thing! A random variable is a curve that maps intervals in the domain to numbers between zero and one. The population mean is an attribute of this curve. The sample mean is a function of random draws generated according to the chance process described by the curve. So the sample mean itself is a random variable! At least until we observe it. Once we observe it, its data.

35 Sample mean of normal RV Recall that the sum of independent normal random variables is also a normal random variable. This lets us figure out what the distribution of the sample mean of normal random variables is. X j N(µ, σ 2 ), j =1,...,n X n N µ, σ2 n Notice that its distribution depends on the parameter values of the data random variables.

36 Statistics This is true in general of statistics of random variables --- that they are themselves random variables. Their distributions are called the sampling distribution of the statistic. What is a statistic? Any function of a random variable or variables. Can you think of some examples? If you start with a Bernoulli model, what statistic has a binomial sampling distribution?

37 Example: Win-Loss The Bulls had a record at the end of the season. If we model each of their games as a Bernoulli random variable with probability p of winning, then 72 wins can be considered a draw from a binomial distribution with parameters p and n=82. The expected value of a binomial random variable is np, so we set np = 72, giving ˆp = 72 82

38 Example: Word Count If we go to major online news outlets (CNN, Fox, ABC, Slate, Washington Post, NY Times, etc), we can do a search for Romney ˆλ =

39 Example: Word Count If we go to major online news outlets (CNN, Fox, ABC, Slate, Washington Post, NY Times, etc), we can do a search for Romney ˆλ =4.66

40 Example: Word Count If we go to major online news outlets (CNN, Fox, ABC, Slate, Washington Post, NY Times, etc), we can do a search for Romney ˆλ =4.66 Is this a good fit? ˆλ 2 = 10.35

41 Example: Daily Stock Prices What if we look at the daily change in stock returns measured in log dollars? Consider Apple. Differenced log close price of Apple Inc. Density Change in Log Price

42 Example: Daily Stock Prices There are probably more unusual events than the normal model would suggest. Change in log close Log dollars

43 Example: Daily Stock Prices There are probably more unusual events than the normal model would suggest. Change in log close Log dollars

44 Example: Daily Stock Prices Normal Q-Q Plot Sample Quantiles Theoretical Quantiles

45 Maximum Likelihood Pia is 31 years old, single, outspoken, and smart. She was a philosophy major. When a student, she was an ardent supporter of Native American rights, and she picketed a department store that had no facilities for nursing mothers. Rank the following statements in order of probability from 1 (most probable) to 6 (least probable). a) Pia is an active feminist. b) Pia is a bank teller. c) Pia works in a small bookstore. d) Pia is a bank teller and an active feminist. e) Pia is a bank teller and an active feminist who takes yoga classes. f) Pia works in a small bookstore and is an active feminist who takes yoga classes.

46 Maximum Likelihood Principle: Pick the model (from some circumscribed class) that makes the data look best in the sense of being most likely under that model. In many cases, moment matching and maximum likelihood estimates coincide. This is true in the case of the Normal distribution, for example.

47 Maximum Likelihood Consider again the word count data. We can compute and plot the likelihood as a function of lambda: Why do/can we take logarithms here? Why is the log-likelihood written as a sum? Log-Likelihood λ > lam <- seq(0.2,10,by=0.01) > b<-rep(0,length(lam)) > for (j in 1:length(lam)){ + b[j] <- sum(dpois(c,lam[j],log =TRUE))} > plot(lam,b,xlab='lambda',ylab='log-likelihood',type='l') > lines(c(mean(c),mean(c)),c(-350,-10),col='red',lwd=3)

48 The Logic of Hypothesis Testing We keep hinting at the idea that we want to gauge our confidence in our parameter estimates. We begin with a special case where we have an a priori parameter value of interest. This parameter setting is called a null hypothesis, because it often encodes the idea that nothing interesting is going on or that the data could have happened by chance. The approach is this: consider the sampling distribution of a statistic of the data if the null hypothesis were true; if the observed statistic is unlikely under that distribution we reject the null hypothesis, otherwise we fail to reject.

49 p-values The approach is this: consider the sampling distribution of a statistic of the data if the null hypothesis were true; if the observed statistic is unlikely under that distribution we reject the null hypothesis, otherwise we fail to reject. The way we measure unlikeliness is via a tail probability: the probability (under the null) of drawing a statistic as-or-more extreme as the one we actually saw.

50 Example: difference of two means Do I wake up earlier on Monday than I do on Friday? To answer this question, imagine I collected the time I woke up on 20 consecutive weeks. If every Friday time was later than that week s Monday time, what would we conclude? In general, some Fridays I will rise later, but some I will rise earlier. The question is how can I decide if this effect is real, or just happened by chance?

51 Example: difference of two means Assume for simplicity that the time I wake up on Fridays is given by as F N(µ F, 100 min) and that my Monday wake up time is randomly distributed as M N(µ M, 100 min) then the sample mean of the difference between my wake up times is X i=1 (M i F i ), X N µ M µ F, 200 min 20

52 Example: difference of two means X i=1 (M i F i ), X N µ M µ F, 200 min 20 How can we use this information to decide if there is a significant difference? Here is one way to judge: we want a rule that if we follow it will lead us to incorrectly reject the null only rarely. H 0 : µ M µ F =0 H A : µ M µ F =0

53 Example: difference of two means X i=1 (M i F i ), X N µ M µ F, H 0 : µ M µ F =0 H A : µ M µ F =0 200 min 20 Rule: reject the null hypothesis if x min 20 or x min 20

54 Example: difference of two means Rule: reject the null hypothesis if x 1.96 Why does this work? 200 min 20 or x min 20 fx x 1 - (pnorm(1.96*sqrt(10),0,sqrt(10))- pnorm(-1.96*sqrt(10),0,sqrt(10))) = 5%

55 Example: difference of two means we want a rule that if we follow it will lead us to incorrectly reject the null only rarely In the previous example we followed the conventional 5% significance level. There is nothing in principle special about 5%. Can anyone think of a problem with this approach? We will come back to this point in a minute.

56 Simulation This example -- and indeed, much of recorded statistical methodology -- relies on mathematical derivations of sampling distributions for particular statistics. Modern computing technology has made hypothesis testing much easier, by allowing simulating from the null distribution. Consider solving the same problem as before by simulation in R: > xnull <- rnorm(10000,0,sqrt(10)) > quantile(xnull,0.025) > quantile(xnull,0.975)

57 Permutation tests Sometimes we can use the logic of hypothesis testing without making parametric assumptions. For example, if it were really the case that Mondays and Fridays were equivalent in terms of my wake-up times, then we would expect that shuffling the observed difference between the average wake-up time would not luck much different if we shuffled the days up. > tf <- c(-20, 10, 2, 3, -2, -1, 1, 6,20,-1,-1,0,0,0,1,2,5,5,4,30) > tm <- c(-17,0,5,0,-1,0,5,-20,5,5,3,3,-1,2,23,11,13,6,-4,-14) Would the parametric model say about the null hypothesis in this case?

58 Permutation tests Monday wake up times Frequency Minutes before or after alarm Friday wake up times Frequency Minutes before or after alarm

59 Permutation tests Difference in wake up times Frequency Minutes before or after alarm Does the data look to support the null hypothesis?

60 Permutation tests > tall <- c(tm,tf) > inds<-1:40 > reps< > m<-rep(0,reps) > for (j in 1:10000){ + idx <- sample(inds,20) + idx2 <- inds[setdiff(inds,idx)] + m[j] <- mean(tall[idx2]-tall[idx])} Null distribution Frequency Difference in wake-up times

Random Variables Handout. Xavier Vilà

Random Variables Handout. Xavier Vilà Random Variables Handout Xavier Vilà Course 2004-2005 1 Discrete Random Variables. 1.1 Introduction 1.1.1 Definition of Random Variable A random variable X is a function that maps each possible outcome