درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

یادگیري ماشین توزیع هاي نمونه و تخمین نقطه اي پارامترها Sampling Distributions and Point Estimation of Parameter (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی درس هفتم 1

Outline Introduction General Concepts of Point Estimation Unbiased Estimators Proof that S is a Biased Estimator of σ Variance of a Point Estimator Standard Error: Reporting a Point Estimate Bootstrap Estimate of the Standard Error Mean Square Error of an Estimator Methods of Point Estimation Method of Moments Method of Maximum Likelihood Bayesian Estimation of Parameters Sampling Distributions Sampling Distributions of Means 2

Introduction The field of statistical inference consists of:- Methods used to make decisions Draw conclusions about a population These methods utilize the information contained in a random sample from the population in drawing conclusions (statistical method used for inference and decision making) Statistical inference may be divided into 2 major areas:- 1) Parameter Estimation Point Estimation Use sample data to compute a number that is in some sense a reasonable value (or guess) of the true unknown parameter (i.e., mean value) Interval Estimation With high confidence that it does contain the unknown population parameter. 2. Hypothesis Testing Decide whether to accept or reject a statement about some parameter. 3

Terminology Suppose that we want to obtain a point estimate of a population parameter. Before the data is collected, the observations are considered to be random variables, Therefore, Any function of the observation, or Any statistic, is also a random variable. (X 1, X 2,., X n ) For example, the sample mean X and 2 the sample variance S are statistics and so are random variables with numerical value and perhaps CI. 4

Terminology: Sampling distribution:- Note: General symbol: Symbol: Since a statistic is a random variable, it has a probability distribution (pd). Is the probability distribution of a statistic The notion of a sampling distribution is very important. When discussing inference problems, it is convenient to have a general symbol to represent the parameter of interest. Use the Greek symbol θ (theta) to represent the parameter. Objective of point estimation: To select a single number, based on sample data, that is the most plausible value for θ. Numerical Value: A numerical value of a sample statistic will be used as the point estimate. Random Process: Monte Carlo, PSO, ACO, GA, SA Distributions 5

X a random variable, with probability distribution (pd) f(x), characterized by the unknown parameter θ, say f(θ x) X 1, X 2,.,X n a random sample of size n from X, Terminology: The statistic ˆ h( X1, X 2,..., X n ) is called a point estimator of θ. Hint: ˆ is a random variable because it is a function of random variables Point Estimate: After the sample has been selected, ˆ takes on a particular numerical value ˆ called the point estimate of θ. Definition: Example: (Point Estimator) A point estimate of some population parameter θ is a single numerical value ˆ of a statistic ˆ. The statistic ˆ is called the point estimator. Suppose that the random variable X is normally distributed with an unknown mean µ. The sample mean is a point estimator of the unknown population mean µ. That is, ˆ X. After the sample has been selected, the numerical value ˆx is the point estimate of µ. Thus, if x 1 =25, x 2 =30, x 3 =29, and x 4 =31, the point estimate of µ is 6

Similarly, if the population variance σ 2 is also unknown, a point estimator for σ 2 is the sample variance S 2, and the numerical value S 2 =6.9 calculated from the sample data is called the point estimate of σ 2. Estimation problems occur frequently in Engineering. Often need to estimate parameters The mean µ of a single population The variance σ 2,(or standard deviation σ) of a single population The proportion p of items in a population that belong to a class of interest The difference in means of two populations, µ 1 µ 2 The difference in two population proportions p 1 p 2 Reasonable point estimates of these parameters For µ, the estimate is ˆ x the sample mean. For σ 2 2 2, the estimate is ˆ s, the sample variance. x For p, the estimate is pˆ, the sample proportion, where x is the number of n items in a random sample of size n that belong to the class of interest. For µ 1 µ 2, the estimate is ˆ 1 ˆ 2 x1 x2, the difference between the sample means of two independent random samples. For p 1 p 2, the estimate is pˆ 1 pˆ 2, the difference between two sample proportions computed from two independent random samples. 7

Note: May have Several different choices for the point estimator of a parameter. Example: Might consider to estimate the mean of a population:- the sample mean, the sample median, or perhaps the average of the smallest and largest observations in the sample as point estimators. In order to decide which point estimator of a particular parameter is the best one to use, we need to examine their statistical properties and develop some criteria for comparing estimators. 8

Statistical Inference: Random Variable: Sampling Distributions Concerned with making decisions about a population based on the information contained in a random sample from that population. The sample mean is a statistic; that is, it is a random variable that depends on the results obtained in each particular sample. Since a statistic is a random variable, it has a probability distribution. The random variables are usually assumed to be independent and identically distributed (iid) مدتی: مستقل داراي توزیع یکسان) موهت: مستقل و همتوزیع These random variables are known as random sample..(موهت Definition: (Sampling Distribution) The probability distribution of a statistic is called a sampling distribution. For example, the probability distribution of X is called the sampling distribution of the mean. The sampling distribution of a statistic depends on the distribution of the population, the size of the sample, and the method of sample selection 9

Sampling Distributions of Means σ 2 µ σ 2 /n 10

The Central Limit Theorem 11

Distributions of average scores from throwing dice RandomGenerator If n < 30, the central limit theorem will work if the distribution of the population is not severely non-normal. 12

Example: 13

Definition: (Approximate Sampling Distribution of a Difference in Sample Mean) 14

Unbiased Estimators: General Concepts of Point Estimation An estimator should be close in some sense to the true value of the unknown parameter. ˆ is an unbiased estimator of θ if the expected value of ˆ is equal to θ, i.e., E( ˆ ). This is equivalent to saying that the mean of the probability distribution of mean of the sampling distribution of ˆ ) is equal to θ. ˆ (or the Definition: (Bias of an Estimator) The point estimator ˆ is an unbiased estimator for the parameter θ if E( ˆ ) If the estimator is not unbiased, then the difference E( ˆ ) is called the bias of the estimator ˆ. When an estimator is unbiased, the bias is zero; that is E( ˆ ) 0 15

Example: (Sample Mean and Variance are Unbiased) Suppose that X is a random variable with mean µ and variance σ 2. Let X 1, X 2,.., X n be a random sample of size n from the population represented by X. Show: The sample mean X and sample variance S 2 are unbiased estimators of µ and σ 2, respectively. Sample Mean: E( X ) (Shown in previous lessons) Sample mean X is an unbiased estimator of the population mean µ. Sample Variance: We have n? 16

The last equality follows from Equation for the mean of a linear function. However, since and we have Therefore, the sample variance S 2 is an unbiased estimator of the population variance σ 2. Although S 2 is unbiased for σ 2, S is a biased estimator of σ!? For large samples, the bias is very small. However, there are good reasons for using S as an estimator of σ in samples from normal distributions, as we will see soon when are discuss confidence intervals and hypothesis testing. 17

Sometimes there are several unbiased estimators of the sample population parameter. For example, suppose we take a random sample of size n=10 from a normal population and obtain the data x 1 =12.8, x 2 =9.4, x 3 =8.7, x 4 =11.6, x 5 =13.1, x 6 =9.8, x 7 =14.1, x 8 =8.5, x 9 =12.1, x 10 =10.3. Now the sample mean is the sample median is and a 10% trimmed mean (obtained by discarding the smallest and largest 10% of the sample before averaging) is We can show that all of these are unbiased estimates of µ. Since there is not a unique unbiased estimator, we cannot rely on the property of unbiasedness alone to select our estimator. We need a method to select among unbiased estimators, soon will be suggested! 18

Variance of a Point Estimator Suppose that ˆ and ˆ are unbiased estimators of θ. 2 1 This indicates that the distribution of each estimator is centered at the true value of θ. However, the variance of these distributions may be different. Since Note: ˆ 1 has a smaller variance than ˆ the estimator ˆ is more likely 2 1 to produce an estimate close to the true value θ. A logical principle of estimation, when selecting among several estimators, is to choose the estimator that has minimum variance. 19

Definition: If we consider all unbiased estimators of θ, the one with the smallest variance is called the Minimum Variance Unbiased Estimator (MVUE). In a sense, the MVUE is most likely among all unbiased estimators to produce an estimate ˆ that is close to the true value of θ. It has been possible to develop methodology to identify the MVUE in many practical situations. Theorem: If X 1, X 2,.., X n is a random sample of size n from a normal distribution with mean µ and variance σ 2, the sample mean X is the MVUE for µ. In situations in which we do not know whether an MVUE exists, we could still use a minimum variance principle to choose among competing estimators. Suppose, for example, we wish to estimate the mean of a population (not necessarily a normal population). We have a random sample of n observations X 1, X 2,.., X n and we wish to compare two possible estimators for µ; the sample mean X and a single observation from the sample, say, X i. Note that both X and X i are unbiased estimators of µ; for the sample mean, we have 2 V ( X ) and the variance of any observation is 2 V ( X ). n n 2 Since V ( X ) V ( X i ) for sample sizes we would conclude that the sample mean is a better estimator of µ than a single observation X i. i 20

Methods of Point Estimation The definitions of unbiasness and other properties of estimators do not provide any guidance about how good estimators can be obtained. In this lesson, 2 methods for obtaining point estimators are discussed:- Moments and Maximum Likelihood (ML) Maximum Likelihood Estimates (MLE) are generally preferable to Moment Estimators (ME) because they have better efficiency properties. However, moment estimators are sometimes easier to compute. Both methods can produce unbiased point estimators? 21

Method of Moments Idea behind the method of moments: To equate population moments, which are defined in terms of expected values, to the corresponding sample moments. The population moments will be functions of the unknown parameters. These equations are solved to yield estimators of the unknown parameters. Definition: (Moments) Let X 1, X 2,.., X n be a random sample from the probability distribution f (x), where f (x) can be a discrete Probability Mass Function (PMF) or a continuous Probability Density Function (PDF). The k th population moment (or distribution moment) is E(X k ), k = 1, 2,.. n 1 The corresponding k th k sample moment is X i, k 1, 2,... n i 1 To illustrate, the first population moment is E(X )=µ, and the first sample moment is 1 n X i n i 1 X Thus by equating the population and sample moments, we find that ˆ X That is, the sample mean is the moment estimator of the population mean. In the general case, the population moments will be functions of the unknown parameters of the distribution, say, θ 1, θ 2,, θ m 22

Definition: (Moments Estimators) Let X 1, X 2,.., X n be a random sample from either a probability mass function or probability density function with m unknown parameters θ 1, θ 2,, θ m. The moment estimators are found by equating the first m population moments to the first m sample moments and solving the resulting equations for the unknown parameters. Example: (Exponential Distribution Moments Estimators) 23

Example: (Normal Distribution Moments Estimators) Biased estimate? Example: (Gamma Distribution Moments Estimators) 24

Moment 0? Moment 1 Moment 2 Skewness: Moment 3 In probability theory and statistics, skewness is a measure of the extent to which a probability distribution of a real-valued random variable "leans" to one side of the mean. The skewness value can be positive or negative, or even undefined. Kurtosis: Moment 4 In probability theory and statistics, kurtosis (from the Greek word κυρτός, kyrtos or kurtos, meaning curved, arching) is any measure of the "peakedness" of the probability distribution of a real-valued random variable. In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. There are various interpretations of kurtosis, and of how particular measures should be interpreted; these are primarily peakedness (width of peak), tail weight, and lack of shoulders (distribution primarily peak and tails, not in between). 25

Gaussian (Normal) Distribution mean Variance Skewness Kurtosis???? 26 26

Method of Maximum Likelihood One of the best methods of obtaining a point estimator of a parameter is the method of Maximum Likelihood Estimation (MLE). This technique was developed in the 1920s by a famous British statistician, Sir R. A. Fisher. As the name implies, the estimator will be the value of the parameter that maximizes the likelihood function. Definition: Note: (Maximum Likelihood Estimator) Suppose that X is a random variable with probability distribution f (x; θ), where θ is a single unknown parameter. Let x 1, x 2., x n be the observed values in a random sample of size n. Then the likelihood function of the sample is The likelihood function is now a function of only the unknown parameter θ. The Maximum Likelihood Estimator (MLE) of θ is the value of θ that maximizes the likelihood function L(θ). 27

In the case of a discrete random variable, the interpretation of the likelihood function is clear. The likelihood function of the sample L(θ) is just the probability That is, L(θ) is just the probability of obtaining the sample values x 1, x 2,.., x n. Therefore, in the discrete case, the maximum likelihood estimator is an estimator that maximizes the probability of occurrence of the sample values. Example: (Bernoulli Distribution MLE) 28

Example: (Normal Distribution MLE) 30

Example: (Exponential Distribution MLE) 31

Example: (Normal Distribution MLE for µ and σ 2 ) n? Dirtributions 32

Maximum Likelihood Estimators Estimator: Estimate: Any statistic used to estimate the value of an unknown parameter θ is called an estimator of θ. The observed value of the estimator is called the estimate. For instance, as we shall see, the usual estimator of the mean of a normal population, based on a sample X 1,..., X n from that population, is the sample mean 1 X X i n i If a sample of size 3 yields the data X 1 = 2, X 2 = 3, X 3 = 4, then the estimate of the population mean, resulting from the estimator X, is the value 3. 33

Suppose that the random variables X 1,..., X n, whose joint distribution is assumed given except for an unknown parameter θ, are to be observed. The problem of interest is to use the observed values to estimate θ. For example, the X i s might be independent, exponential random variables each having the same unknown mean θ. In this case, the joint density function of the random variables would be given by 34

Given: Notation: A random variable X has a probability distribution that is a function of a parameter θ. In the form of Probability distribution f(x θ). This implies that f(x θ), the exact form of the distribution of X is conditional on the value assigned to θ. Classical approach to estimation: Consist of taking a random sample of size n from this distribution and then substituting the sample value x i into the estimation of θ. Additional information: Suppose that we have some additional information about θ, and that we can summarize that information in the form of a probability distribution for θ, say f(θ). This probability distribution is often called the prior distribution for θ. Mean of the prior is µ 0 and the variance is σ 02. For example µ 0 =0, and σ 02 =1. prior distribution 35

This is a very novel concept insofar as the rest of this lessons is concerned because we are now viewing the parameter θ as a random variable. The probabilities associated with the prior distribution are often called subjective probabilities (degree of believes)!, in that they usually reflect the analyst s degree of belief regarding the true value of θ. The Bayesian approach to estimation uses the prior distribution for θ, f (θ), and the joint probability distribution of the sample, say f ( x1, x2,..., xn, ) to find a posterior distribution for θ, say, (,,..., ) f x1 x2 x n This posterior distribution contains information both from the sample and the prior distribution for θ. In a sense, it expresses our degree of belief regarding the true value of θ after observing the sample data. It is easy conceptually to find the posterior distribution. 36

Bayesian Estimation of Parameters Name of the game: Statistical inference based on the information in the sample data. 2 views of probability: 1) Relative frequencies Objective probabilities approach 2) Degree of belief Usual approach, i.e., Estimation based on MLE Subjective probabilities approach Bayesian approach Combines sample information with other information that may be available prior to collecting the sample. Purpose of this section:- Briefly illustrate how this approach may be used in parameter estimation. 37

The joint probability distribution of the sample X 1, X 2,., X n and the parameter θ (remember that θ is a random variable) is f ( x, x,..., x, ) f ( x, x,..., x ) f ( ) 1 2 n 1 2 and the marginal distribution of X 1, X 2,., X n is f ( x1, x2,..., xn) 1 2 1 2 n f ( x, x,..., x, ), discrete n f ( x, x,..., x, ) d, continous n Therefore, the desired distribution is Likelihood of Distribution f ( x1, x2,..., xn, ) f ( x1, x2,..., xn ) f ( ) f ( x1, x2,..., xn ) f ( x, x,..., x ) f ( x, x,..., x ) 1 2 n 1 2 n Prior Distribution Posterior Distribution Joint probability Distribution Evidence of Distribution 38

We define the Bayes estimator of θ as the value of the posterior distribution (,,..., ) f x1 x2 x n that corresponds to the mean Sometimes, the mean of the posterior distribution of θ can be determined easily. As a function of θ, f x1 x2 x n (,,..., ) x 1, x 2,., x n are just constants. is a probability density function and Because θ enters into (,,..., ) only through f ( x, x,..., x, ) if f x1 x2 x n 1 2 f ( x1, x2,..., xn, ) as a function of θ is recognized as a well-known probability function, the posterior mean of θ can be deduced from the well-known distribution without integration or even calculation of n f ( x1, x2,..., x n ) 39

Example: Bayes Estimator for the Mean of a Normal Distribution Let X 1, X 2,, X n be a random sample from the normal distribution with mean µ and variance σ 2 where µ is unknown? And σ 2 is known!! (single parameter) Assume that the prior distribution for µ is normal with mean µ 0 and variance σ 2 2 2 2 2 0; that is 1 ( 0) 1 ( 2 0 0 ) f ( ) exp{ } exp{ } 2 2 2 2 2 2 0 0 0 0 The joint probability distribution of the sample is 1 ( 2 ) f ( x, x,..., x ) exp{ }, iid sample n 2 2 2 0 0 1 2 n 2 i 1 2 2 0 0 n 1 1 2 1 2 n n 2 i 0 2 2 2 0 i 1 (2 0 ) f ( x, x,..., x ) exp{ ( ) ( x ) } 1 1 f x x x x x n n n 2 ( 1, 2,..., n ) exp{ ( )( 2 )} n 2 i i 2 2 2 0 i 1 i 1 (2 0 ) 40

Thus, the joint probability distribution of the sample and µ is Upon completing the square in the exponent where h 1 (x 1,., x n, σ 2, µ 0,σ 02 ) is a function of the observed values, σ 2, µ 0, and σ 0 2 Now, because f (x 1,., x n ) does not depend on µ, 41

This is recognized as a normal probability density function with posterior mean and posterior variance Consequently, the Bayes estimate of µ is a weighted average of µ 0 and. For purposes of comparison, note that the maximum likelihood estimate of µ is x. To illustrate, suppose that we have a sample of size n =10 from a normal distribution with unknown mean µ and variance σ 2 =4. Assume that the prior distribution for µ is normal with mean µ 0 and variance σ 02 =1. If the sample mean is 0.75, the Bayes estimate of µ is Note that the maximum likelihood estimate of 2 2 ( 0 n 0, 0 x) n posterior mean 0 posterior mean 2 2 n 0 n, posterior mean x 2 2 ( 0, var 0 ) n posterior iance posterior var iance 2 2 n 0 n, posterior var iance 0 x is. x 2 0 42

Bayesian Estimation of Parameters (Another approach) 43

precision = 1.0/variation = 1/σ 2 precision n = precision 0 + n * precision ML for n=0 precision n = precision 0 n= Precision n = precision ML µ n = σ n2 (µ 0 / σ 02 + i x i /σ 2 )=( σ n2 / σ 02 ) µ 0 + ( σ n2 / σ 2 ) n * µ ML for n=0 µ n = µ 0 n= µ n = µ ML 44

Remarks: There is a relationship between the Bayes estimator for a parameter and the maximum likelihood estimator of the same parameter. For large sample sizes, the two are nearly equivalent. In general, the difference between the two estimators is small compared to In practical problems, a moderate sample size will produce approximately the same estimate by either the Bayes or maximum likelihood method, if the sample results are consistent with the assumed prior information. If the sample results are inconsistent with the prior assumptions, the Bayes estimate may differ considerably from the maximum likelihood estimate. In these circumstances, if the sample results are accepted as being correct, the prior information must be incorrect. The maximum likelihood estimate would then be the better estimate to use. If the sample results are very different from the prior information, the Bayes estimator will always tend to produce an estimate that is between the maximum likelihood estimate and the prior assumptions. If there is more inconsistency between the prior information and the sample, there will be more difference between the two estimates. 1 n 46

Suppose that we want to obtain a point estimate of a population parameter. We know that before the data is collected, the observations are considered to be random variables, say X 1, X 2,.., X n. Therefore, any function of the observation, or any statistic, is also a random variable. For example, the sample mean X - and the sample variance S 2 are statistics and they are also random variables. Since a statistic is a random variable, it has a probability distribution. We call the probability distribution of a statistic a sampling distribution. The notion of a sampling distribution is very important and will be discussed and illustrated later in the lesson. When discussing inference problems, it is convenient to have a general symbol to represent the parameter of interest. We will use the Greek symbol (theta, µ) to represent the parameter. The objective of point estimation is to select a single number, based on sample data, that is the most plausible value for. A numerical value of a sample statistic will be used as the point estimate. 47

Parameter Estimation Let X 1,..., X n be a random sample from a distribution F θ that is specified up to a vector of unknown parameters θ. Sample could be from a or it could be from a It is usual in the opposite is true in Poisson distribution whose mean value is unknown; Normal distribution having an unknown mean and variance Probability theory to suppose that all of the parameters of a distribution are known, Statistics, where a central problem is to use the observed data to make inferences about the unknown parameters we present the Maximum Likelihood (ML) method for determining estimators of unknown parameters. 48

The estimates so obtained are called Point Estimates because they specify a single quantity as an estimate of θ. Later on, we consider the problem of obtaining Interval Estimates In this case, rather than specifying a certain value as our estimate of θ, we specify an interval in which we estimate that θ lies. Additionally, we consider the question of how much confidence we can attach to such an interval estimate. We illustrate by showing how to obtain an interval estimate of the unknown mean of a normal distribution whose variance is specified. We then consider a variety of interval estimation problems. We also present an interval estimate of the mean of a normal distribution whose variance is unknown. We obtain an interval estimate of the variance of a normal distribution. 49

We determine an interval estimate for the difference of two normal means, both when their variances are assumed to be known and when they are assumed to be unknown (although in the latter case we suppose that the unknown variances are equal). We present interval estimates of the mean of a Bernoulli random variable and the mean of an exponential random variable. We return to the general problem of obtaining point estimates of unknown parameters and show how to evaluate an estimator by considering its mean square error. The bias of an estimator is discussed, and its relationship to the mean square error is explored. We consider the problem of determining an estimate of an unknown parameter when there is some prior information available. This is the Bayesian approach, which supposes that prior to observing the data, information about θ is always available to the decision maker, and that this information can be expressed in terms of a probability distribution on θ. In such a situation, we show how to compute the Bayes estimator, which is the estimator whose expected squared distance from θ is minimal 50