2.1 Probability, stochastic variables and distribution functions

Size: px

Start display at page:

Download "2.1 Probability, stochastic variables and distribution functions"

Shon Pierce
5 years ago
Views:

1 Chapter 2 Probability and statistics 2.1 Probability, stochastic variables and distribution functions The defining characteristic of a stochastic experiment E is that it produces different outcomes under ostensibly similar circumstances. Although it is not possible to know with certainty the outcome of a stochastic experiment, it is possible to describe all possible outcomes. We denote by Ω the set of all possible outcomes from an experiment. The set Ω can be finite, countable or uncountable. Here are some examples: E (toss a coin, Ω = {Heads, T ails} E (observe the number of days a light bulb lasts, Ω = {, 1, } E (observe the price of a stock at time t, Ω = (, The collection F of subsets of Ω that are of interest are called events. The collection of events F must satisfy 1. Ω F 2. if E F then E c F 3. if E 1, E 2, F then i=1e i F. We associate with each event E a probability by defining the set function where F is the collection of events. P must satisfy P (E : F [, 1], (2.1 1

2 1. P (E, P (Ω = 1 2. For E 1, E 2, F such that E i E j = we have P ( i=1e i = i=1 P (E i. The set function P has the following properties: 1. for any event E, P (E c = 1 P (E, 2. P ( =, 3. if A B then P (A P (B (monotonicity property, 4. for any two events A and B, P (A B = P (A + P (B P (A B P (A + P (B, 1 5. if A 1 A 2 A 3 and we define A = i=1a i, then P (A = lim n P (A n. Also, if A 1 A 2 A 3 and we define A = i=1a i, then P (A = lim n P (A n (continuity property. For any two events E 1 and E 2, the conditional probability of E 2 given E 1 is denoted by P (E 2 E 1 = P (E 2 E 1 P (E 1 whenever P (E 1 >. For fixed E 2, P (E 1 E 2 is a legitimate probability measure in that it satisfies the requirements 1 and 2 following (2.1. The events E 1, E 2,, E n are independent if for any 1 i 1 < i 2 < < i k n we have where k = 1,, n. P (E i1 E i2 E ik = P (E i1 P (E ik, (2.2 We call {B 1, B 2,, B K } a partition of Ω if K i=1b i = Ω and for all i and j, we have that B i B j =. For any event A, we have that and by construction Thus, A = (A B 1 (A B 2 (A B K P (A = P (A B 1 + P (A B P (A B K. P (B j A = P (B j A P (A = P (A B jp (B j P (A = P (A B jp (B j K i=1 P (A B ip (B i. (2.3 Equation (2.3 is called Bayes law. It allows the calculation of the probability associated with a particular event (in this case B j with knowledge that event A has occurred. What follows is a simple example on how Bayes law can be used to update our beliefs on probabilities. 1 In fact, if A 1, A 2, is a sequence of events P ( i=1 A i i=1 P (A i (subadditivity property. 2

3 Example 2.1. Suppose the stochastic experiment under consideration is the tossing of a coin. Then, Ω = {H, T } and let F = {, {H}, {T }, {H, T }}. Let B 1 = {H}, B 2 = {T } and note that B 1 B 2 = and B 1 B 2 = Ω. A fair coin is such that a = P (B 1 =.5 and consequently P (B 2 = 1 a =.5. Suppose that our prior belief is that either a =.5 (coin fair with probability.5 or a =.8 (coin unfair with probability.5. Suppose that after tossing the coin three times we observe H in each of the tosses. If A represents such occurrence and the tosses are independent, by Bayes law P (a =.5 A =.5 P (A a =.5.5 P (A a = P (A a =.8 =.5 ( ( (.8 3 = P (a =.5 A represents the probability that the coin is fair conditional, or updated, on the fact three heads were observed after the tosses. It is commonly called the posterior probability (after observing the outcome of the tosses that the coin is fair. Not surprisingly, it is smaller than the prior probability (.5. Let X be a function X(s : Ω R. We call X a stochastic variable if, and only if, the inverse image of any interval (, x] under X is an event. Formally, X 1 ((, x] F for all x R. In this case, we can write, P (X 1 ((, x] = P X 1 ((, x] = P X ((, x] (2.4 for all x R. We can now think of Ω as the real line R, the events as (suitable subsets of R and P X as a probability on subsets of R. When we fix X and P, that is, when we have a particular stochastic variable and probability in mind, we can think of P X ((, x] as a function of x and define the distribution function of X as We have the following properties for F X. F X (x = P X ((, x] for all x R. (2.5 Theorem 2.1. Let F X (x : R [, 1] be the distribution function associated with the stochastic variable X. Then, a F X is continuous from the right; b F X is monotonically nondecreasing; c lim x F X (x = 1 and lim F X(x =. x Proof. We first prove b. Note that for x < y, (, x] (, y], hence F X (x = P X ((, x] P X ((, y] = F X (y. c Consider a sequence {x n } such that x n. F (x n = P X ((, x n ] and lim n F (x n = lim n P X ((, x n ] = P X (lim n (, x n ] = P X (R = 1. Similarly, if {x n } such that x n. F (x n = P X ((, x n ] and lim n F (x n = 3

4 lim n P X ((, x n ] = P X (lim n (, x n ] = P X ( =. Finally, to prove a we need to show that F (x n F X (x as x n x, but this follows directly from the fact that (, x n ] (, x] and the continuity property of P X. It is convenient to classify stochastic variables as discrete or continuous. Discrete stochastic variables are those whose image forms a countable set. Continuous stochastic variables have uncountable image. Hence, for a discrete stochastic variable X that takes values {x 1, x 2, } we have that A j = X 1 (x j for j = 1, 2, are events with P (A j = P X (X = x j = p j > and j=1 p j = 1. We say that that X (or F X is absolutely continuous if there exists a non-negative function f X : R [, that satisfies F X (a = a f X (xdx. (2.6 We call f X the density function associated with X. It is easy to verify that P (a X b = b f a X(xdx = F X (b F X (a, lim F X (a = 1 and lim F a X(a =. a If F X (x is continuous an strictly increasing, it has an inverse function which we denote by F 1 X : (, 1 R. For each q (, 1 there exists x q such that x q = F 1 X (q F X(x q = q. (2.7 x q is called the q-quantile associated with the distribution of the stochastic variable X. For example, if q =.95, then x.95 is the value of the stochastic variable that will be exceeded with probability 5 percent. Or, alternatively, the value of the stochastic variable that will not be exceeded with probability.95. If F is not strictly increasing, then there might exist several values of X associated with a particular quantile q. If X is a discrete stochastic variable, that is, a stochastic variable that takes on countably many values, then F X is a step function and does not have an inverse. 2.2 Expectation and variance The expected value of a continuous stochastic variable X, denoted by E(X, is given (whenever it exists by E(X = R xf X (xdx. (2.8 If the stochastic variable is discrete, taking on countably many values {x 1, x 2, }, we write E(X = x i P (X = x i (2.9 i=1 4

5 whenever the summation exists. The variance of a stochastic variable X, denoted by V (X, is given (whenever it exists by V (X = E((X E(X 2. (2.1 It is easy to show that V (X = E(X 2 (E(X 2. The standard deviation of a stochastic variable is given by V (X. Note that E(X = R xf X(xdx = xf X(xdx + xf X (xdx = I 1 + I 2. If I 1 and I 2 are both finite, then E(X exists as real number. If I 1 (I 2 is a real number but I 2 = ± (I 1 = ± then E(X = ±, and if I 1 = and I 2 = then E(X is not defined, or does not exist. If E(X does not exist, neither does V (X. Note also that even if E(X is finite, V (X can be infinite provided that E(X 2 =. 2.3 Functions of stochastic variables If X is a stochastic variable and g is a continuous function defined on the set in which X takes values, then Y = g(x is also a stochastic variable. Furthermore, if g is strictly increasing p = P (Y y = F Y (y = P (g(x y = P (X g 1 (y = F X (g 1 (y, (2.11 and differentiating we have d dy F Y (y = d dy y If g is strictly decreasing, we have f Y (zdz = f Y (y = d dy F X(g 1 (y = f X (g 1 (y d dy g 1 (y. p = P (Y y = F Y (y = P (g(x y = P (X g 1 (y = 1 F X (g 1 (y, (2.12 and differentiating we have d dy F Y (y = d dy y f Y (zdz = f Y (y = d dy (1 F X(g 1 (y = f X (g 1 (y d dy g 1 (y. Hence, for strictly monotone functions we have f Y (y = f X (g 1 (y d dy g 1 (y. (2.13 Also, note that if p = F Y (y, from equation (2.11 we have that F 1 X (p = g 1 (y = g(f 1 1 (p = y = F (p. (2.14 X That is, the p-quantile of Y is just the mapping under g of the p-quantile of X. 5 Y

6 Example 2.2. Let g(x = a + bx for b. Then g 1 (y = y a and d b dy g 1 (y = 1/b and ( f Y (y = f y a X. Also, F b b Y (p = a + bfx (p if b > and FY (p = a + bfx (1 p if b <. 2.4 Samples A sample of size n is a collection of values (realizations of stochastic variables. We denote by, χ n = {x 1, x 2,, x n } with x i being the realization of stochastic variable X i. To avoid additional notation, we will use uppercase X i to denote both a stochastic variable (a function and its realized value. As such, it will be clear from the context when {X i } n i=1 represents a collection of stochastic variables or a collection of realizations of stochastic variables. If the stochastic variables {X i } n i=1 are independent, i.e., P ({X 1 A 1 } {X 2 A 2 } {X n A n } = P ({X 1 A 1 }P ({X 2 A 2 } P ({X n A n } and if X i = X for all i, we say that χ n is a stochastic sample and {X i } n i=1 is a sequence of independent and identically distributed stochastic variables. The sample average, normally denoted by X, is given by X = n 1 n i=1 X i. The sample variance, normally denoted by s 2, is given by s 2 = (n 1 1 n i=1 (X i X 2 and the sample standard deviation is given by s = s Parametric models Often it assumed that the density f X (x associated with a stochastic variable X is an element of an indexed class of densities. Let the index be represented by θ and the set in which the index takes value be represented by Θ R K, K a positive integer. Then, we write f X (x; θ call θ a finite dimensional parameter and Θ the parameter space. If there exists a one-to-one relation between Θ and the class of densities we say that the parameter θ is identified. As a consequence, knowledge of θ is equivalent to knowledge of f X. In this case, E(X = m(θ, V (X = h(θ and F 1 X (q = x q(θ. Example 2.3. Let Y = X + µ for µ R. From above we have that f Y (y = f X (y µ, E(Y = E(X + µ, V (Y = V (X and F 1 1 (q = µ + F (q for q (, 1. Y 6 X

7 More generally, let Y = θx + µ for µ R and θ >. Then, E(Y = θe(x + µ and V (Y = θ 2 V (X. Hence, we have that f Y (y; µ, θ = 1f ( y µ θ X θ. Note that fx can be viewed as a special case (µ = and θ = 1 of a family of distributions given by F = {f Y (y; µ, θ : µ R, θ (, }. This is called a location-scale family, where the location is given by µ (location parameter and the scale is given by θ (scale parameter. What follows are examples of parametrically indexed family of distributions: Binomial - Let n be the number of trials associated with a stochastic experiment that allows for two outcomes: success with probability θ and failure with probability 1 θ. Let X be the total number of success in n trials, then ( n P (X = k = θ k (1 θ n k k for k = 1, 2,, n. The probability distribution P ( is called a binomial distribution with parameters n and θ and is denoted B(n, θ. In this case we say that X B(n, p. If n = 1 we say that X has a Bernoulli distribution. Theorem 2.2. If X B(n, p, then E(X = nθ and V (X = nθ(1 θ. Proof. Recall that (1 + x n = ( n n k= k to x we have n(1 + x n 1 = and multiplying both sides by x we get nx(1 + x n 1 = x k. Differentiating both sides with respect n k= n k= ( n k ( n k kx k 1 kx k. (2.15 Now, E(X = n k= ( n k = (1 θ n n θ k (1 θ n k k = k= ( n k k n k= ( k θ. 1 θ ( n k θ k (1 θ n (1 θ k k 7

8 Letting x = θ 1 θ in (2.15, we have ( ( θ E(X = (1 θ n n 1 + θ n 1 = nθ. 1 θ 1 θ For the variance, we put S n,k = (1 θ ( n n n ( θ k k= k2 k 1 θ and note that V (X = S n,k n 2 θ 2. Differentiating (2.15 with respect to x we obtain n(n 1(1 + x n 2 x 2 = = n ( n k n ( n k k= k= k(k 1x k k 2 x k n k= ( n k k(k 1x k. (2.16 Letting x = θ/(1 θ, multiplying both sides of (2.16 by (1 θ n and noting that (1 θ ( n n n k= kx k k = nθ we have, n(n 1θ 2 = S n,k nθ. Thus, V (X = n(n 1θ 2 + nθ n 2 θ 2 = nθ(1 θ. Figure 2.1 provides a graph of a B(5,.6 distribution. The height of each of the 5 bars provides the probability of X = 1,, 5 in 5 trials (see MATLAB code binomial.m P(X=x x Figure 2.1: Plot of a Binomial probability function with n = 5 and p =.6. 8

9 Uniform - Let X be a continuous stochastic variable that takes values in the interval [a, b] for a, b R with density f(x = { 1 b a if x [a, b] if x / [a, b]. In this case we say that X U[a, b] with parameters a and b. It is easy to show that E(X = a+b 2 and V (X = (b a2 12. Note that we can reparametrize this density by setting µ = a+b 2 and σ = (b a/ 12. The first parametrization emphasizes the endpoints of the set in which X takes values and the second parametrization emphasizes the expected value and variance of the distribution. Since strictly increasing cdfs (F have inverses, it is always possible to generate a stochastic sample from any such distributions by first generating a stochastic sample from U[, 1], say {u 1,, u n } and then obtaining x i = F 1 (u i cdfs take values in [, 1]. normal (Normal - Let X be a continuous stochastic variable that takes values in the interval (, with density f(x = 1 1 (x µ 2 2πσ 2 e 2 σ 2, with µ R and σ >. In this case we say that X N(µ, σ 2 with parameters µ and σ 2. It can be shown (the best way to do this is integrating using polar coordinates that E(X = µ and V (X = σ 2. When µ = and σ 2 = 1 we write X N(, 1 and say that X has a standard normal (normal density. Note that if Z N(, 1, then Y = µ + σz where µ R and σ > is such that E(Y = µ, V (Y = σ 2. Furthermore, f Y (y = 1 f ( y µ σ Z σ = 1 1 (y µ 2 2πσ 2 e 2 σ 2. Also, F Y (y = y f Y (αdα = y 1 f ( α µ σ Z σ dα. Changing variables by letting z = α µ we σ have that y µ/σ 1 y µ/σ F Y (y = σ f Z(zσdz = f Z (zdz. Figures 2.2, 2.3 and 2.4 show the graphs of normal densities with different µ and σ 2, a normal distribution function and a normal quantile function (see MATLB code normgen.m. Log-normal - Let Y N(µ, σ 2, then X = exp(y is said to have a Log-Normal density 9

10 .4.35 N(,1 N(,2 N(2,1.3 Value of the density Realizations of the Random Variable Figure 2.2: normal densities N(,1 CDF(,1 Value of the distribution function Realizations of the Random Variable Figure 2.3: Standard normal density and distribution function 4 Standard Normal Quantiles Quantiles probabilities Figure 2.4: Standard normal quantile function 1

11 .35.3 LN(1,1 LN(1,.25 LN(1.5, Value of the density Realizations of the Random Variable Figure 2.5: log-normal densities and we write X LN(µ, σ 2. Clearly, f X (x = 1 x f Y (log x (2.17 = x 2πσ 2 e 2 (log(x µ 2 σ 2 (2.18 for < x <. It can be shown that E(X = e (µ+(1/2σ2 and V (X = e 2µ+σ2 (e σ2 1. Figure 2.5 contains the graphs of three log-normal densities (see MATLAB code lognormgen.m. Exponential - Let X be a stochastic variable taking values in (, with density given by f(x = e x/θ with θ > θ In this case we say that X has an exponential density with parameter θ and we write X exp(θ. It can be shown that E(X = θ and V (X = θ 2. Pareto (Type 1 - Let X be a stochastic variable taking values in (c,, c > such that P (X > x = ( c α, ( x α >. Then, F (x = 1 c α x and f(x = αc α x α 1. The E(X = αc/(α 1, V (X = ( c 2 α. An important characteristic of the Pareto α 1 α 2 distribution is that it decays at a slow polynomial rate compared to densities that decay exponentially. 11

12 Value of the Normal and Student t densities t(8,,8/6 N(,8/ Realizations of the Random Variable Figure 2.6: normal and t densities with E(X =, V (X = 8/6 and v = Important densities related to the normal Suppose that {Z i } v i=1 is a sequence of independent and identically distributed stochastic variables with Z i N(, 1. Then, X = v i=1 Z2 i χ 2 (v where v is the parameter of the density. Note that X takes values in [,. This density is called a χ-squared density and the parameter v is called its degrees of freedom. It can be shown that E(X = v and V (X = 2v. If Z N(, 1, W χ 2 (v and Z and W are independent, then the ratio Y = Z W v t(v = v+1 Γ( ( v x2 2 vπγ(v/2 v where Γ(x = t x 1 exp( tdt is the gamma function. It can be shown that E(Y = if v > 1 (if v = 1, E(Y does not exist and V (Y = if v > 2 (if v = 2, V (Y is v v v 2 infinite. If v > 2 and X = µ + σy, then E(X = µ and V (X = σ 2. Also, f X (x = v 2 v ( v 2 x µ f v σ Y which we denote by t(v, µ, σ. Figure 2.6 shows the graph of a t-density σ v 2 and a normal density (same expected value and variance and Figure 2.7 shows the tails of the same densities in Figure 2.6. If V χ 2 (v 1 and W χ 2 (v 2, and V and W are independent then for v 1, v 2 > X = V/v 1 W/v 2 F (v 1, v 2, where F (v 1, v 2 denotes a Fisher s F - distribution with v 1 and v 2 degrees of freedom. 12

13 .1.9 t(8,,8/6 N(,8/6 Value of the Normal and Student t densities Realizations of the Random Variable Figure 2.7: Tail of normal and t densities with E(X =, V (X = 8/6 and v = Characteristic functions If X is a stochastic variable with density function f X we define the characteristic function associated with f X as the complex valued function φ fx (τ = exp(iτxf X (xdx E(exp(iτx (2.19 where τ R and i 2 = 1. Characteristic functions always exist because by the triangle inequality exp(iτxf X (xdx exp(iτx f X (xdx, and since exp(iτx = cos(τx+isin(τx = cos 2 (τx+sin 2 (τx = 1, we have exp(iτxf X (xdx 1. In mathematical parlance, the characteristic function is called the exponential Fourier transform of the density f X. some density functions. Example 2.4. The following example gives the characteristic functions of Let Y U[ a, a] then φ fy (τ = sina(τ aτ ; Let Y N (µ, σ 2 then φ fy (τ = exp ( iτµ 1 2 τ 2 σ 2 ; Let Y exp(θ then φ fy (τ = θ ; θ iτ There is a very important result that establishes that the characteristic function of a density function uniquely determines (or characterizes the density function. Put differently, for every characteristic function there is one, and only one, density function. This result is known as the Uniqueness Theorem for characteristic functions and we will state it without proof. 13

14 Theorem 2.3. The characteristic function φ fx associated with f X uniquely determines f X. Proof. The proof is advanced for this course. If interested you may consult Jacod and Protter (2[p. 17] or Resnick (25[p. 32]. The usefulness of this theorem is illustrated by the following result, that shows that linear combinations of independently distributed stochastic variables are normally distributed. Theorem 2.4. Let {Z j } n j=1 be a collection of independent stochastic variables such that Z j N (µ j, σj 2 and consider Y = n j=1 a jz j where a j R are non-stochastic. Then, ( n Y N j=1 a jµ j, n j=1 σ2 j a 2 j Proof. φ fy (τ = E(exp(iτY = E(exp(iτ n j=1 a jz j = n i=1 E(exp(iτa jz j, where the last equality follows from the independence of the Z j s. Note that E(exp(iτa j Z j is the characteristic function of a normally distributed stochastic variable evaluated at τa j. From example 4 we have E(exp(iτa j Z j = exp(iτa j µ j 1 2 σ2 j a 2 jτ 2, and consequently, ( ( n n ( n φ fy (τ = exp(iτa j µ j 1 2 σ2 j a 2 jτ 2 = exp iτ µ j a j 1 2 τ 2 σj 2 a 2 j. j=1 j=1 This is a characteristic function that is uniquely associated with a normal density given by ( n n N µ j a j, σj 2 a 2 j. j=1 j=1 j=1 2.8 Order statistics and empirical distributions Let X be a stochastic variable with distribution function given by F X and χ n = {X i } n i=1 be a stochastic sample of size n. The empirical distribution associated with χ n is given by F n (x = 1 n I {ω:xi x}(ω, (2.2 n i=1 where I A (ω is the indicator function associated with set A, that is, I A (ω = 1 if ω A and I A (ω = if ω / A. The empirical distribution is a stochastic variable (it depends on χ n and it is easy to see that E(F n (x = 1 n E(I {Xi x}(x i = 1 n n n(1 F X(x + (1 F X (x = F X (x (2.21 i=1 14

15 and, by independence of the X i s, we have V (F n (x = 1 n 2 n V (I {Xi x}(x i = 1 n F X(x(1 F X (x. (2.22 i=1 The order statistics associated with the sample χ n are the elements of χ n listed in ascending order, and denoted by X (1, X (2, X (n. The empirical distribution is discontinuous with jumps at the order statistics. In this case, we define the quantile of order a (, 1 to be F X (a = inf {x : F X(x a}. Then, we have that F n (a = inf {x : F n (x a} = inf {X (j : F n (X (j a} = inf {X (j : j/n a} = inf {X (j : j na}. Thus, the quantile F n (a is the smallest order statistic X (j for which j na. Since j is a natural number, we write q n (a = F n (a = { X(na if na N X ([na]+1 if na / N, (2.23 where [x] denotes the integer part of the number x. For example, if n = 1 and a =.95, then the.95 quantile associated with the empirical distribution is X (na which in this case is X (95. If na is not a natural number, then we round to the next largest order statistic, that is, X ([na]+1. We note that if X N(µ, σ 2 then F 1 1 (a = µ + σf (a for a (, 1. This equation X Z says that the quantiles of X can be written as a linear function of the quantiles of a standard normal distribution with intercept given by µ and slope given by σ. σf 1 (a can be easily Z calculated and F 1 X (a can be approximated by order statistics of a sample of observations on X. If X is indeed normally distributed, then plotting q n (a Fn (a against F 1 Z (a should produce a graph close to a linear function. Deviations from a linear function should result only from the fact that q n (a is an estimator for F 1 X (a. The resulting plot is called a Q-Q plot and significant deviations from a linear function are evidence of nonnormality. Figures 2.8 and 2.9 show Q-Q plots for stochastic samples from a N(, 8/6 and a t(5,, 8/6. Figure 2.1 provides a Q-Q plot for a sequence of 527 daily log-returns for three month future contracts for wheat at the Chicago Board of Trade, where the last trading day is September 8, 214. Visual inspection suggests that the Q-Q plot for the daily log-returns for wheat contracts is more similar to that of a t-distribution than to that of a normal distribution, exhibiting much thicker tails than those associated with a normal distribution. 15

16 5 QQ Plot of Sample Data versus Standard Normal Quantiles of Input Sample Standard Normal Quantiles Figure 2.8: Q-Q plot for a normal density with E(X =, V (X = 8/6 6 QQ Plot of Sample Data versus Standard Normal 4 Quantiles of Input Sample Standard Normal Quantiles Figure 2.9: Q-Q plot for a Student-t density with E(X =, V (X = 8/6 and v = 5.15 QQ Plot of Sample Data versus Standard Normal.1.5 Quantiles of Input Sample Standard Normal Quantiles Figure 2.1: Q-Q plot for daily log-returns for wheat contracts at the Chicago Board of Trade 16

17 2.9 Skewness and kurtosis Skewness and kurtosis of a distribution are measures of its shape. A measure of skewness captures the extent to which a distribution is asymmetric. We say that a distribution is symmetric about a point µ, if P (X µ + x = P (X µ x for all x in the set where the stochastic variable takes values. When µ = E(X we say that the distribution of X is symmetric about the mean. When it exists, the skewness of a distribution is given by, Sk(X = E((X E(X3 σ 3, where V (X = σ 2. (2.24 The next theorem shows that if a distribution is symmetric about E(X then Sk(X =. Theorem 2.5. Let F be symmetric about µ = xdf (x. Then, (x µ 3 df (x =. Proof. Let z = x µ. Then, I = (x µ 3 df (x = z 3 df (z + µ = z3 df (z + µ + z 3 df (z + µ. Now, letting z = y z 3 df (z + µ = ( y 3 df ( y + µ and by symmetry of F, F (µ y = 1 F (µ + y. Consequently, Hence, df (µ y = df (µ + y. z 3 df (z + µ = y 3 df (µ + y. Thus, I =. The normal, student-t and uniform densities are examples of densities which are symmetric about µ. Examples of asymmetric distributions include the log-normal, where Sk(X = (e σ2 + 2(e σ2 1 1/2 >, the binomial, where Sk(X = 1 2θ (nθ(1 θ 1/2. When Sk(X > we speak of right skewness and when Sk(X < we speak of left skewness. Kurtosis is a measure of the relative probability weight of the tails and center of a density relative to its shoulder. It is arbitrary what constitute the center, shoulders and tails of a distribution. Normally, for a distribution that is symmetric about µ the center is defined as [µ σ, µ + σ], the shoulders as (µ + σ, µ + 2σ] [µ 2σ, µ σ and the tails as (, µ 2σ (µ + 2σ,. Kurtosis is normally defined as, K(X = E((X E(X4 σ 4, where V (X = σ 2. (

18 The greater the probability mass on the center and (specially the tails relative to the shoulder, the greater the kurtosis. If X N(, 1, then K(X = E(X 4 = x 4 f X (xdx. Letting Y = X 2 we have that K(X = y 2 f Y (ydy = V (Y + (E(Y 2 = = 3, since Y χ 2 1. It is common to measure kurtosis relative to K = 3. Hence, if a distribution has K(X > 3 we say that it has excess kurtosis (relative to the standard normal distribution it has more weights on the center and tails. Figures 6 and 7 show that this is the case for the Student-t distribution. In fact, the kurtosis for a student t distribution t(v,, v/(v 2 is given by K = 3+ 6 for v > 4. When a distribution is not symmetric, the measure of kurtosis embeds v 4 both a measure of asymmetry and relative probability weight, making it more difficult to interpret the meaning of K(X Tail behavior An important way in which the normal and Student-t densities differ has to do with their tail behavior. From the analytical expression for the normal, we see that it decays to zero as x at an exponential rate, that is f(x exp(.5x 2, whereas as the Student-t density decays at a polynomial rate, since f(x x.5(v+1. Note that the rate of decay for the Student-t slows down as v gets smaller. In this sense, the tail behavior of the Student-t is akin to the tail behavior of the Pareto density, with the constraint, in the case of the Student-t density, that v is an integer rather than a continuous parameter. An arbitrary distribution function F (x is said to have a Pareto right tail if 1 F (x = L(x x α (2.26 for some α > where L(x is slowly varying at. By slowly varying at it is meant that L(xλ L(x 1 as x for all λ >. To understand what this condition means about L(x, take λ = 1/2, then for x sufficiently large L(x/2 and L(x are nearly the same, that is, in parts of the domain where x is sufficiently large, multiplying x by 2 produces little change in L. Put differently, for x sufficiently large L is nearly a constant, and as a result for x large enough 1 F (x is nearly proportional to x α. The probability 1 F (x is called the survival function of the stochastic variable X. 18

19 2.9.2 Multivariate distributions It is often the case that we are interested in multiple stochastic variables. Suppose, for example, that rather than dealing with a single stochastic variable X we are interested in a collection of d stochastic variables. It is convenient to collect them in vector X 1 X 2 X =.. X d In this case we speak of a stochastic vector. We are interested in attaching probabilities to the event A = {X 1 (, x 1 ]} {X d (, x d ]}, where x = ( x 1 x d. The joint cumulative distribution associated with the vector X is given by F X (x = P (X A. When F X admits a density f X (x : R d R, it must satisfy F X (y = y1 yd If the stochastic vector has independent components, than f X (x 1,, x d dx 1 dx d. (2.27 f X (x 1,, x d = f X1 (x 1 f Xd (x d and f Xi (x i is called the marginal density of X i. One synthetic measure of how two stochastic variables behave relative to each other is called the covariance. Whenever it exists, we define it as, C(X 1, X 2 = E((X 1 E(X 1 (X 2 E(X 2. It is easy to show that C(X 1, X 2 = E(X 1 X 2 E(X 1 E(X 2. Furthermore it follows that if X 1 and X 2 are independent, than C(X 1, X 2 =. In fact, for any two (continuous functions g and h, when X 1 and X 2 are independent, we have E(g(X 1 h(x 2 = E(g(X 1 E(h(X 2. The correlation between two stochastic variables X 1 and X 2 is given by ρ(x 1, X 2 = C(X 1, X 2 V (X1 V (X 2. We note that for any a R, E ( (a(x 1 E(X 1 + (X 2 E(X 2 2 = f(a = a 2 V (X 1 + 2aC(X 1, X 2 + V (X 2. 19

20 This is a nonnegative quadratic function and consequently, it must be that 4C 2 (X 1, X 2 4V (X 1 V (X 2, which implies that C(X 1, X 2 V (X 1 V (X 2 (2.28 and ρ(x 1, X 2 1. The inequality in (2.28 is a special case of a more general inequality called the Cauchy-Schwarz inequality. as, Given a sample of two stochastic variables {(X 1i, X 2i } n i=1 we define the sample covariance Ĉ = 1 n and the sample correlation as ˆρ = n (X 1i X 1 (X 2i X 2 i=1 Ĉ. s 2 X1 s 2 X2 The conditional density of X d given X d 1 X 1 is given by f Xd X d 1 X 1 (x = f X(X 1, x f X d (X d (2.29 where X d = (X 1,, X d 1 and f X d (X d is the joint marginal distribution of X d. We define E(X d X d 1 X 1 = zf Xd X d 1 X 1 (zdz and V (X d X d 1 X 1 = (z E(X d X d 1 X 1 2 f Xd X d 1 X 1 (zdz. Example 2.5. Let f XY (x, y = 2 if < x < 1 and x < y < 1. Then, f X (x = 1 2dy = 2(1 X X. f Y X (y = 1. E(Y X = X ydy =. Similar calculations yield V (Y X = 1 x 1 X 2 (1 X Multivariate normal X Definition 2.1. The stochastic vector X = ( X 1 X 2 X d is said to have a multivariate normal distribution if for any set of constants a 1,, a d, the stochastic variable Y = d a i X i N(µ, σ 2, for some µ R and some σ 2 (,. (2.3 i=1 Clearly, if all a i =, except for i = j where a j = 1, then we can conclude that Y = X j N(E(X j, V (X j. Also, E(Y = d i=1 a ie(x i = a E(X, where E(X 1 a 1 E(X 2 a 2 E(X =. E(X d 2, a =. a d

21 and a represents the transposition of the vector a. The variance of Y is given by, ( d 2 V (Y = E a i (X i E(X i (2.31 i=1 ( d = E a 2 i (X i E(X i = i=1 d a 2 i V (X i + 2 i=1 d i=1 d a i a j (X i E(X i (X j E(X j (2.32 j<i d a i a j C(X i, X j (2.33 i=1 j<i = c cov(xc (2.34 V (X 1 C(X 1, X 2 C(X 1, X d C(X 2, X 1 V (X 2 C(X 2, X d where cov(x =..... In this case we write that.. C(X d, X 1 C(X d, X 2 V (X d X N(E(X, cov(x. It is useful to have an expression for the characteristic function of a stochastic vector that has a multivariate normal distribution. Theorem 2.6. X N(E(X, cov(x if, and only if, the characteristic function of its joint density is written as and t R d. ( φ fx (t = exp it E(X 1 2 t cov(xt (2.35 Proof. Suppose (2.35 holds, we need to show that Y = a X is univariate normal for any a R. Note that φ fy (u = E(exp(iuY = E(iua X = φ fx (ua = exp ( iua E(X 1 2 u2 a cov(xa, which implies that Y N(a E(X, a cov(xa by Theorem 2.4. Now, suppose X is multivariate normal, then Y = t X is univariate normal for any t R d and φ fy (u = exp(iuµ 1 2 u2 σ 2 where µ = d i=1 t ie(x i and σ 2 = t cov(xt. Then, φ fy (1 = exp(it E(X 1 2 t cov(xt, which is (2.35. If X is partitioned as X = (, ( X 1 E(X as E(X = E(X1 E(X 1 and cov(x as and if X N(E(X, cov(x then cov(x = X 1 ( V (X1 Σ 1, 1 Σ 1, 1 Σ 1, 1 X 1 X 1 N ( E(X 1 + Σ 1, 1 Σ 1 1, 1(X 1 E(X 1, V (X 1 Σ 1, 1 Σ 1 1, 1Σ 1, 1. (2.36 Equation (2.36 states that components of a multivariate normally distributed stochastic vector have normal conditional distributions. 21

22 2.1 Estimation Given a sample S n = {X 1,, X n } of observations on the stochastic variable X F (x; θ, for θ Θ, an estimator is a function ˆθ(X 1,, X n : S n Θ. The bias of an estimator ˆθ is defined as B(ˆθ = E(ˆθ θ and the mean squared error of the estimator is defined MSE(ˆθ = E ((ˆθ θ 2. We normally seek estimators that are efficient, in that, MSE minimized. It is clearly the case that MSE(ˆθ = V (ˆθ + B(ˆθ 2. (2.37 Hence, efficiency calls for estimators that have small variance and bias. When an estimator is such that B(ˆθ =, we call the estimator unbiased. When this is the case efficiency involves variance minimization Two basic estimation procedures We will consider two generic estimation procedures: maximum likelihood (ML and method of moments (MM estimation. Maximum likelihood estimation: Let S n = {X i } n i=1 be a sample and let X be a vector with component X i and assume that X f(x; θ where θ Θ R p. The function L(θ = f(x; θ : Θ R (for fixed X is called the likelihood function associated with the sample S n. The maximum likelihood estimator for θ, denoted by ˆθ ML, is (whenever it exists defined as ˆθ ML = argmax θ Θ Often, it is easier to maximize the logarithm of f(x; θ. increasing function of x, it follows that we can similarly define ˆθ ML as ˆθ ML = argmax θ Θ f(x; θ. (2.38 Since, the log(x is a strictly log f(x; θ. (2.39 It is often the case that enough assumptions are placed on the structure of the optimization in (2.39 to assure that ˆθ ML is the unique solution of log f(x; θ =. For example, if θ log f(x; θ is strictly concave in Θ, differentiable and reaches a maximum in the interior of Θ, then ˆθ ML is indeed the solution for log f(x; θ =. In this case, the maximum θ likelihood estimator can be defined as the value of θ that solves log f(x; θ =. Note θ that the last equality defines a system of p-equations. The vector log f(x; θ is called the θ score. 22

23 Example 2.6. Suppose, S n is a stochastic sample from N(µ, σ 2. Then, f(x i ; µ, σ 2 = 1 exp 1 (X i µ 2 2πσ 2 2 σ 2 and f(x; µ, σ 2 = n i=1 f(x i; µ, σ 2. Hence, log f(x; µ, σ 2 = n i=1 log f(x i; µ, σ 2 = n(log 2 σ2 + log 2π 1 n 2σ 2 i=1 (X i µ 2. The score vector in this case is given by ( ( log f(x; µ, µ σ2 log f(x; µ, σ 2 σ 2 and solving log f(x; µ, µ σ2 log f(x; µ, σ 2 σ 2 and ˆσ ML 2 = 1 n n i=1 (X i ˆµ ML 2. ( = 1 n 2σ 2 i=1 2(X i µ n + 1 n 2σ 2 2σ 4 i=1 (X i µ 2 (2.4 = and solving for µ and σ 2 gives ˆµ ML = 1 n n i=1 X i Example 2.7. Suppose, S n is a stochastic sample from a Pareto distribution with parameters (c, α where c is know (if c is not known, it can be estimated by ĉ = min{x 1,, X n }. Since the density associated with a Pareto density is given by f(x i ; c, α = function is given by and log f(x; c, α = f(x; c, α = n i=1 αc α X α+1 i Taking the first derivative and solving log f(x; α = gives α n ˆα ML = n i=1 log(x i/c. αcα X α+1 i, the likelihood (2.41 n (logα + αlog c (α + 1logX i. (2.42 i=1 Often, the solution for optimization problems such as (2.39 cannot be obtained analytically. In this case, it is necessary to numerically maximize log f(x, θ. In MATLAB the function fminsearch allows for conducting numerical maximization and minimization. See the code norm mle.m that conducts numerical optimization to obtain maximum likelihood estimation of the parameters µ and σ 2 for a normal density. Method of Moments estimation: The main idea behind method of moments estimation is to substitute theoretical moments with their sample equivalents. Consider the following two examples. Example 2.8. Consider a stochastic sample S n from a stochastic variable X N(µ, σ 2. Since E(X = µ, we define the estimator ˆµ M = 1 n n i=1 X i. That is, E(X is estimated by the sample average. Also, since σ 2 = V (X = E(X E(X 2 we define the estimator ˆσ 2 M = 1 n n i=1 (X i ˆµ M 2. 23

24 Example 2.9. Consider a stochastic sample S n from a stochastic variable X which has a Pareto distribution with parameters (c, α where c is known. Recall that E(X = αc, hence α 1 we write 1 n n i=1 X i = ˆα M c which implies that ˆα ˆα M 1 M = 1 n n i=1 X i. n i=1 X i c Note that whereas in the case of estimation of the parameters of a normal density the MM and ML estimators coincide, this is not the case when estimating the parameters of the Pareto distribution Evaluating estimators Since estimators are functions of stochastic variables, they are themselves (in general stochastic variables. As a result, it is of interest to inquire what are their distributions. Ideally, an arbitrary estimator ˆθ ought to have values that are very close and concentrated around the true parameter value θ. Unbiasedness, mentioned above, is a measure of closeness of ˆθ to θ. In essence, it says that the distribution of ˆθ is located at θ. Furthermore, a small variance (and in the case of unbiasedness, a small MSE means that the distribution of ˆθ is largely concentrated around θ. There are other useful ways to ascertain how close ˆθ is to θ. Some of the most useful concepts of closeness are related to the behavior of the estimator when the sample size n grows to, i.e, n. The collection of concepts and results that pertain to the behavior of ˆθ (or any sequence of stochastic variables X n as n is called Asymptotic Theory. One of the most used asymptotic concepts of closeness between an estimator ˆθ(X 1,, X n and the true parameter value θ is that of convergence in probability. An estimator is said to converge in probability to θ if for all ɛ, δ > there exists N ɛ,δ such that whenever n > N ɛ,δ we have P ({ ˆθ(X 1,, X n θ > ɛ} < δ. If this is the case we say that ˆθ(X 1,, X n θ. If θ R q, then we can write that ˆθ(X 1,, X n p θ if for all ɛ, δ > there exists N ɛ,δ such that whenever n > N ɛ,δ we have p 1 n P ( ˆθ(X 1,, X n θ > ɛ < δ. where ˆθ(X 1,, X n θ is the Euclidean distance between ˆθ(X 1,, X n and θ. Knowing that Z n ˆθ θ gets arbitrarily close to zero with probability approaching 1 as the sample size grows is useful, but it conveys no useful information about the distribution F n (z of Z n, other than the fact that as n it degenerates to {, if z < F (z = 1, if z. 24

25 A more useful result would be to know the circumstances under which there exists a sequence a n, which may be stochastic or non-stochastic, such that the distribution F anz n (z of a n Z n converges to a non-degenerate F (z. That is, F anz n (z F (z as n. Formally, we say that a sequence {X n } of stochastic variables with distribution functions F n converges in distribution to the stochastic variable X with distribution function F, if F n (x F (x for every point x where F is continuous. In this case we write X n d X. Theorem 2.7. Let X : Ω R be a stochastic variable, h : R [, such that E(h(X <. Then, for all M >, P ({ω : h(x(ω M} E(h(X M. Proof. Let A M = {ω : h(x(ω M} and note that for all ω Ω we have h(x(ω MI AM. Hence, E(h(X MP (A M. If we take h(x = X in Theorem 2.7 we conclude that P ({ω : X(ω M} E( X M. provided E( X <. This is called Markov s Inequality. Also, if we take h(x = (X E(X 2 in Theorem 2.7 we conclude that P ({ω : X(ω E(X(ω M} E((X E(X2 M 2 = V (X M 2. provided E( X <. Thus we have the following corollary called the Bienaymé-Chebyshev inequality. Corollary 2.1. Let h(x = X E(X in Theorem 2.7. Then, P ( X E(X M V (X M 2. 25

26 26

27 Bibliography Jacod, J., Protter, P., 2. Probability Essentials. Springer, New York, NY. Resnick, S. I., 25. A Probability Path. Birkhäuser, Boston, MA. 27

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ. 9 Point estimation 9.1 Rationale behind point estimation When sampling from a population described by a pdf f(x θ) or probability function P [X = x θ] knowledge of θ gives knowledge of the entire population.