Statistical Methods for NLP LT PDF Free Download

LT 2202 Lecture 3 Random variables January 26, 2012

Recap of lecture 2 Basic laws of probability: 0 P(A) 1 for every event A. P(Ω) = 1 P(A B) = P(A) + P(B) if A and B disjoint Conditional probability: P(A B) = P(A B)/P(B) Bayes: P(B A) = P(A B)P(B)/P(A) Independence if P(A B) = P(A) P(B)

Today s lecture Random variables Probability distributions Expectations and variances The zoo of probability distributions

Random variables, informally A variable that takes different values with different probabilities The amount I win when buying a lottery ticket The number of heads when throwing coins n times The gender of a newborn baby The number of words in a random English sentence The initial word in a random English sentence

Random variables, informally The outcomes: If I win, I ll get 1,000,000 SEK Otherwise, I get nothing P(nothing) = 0.99999 P(1,000,000 SEK) = 0.00001 P(something else) = 0

Throwing a coin twice The number of heads when throwing a coin twice The outcomes: (H,H), (H,T), (T,H), (T,T) P(none) = 1/4 P(one) = 2/4 P(two) = 1/4

Random variables, formally Definition: A random variable is a function from the sample space Ω to some set of values Also stochastic variable (στοχαστικός) Example: the number of heads

Describing a random variable When discussing a random variable, we need to describe which values it takes and with which probabilities: the distribution

The probability mass function For all values x that a random variable X may take, we define the function p X (x) = P(X takes the value x) This is called the probability mass function (pmf) of X

Coins again X = the number of heads when throwing a coin twice p X (0) = P(X = 0) = 1/4 p X (1) = P(X = 1) = 2/4 p X (2) = P(X = 2) = 1/4

Exams X = the number of times I have to go to the exam p X (1) = P(X = 1) = 0.6 p X (2) = P(X = 2) = 0.4 * 0.6... p X (k) = P(X = k) = 0.4 k-1 * 0.6

Independent random variables We saw independent events previously We may introduce a similar concept for random variables: X and Y are independent random variables if P(X = x, Y = y) = P(X = x) P(Y = y) = p X (x) p Y (y) for all possible values x, y of X and Y.

Expectations: averages We are rolling a die repeatedly. What will be the average result? We are throwing a coin. If heads, we pay $10, if tails, we get $5. What is the average gain?

Expectations: averages We are throwing a coin. If heads, we pay $10, if tails, we get $5. What is the average gain? P($5) = 0.5 P(-$10) = 0.5 0.5 * $5 + 0.5 * -$10

The expected value If X is a random variable taking numeric values, then Ε[ = k X ] p ( k) X is called the expected value, mean value, or expectation of X. k

Dice rolling X is the result of a die roll: 1 1+... + 6 Ε[ X ] = k p ( ) = X k k = = 6 6 k k 3.5

Coins again X = the number of heads when throwing a coin twice p X (0) = P(X = 0) = 1/4 p X (1) = P(X = 1) = 2/4 p X (2) = P(X = 2) = 1/4 1 2 1 Ε[ X ] = k p ( k X ) = 0 + 1 + 2 = 4 4 4 k 1

Interpretation

Some interesting facts about expectations The expectation is linear: Ε [ ax + b ] = a Ε [ X ] + b We can sum expectations directly: Ε[ X + Y ] = Ε[ X ] + Ε[ Y ]

Expectations of functions If Y is a function of X, e.g. Y = g(x), then = g( k) Ε [ Y ] p ( k) X k

Example X = the result of a die roll Y = Amount won or lost depending on X: 1: Lose $20 2: Lose $5 3: Nothing 4: Win $1 5: Win $2 6: Win $20

Example = g( k) Ε[ Y ] p X ( k) k Ε[ Y ] = g(1) px (1) +... + g(6) px (6) 1 1 1 1 1 Ε[ Y ] = 20 5 + 1 + 2 + 20 = 6 6 6 6 6 2 6

Variance If m = E[X], then V ( X ) =Ε[( X m) 2 ] is called the variance of X, and D ( X ) = V ( X ) is called the standard deviation of X.

Example (standard deviation)

The zoo of probability distributions We ll exemplify a few. Try to remember the type cases! See table on course page Examples: Coin tossing with uneven coin: Bernoulli Die rolling: Uniform Counting errors/successes: Binomial Trying until success: Geometric Word frequencies: Zipf

The Bernoulli distribution Assume that we throw an unbalanced coin giving heads (1) with probability p and tails (0) with probability 1-p: p X ( 0) =1 p p X (1) = p

Properties of the Bernoulli distribution Expectation and variance (see exercise): Ε [X X ] = p V[ X ] = p(1 p)

The uniform distribution: die rolling 1 p X ( k) = n Ε[ X ] = n + 1 2

Counting: the binomial distribution The number of heads when throwing a coin (heads prob p) n times n=1: H, T n=2: (H,H), (H,T), (T,H), (T,T) n=3: (H,H,H), (H,H,T), (H,T,H), (H,T,T), (T,H,H), (T,H,T), (T,T,H), (T,T,T)

Counting: the binomial distribution The probability of getting k heads? n=3, k=1: (H,T,T), (T,H,T,), (T,T,H) Each of the outcomes has the probability k n k p (1 p) How many such outcomes: the number of ways to pick k items from a set of n

Picking k items out of n The number of ways to pick k items from a set of n is called the binomial coefficient: n n k = n!! k!( n k)!

The binomial distribution We get the pmf: p X ( k) n k k k n k = p (1 p) The expectation: (think about it!) Ε [ X ] = np

The geometric distribution Trying until success p X(1) = P(X = 1) = p p X (2) = P(X = 2) = (1-p) * p... p X (k) = P(X = k) = (1-p) k-1 * p Ε[ X ] = 1 p

The Zipf distribution For many phenomena in language, frequency tends to be approximately proportional to inverse frequency rank Assuming such a distribution, the probability of drawing the kth most common item is p X ( k) = C k (C is a normalizer so that the sum is 1.) Must assume finite vocabulary!

Example: the Codex Argenteus Corpus size about 60,000 words How many hapax legomena? About 9% Zipf with voc. size 25,000 would give us around 10% 50,000: 15%

Continuous distributions So far, our random values have taken discrete values (1, 2, 3,... or a, b,...) We may also define random variables that take continuous values This requires slightly different machinery

Probability density function For a continuous random variable X, we don t use a pmf Instead we use a probability density function (pdf) f X (x) such that P ( a < X b) = f X ( x) b x= a (Note that f X (x) is not a probability by itself!)

Expectations and such The expectation (and most other stuff as well) becomes similar we just need to replace sums with integrals: Discrete: Continuous: = k Ε [ X ] p ( k) X k Ε [ X ] = x f ( x) dx x X

The normal distribution (Gaussian) The normal distribution has the following pdf f X ( x m) 2 2 1 σ ( x) = e σ 2π 2 Ε [ X ] = m V[ X ] = σ 2

Some nice facts about the normal distribution The normal distribution is very common in statistics because Often seen in nature (height, test scores,...) Has nice mathematical properties Scale, translation, sum invariant Central limit theorem: if we sum (or average) a large number of i.i.d. variables, the sum is approximately normal

Central limit theorem

Normal approximation of binomial For large values of n, the binomial distribution can be approximated with a normal: (why???)

Recap Random variables: Variables taking values with some probability Probability distributions Probability mass function Expectations (means) and variances Distributions... Try to remember the type cases See table on course page

Statistical Methods for NLP LT 2202