Probability Theory. Probability and Statistics for Data Science CSE594 - Spring 2016

Similar documents
Language Models Review: 1-28

The normal distribution is a theoretical model derived mathematically and not empirically.

Review for Final Exam Spring 2014 Jeremy Orloff and Jonathan Bloom

Theoretical Foundations

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

UQ, STAT2201, 2017, Lectures 3 and 4 Unit 3 Probability Distributions.

4: Probability. What is probability? Random variables (RVs)

IEOR 165 Lecture 1 Probability Review

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

MA : Introductory Probability

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

4-1. Chapter 4. Commonly Used Distributions by The McGraw-Hill Companies, Inc. All rights reserved.

Binomial Random Variables. Binomial Random Variables

4 Random Variables and Distributions

6 If and then. (a) 0.6 (b) 0.9 (c) 2 (d) Which of these numbers can be a value of probability distribution of a discrete random variable

Random Variables Handout. Xavier Vilà

A useful modeling tricks.

Business Statistics 41000: Probability 4

Statistical Methods in Practice STAT/MATH 3379

Chapter 5. Statistical inference for Parametric Models

Chapter 3 Discrete Random Variables and Probability Distributions

Discrete Random Variables and Probability Distributions. Stat 4570/5570 Based on Devore s book (Ed 8)

6. Continous Distributions

Chapter 7: Estimation Sections

Week 1 Quantitative Analysis of Financial Markets Probabilities

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 3, 1.9

CSE 312 Winter Learning From Data: Maximum Likelihood Estimators (MLE)

ECON 214 Elements of Statistics for Economists 2016/2017

TOPIC: PROBABILITY DISTRIBUTIONS

MVE051/MSG Lecture 7

EXERCISES FOR PRACTICE SESSION 2 OF STAT CAMP

Examples: Random Variables. Discrete and Continuous Random Variables. Probability Distributions

Chapter 4 and 5 Note Guide: Probability Distributions

4.3 Normal distribution

4: Probability. Notes: Range of possible probabilities: Probabilities can be no less than 0% and no more than 100% (of course).

Statistical Methods for NLP LT 2202

2011 Pearson Education, Inc

MATH 264 Problem Homework I

The topics in this section are related and necessary topics for both course objectives.

The Binomial Distribution

II. Random Variables

Lecture 3: Probability Distributions (cont d)

Business Statistics 41000: Probability 3

5.2 Random Variables, Probability Histograms and Probability Distributions

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

Back to estimators...

CS 237: Probability in Computing

Probability. An intro for calculus students P= Figure 1: A normal integral

Chapter 8. Variables. Copyright 2004 Brooks/Cole, a division of Thomson Learning, Inc.

Exam 2 Spring 2015 Statistics for Applications 4/9/2015

Normal distribution Approximating binomial distribution by normal 2.10 Central Limit Theorem

Bernoulli and Binomial Distributions

Drunken Birds, Brownian Motion, and Other Random Fun

STAT 201 Chapter 6. Distribution

Introduction to Business Statistics QM 120 Chapter 6

Chapter 7: Random Variables

LECTURE CHAPTER 3 DESCRETE RANDOM VARIABLE

HUDM4122 Probability and Statistical Inference. February 23, 2015

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

Commonly Used Distributions

Chapter 7: Estimation Sections

Chapter 7: Estimation Sections

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 3, 1.9

. (i) What is the probability that X is at most 8.75? =.875

Conjugate priors: Beta and normal Class 15, Jeremy Orloff and Jonathan Bloom

Lecture Data Science

Probability Theory and Simulation Methods. April 9th, Lecture 20: Special distributions

MLLunsford 1. Activity: Central Limit Theorem Theory and Computations

5.1 Personal Probability

Chapter 6: Random Variables. Ch. 6-3: Binomial and Geometric Random Variables

Introduction to Probability and Inference HSSP Summer 2017, Instructor: Alexandra Ding July 19, 2017

1 PMF and CDF Random Variable PMF and CDF... 4

Some Characteristics of Data

The Binomial Probability Distribution

Homework: Due Wed, Nov 3 rd Chapter 8, # 48a, 55c and 56 (count as 1), 67a

CS145: Probability & Computing

HUDM4122 Probability and Statistical Inference. March 4, 2015

Chapter 5. Continuous Random Variables and Probability Distributions. 5.1 Continuous Random Variables

STAT Mathematical Statistics

MAS187/AEF258. University of Newcastle upon Tyne

Topic 6 - Continuous Distributions I. Discrete RVs. Probability Density. Continuous RVs. Background Reading. Recall the discrete distributions

Learning From Data: MLE. Maximum Likelihood Estimators

Normal Distribution. Definition A continuous rv X is said to have a normal distribution with. the pdf of X is

continuous rv Note for a legitimate pdf, we have f (x) 0 and f (x)dx = 1. For a continuous rv, P(X = c) = c f (x)dx = 0, hence

Probability mass function; cumulative distribution function

Chapter 7. Random Variables

Chapter 5. Sampling Distributions

Point Estimation. Copyright Cengage Learning. All rights reserved.

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Chapter 8: The Binomial and Geometric Distributions

CS 237: Probability in Computing

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 4: May 2, Abstract

Chapter 7 1. Random Variables

Part V - Chance Variability

Chapter 5: Probability

The Binomial and Geometric Distributions. Chapter 8

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

Simple Random Sample

TABLE OF CONTENTS - VOLUME 2

Chapter 3 Discrete Random Variables and Probability Distributions

Transcription:

Probability Theory Probability and Statistics for Data Science CSE594 - Spring 2016

What is Probability? 2

What is Probability? Examples outcome of flipping a coin (seminal example) amount of snowfall mentioning a word mentioning a word a lot 3

What is Probability? The chance that something will happen. Given infinite observations of an event, the proportion of observations where a given outcome happens. Strength of belief that something is true. Mathematical language for quantifying uncertainty - Wasserman 4

Probability (review) Ω : Sample Space, set of all outcomes of a random experiment A : Event (A Ω), collection of possible outcomes of an experiment P(A): Probability of event A, P is a function: events R 5

Probability (review) Ω : Sample Space, set of all outcomes of a random experiment A : Event (A Ω), collection of possible outcomes of an experiment P(A): Probability of event A, P is a function: events R P(Ω) = 1 P(A) 0, for all A If A 1, A 2, are disjoint events then: 6

Probability (review) Ω : Sample Space, set of all outcomes of a random experiment A : Event (A Ω), collection of possible outcomes of an experiment P(A): Probability of event A, P is a function: events R P is a probability measure, if and only if P(Ω) = 1 P(A) 0, for all A If A 1, A 2, are disjoint events then: 7

Probability Examples outcome of flipping a coin (seminal example) amount of snowfall mentioning a word mentioning a word a lot 8

Probability (review) Some Properties: If B A then P(A) P(B) P(A B) P(A) + P(B) P(A B) min(p(a), P(B)) P( A) = P(Ω / A) = 1 - P(A) / is set difference P(A B) will be notated as P(A, B) 9

Probability (Review) Independence Two Events: A and B Does knowing something about A tell us whether B happens (and vice versa)? 10

Probability (Review) Independence Two Events: A and B Does knowing something about A tell us whether B happens (and vice versa)? A: first flip of a fair coin; B: second flip of the same fair coin A: mention or not of the word happy B: mention or not of the word birthday 11

Probability (Review) Independence Two Events: A and B Does knowing something about A tell us whether B happens (and vice versa)? A: first flip of a fair coin; B: second flip of the same fair coin A: mention or not of the word happy B: mention or not of the word birthday Two events, A and B, are independent iff P(A, B) = P(A)P(B) 12

Probability (Review) Conditional Probability P(A, B) P(A B) = ------------- P(B) 13

Probability (Review) Conditional Probability P(A, B) P(A B) = ------------- P(B) H: mention happy in message, m B: mention birthday in message, m P(H) =.01 P(B) =.001 P(H, B) =.0005 P(H B) =?? 14

Probability (Review) Conditional Probability P(A, B) P(A B) = ------------- P(B) H: mention happy in message, m B: mention birthday in message, m P(H) =.01 P(B) =.001 P(H, B) =.0005 P(H B) =.50 H1: first flip of a fair coin is heads H2: second flip of the same coin is heads P(H2) = 0.5 P(H1) = 0.5 P(H2, H1) = 0.25 P(H2 H1) = 0.5 15

Probability (Review) Conditional Probability P(A, B) P(A B) = ------------- P(B) H1: first flip of a fair coin is heads H2: second flip of the same coin is heads P(H2) = 0.5 P(H1) = 0.5 P(H2, H1) = 0.25 P(H2 H1) = 0.5 Two events, A and B, are independent iff P(A, B) = P(A)P(B) P(A, B) = P(A)P(B) iff P(A B) = P(A) 16

Probability (Review) Conditional Probability P(A, B) P(A B) = ------------- P(B) H1: first flip of a fair coin is heads H2: second flip of the same coin is heads P(H2) = 0.5 P(H1) = 0.5 P(H2, H1) = 0.25 P(H2 H1) = 0.5 Two events, A and B, are independent iff P(A, B) = P(A)P(B) P(A, B) = P(A)P(B) iff P(A B) = P(A) Interpretation of Independence: Observing B has no effect on probability of A. 17

Why Probability? 18

Why Probability? A formality to make sense of the world. To quantify uncertainty Should we believe something or not? Is it a meaningful difference? To be able to generalize from one situation or point in time to another. Can we rely on some information? What is the chance Y happens? To organize data into meaningful groups or dimensions Where does X belong? What words are similar to X? 19

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. 20

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH> } 21

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH> } We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 22

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH> } We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 23

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH> } We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 What is the probability that we end up with k = 4 tails? P(X(ω) = k) where ω Ω 24

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH> } We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 What is the probability that we end up with k = 4 tails? P(X = k) := P( {ω : X(ω) = k} ) where ω Ω 25

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH> } We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 What is the probability that we end up with k = 4 tails? P(X = k) := P( {ω : X(ω) = k} ) where ω Ω X(ω) = 4 for 5 out of 32 sets in Ω. Thus, assuming a fair coin, P(X = 4) = 5/32 26

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH> } We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 What is the probability that we end up with k = 4 tails? P(X = k) := P( {ω : X(ω) = k} ) where ω Ω X(ω) = 4 for 5 out of 32 sets in Ω. Thus, assuming a fair coin, P(X = 4) = 5/32 (Not a variable, but a function that we end up notating a lot like a variable) 27

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH> } We may just care about how many tails? Thus, X(<HHHHH>) = 0 X is a discrete random variable X(<HHHTH>) = 1 if it takes only a countable X(<TTTHT>) = 4 number of values. X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 What is the probability that we end up with k = 4 tails? P(X = k) := P( {ω : X(ω) = k} ) where ω Ω X(ω) = 4 for 5 out of 32 sets in Ω. Thus, assuming a fair coin, P(X = 4) = 5/32 (Not a variable, but a function that we end up notating a lot like a variable) 28

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. X is a continuous random variable if it can take on an infinite number of values between any two given values. X is a discrete random variable if it takes only a countable number of values. 29

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ) R X is a continuous random variable if it can take on an infinite number of values between any two given values. 30

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ) R X is a continuous random variable if it can take on an infinite number of values between any two given values. X amount of inches in a snowstorm X(ω) = ω 31

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ) R X is a continuous random variable if it can take on an infinite number of values between any two given values. X amount of inches in a snowstorm X(ω) = ω What is the probability we receive (at least) a inches? P(X a) := P( {ω : X(ω) a} ) What is the probability we receive between a and b inches? P(a X b) := P( {ω : a X(ω) b} ) 32

Random Variables X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ) R X is a continuous random variable if it can take on an infinite number of values between any two given values. X amount of inches in a snowstorm X(ω) = ω P(X = i) := 0, for all i Ω What is the probability we receive (at least) a inches? P(X a) := P( {ω : X(ω) a} ) (probability of receiving exactly i inches of snowfall is zero) What is the probability we receive between a and b inches? P(a X b) := P( {ω : a X(ω) b} ) 33

Probability Review: 1-26 what constitutes a probability measure? independence conditional probability random variables discrete continuous 34

Language Models Review: 1-28 Why are language models (LMs) useful? Maximum Likelihood Estimation for Binomials Idea of Chain Rule, Markov assumptions Why is word sparsity an issue? Further interest: Leplace Smoothing, Good-Turing Smoothing, LMs in topic modeling. 35

Disjoint Sets vs. Independent Events Independence: iff P(A,B) = P(A)P(B) Disjoint Sets: If two events, A and B, come from disjoint sets, then P(A,B) = 0 36

Disjoint Sets vs. Independent Events Independence: iff P(A,B) = P(A)P(B) Disjoint Sets: If two events, A and B, come from disjoint sets, then P(A,B) = 0 Does independence imply disjoint? 37

Disjoint Sets vs. Independent Events Independence: iff P(A,B) = P(A)P(B) Disjoint Sets: If two events, A and B, come from disjoint sets, then P(A,B) = 0 Does independence imply disjoint? No Proof: A counterexample: A: first coin flip is heads, B: second coin flip is heads; P(A)P(B) = P(A,B), but.25 = P(A, B) =/= 0 A B 38

Disjoint Sets vs. Independent Events Independence: iff P(A,B) = P(A)P(B) Disjoint Sets: If two events, A and B, come from disjoint sets, then P(A,B) = 0 Does independence imply disjoint? No Proof: A counterexample: A: first coin flip is heads, B: second coin flip is heads; P(A)P(B) = P(A,B), but.25 = P(A, B) =/= 0 Does disjoint imply independence? 39

Tools for Decomposing Probabilities Whiteboard Time! Table Tree Examples: urn with 3 balls (with and without replacement) conversation lengths championship bracket 40

Probabilities over >2 events... Independence: A 1, A 2,, A n are independent iff P(A 1, A 2,, A n ) = P(A i ) 41

Probabilities over >2 events... Independence: A 1, A 2,, A n are independent iff P(A 1, A 2,, A n ) = P(A i ) Conditional Probability: P(A 1, A 2,, A n-1 A n ) = P(A 1, A 2,, A n-1, A n ) / P(A n ) P(A 1, A 2,, A m-1 A m,a m+1,, A n ) = P(A 1, A 2,, A m-1, A m,a m+1,, A n ) / P(A m,a m+1,, A n ) (just think of multiple events happening as a single event) 42

Conditional Independence A and B are conditionally independent, given C, IFF P(A, B C) = P(A C)P(B C) Equivalently, P(A B,C) = P(A C) Interpretation: Once we know C, B doesn t tell us anything useful about A. Example: Championship bracket 43

Bayes Theorem - Lite GOAL: Relate P(A B) to P(B A) Let s try: 44

Bayes Theorem - Lite GOAL: Relate P(A B) to P(B A) Let s try: (1) P(A B) = P(A,B) / P(B), def. of conditional probability (2) P(B A) = P(B,A) / P(A) = P(A,B) / P(A), def. of conf. prob; sym of set union 45

Bayes Theorem - Lite GOAL: Relate P(A B) to P(B A) Let s try: (1) P(A B) = P(A,B) / P(B), def. of conditional probability (2) P(B A) = P(B,A) / P(A) = P(A,B) / P(A), def. of conf. prob; sym of set union (3) P(A,B) = P(B A)P(A), algebra on (2) known as Multiplication Rule 46

Bayes Theorem - Lite GOAL: Relate P(A B) to P(B A) Let s try: (1) P(A B) = P(A,B) / P(B), def. of conditional probability (2) P(B A) = P(B,A) / P(A) = P(A,B) / P(A), def. of conf. prob; sym of set union (3) P(A,B) = P(B A)P(A), algebra on (2) known as Multiplication Rule (4) P(A B) = P(B A)P(A) / P(B), Substitute P(A,B) from (3) into (1) 47

Bayes Theorem - Lite GOAL: Relate P(A B) to P(B A) Let s try: (1) P(A B) = P(A,B) / P(B), def. of conditional probability (2) P(B A) = P(B,A) / P(A) = P(A,B) / P(A), def. of conf. prob; sym of set union (3) P(A,B) = P(B A)P(A), algebra on (2) known as Multiplication Rule (4) P(A B) = P(B A)P(A) / P(B), Substitute P(A,B) from (3) into (1) 48

Law of Total Probability and Bayes Theorem GOAL: Relate P(A i B) to P(B A i ), for all i = 1... k, where A 1... A k partition Ω 49

Law of Total Probability and Bayes Theorem GOAL: Relate P(A i B) to P(B A i ), for all i = 1... k, where A 1... A k partition Ω partition: P(A 1 U A 2 U A k ) = Ω P(A i, A j ) = 0, for all i j 50

Law of Total Probability and Bayes Theorem GOAL: Relate P(A i B) to P(B A i ), for all i = 1... k, where A 1... A k partition Ω partition: P(A 1 U A 2 U A k ) = Ω P(A i, A j ) = 0, for all i j law of total probability: If A 1... A k partition Ω, then for any event, B 51

Law of Total Probability and Bayes Theorem GOAL: Relate P(A i B) to P(B A i ), for all i = 1... k, where A 1... A k partition Ω partition: P(A 1 U A 2 U A k ) = Ω P(A i, A j ) = 0, for all i j law of total probability: If A 1... A k partition Ω, then for any event, B 52

Law of Total Probability and Bayes Theorem GOAL: Relate P(A i B) to P(B A i ), for all i = 1... k, where A 1... A k partition Ω Let s try: 53

Law of Total Probability and Bayes Theorem GOAL: Relate P(A i B) to P(B A i ), for all i = 1... k, where A 1... A k partition Ω Let s try: (1) P(A i B) = P(A i,b) / P(B) (2) P(A i,b) / P(B) = P(B A i ) P(A i ) / P(B), by multiplication rule 54

Law of Total Probability and Bayes Theorem GOAL: Relate P(A i B) to P(B A i ), for all i = 1... k, where A 1... A k partition Ω Let s try: (1) P(A i B) = P(A i,b) / P(B) (2) P(A i,b) / P(B) = P(B A i ) P(A i ) / P(B), by multiplication rule but in practice, we might not know P(B) 55

Law of Total Probability and Bayes Theorem GOAL: Relate P(A i B) to P(B A i ), for all i = 1... k, where A 1... A k partition Ω Let s try: (1) P(A i B) = P(A i,b) / P(B) (2) P(A i,b) / P(B) = P(B A i ) P(A i ) / P(B), by multiplication rule but in practice, we might not know P(B) (3) P(B A i ) P(A i ) / P(B) = P(B A i ) P(A i ) / ( ), by law of total probability 56

Law of Total Probability and Bayes Theorem GOAL: Relate P(A i B) to P(B A i ), for all i = 1... k, where A 1... A k partition Ω Let s try: (1) P(A i B) = P(A i,b) / P(B) (2) P(A i,b) / P(B) = P(B A i ) P(A i ) / P(B), by multiplication rule but in practice, we might not know P(B) (3) P(B A i ) P(A i ) / P(B) = P(B A i ) P(A i ) / ( ), by law of total probability Thus, P(A i B) = P(B A i ) P(A i ) / ( ) 57

Probability Theory Review: 2-2 Conditional Independence How to derive Bayes Theorem Law of Total Probability Bayes Theorem in Practice 58

Working with data in python = refer to python notebook 59

Random Variables, Revisited X: A mapping from Ω to R that describes the question we care about in practice. X is a continuous random variable if it can take on an infinite number of values between any two given values. X is a discrete random variable if it takes only a countable number of values. 60

Random Variables, Revisited X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ) R X is a continuous random variable if it can take on an infinite number of values between any two given values. X amount of inches in a snowstorm X(ω) = ω P(X = i) := 0, for all i Ω What is the probability we receive (at least) a inches? P(X a) := P( {ω : X(ω) a} ) (probability of receiving exactly i inches of snowfall is zero) What is the probability we receive between a and b inches? P(a X b) := P( {ω : a X(ω) b} ) 61

Random Variables, Revisited X: A mapping from Ω to R that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ) R X is a continuous random variable if it can take on an infinite number of values between any two given values. X amount of inches in a snowstorm X(ω) = ω P(X = i) := 0, for all i Ω What is the probability we receive (at least) a inches? P(X a) := P( {ω : X(ω) a} ) How to model? What is the probability we receive between a and b inches? P(a X b) := P( {ω : a X(ω) b} ) (probability of receiving exactly i inches of snowfall is zero) 62

Continuous Random Variables Discretize them! (group into discrete bins) How to model? 63

Continuous Random Variables Discretize them! (group into discrete bins) How to model? Histograms 64

Continuous Random Variables 65

Continuous Random Variables P(bin=8) =.32 P(bin=12) =.08 66

Continuous Random Variables P(bin=8) =.32 P(bin=12) =.08 But aren t we throwing away information? 67

Continuous Random Variables 68

Continuous Random Variables X is a continuous random variable if it can take on an infinite number of values between any two given values. X is a continuous random variable if there exists a function fx such that: 69

Continuous Random Variables X is a continuous random variable if it can take on an infinite number of values between any two given values. X is a continuous random variable if there exists a function fx such that: fx : probability density function (pdf) 70

Continuous Random Variables X is a continuous random variable if it can take on an infinite number of values between any two given values. PDFs X is a continuous random variable if there exists a function fx such that: fx : probability density function (pdf) 71

Continuous Random Variables 72

Continuous Random Variables 73

CRV Review: 2-4 Concept of PDF Formal definition of a pdf How to create a continuous random variable in python Plot Histograms Plot PDFs 74

Continuous Random Variables Common Trap does not yield a probability does may be anything (R) thus, may be > 1 75

Continuous Random Variables Some Common Probability Density Functions 76

Continuous Random Variables Common pdfs: Normal(μ, σ 2 ) = 77

Continuous Random Variables Common pdfs: Normal(μ, σ 2 ) = μ: mean (or center ) = expectation σ 2 : variance, σ: standard deviation 78

Continuous Random Variables Common pdfs: Normal(μ, σ 2 ) Credit: Wikipedia = μ: mean (or center ) = expectation σ 2 : variance, σ: standard deviation 79

Continuous Random Variables Common pdfs: Normal(μ, σ 2 ) X ~ Normal(μ, σ 2 ), examples: height intelligence/ability measurement error averages (or sum) of lots of random variables 80

Continuous Random Variables Common pdfs: Normal(0, 1) ( standard normal ) How to standardize any normal distribution: subtract the mean, μ (aka mean centering ) divide by the standard deviation, σ z = (x - μ) / σ, (aka z score ) Credit: MIT Open Courseware: Probability and Statistics 81

Continuous Random Variables Common pdfs: Normal(0, 1) Credit: MIT Open Courseware: Probability and Statistics 82

Continuous Random Variables Common pdfs: Uniform(a, b) = 83

Continuous Random Variables Common pdfs: Uniform(a, b) = X ~ Uniform(a, b), examples: spinner in a game random number generator analog to digital rounding error 84

Continuous Random Variables Common pdfs: Exponential(λ) Credit: Wikipedia λ: rate or inverse scale : scale ( ) 85

Continuous Random Variables Common pdfs: Exponential(λ) Credit: Wikipedia X ~ Exp(λ), examples: lifetime of electronics waiting times between rare events (e.g. waiting for a taxi) recurrence of words across documents 86

Continuous Random Variables How to decide which pdf is best for my data? Look at a non-parametric curve estimate: (If you have lots of data) Histogram Kernel Density Estimator 87

Continuous Random Variables How to decide which pdf is best for my data? Look at a non-parametric curve estimate: (If you have lots of data) Histogram Kernel Density Estimator K: kernel function, h: bandwidth (for every data point, draw K and add to density) 88

Continuous Random Variables How to decide which pdf is best for my data? Look at a non-parametric curve estimate: (If you have lots of data) Histogram Kernel Density Estimator K: kernel function, h: bandwidth (for every data point, draw K and add to density) 89

Continuous Random Variables 90

Continuous Random Variables just like a pdf, this function takes in an x and returns the appropriate y on an estimated distribution curve to figure out y for a given x, take the sum of where each where each kernel (a density plot for each data point in the original X) puts that x. 91

Continuous Random Variables Analogies Funky dartboard Credit: MIT Open Courseware: Probability and Statistics 92

Continuous Random Variables Analogies Funky dartboard Random number generator 93

Cumulative Distribution Function Random number generator 94

Cumulative Distribution Function For a given random variable X, the cumulative distribution function (CDF), Fx: R [0, 1], is defined by: 95

Cumulative Distribution Function For a given random variable X, the cumulative distribution function (CDF), Fx: R [0, 1], is defined by: Uniform Exponential Normal 96

Cumulative Distribution Function For a given random variable X, the cumulative distribution function (CDF), Fx: R [0, 1], is defined by: Uniform Exponential Normal normal cdf 97

Cumulative Distribution Function For a given random variable X, the cumulative distribution function (CDF), Fx: R [0, 1], is defined by: Uniform Pro: yields a probability! Exponential Con: Not intuitively interpretable. Normal 98

Random Variables, Revisited X: A mapping from Ω to R that describes the question we care about in practice. X is a continuous random variable if it can take on an infinite number of values between any two given values. X is a discrete random variable if it takes only a countable number of values. 99

Discrete Random Variables For a given random variable X, the cumulative distribution function (CDF), Fx: R [0, 1], is defined by: X is a discrete random variable if it takes only a countable number of values. 100

Discrete Random Variables For a given random variable X, the cumulative distribution function (CDF), Fx: R [0, 1], is defined by: Discrete Uniform X is a discrete random variable if it takes only a countable number of values. Binomial (n, p) (like normal) 101

Discrete Random Variables For a given random variable X, the cumulative distribution function (CDF), Fx: R [0, 1], is defined by: For a given discrete random variable X, probability mass function (pmf), fx: R [0, 1], is defined by: X is a discrete random variable if it takes only a countable number of values. 102

Discrete Random Variables Binomial (n, p) For a given random variable X, the cumulative distribution function (CDF), Fx: R [0, 1], is defined by: For a given discrete random variable X, probability mass function (pmf), fx: R [0, 1], is defined by: X is a discrete random variable if it takes only a countable number of values. 103

Discrete Random Variables Binomial (n, p) For a given random variable X, the cumulative distribution function (CDF), Fx: R [0, 1], is defined by: For a given discrete random variable X, probability mass function (pmf), fx: R [0, 1], is defined by: X is a discrete random variable if it takes only a countable number of values. 104

Discrete Random Variables Binomial (n, p) Common Discrete Random Variables Binomial(n, p) example: number of heads after n coin flips (p, probability of heads) Bernoulli(p) = Binomial(1, p) example: one trial of success or failure 105

Discrete Random Variables Binomial (n, p) Common Discrete Random Variables Binomial(n, p) example: number of heads after n coin flips (p, probability of heads) Bernoulli(p) = Binomial(1, p) example: one trial of success or failure Discrete Uniform(a, b) 106

Discrete Random Variables Binomial (n, p) Common Discrete Random Variables Binomial(n, p) example: number of heads after n coin flips (p, probability of heads) Bernoulli(p) = Binomial(1, p) example: one trial of success or failure Discrete Uniform(a, b) Geometric(p) P(X = k) = p(1 - p) k-1, k 1 Geo(p) example: coin flips until first head 107

Discrete Random Variables Binomial (n, p) Common Discrete Random Variables Binomial(n, p) example: number of heads after n coin flips (p, probability of heads) Bernoulli(p) = Binomial(1, p) example: one trial of success or failure Discrete Uniform(a, b) Geometric(p) P(X = k) = p(1 - p) k-1, k 1 Geo(p) example: coin flips until first head discrete random variables 108

Maximum Likelihood Estimation (parameter estimation) Given data and a distribution, how does one choose the parameters? 109

Maximum Likelihood Estimation (parameter estimation) Given data and a distribution, how does one choose the parameters? likelihood function: maximum likelihood estimation: What is the θ that maximizes L? 110

Maximum Likelihood Estimation (parameter estimation) Given data and a distribution, how does one choose the parameters? likelihood function: log-likelihood function: maximum likelihood estimation: What is the θ that maximizes L? 111

Maximum Likelihood Estimation (parameter estimation) Given data and a distribution, how does one choose the parameters? likelihood function: log-likelihood function: maximum likelihood estimation: What is the θ that maximizes L? Example: X 1, X 2,, X n ~ Bernoulli(p), then f(x;p) = p x (1 - p) 1-x, for x = 0, 1. 112

Maximum Likelihood Estimation (parameter estimation) Given data and a distribution, how does one choose the parameters? likelihood function: log-likelihood function: maximum likelihood estimation: What is the θ that maximizes L? Example: X 1, X 2,, X n ~ Bernoulli(p), then f(x;p) = p x (1 - p) 1-x, for x = 0, 1. 113

Maximum Likelihood Estimation (parameter estimation) Given data and a distribution, how does one choose the parameters? likelihood function: log-likelihood function: maximum likelihood estimation: What is the θ that maximizes L? Example: X 1, X 2,, X n ~ Bernoulli(p), then f(x;p) = p x (1 - p) 1-x, for x = 0, 1. 114

Maximum Likelihood Estimation (parameter estimation) Given data and a distribution, how does one choose the parameters? likelihood function: log-likelihood function: maximum likelihood estimation: What is the θ that maximizes L? Example: X 1, X 2,, X n ~ Bernoulli(p), then f(x;p) = p x (1 - p) 1-x, for x = 0, 1. take the derivative and set to 0 to find: 115

Probability Theory Review: 2-11 common pdfs: Normal, Uniform, Exponential how does kernel density estimation work? common pmfs: Binomial (Bernoulli), Discrete Uniform, Geometric cdfs (and how to transform out from a random number generator (i.e. uniform distribution) into another distribution) how to plot: pdfs, cdfs, and pmfs in python. MLE revisited: how to derive the parameter estimate from the likehood function 116

Maximum Likelihood Estimation (parameter estimation) Given data and a distribution, how does one choose the parameters? likelihood function: log-likelihood function: maximum likelihood estimation: What is the θ that maximizes L? Example: X 1, X 2,, X n ~ Bernoulli(p), then f(x;p) = p x (1 - p) 1-x, for x = 0, 1. take the derivative and set to 0 to find: 117

Maximum Likelihood Estimation Given data and a distribution, how does one choose the parameters? likelihood function: log-likelihood function: maximum likelihood estimation: What is the θ that maximizes L? Example: X ~ Normal(μ, σ), then GOAL: take the derivative and set to 0 to find: 118

Maximum Likelihood Estimation Given data and a distribution, how does one choose the parameters? likelihood function: log-likelihood function: maximum likelihood estimation: What is the θ that maximizes L? Example: X ~ Normal(μ, σ), then Normal pdf GOAL: take the derivative and set to 0 to find: 119

Maximum Likelihood Estimation Example: X ~ Normal(μ, σ), then GOAL: take the derivative and set to 0 to find: 120

Maximum Likelihood Estimation Example: X ~ Normal(μ, σ), then GOAL: take the derivative and set to 0 to find: 121

Maximum Likelihood Estimation Example: X ~ Normal(μ, σ), then first, we find μ using partial derivatives: GOAL: take the derivative and set to 0 to find: 122

Maximum Likelihood Estimation Example: X ~ Normal(μ, σ), then first, we find μ using partial derivatives: 123

Maximum Likelihood Estimation Example: X ~ Normal(μ, σ), then first, we find μ using partial derivatives: now σ: 124

Maximum Likelihood Estimation Example: X ~ Normal(μ, σ), then first, we find μ using partial derivatives: now σ: 125

Maximum Likelihood Estimation Example: X ~ Normal(μ, σ), then first, we find μ using partial derivatives: sample mean now σ: sample variance 126

Maximum Likelihood Estimation Try yourself: Example: X ~ Exponential(λ), hint: should arrive at something almost familiar; then recall 127

Expectation, revisited Conceptually: Just given the distribution and no other information: what value should I expect? 128

Expectation, revisited Conceptually: Just given the distribution and no other information: what value should I expect? Formally: The expected value of X is: denoted: 129

Expectation, revisited Conceptually: Just given the distribution and no other information: what value should I expect? Formally: The expected value of X is: denoted: expectation mean first moment 130

Expectation, revisited Conceptually: Just given the distribution and no other information: what value should I expect? Formally: The expected value of X is: denoted: expectation mean first moment Alternative Conceptualization: If I had to summarize a distribution with only one number, what would do that best? (the average of a large number of randomly generated numbers from the distribution) 131

Expectation, revisited Examples: X ~ Bernoulli(p): X ~ Uniform(-3,1): The expected value of X is: denoted: 132

Probability Theory Review: 2-16 MLE over a continuous random variable mean and variance The concept of expectation Calculating expectation for discrete variables continuous variables 133