Sampling & Confidence Intervals

Similar documents
Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

Statistics 13 Elementary Statistics

CHAPTER 5 Sampling Distributions

ECON 214 Elements of Statistics for Economists 2016/2017

Data Analysis and Statistical Methods Statistics 651

Confidence Intervals Introduction

STAT Chapter 7: Confidence Intervals

1. Statistical problems - a) Distribution is known. b) Distribution is unknown.

Expected Value of a Random Variable

CH 5 Normal Probability Distributions Properties of the Normal Distribution

ECON 214 Elements of Statistics for Economists

Section The Sampling Distribution of a Sample Mean

Business Statistics 41000: Probability 4

4.2 Probability Distributions

Some Characteristics of Data

BIOSTATISTICS TOPIC 5: SAMPLING DISTRIBUTION II THE NORMAL DISTRIBUTION

Sampling Distributions and the Central Limit Theorem

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

Statistical Intervals (One sample) (Chs )

Statistics 431 Spring 2007 P. Shaman. Preliminaries

Math 227 Elementary Statistics. Bluman 5 th edition

Chapter 8 Estimation

Continuous Probability Distributions & Normal Distribution

Chapter 8 Statistical Intervals for a Single Sample

Numerical Descriptive Measures. Measures of Center: Mean and Median

Chapter 5 Basic Probability

Lecture 6: Chapter 6

Normal Probability Distributions

Chapter 7 Study Guide: The Central Limit Theorem

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

μ: ESTIMATES, CONFIDENCE INTERVALS, AND TESTS Business Statistics

Statistics for Business and Economics: Random Variables:Continuous

As you draw random samples of size n, as n increases, the sample means tend to be normally distributed.

Statistical Methods in Practice STAT/MATH 3379

The Normal Distribution

Data Analysis and Statistical Methods Statistics 651

Midterm Exam III Review

Chapter 7 1. Random Variables

Version A. Problem 1. Let X be the continuous random variable defined by the following pdf: 1 x/2 when 0 x 2, f(x) = 0 otherwise.

Previously, when making inferences about the population mean, μ, we were assuming the following simple conditions:

AMS7: WEEK 4. CLASS 3

Chapter 14 : Statistical Inference 1. Note : Here the 4-th and 5-th editions of the text have different chapters, but the material is the same.

Statistics and Probability

6.1, 7.1 Estimating with confidence (CIS: Chapter 10)

IOP 201-Q (Industrial Psychological Research) Tutorial 5

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

UNIT 4 MATHEMATICAL METHODS

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Section 7.5 The Normal Distribution. Section 7.6 Application of the Normal Distribution

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

4.1 Introduction Estimating a population mean The problem with estimating a population mean with a sample mean: an example...

Back to estimators...

Hypothesis Tests: One Sample Mean Cal State Northridge Ψ320 Andrew Ainsworth PhD

ECO220Y Continuous Probability Distributions: Normal Readings: Chapter 9, section 9.10

The Normal Distribution & Descriptive Statistics. Kin 304W Week 2: Jan 15, 2012

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

Section Sampling Distributions for Counts and Proportions

Data Analysis and Statistical Methods Statistics 651

Using the Central Limit Theorem It is important for you to understand when to use the CLT. If you are being asked to find the probability of the

Chapter Four: Introduction To Inference 1/50

Central Limit Theorem

Chapter 5. Sampling Distributions

CHAPTER 5 SAMPLING DISTRIBUTIONS

Interval estimation. September 29, Outline Basic ideas Sampling variation and CLT Interval estimation using X More general problems

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

Moments and Measures of Skewness and Kurtosis

Chapter 7. Sampling Distributions and the Central Limit Theorem

MidTerm 1) Find the following (round off to one decimal place):

Week 7. Texas A& M University. Department of Mathematics Texas A& M University, College Station Section 3.2, 3.3 and 3.4

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Stat 213: Intro to Statistics 9 Central Limit Theorem

Homework: Due Wed, Nov 3 rd Chapter 8, # 48a, 55c and 56 (count as 1), 67a

BIO5312 Biostatistics Lecture 5: Estimations

Part V - Chance Variability

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Probability. An intro for calculus students P= Figure 1: A normal integral

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

A continuous random variable is one that can theoretically take on any value on some line interval. We use f ( x)

Diploma in Business Administration Part 2. Quantitative Methods. Examiner s Suggested Answers

STA 320 Fall Thursday, Dec 5. Sampling Distribution. STA Fall

The Central Limit Theorem

1 Inferential Statistic

Simple Descriptive Statistics

Shifting and rescaling data distributions

value BE.104 Spring Biostatistics: Distribution and the Mean J. L. Sherley

Chapter 9: Sampling Distributions

Normal Curves & Sampling Distributions

Bayesian Normal Stuff

Homework: Due Wed, Feb 20 th. Chapter 8, # 60a + 62a (count together as 1), 74, 82

CHAPTER 8. Confidence Interval Estimation Point and Interval Estimates

MAS187/AEF258. University of Newcastle upon Tyne

χ 2 distributions and confidence intervals for population variance

MATH 3200 Exam 3 Dr. Syring

Unit 5: Sampling Distributions of Statistics

Example - Let X be the number of boys in a 4 child family. Find the probability distribution table:

Unit 5: Sampling Distributions of Statistics

MATH 104 CHAPTER 5 page 1 NORMAL DISTRIBUTION

Random variables The binomial distribution The normal distribution Sampling distributions. Distributions. Patrick Breheny.

Honors Statistics. Daily Agenda

Transcription:

Sampling & Confidence Intervals Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 24/10/2017

Principles of Sampling Often, it is not practical to measure every subject in a population. A reduced number of subjects, a sample, is measured instead. Cheaper Quicker More thorough Sample needs to be chosen in such a way as to be representative of the population

Types of Sample Simple Random Stratified Cluster Quota Convenience Systematic

Simple Random Sample Every subject has the same probability of being selected. This probability is independent of who else is in the sample. Need a list of every subject in the population (sampling frame). Statistical methods depend on randomness of sampling. Refusals mean the sample is no longer random.

Stratified Divide population into distinct sub-populations. E.g. into age-bands, by gender Randomly sample from each sub-population. sampling probability is same for everyone in a sub-population sampling probability differs between sub-populations More efficient than a simple random sample if variable of interest varies more between sub-populations than within sub-populations.

Cluster Randomly sample groups of subjects rather than subjects Why? List of subjects not available, list of groups is Cheaper and easier to recruit a number of subjects at the same time. In intervention studies, may be easier to treat groups: randomise hospitals rather than patients. Need a reasonable number of clusters to assure representativeness. The more similar clusters are, the better cluster sampling works. Cluster samples need special methods for analysis

Quota Deliberate attempt to ensure proportions of subjects in each category in a sample match the proportion in the population. Often used in market research: quotas by age, gender, social status. Variables not used to define the quotas may be very different in the sample and population. Proportion of men and of elderly may be correct, not proportions of elderly men. Probability of inclusion is unknown, may vary greatly between categories Cannot assume sample is representative.

Systematic & Convenience Samples Systematic Take every n th subject. If there is clustering (or periodicity) in the sampling frame, may not be representative. Shared surnames can cause problems. Randomly order and take every n th subject: random. Convenience Take a random sample of easily accessible subjects May not be representative of entire population. E.g. people going to G.P. with sore throat easy to identify, not representative of people with sore throat.

Estimating from Random Samples We are interested in what our sample tells us about the population We use sample statistics to estimate population values Need to keep clear whether we are talking about sample or population Values in the population are given Greek letters µ, π..., whilst values in the sample are given equivalent Roman letters m, p.... Suppose we have a population, in which a variable x has a mean µ and standard deviation σ. We take a random sample of size n. Then Sample mean x should be close to the population mean µ. However, if several samples are taken, x in each sample will differ slightly.

Variation of x around µ How much the means of different samples differ depends on Sample Size The mean of a small sample will vary more than the mean of a large sample. Variance in the Population If the variable measured varies little, the sample mean can only vary little. I.e. variance of x depends on variance of x and on sample size n.

Example Consider consider a population consisting of 1000 copies of each of the digits 0, 1,..., 9. The distribution of the values in this population is Density 0.02.04.06.08.1 0 2 4 6 8 10 x

Example: Samples Samples of size 5, 25 and 100 2000 samples of each size were randomly generated Mean of x ( x) was calculated for each sample Histograms created for each sample size separately

Example: Distributions of x Density 0.1.2.3.4.5 Density 0.4.8.2.6 Density 0.5 1 1.5 0 2 4 6 8 (mean) x 0 2 4 6 8 (mean) x 0 2 4 6 8 (mean) x Size 5 Size 25 Size 100

Properties of x E( x) = µ i.e. on average, the sample mean is the same as the population mean. Standard Deviation of x = σ n i.e the uncertainty in x increases with σ, decreases with n. The standard deviation of the mean is also called the Standard Error x is normally distributed This is true whether or not x is normally distributed, provided n is sufficiently large. Thanks to the Central Limit Theorem.

Standard Error Standard deviation of the sampling distribution of a statistic Sampling distribution: the distribution of a statistic as sampling is repeated All statistics have sampling distributions Statistical inference is based on the standard error

Example: Sampling Distribution of x µ = 4.5 σ = 2.87 Size of samples Mean x S.D. x Predicted Observed Predicted Observed 5 4.5 4.47 1.29 1.26 25 4.5 4.51 0.57 0.57 100 4.5 4.50 0.29 0.30

Estimating the Variance In a population of size N, the variance of x is given by σ 2 = Σ(x i µ) 2 N This is the Population Variance In a sample of size n, the variance of x is given by s 2 = Σ(x i x) 2 n 1 (1) (2) This is the Sample Variance

Why n 1 rather than N Population σ 2 = Σ(x i µ) 2 N Sample s 2 = Σ(x i x) 2 n 1 Use n 1 rather than n because we don t know µ, only an imperfect estimate x. Since x is calculated from the sample (i.e. from the x i ), x i will tend to be closer to x than it is to µ. Dividing by n would underestimate the variance With a reasonable sample size, makes little difference.

Proportions Suppose that you want to estimate π, the proportion of subjects in the population with a given characteristic. You take a random sample of size n, of whom r have the characteristic. p = r n is a good estimator for π. If you create a variable x which is 1 for subjects which have the characteristic and 0 for those who do not, then p = x If the sample is large, p will be normally distributed, even though x isn t

Reference Ranges If x is normally distributed with mean µ and standard deviation σ, then we can find out all of the percentiles of the distribution. E.g. Median = µ 25 th centile = µ 0.674σ 75 th centile = µ + 0.674σ Commonly, we are interested in the interval in which 95% of the population lie, which is from µ 1.96 σ to µ + 1.96σ This is from the 2.5 th centile to the 97.5 th centile

Reference Range Illustration Density 0.1.2.3.4 4 2 0 2 4 x Red lines cut off 5% of data in each tail 90% of data lies between lines Blue lines are at -1.645, 1.645

Non-normal distributions 1: Skewed distribution Density 0.1.2.3.4 2 0 2 4 6 Standardized values of (z) χ 2 distribution Red lines cut off 5% of data in each tail Mean ± 1.645 S.D. covers > 90% of data Only 2% < mean - 1.645 S.D 6.5% > mean + 1.645 S.D.

Non-normal distributions 2: Long-tailed distribution Density 0.2.4.6 5 0 5 Standardized values of (z) t-distribution Symmetric, but not normal Higher peak, longer tails than normal Red lines cut off 5% of data in each tail Blue lines at mean ± 1.645 S.D. Mean ± 1.645 S.D. covers > 94% of data

Reference Range Example Bone mineral density (BMD) was measured at the spine in 1039 men. The mean value was 1.06g/cm 2 and the standard deviation was 0.222g/cm 2. Assuming BMD is normally distributed, calculate a 95% reference interval for BMD in men. Mean BMD = 1.06g/cm 2 Standard deviation of BMD = 0.222g/cm 2 95% Reference interval = 1.06 ± 1.96 0.222 = 0.62g/cm 2, 1.50g/cm 2

Confidence Intervals The distribution of x approaches normality as n gets bigger. The standard deviation of x is σ n. If samples could be taken repeatedly, 95% of the time, the x would lie between µ 1.96 σ n and µ + 1.96 σ n. As a consequence, 95% of the time, µ would lie between x 1.96 σ n and x + 1.96 σ n. This is a 95% confidence interval for the population mean. If, as is usually the case, σ is unknown, can use its estimate s.

Confidence Interval Example In 216 patients with primary biliary cirrhosis, serum albumin had a mean value of 34.46 g/l and a standard deviation of 5.84 g/l. Standard deviation of x = 5.84 Standard error of x = 5.84 216 = 0.397 95% Confidence Interval = 34.46 ± 1.96 0.397 = (33.68, 35.24) So, the mean value of serum albumin in the population of patients with primary biliary cirrhosis is probably between 33.68 g/l and 35.24 g/l.

Confidence Intervals for Proportions p is normally distributed with standard error provided n is large enough. p(1 p) n This can be used to calculate a confidence interval for a proportion. Exact confidence intervals can be calculated for small n (less than 20, say) from tables of the binomial distribution. A reference range for a proportion in meaningless: a subject either has the characteristic or they do not.

Confidence Interval around a Proportion: Example 100 subjects each receive two analgesics, X and Y, for one week each in a randomly determined order. They then state a preference for one drug. 65 prefer X, 35 prefer Y. Calculate a 95% confidence interval for the proportion preferring X. 0.65 0.35 Standard Error of p = 100 = 0.0477 95% Confidence Interval = 0.65 ± 1.96 0.0477 = (0.56, 0.74) So, in the general population, it is likely that between 56% and 74% of people would prefer X.

Confidence Intervals in Stata The ci command produces confidence intervals For proportions, you use the binomial option

Confidence Intervals and Reference Ranges Confidence intervals tell us about the population mean Reference ranges tell us about individual values Reference ranges require the variable to be normally distributed Confidence intervals do not If sampling distribution of statistic of interest is normal Normality may require reasonable sample size

Sample Size Calculations Primary outcome of a study is a statistic (mean, proportion, relative risk, incidence rate, hazard ratio etc) The larger the study, the more precisely we can estimate our statistic We can calculate how many subjects we need to achieve adequate precision if we know how the distribution of the statistic changes with increasing numbers of subjects Have a definition of adequate Power-based calculations are more complicated

Sample size for precision of mean Suppose that we want to know µ to a certain level of precision. We can be 95% certain that µ lies within x ± 1.96σ n The width of this interval depends on n, which we control. Therefore, we can select n to give our chosen width. Need to use an estimate for σ, for which we can use s.

Sample Size Formula Suppose we want to fix the width of the 95% confidence interval to 2W, i.e. 95% CI = x ± W. Then W = 1.96 Standard Error = 1.96 σ n W 2 = 1.962 σ 2 n ( ) 1.96σ 2 n = W

Sample Size Example In the primary biliary cirrhosis example, suppose that we wish to know the mean serum albumin in cirrhosis patients to within 0.5 g/l. How many patients would we need to study (assuming a standard deviation of 5.84 g/l). W = 0.5 σ = 5.84 ( ) 1.96σ 2 n = W ( 1.96 5.84 = 0.5 524 ) 2