Planning Sample Size for Randomized Evaluations Esther Duflo J-PAL

Similar documents
Planning Sample Size for Randomized Evaluations

RANDOMIZED TRIALS Technical Track Session II Sergio Urzua University of Maryland

Abdul Latif Jameel Poverty Action Lab Executive Training: Evaluating Social Programs Spring 2009

Cost-Effectiveness Analysis and Cost-Benefit Analysis. Dagmara Celik Katreniak HSE

Evaluation Design: Assignment of Treatment

Principles Of Impact Evaluation And Randomized Trials Craig McIntosh UCSD. Bill & Melinda Gates Foundation, June

Sampling & Statistical Methods for Compliance Professionals. Frank Castronova, PhD, Pstat Wayne State University

Equivalence Tests for the Difference of Two Proportions in a Cluster- Randomized Design

P E R D I P E R D I P E R D I P E R D I P E R D I

Value (x) probability Example A-2: Construct a histogram for population Ψ.

Microenterprises. Gender and Microenterprise Performance. The Experiment. Firms in three zones:

Discrete Probability Distributions

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

A Stratified Sampling Plan for Billing Accuracy in Healthcare Systems

Sampsize. Sample size and Power Version 0.6 November 9, Philippe Glaziou

VARIABILITY: Range Variance Standard Deviation

Numerical Descriptive Measures. Measures of Center: Mean and Median

Randomized Evaluation Start to finish

Risk Management, Qualtity Control & Statistics, part 2. Article by Kaan Etem August 2014

Session 178 TS, Stats for Health Actuaries. Moderator: Ian G. Duncan, FSA, FCA, FCIA, FIA, MAAA. Presenter: Joan C. Barrett, FSA, MAAA

Tests for One Variance

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

Linear Regression with One Regressor

How to Hit Several Targets at Once: Impact Evaluation Sample Design for Multiple Variables

Sampling & Confidence Intervals

DE CHAZAL DU MEE BUSINESS SCHOOL AUGUST 2003 MOCK EXAMINATIONS STA 105-M (BASIC STATISTICS) READ THE INSTRUCTIONS BELOW VERY CAREFULLY.

Economics 345 Applied Econometrics

Audit Sampling: Steering in the Right Direction

Chapter 8 Statistical Intervals for a Single Sample

Lecture 1: Review and Exploratory Data Analysis (EDA)

starting on 5/1/1953 up until 2/1/2017.

Student Loan Nudges: Experimental Evidence on Borrowing and. Educational Attainment. Online Appendix: Not for Publication

Programming periods and

MATH 10 INTRODUCTORY STATISTICS

ECO220Y Estimation: Confidence Interval Estimator for Sample Proportions Readings: Chapter 11 (skip 11.5)

The Two-Sample Independent Sample t Test

6.1, 7.1 Estimating with confidence (CIS: Chapter 10)

Chapter 5. Sampling Distributions

Invitational Mathematics Competition. Statistics Individual Test

Chapter 15: Sampling distributions

Review: Population, sample, and sampling distributions

Public Employees as Politicians: Evidence from Close Elections

Lecture outline. Monte Carlo Methods for Uncertainty Quantification. Importance Sampling. Importance Sampling

The binomial distribution p314

Cash versus Kind: Understanding the Preferences of the Bicycle- Programme Beneficiaries in Bihar

Tests for Two Means in a Multicenter Randomized Design

Final Quality report for the Swedish EU-SILC. The longitudinal component

Survey Sampling, Fall, 2006, Columbia University Homework assignments (2 Sept 2006)

Quasi-Experimental Methods. Technical Track

Using Monte Carlo Analysis in Ecological Risk Assessments

One in Five Americans Could Not Afford to Pay an Unexpected Medical Bill Without Accumulating Some Debt

R & R Study. Chapter 254. Introduction. Data Structure

Final Quality report for the Swedish EU-SILC. The longitudinal component. (Version 2)

CHAPTER 5 RESULT AND ANALYSIS

Value Added TIPS. Executive Summary. A Product of the MOSERS Investment Staff. March 2000 Volume 2 Issue 5

PASS Sample Size Software

a. Explain why the coefficients change in the observed direction when switching from OLS to Tobit estimation.

Y i % (% ( ( ' & ( # % s 2 = ( ( Review - order of operations. Samples and populations. Review - order of operations. Review - order of operations

Motivation. Research Question

Data Analysis and Statistical Methods Statistics 651

Chapter 7 Study Guide: The Central Limit Theorem

STAB22 section 2.2. Figure 1: Plot of deforestation vs. price

2 DESCRIPTIVE STATISTICS

Experiments! Benjamin Graham

Statistical Evidence and Inference

Descriptive Statistics: Measures of Central Tendency and Crosstabulation. 789mct_dispersion_asmp.pdf

IOP 201-Q (Industrial Psychological Research) Tutorial 5

5.3 Standard Deviation

POLI 300 PROBLEM SET #7 due 11/08/10 MEASURES OF DISPERSION AND THE NORMAL DISTRIBUTION

Sampling Methods, Techniques and Evaluation of Results

STAT 1220 FALL 2010 Common Final Exam December 10, 2010

Tests for Intraclass Correlation

Diploma Part 2. Quantitative Methods. Examiner s Suggested Answers

5.1 Personal Probability

3. The n observations are independent. Knowing the result of one observation tells you nothing about the other observations.

8.2 The Standard Deviation as a Ruler Chapter 8 The Normal and Other Continuous Distributions 8-1

Rand Final Pop 2. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Monte Carlo Methods for Uncertainty Quantification

DRAFT GUIDANCE NOTE ON SAMPLING METHODS FOR AUDIT AUTHORITIES

Does shopping for a mortgage make consumers better off?

Tests for Two Variances

Tests for the Odds Ratio in a Matched Case-Control Design with a Binary X

Active Portfolio Management. A Quantitative Approach for Providing Superior Returns and Controlling Risk. Richard C. Grinold Ronald N.

Savings, Subsidies and Sustainable Food Security: A Field Experiment in Mozambique November 2, 2009

Intelligent Statistical Methods for Safer and More Robust Qualifications

7. For the table that follows, answer the following questions: x y 1-1/4 2-1/2 3-3/4 4

Statistics and Probability

Some Characteristics of Data

Policy Evaluation: Methods for Testing Household Programs & Interventions

DRAFT. California ISO Baseline Accuracy Work Group Proposal

Data Analysis and Statistical Methods Statistics 651

Sampling Distributions Chapter 18

DIFFERENCE DIFFERENCES

1. Variability in estimates and CLT

3. The n observations are independent. Knowing the result of one observation tells you nothing about the other observations.

1 Inferential Statistic

A random variable is a (typically represented by ) that has a. value, determined by, A probability distribution is a that gives the

A CLEAR UNDERSTANDING OF THE INDUSTRY

Wk 2 Hrs 1 (Tue, Jan 10) Wk 2 - Hr 2 and 3 (Thur, Jan 12)

How to Consider Risk Demystifying Monte Carlo Risk Analysis

Transcription:

Planning Sample Size for Randomized Evaluations Esther Duflo J-PAL povertyactionlab.org

Planning Sample Size for Randomized Evaluations General question: How large does the sample need to be to credibly detect a given effect size? What does Credibly mean here? It means that I can be reasonably sure that the difference between the group that received the program and the group that did not is due to the program Randomization removes bias, but it does not remove noise: it works because of the law of large numbers how large much large be?

Basic set up At the end of an experiment, we will compare the outcome of interest in the treatment and the comparison groups. We are interested in the difference: Mean in treatment - Mean in control = Effect size For example: mean of the number of wells in villages with women vs mean of the number of wells in villages with men

i 1 Estimation But we do not observe the entire population, just a sample. In each village of the sample, there is a given number of wells. It is more or less close to the mean in the population, as a function of all the other factors that affect the placement of wells. We estimate the mean by computing the average in the sample If we have very few villages, the averages are imprecise. When we see a difference in sample averages, we do not know whether it comes from the effect of the treatment or from something else

i 1 Estimation The size of the sample: Can we conclude if we have one treated village and one non treated village? Can we conclude if we give textbook to one classroom and not the other? Even though we have a large class size? What matter is the effective sample size i.e. the number of treated units and control units (e.g. class rooms). What is it the unit the case of the Panchayat? The variability in the outcome we try to measure: If there are other many non-measured things that explain our outcomes, it will be harder to say whether the treatment really changed it.

When the outcomes are very precise Low Standard Deviation 25 20 15 10 mean 50 mean 60 5 0 value 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 Frequency Number

9 8 7 6 5 4 3 2 1 0 Less Precision Medium Standard Deviation value 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 Number mean 50 mean 60 Frequency

8 7 6 5 4 3 2 1 0 Can we conclude? High Standard Deviation mean 50 mean 60 37 41 45 49 53 57 61 65 69 73 77 81 85 89 Number 33 value Frequency

Confidence Intervals The estimated effect size (the difference in the sample averages) is valid only for our sample. Each sample will give a slightly different answer. How do we use our sample to make statements about the overall population? A 95% confidence interval for an effect size tells us that, for 95% of any samples that we could have drawn from the same population, the estimated effect would have fallen into this interval. The Standard error (se) of the estimate in the sample captures both the size of the sample and the variability of the outcome (it is larger with a small sample and with a variable outcome) Rule of thumb: a 95% confidence interval is roughly the effect plus or minus two standard errors.

Hypothesis testing Often we are interested in testing the hypothesis that the effect size is equal to zero (we want to be able to reject the hypothesis that the program had no effect) We want to test: : Effect size 0 H o Against: H a : Effect size 0

Two types of mistakes First type of error : Conclude that there is an effect, when in fact there are no effect. The level of your test is the probability that you will falsely conclude that the program has an effect, when in fact it does not. So with a level of 5%, you can be 95% confident in the validity of your conclusion that the program had an effect For policy purpose, you want to be very confident in the answer you give: the level will be set fairly low. Common level of : 5%, 10%, 1%.

Relation with confidence intervals If zero does not belong to the 95% confidence interval of the effect size we measured, then we can be at least 95% sure that the effect size is not zero. So the rule of thumb is that if the effect size is more than twice the standard error, you can conclude with more than 95% certainty that the program had an effect

Two types of mistakes Second type of error: you fail to reject that the program had no effect, when it fact it does have an effect. The Power of a test is the probability that I will be able to find a significant effect in my experiment (higher power are better since I am more likely to have an effect to report) Power is a planning tool. It tells me how likely it is that I find a significant effect for a given sample size One minus the power is the probability to be disappointed.

Calculating Power When planning an evaluation, with some preliminary research we can calculate the minimum sample we need to get to: Test a pre-specified hypothesis: program effect was zero or not zero For a pre-specified level (e.g. 5%) Given a pre-specified effect size (what you think the program will do) To achieve a given power A power of 80% tells us that, in 80% of the experiments of this sample size conducted in this population, if there is indeed an effect in the population, we will be able to say in our sample that there is an effect with the level of confidence desired. The larger the sample, the larger the power. Common Power used: 80%, 90%

Ingredients for a power calculation in a simple study What we need Significance level The mean and the variability of the outcome in the comparison group The effect size that we want to detect Where we get it This is often conventionally set at 5%. The lower it is, the larger the sample size needed for a give power -From previous surveys conducted in similar settings -The larger the variability is, the larger the sample for a given power What is the smallest effect that should prompt a policy response? The smaller the effect size we want to detect, the larger a sample size we need for a given power

Picking an effect size What is the smallest effect that should justify the program to be adopted: Cost of this program vs the benefits it brings Cost of this program vs the alternative use of the money If the effect is smaller than that, it might as well be zero: we are not interested in proving that a very small effect is different from zero In contrast, any effect larger than that effect would justify adopting this program: we want to be able to distinguish it from zero Common danger: picking effect size that are too optimistic the sample size may be set too low!

Standardized Effect Sizes How large an effect you can detect with a given sample depends on how variable the outcomes is. Example: If all children have very similar learning level without a program, a very small impact will be easy to detect The standard deviation captures the variability in the outcome. The more variability, the higher the standard deviation is The Standardized effect size is the effect size divided by the standard deviation of the outcome = effect size/st.dev. Common effect sizes: small) medium) large)

The Design factors that influence power The level of randomization Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested.

Level of Randomization Clustered Design Cluster randomized trials are experiments in which social units or clusters rather than individuals are randomly allocated to intervention groups Examples: PROGRESA Gender Reservations Flipcharts, Deworming Iron supplementation Village Panchayats school Family

Reason for adopting cluster randomization Need to minimize or remove contamination Example: In the deworming program, schools was chosen as the unit because worms are contagious Basic Feasibility considerations Example: The PROGRESA program would not have been politically feasible if some families were introduced and not others. Only natural choice Example: Any education intervention that affect an entire classroom (e.g. flipcharts, teacher training).

Impact of Clustering The outcomes for all the individuals within a unit may be correlated All villagers are exposed to the same weather All Panchayats share a common history All students share a schoolmaster The program affect all students at the same time. The member of a village interact with each other The sample size needs to be adjusted for this correlation The more correlation between the outcomes, the more we need to adjust the standard errors

Example of group effect multipliers Intraclass Randomized Group Size_ Correlation

Implications It is extremely important to randomize an adequate number of groups. Often the number of individual within groups matter less than the number of groups Think that the law of large number applies only when the number of groups that are randomized increase You CANNOT randomize at the level of the district, with one treated district and one control district!!!!

Availability of a Baseline A baseline has three main uses: Can check whether control and treatment group were the same or different before the treatment Reduce the sample size needed, but requires that you do a survey before starting the intervention: typically the evaluation cost go up and the intervention cost go down Can be used to stratify and form subgroups (e.g. balsakhi) To compute power with a baseline: You need to know the correlation between two subsequent measurement of the outcome (for example: between consumption between two years). The stronger the correlation, the bigger the gain. Very big gains for very persistent outcomes such as tests scores;

Control Variables If we have control variables (e.g. village population, block where the village is located, etc.) we can also control for them What matters now for power is, the residual variation after controlling for those variables If the control variables explain a large part of the variance, the precision will increase and the sample size requirement decreases. Warning: control variables must only include variables that are not INFLUENCED by the treatment: variables that have been collected BEFORE the intervention.

Stratified Samples Stratification: create BLOCKS by value of the control variables and randomize within each block Stratification ensure that treatment and control groups are balanced in terms of these control variables. This reduces variance for two reasons: it will reduce the variance of the outcome of interest in each strata the correlation of units within clusters. Example: if you stratify by district for an agricultural extension program Agroclimatic factors are controlled for The common district magistrate effect disappears.

The Design factors that influence power Clustered design Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested.

The Hypothesis that is being tested Are you interested in the difference between two treatments as well as the difference between treatment and control? Are you interested in the interaction between the treatments? Are you interested in testing whether the effect is different in different subpopulations? Does your design involve only partial compliance? (e.g. encouragement design?)

Power Calculations using the OD software Choose Power vs number of clusters in the menu clustered randomized trials

Choose cluster size Cluster Size

Choose Significance Level, Treatment Effect, and correlation Pick : level Normally you pick 0.05 Pick Can experiment with 0.20 Pick the intra class correlation (rho) You obtain the resulting graph showing power as a function of sample size.

Power and Sample Size

Conclusions: Power Calculation in Practice Power calculations involve some guess work. Some time we do not have the right information to conduct it very properly However, it is important to spend some effort on them: Avoid launching studies that will have no power at all: waste of time and money Devote the appropriate resources to the studies that you decide to conduct (and not too much).