Establishing a framework for statistical analysis via the Generalized Linear Model

Similar documents
Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Simple Descriptive Statistics

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Lectures delivered by Prof.K.K.Achary, YRC

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Data screening, transformations: MRC05

Some Characteristics of Data

Summary of Statistical Analysis Tools EDAD 5630

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Moments and Measures of Skewness and Kurtosis

Engineering Mathematics III. Moments

Fundamentals of Statistics

Probability & Statistics Modular Learning Exercises

Descriptive Analysis

chapter 2-3 Normal Positive Skewness Negative Skewness

DATA SUMMARIZATION AND VISUALIZATION

Basic Procedure for Histograms

Lecture 1: Review and Exploratory Data Analysis (EDA)

Getting to know data. Play with data get to know it. Image source: Descriptives & Graphing

2 Exploring Univariate Data

Getting to know a data-set (how to approach data) Overview: Descriptives & Graphing

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

Data Distributions and Normality

Chapter 6 Simple Correlation and

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Lecture Week 4 Inspecting Data: Distributions

Graphical and Tabular Methods in Descriptive Statistics. Descriptive Statistics

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

CHAPTER 2 Describing Data: Numerical

Chapter 18: The Correlational Procedures

Measures of Central tendency

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

M249 Diagnostic Quiz

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Terms & Characteristics

Measures of Central Tendency: Ungrouped Data. Mode. Median. Mode -- Example. Median: Example with an Odd Number of Terms

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

Business Statistics: A First Course

DESCRIPTIVE STATISTICS II. Sorana D. Bolboacă

Frequency Distribution and Summary Statistics

Steps with data (how to approach data)

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

DESCRIPTIVE STATISTICS

Measures of Dispersion (Range, standard deviation, standard error) Introduction

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

Descriptive Statistics-II. Mahmoud Alhussami, MPH, DSc., PhD.

Introduction to Descriptive Statistics

Numerical summary of data

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

Exploring Data and Graphics

Hypothesis Tests: One Sample Mean Cal State Northridge Ψ320 Andrew Ainsworth PhD

Statistics & Statistical Tests: Assumptions & Conclusions

SOLUTIONS TO THE LAB 1 ASSIGNMENT

E.D.A. Exploratory Data Analysis E.D.A. Steps for E.D.A. Greg C Elvers, Ph.D.

PSYCHOLOGICAL STATISTICS

The Normal Distribution & Descriptive Statistics. Kin 304W Week 2: Jan 15, 2012

Descriptive Statistics

Economics 483. Midterm Exam. 1. Consider the following monthly data for Microsoft stock over the period December 1995 through December 1996:

Table of Contents. New to the Second Edition... Chapter 1: Introduction : Social Research...

Description of Data I

Contents. An Overview of Statistical Applications CHAPTER 1. Contents (ix) Preface... (vii)

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

Descriptive Statistics in Analysis of Survey Data

Numerical Descriptions of Data

Lecture 2 Describing Data

SPSS I: Menu Basics Practice Exercises Target Software & Version: SPSS V Last Updated on January 17, 2007 Created by Jennifer Ortman

Quantitative Methods for Economics, Finance and Management (A86050 F86050)

Today's Agenda Hour 1 Correlation vs association, Pearson s R, non-linearity, Spearman rank correlation,

STAB22 section 2.2. Figure 1: Plot of deforestation vs. price

Contents Part I Descriptive Statistics 1 Introduction and Framework Population, Sample, and Observations Variables Quali

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

NOTES: Chapter 4 Describing Data

32.S [F] SU 02 June All Syllabus Science Faculty B.A. I Yr. Stat. [Opt.] [Sem.I & II] 1

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

2.4 STATISTICAL FOUNDATIONS

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

The normal distribution is a theoretical model derived mathematically and not empirically.

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

Topic 8: Model Diagnostics

34.S-[F] SU-02 June All Syllabus Science Faculty B.Sc. I Yr. Stat. [Opt.] [Sem.I & II] - 1 -

UNIT 4 NORMAL DISTRIBUTION: DEFINITION, CHARACTERISTICS AND PROPERTIES

Business Statistics 41000: Probability 4

Lecture 6: Non Normal Distributions

starting on 5/1/1953 up until 2/1/2017.

Session 5: Associations

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Impact of Unemployment and GDP on Inflation: Imperial study of Pakistan s Economy

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

Stat3011: Solution of Midterm Exam One

STAT 157 HW1 Solutions

8. From FRED, search for Canada unemployment and download the unemployment rate for all persons 15 and over, monthly,

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

SPSS t tests (and NP Equivalent)

Business Statistics. University of Chicago Booth School of Business Fall Jeffrey R. Russell

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

KARACHI UNIVERSITY BUSINESS SCHOOL UNIVERSITY OF KARACHI BS (BBA) VI

appstats5.notebook September 07, 2016 Chapter 5

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

Transcription:

PSY349: Lecture 1: INTRO & CORRELATION Establishing a framework for statistical analysis via the Generalized Linear Model GLM provides a unified framework that incorporates a number of statistical methods (including regression). If you understsand this framework you won t have to puzzle over which test to use. Common hypothesis tests and their underlying models *don t need to know above; tests included in GLM We learn about each individually, but when understanding the model, you ll know which test to use WHEN Each test has an associated formula Each test has a critical values table to look up But are there any underlying commonalities? A system of linear models

Concept of the General Linear Model (GLM) ^ looking at relationship between IV and DV. - Relationship What are the expected changes in the DV as a result of changes in the IVs. [up/down significantly?] - This is regression way of thinking (talking) and interpreting relationships (prediction) As a diagram X1 = IV [predictors] Y = DV B1/2/3 = relationships R = correlatoin Note: Double-headed arrows signify correlation, so X 1, X 2 and X 3 can be correlated. What s not shown is that X 1, X 2 and X 3 can be numeric or categorical. Foreshadow that b 1, b 2 and b 3 represent expected changes in DV given a 1 unit change in the IV. [unit increases/decreses]

residuals = error in prediction * note generalized linear model has a set of tests with non- normal residuals and normally distributed residual [diff to General linear model] Many common statistical methods are linked through the GLM concept All tests have a model underlying them When thinking about what test should I use?, thinking in terms of a GLM structure may actually make life simpler! Correlation Correlation = linear relationship between 2 continuous variables (numeric = continuous) - ve = higher score on one variable associated with lower score on other variable +ve = increase in one,

increase in other Before the Statistics (because we are taking a research vs analytic approach to correlation/regression) Step 1. Understand the research question Step 2. How are the DV and IV measured? [type of stats tests is dependent on how DV/IV are measured] Step 3. Choose method of analysis (=correlation/regression) Conduct Correlation/Regression Analysis Part 1: Univariate Part 2: Bivariate [relationships between 2 variables] Part 3: Regression and Check Assumptions Step 1: Understand the RQ Research: studying carers of people on home haemodialysis Research question: how much distress do home haemodialysis caregivers experience (univariate), is this distress related to age of the carer (bivariate correlation), and can this distress be predicted from information about the age of the carer? (regression) * predict = regression Hypothesis: expecting younger carers will experience more distress RQ in a diagram: 3 parts to the analysis (1) Univariate; level of carer distress (2) Bivariate; correlation can only look if there s a linear relationship (3) Regression

Step 2: How are the DV and IV measured non- experimental = can t infer anything causal (not saying age causes distress, only that is predicts) regression doesn t infer causality non- experimental because we re not allocating anyone to any groups

Understand scale of measurement: age = ordinal, but can treat as interval because there are enough in each category to treat as CONITNUOUS SHQ is continous

Step 3: Choose Method of Analysis Before the Statistics (because we are taking a research vs analytic approach to correlation/regression) - Step 1: Understand the research question - Step 2: How are the DV and IV measured? - Step 3: Choose method of analysis à Both the IV and DV are continuous = correlation/regression Correlation/Regression Analysis Part 1: Univariate (Graphical and Numberic) Tells us whether the DV (in particular) and IV are normally distributed - We are more interested in the DV than the IV [don t care about IV] - This will help us describe the DV GRAPHICAL: Histogram Describe histogram based on 5 features Central Tendency Variability Kurtosis Skewness Modal Characteristics (modality) Describing Univariate Characteristics (Graphical = histogram)

1. Central Tendency - Typical or average score, centre of the distribution, peak in the distribution. Does it exist? 2. Variability - Do all the cases tend to score at about the same point (low variability), or are they widely scattered (high variability). Width of distribution. 3. Kurtosis o No variability in DV, analysis not possible - Flatness or peakness of a distribution. - Platykurtic (flat), leptokurtic (very peaked), and mesokurtic (in between). 4. Skewness o Want to be mesokurtic - Symmetry vs lopsidedness of distribution - Positive (right) skew, negative (left) skew [tales]. Symmetric distributions have no skew 5. Modal Characteristics (Modality) - Frequency of peaks as unimodal, bimodal or multimodal [mode = most frequent score] o Want unimodal - A distribution with no mode is a uniform or rectangular distribution - In general, the presence of more than one frequency peak (mode) in a distribution means that the data represent several relatively homogeneous subgroups within the larger sample being studied Analyze à Descriptive Statistics à Frequencies Stats & charts

Syntax version: frequencies variables = ghtot perdist lifeups negfeel shtot [how many ppl have each score] /format = notable /statistics = stddev variance range minimum maximum mean median mode skewness seskew kurtosis sekurt /histogram = normal. Histogram for Total Specific Health Questionnaire Good Choice for DV most well behaved

Numeric Summaries for SHTOT Skewness = between - 1 & +1 (but look at distribution aswell; graph & stats [could be because of one outlier ; exlude?]) Now let s look at the actual distribution of ages Analyze à Descriptive Statistics à Frequencies Display freq tables Syntax frequencies variables = age /statistics = stddev variance range minimum maximum mean median mode skewness seskew kurtosis sekurt. Descriptive statistics for age

Conduct Correlation/Regression Analysis Part 1: Univariate i.e. - Graphical = histogram (describe 5 features) - Numeric = descriptive statistics AND Answers the first part of our research question Describing Distress *project report Because SHTOT is well behaved we can report the mean and standard deviation. [if not well- behaved and had to make it categorical, can t mention the mean and SD]

- Mean = 21.10 - Standard deviation = 10.95 - (std. dev) 2 = variance = 119.99. Therefore SHTOT varies! [variance only tells us this, not mean] - Because normally distributed we know that approximately 66% of carers have distress between: - (mean 1 std. dev.) and (mean + 1 std. dev.) - (21.10 10.95) and (21.10 + 10.95) - So, approx. 66% of carers fall between 10 and 32 [not reported in a research paper] In theory the SHTOT score can range between 0 and 60. Our observed range is between 0 and 47. So we could conclude that carer s distress certainly varies but that most carers display a level of distress in the lower half of the possible range. Understand variation in the SHTOT - If the RNS Administrator wants to know the level of distress in the caregiver community our best guess at the moment is the mean of the sample (21.10). - This is a reasonable estimate and statistically the best estimate but can we improve on this answer? o Yes, if we can find variables with which distress is related (covariation). - Let s look at the relationship between distress and age. Conduct Correlation/Regression Analysis Part 2: Bivariate i.e. - Graphical = scatterplot (describe 7 features) - Numeric = Pearson correlation (if appropriate) - We are now moving from univariate (one at a time) statistics to bivariate (two at a time) statistics. Is distress related to age? - Answers the 2 nd part of our research question

Use SPSS point & click to produce SCATTERPLOT Graphs à Legacy Dialogues à Scatter/Dot à Simple à Define - DV[outcome]; Y axis, - IV[predictor]; X axis Syntax graph /scatterplot(bivariate) = age with shtot. Another of the great moments in data analysis Have I chosen a good variable to relate to my dependent variable? (yes, there s a trend) Once we have our scatterplot there are 6 aspects to look at.. 1. Monotonic [consistent linear trend] Is the relationship monotonic? In other words, does Y rise or fall consistently as X rises? A u- shaped relationship is not monotonic. Our scatterplot is monotonic [consistent decrease]. 2. Direction Are the variables positively or negatively related? Ours is negative. As Age increases Distress reduces. 3. Linear, straight line Can the relationship be summarised by a straight line or will it need a curve. Our scatterplot can be summarised by a straight line. 4. Effect of X on Y

How much effect does X have on Y? In other words, how much does Y increase (or decrease) for every unit increase in X. Ours is moderate. à no standard, based on own judgement [noticeable slope = moderate, angled = large, flat = small. Cut offs ambiguous] à get this answer from regression analysis 5. Correlation How highly do the variables correlate? In other words, how tightly do the points cluster around a fitted line or curve? Ours is weak to moderate. [circle = weak, oval = moderate. Not about the angle of the line, but how tight in the points are, diff to the effect] 6. Gaps Are there any gaps in the plot? Do we have examples smoothly ranged across the whole scale of X and Y or are there gaps and discontinuities? In ours there are no gaps. 7. Outliers Are there any obvious outliers? Draw attention to any unusual data points. Here there are no obvious outliers. Hence from our scatterplot we can conclude that.. - Age and Distress are negatively, linearly related. Is this what you expected? - Provided we are happy that (1) the relationship is monotonic, (3) can be summarised by a straight line, (6) there are not gaps and (7) no obvious outliers, we can appropriately summarise the relationship numerically by calculating a (Pearson) correlation. o If was U shaped curve, or gaps or outliers couldn t use Pearson r - Let s use SPSS Analyze à Correlate à Bivariate Syntax Correlations variables = shtot age /print = twotail nosig. Numeric summaries for linear relationship between IV & DV

p = <.01 [not p=0] The Correlation coefficient By using the correlation command in SPSS we found that the Pearson correlation coefficient was 0.53. Here we have a numerical value for: 2. Direction (positive or negative) the sign (i.e. - 0.53) 5. How highly do the variables correlate the size (i.e. - 0.53 [between - 1, +1, closer to 0; weaker the relationship, +1/- 1 strong relationship] The answer to: à Also tells us something about significance. 4. How much effect does X have on Y

is given when we conduct the regression analysis. Correlation SIZE Ours is 0.53 which I would say is moderate correlation. As a rule of thumb I would say: - Correlations between 0 and 0.29 (positive and negative) are weak, - Correlations between 0.30 and 0.59 are moderate, and, - Correlations between 0.60 and 1.00 are strong. Correlation SIGNIFICANT Significance depends on two things: 1. The size of the relationship AND the sample size. à Large correlation and small sample; might not be significant 2. So keep the sample size in mind when interpreting significance Strong vs. weak correlations Don t confuse the steepness of the slope with how tightly clustered the points are! - Perfect correlation = data points making a straight line - Doesn t matter how steep the line is - Weaker correlation = data points more dispersed - Still doesn t matter how steep the line is - No correlation = no linear pattern to the data points - Slopes are the same only dispersion of data points differs (so diff correlation) Which brings us to the question what exactly is the correlation coefficient? Pearson r the correlation coefficient - Σ = sum - x = score on first variable - x = mean of all scores on x - y = score on second variable - y = mean of all scores on y - n = sample size - Sx = standard deviation of x scores - Sy = standard deviation of y scores ( x x)( y r = SxSy y) / n

Removal of an outlier, leads to increase in correlation *top line of the equation = covariance 5 important points about correlations How does this relate to the strong vs. weak correlations graphs? 1 Note that SPSS reports that our correlation is significant (p < 0.0005). Our hypothesis here is: H0: r = 0 (where rho (r) is population correlation) No relationship between the variables H1: r not = 0 So this significance is telling us that with this sample size our population correlation is not equal to zero. With large sample sizes even very weak correlations can be shown to be different from zero so don t misinterpret the significance to mean that the correlation is important. It is just different from zero. 2 If asked SPSS will always calculate a (linear) correlation even when it is inappropriate to do so. Always inspect the bivariate scatterplot first to determine that a linear correlation is appropriate (particularly checking for gaps, outliers, nonlinearity). 3 r = 0.00 doesn t always mean no correlation. It means no linear correlation. Here there is a strong relationship but it is not linear. Linear r would equal 0 here. 4 Always report ranges of X and Y. We don t know what happens to relationship beyond range of our data. 5 Correlation does not imply causation - Correlation between number of fire trucks called to a fire and the damage done by the fire is very strong - Correlation between number of nesting storks and number of births = 0.72 (Danish research). - But fire trucks don t cause fire damage, storks don t bring babies and there is no causal connection between beer drunk and cars on the bridge! there might be a 3 rd variable