Exploring Data and Graphics Rick White Department of Statistics, UBC Graduate Pathways to Success Graduate & Postdoctoral Studies November 13, 2013
Outline Summarizing Data Types of Data Visualizing Data Questions
Moments Quantitative measures of the shape of a set of points 1st raw moment: mean (center) 2nd central moment: variance (width) Sqrt of variance is standard deviation 3rd central moment: skewness (lopsidedness) 4th central moment: kurtosis (how fat is the data) Mixed moments are co moments (covariance) Covariance is standardized to correlation Moments are susceptible to outliers
Computing Moments Computing the kth raw moment: Raise each observation to the kth power Sum all the data Divide by th number of observations Computing the kth central moment: Compute the mean of the data Subtract the mean from all observed data Proceed as for a raw moment Standardized kth moment (k > 2) Based on the variance
The different means The arithmetic mean (or simply "mean") of a sample is the sum the sampled values divided by the number of items in the sample. The geometric mean is an average that is useful for sets of positive numbers. Take the nth root of the product of the data points. Or log the data, compute the arithmetic mean, exponentiate result. There are many other types of means.
Quantiles Points that split the ordered data into equal sized groups 2 groups: median (50%) 3 groups: terciles (33%) 4 groups: quartiles (25%) 5 groups: quintiles (20%) 10 groups: deciles (10%) 100 groups: percentiles (1%)
Wikipedia A good source of information Website http://en.wikipedia.org/wiki Moment (mathematics) Mean Variance Quantiles Many ways to estimate
Example Distributions distribution Mean Variance Skewness Kurtosis Normal 3.0 9.0 0.00 0.00 Negative Binomial 3.0 9.0 1.67 4.12 Gamma 3.0 9.0 2.00 6.00 distribution 1% 25% 50% 75% 90% 99% Normal -3.98 0.98 3.00 5.02 6.84 9.98 Negative Binomial 0.00 1.00 2.00 4.00 7.00 13.00 Gamma 0.03 0.86 2.07 4.16 6.91 13.81
Tabulating Data Only useful if data takes on relatively few different values Data can be binned to reduce the number of distinct values. Count the number of items in each bin To express as a percentage: divide the number of items in each bin by the total number of items and multiply by 100.
Types of Data Nominal scale Differentiated by label but no logical order Ordinal scale Differentiate by rank order: data can be sorted Interval scale Numeric data with an arbitrarily defined zero. Ratio scale Numeric data with a meaningful zero
Nominal scale Gender, Ethnicity, Species, Genre Moments are undefined Arithmetic is meaningless (+, -, *, /) Quantiles are undefined Only Logical operation is equality Central tendency: mode (most common value) Generally not of much value Data can be tabulated
Ordinal scale Likert scales; descriptive size Moments are undefined Arithmetic is meaningless (+, -, *, /) Quantiles are defined Logical operations: =, <, >=, etc Central tendency: median, mode Generally not of much value Data can be tabulated
Interval scale Celcius, date of event Moments and Quantiles are defined Central tendency: mean (arithmetic), median Dispersion: standard deviation, inter-quartile range Data usually cannot be tabulated without some manipulation Ratio of observations is meaningless but ratio of differences is interpretable
Ratio scale Height, length, duration Moments and Quantiles are defined Central tendency: arithmetic mean, geometric mean, median Dispersion: standard deviation, inter-quartile range, coefficient of variation Data usually cannot be tabulated without some manipulation Ratio of observations is meaningful
Graphics Purpose of a graphic is to communicate information in a clear and effective manner. A graphic is more effective at conveying information than a table of numbers. A graphic need not be complex to be effective. If something is present on a graphic it should serve a purpose.
One Variable Categorical variables (ordinal or nominal) Pie Charts (not recommended) The eye is good at judging linear measures and bad at judging relative areas. Barcharts or dotplots Numeric variables (interval or ratio) Barchart with error bars (not recommended) Too much information is lost Boxplots, histograms, density plots
Pie Charts: Are any the same?
Barcharts: Are any the same?
Barchart with errors Barchart for numeric data Bar height represents the mean. Whisker represents the SD Is the data the same?
Box and Whiskers plot Boxplots. Center line: Median Box: Q1 to Q3 Whiskers: Max & Min Each has same mean and SD but is the data the same?
Histogram and Density Plots
Two variables Both categorical Stacked barchart (not recommended) Side by side barchart One categorical, one numeric Side by side boxplots Histogram or Density plot within each category Both numeric scatterplot
Stacked Barchart Compare the relative size of each colour category within each bar and between the different bars. Are the different sizes obvious?
Side by side barchart Can you tell the difference if we make it a side by side bar chart?
Side by side boxplot
Adding more information
Same Scale Histogram/Density
Scatterplots Scatterplots show the type of relationship that exists between two numeric variables. This is an example of a linear relationship.
Other Relationships
Other Relationships
Outlier Detection
Scatterplots with too many points can appear as a big blob of ink Overplotting
Overplotting option
Demonstration in R R is a free software programming language and a software environment for statistical computing and graphics. www.r-project.org RStudio IDE is a powerful and productive user interface for R. It s free and open source, and works great on Windows, Mac, and Linux. www.rstudio.com/ www.stat.ubc.ca/~rickw/rw2013-11-13.html
Questions 12 10 8 6 Column 1 Column 2 Column 3 4 2 0 Row 1 Row 2 Row 3 Row 4