Basic Procedure for Histograms

Size: px

Start display at page:

Download "Basic Procedure for Histograms"

Adela Summers
5 years ago
Views:

1 Basic Procedure for Histograms 1. Compute the range of observations (min. & max. value) 2. Choose an initial # of classes (most likely based on the range of values, try and find a number of classes that divides evenly into the range of values and is still within the 6 12 class guideline) 3. Compute the class interval = range / number of classes round the precise range to the nearest convenient number (preferably an integer, adjusting as necessary) 4. Select a starting value for the classes that is less than or equal to the lowest value in the observations

2 Basic Procedure for Histograms 5. Adjust the range, width, and starting point if necessary 6. Compute the midpoint of each class (this is particularly useful if plotting a line plot histogram rather than the bar chart variety) 7. The actual bounds will depend on the precision and accuracy of the data (e.g. class limits of 1-2, 3-4, etc. might have actual limits of , etc. because we have rounded) 8. Plot the data At this point, you re not likely to be required to create a histogram by hand very often (as it easily done using software), but it s good to know the theory

3 Simple Descriptive Statistics Descriptive statistics provide an organization and summary of a dataset A small number of summary measures replaces the entirety of a dataset You re likely already familiar with some simple descriptive summary measures: 1. Ratios 2. Proportions 3. Percentages 4. Rates of Change 5. (Location Quotients)

4 Simple Descriptive Summary Measures 1. Ratios - # of observations in A # of observations in B e.g. A - 6 overcast, B 24 mostly cloudy days 2. Proportions Relates one part or category of data to the entire set of observations, e.g. a box of marbles that contains 4 yellow, 6 red, 5 blue, and 2 green gives a yellow proportion of 4/17 or a color ={yellow, red, blue, green} a a count ={4, 6, 5, 2} proportion = i Σ a i

5 Simple Descriptive Summary Measures 2. Proportions cont. Sum of all proportions = 1 These are useful for comparing two sets of data w/ different sizes and category counts e.g. a different box of marbles gives a yellow proportion of 2/23, and in order for this to be a reasonable comparison we need to know the totals for both samples 3. Percentages Calculated by proportions x 100, e.g. 2/23 = 8.696%, use of these should be restricted to larger sample sizes, perhaps 20+ observations

6 Simple Descriptive Summary Measures 4. Rates of Change Expressing the change in a variable with respect to its original value, e.g. Z = x(t 2) x(t 1 ) x(t 1 ) = change in x original value of x e.g. if we had 20 marbles and then added 10, the rate of change = (30-20)/20 = 10/20 = Location Quotients An index of relative concentration in space, a comparison of a region s share of something to the total

7 Simple Descriptive Summary Measures 5. Location Quotients cont. For example, suppose we have a region of 1000 km 2 which we subdivide into three smaller areas of 200, 300, and 500 km 2 respectively (labeled A, B, and C) The region has an influenza outbreak with 150 cases in the first region, 100 in the second, and 350 in the third (a total of 600 flu cases): Proportion of Area Proportion of Cases Location Quotient A 200/1000= /600= /0.2=1.25 B 300/1000= /600= /0.3=0.57 C 500/1000= /600= /0.5=1.17 Location Quotient = Prop. of Cases / Prop. of Area

8 Simple Descriptive Statistics These are ways to summarize a number set quickly and accurately The most common way of describing a variable distribution is in terms of two of its properties: Central tendency describes the central value of the distribution, around which the observations cluster Dispersion describes how the observations are distributed First we ll look at measures of central tendency

9 Measures of Central Tendency - Review 1. Mode This is the most frequently occurring value in the distribution 2. Median This is the value of a variable such that half of the observations are above and half are below this value i.e. this value divides the distribution into two groups of equal size 3. Mean a.k.a. average, the most commonly used measure of central tendency

10 Measures of Central Tendency - Review 1. Mode This is the most frequently occurring value in the distribution Procedure for finding the mode of a data set: 1) Sort the data, putting the values in ascending order 2) Count the instances of each value (if this is continuous data with a high degree of precision and many decimal places, this may be quite tedious) 3) Find the value that has the most occurrences this is the mode (if more than one value occurs an equal number of times and these exceed all other counts, we have multiple modes) Use the mode for multi-modal or nominal data sets

11 Measures of Central Tendency - Review 2. Median - ½ of the values are above & ½ below this value Procedure for finding the median of a data set: 1) Sort the data, putting the values in ascending order 2) Find the value with an equal number of values above and below it (if there are an even number of values, you will need to average two values together): Odd number of observations [(n-1)/2]+1 values from the lowest, e.g. n=19 [(19-1)/2]+1 = 10 th value Even number of observations average the (n/2) and [(n/2)+1] values, e.g. n=20 average the 10 th and 11 th Use the median with assymetric distributions, when you suspect outliers are present, or with ordinal data

12 Measures of Central Tendency - Review 3. Mean a.k.a. average, the most commonly used measure of central tendency Procedure for finding the mean of a data set: 1) Sum all the values in the data set 2) Divide the sum by the number of values in the data set x = i=n Σ x i i=1 n Use the mean when you have interval or ratio data sets with a large sample size, few (or no?) outliers, and a reasonably symmetric unimodal distribution

13 Measures of Central Tendency - Mean 3. Mean cont. We can also calculate a weighted mean using some weighting factor: x = i=n Σ w i x i i=1 i=n Σ w i i=1 e.g. What is the average income of all people in cities A, B, and C: City Avg. Income Population A $23, ,000 B $20,000 50,000 C $25, ,000 Weighted mean Here, population is the weighting factor w i and the average income is the variable of interest x i

14 Measures of Central Tendency - Mean 3. Mean cont. We can also calculate a grouped mean using the mid-points and frequencies of groups: x = i=n Σ m i f i i=1 i=n Σ f i i=1 Grouped mean e.g. Suppose we had grouped some data in a frequency table and wanted to calculate the grouped mean: Group Freq Midpoint

15 Measures of Central Tendency - Mean 3. Mean cont. A standard geographic application of the mean is to locate the center (a.k.a. centroid) of a spatial distribution by assigning to each member of the spatial distribution a gridded coordinate and calculating the mean value in each coordinate direction Bivariate mean or mean center For a set of (x,y) coordinates, the mean center (x,y) is computed using: x = i=n Σ x i i=1 n y = i=n Σ y i i=1 n

16 Measures of Central Tendency - Mean 3. Mean cont. We can also calculate a weighted mean center in much the same way, but by using weights: For a set of (x,y) coordinates, the weighted mean center (x,y) is computed using: x = i=n Σ w i x i i=1 Σ i=n w i i=1 y = i=n Σ w i y i i=1 Σ i=n w i i=1 e.g. suppose we had the centroids and areas of 3 polygons Here we weight by area, but other weightings possible

17 Measures of Dispersion In addition to measures of central tendency, we can also summarize data by characterizing its variability Measures of dispersion are concerned with the distribution of values around the mean in data: 1. Range 2. Quartile range etc. 3. Mean deviation 4. Variance, standard deviation and z-scores 5. Coefficient of variation

18 Measures of Dispersion - Range 1. Range this is the most simply formulated of all measures of dispersion Given a set of measurements x 1, x 2, x 3,,x n-1, x n, the range is defined as the difference between the largest and smallest values: Range = x max x min This is another descriptive measure that is vulnerable to the influence of outliers in a data set, which result in a range that is not really descriptive of most of the data

19 Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. We can divide distributions into a number of parts each containing an equal number of observations: Quartiles each contains 25% of all values Quintiles each contains 20% of all values Deciles each contains 10% of all values Percentiles each contains 1% of all values A standard application of this approach for describing dispersion involves calculating the interquartile range (a.k.a. quartile deviation)

20 Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. cont. Rogerson (p. 6) defines the interquartile range as being the difference between the values of the 25 th and 75 th percentiles (i.e. the minimum value of the 2 nd quartile and the maximum value of the 3 rd quartile) This is well applied to skewed distributions, since it measures deviation around the median The interquartile range provides 2 of the 5 values displayed in a box plot, which is a convenient graphical summary of a data set

value 25 th percentile value max. median 75 th percentile value min. Rogerson, p. 8.

21 Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. cont. A box plot graphically displays the following five values: Median Minimum value Maximum value 25 th percentile value max. median 75 th percentile value min. Rogerson, p th %-ile 25 th %-ile Under some circumstances, the whiskers are not used for the min. and max. because of outliers

22 Measures of Dispersion Mean Deviation 3. Mean Deviation Once we have calculated the mean value for a data set, we can assess the difference between any observation and that mean, and this is termed the statistical distance: Statistical distance = x i - x If we take the absolute values of these, and sum for all observations, we have calculated the mean i=n deviation: Mean deviation = Σ x i x i=1 n

23 Measures of Dispersion Mean Deviation 3. Mean Deviation cont. Why is it necessary to take absolute values of statistical distances (x i x) before summing them to get the mean deviation? Because the statistical distances would be both positive and negative, and when summed using the mean deviation formula (without absolute values), they would sum to zero Mean deviation = i=n Σ x i x i=1 n

24 Measures of Dispersion Review 4. Standard Deviation Standard deviation is calculated by taking the square root of variance: σ = i=n Σ (x i µ) 2 i=1 N Population standard deviation S = i=n Σi=1 (x i x) 2 n - 1 Sample standard deviation Why do we prefer standard deviation over variance as a measure of dispersion? Magnitude of values and units match means.

25 Measures of Dispersion - Review 4. Standard Deviation This is the most frequently used measure of dispersion because it has the same units as the values and their mean (unlike variance) Procedure for finding the standard deviation of a data set: 1) Calculate the mean 2) Calculate the statistical distances (x i x) for each value 3) Square each of the statistical distances (x i x) 2 4) Sum the squared statistical distances, the sum of squares 5) Divide the sum of squares by N for a population or by (n-1) for a sample this gives you the variance 6) Take the square root of the variance to get the standard deviation

26 Measures of Dispersion - Review 4. Z-scores These express the difference from the mean in terms of standard deviations of an individual value, and thus can be compared to z-scores drawn from other data sets or distributions Procedure for finding the z-score of an observation: 1) Calculate the mean 2) Calculate the statistical distances (x i x) for each value where we wish find the z-score 3) Calculate the standard deviation 4) Calculate the z-score using the formula Z-score = x - x S

27 Measures of Dispersion - Review 5. Coefficient of Variation This is an overall measure of dispersion that is normalized with respect to the mean from the same distribution, and thus is comparable to coefficients of variation from other data sets because it is a normalized measure of dispersion Procedure for finding the coef. of variation for a data set: 1) Calculate the mean 2) Calculate the standard deviation 3) Calculate the coefficient of variation using the formula S σ Coefficient of variation = or (*100%) x µ

28 Skewness and Kurtosis - Review 1. Skewness This statistic measures the degree of asymmetry exhibited by the data (i.e. whether there are more observations on one side of the mean than the other) 2. Kurtosis This statistic measures the degree to which the distribution is flat or peaked

29 Skewness and Kurtosis - Review 1. Skewness This statistic measures the degree of asymmetry exhibited by the data (i.e. whether there are more observations on one side of the mean than the other): Skewness = i=n Σi=1 (x i x) 3 ns 3 Because the exponent in this moment is odd, skewness can be positive or negative; positive skewness has more observations below the mean than above it (negative vice-versa)

30 Skewness and Kurtosis - Review 1. Skewness This statistic measures the degree of asymmetry exhibited by the data Procedure for finding the skewness of a data set: 1) Calculate the mean 2) Calculate the statistical distances (x i x) for each value 3) Cube each of the statistical distances (x i x) 3 4) Sum the cubed statistical distances, the sum of cubes (i.e. this is the numerator in the skewness formula) 5) Divide the sum of cubes by the sample size multiplied by the standard deviation cubes (i.e. the denominator is n*s 3 in [Σ (x i x) 3 ] / [ n*s 3 ])

31 Skewness and Kurtosis - Review 2. Kurtosis This statistic measures how flat or peaked the distribution is, and is formulated as: i=n Σi=1 (x i x) 4 Kurtosis = ns 4-3 The 3 is included in this formula because it results in the kurtosis of a normal distribution to have the value 0 (this condition is also termed having a mesokurtic distribution)

32 Skewness and Kurtosis - Review 2. Kurtosis This statistic measures how flat or peaked the distribution is Procedure for finding the kurtosis of a data set: 1) Calculate the mean 2) Calculate the statistical distances (x i x) for each value 3) Raise each of the statistical distances to the 4 th power, i.e. (x i x) 4 4) Sum the statistical distances to the 4 th power Σ (x i x) 4 5) Divide the sum by the sample size multiplied by the standard deviation raised to the 4 th power (i.e. the denominator is n*s 4 in [Σ (x i x) 4 ] / [ n*s 4 ]) 6) Subtract 3 from [Σ (x i x) 4 ] / [ n*s 4 ]

33 Probability An Example, Part II Here, the values of x i are drawn from the four outcomes, and their probabilities are the number of events with each outcome divided by the total number of events: City # of Malls 1 1 Outcome # Outcome # Outcome #3 6 3 Outcome #4 x i P(x i ) 1 1/6 = /6 = /6 = /6 = 0.5 The probability of an outcome P(x i ) = # of times an outcome occurred Total number of events

34 Discrete Random Variables We can calculate the mean of a discrete probability distribution by taking all possible values of the variable, multiplying them by their probability, and summing them over the values: µ = i=k Σ x i *P(x i ) i=1 The symbol µ is used here rather than x because the basic idea of a probability distribution is to use a large number of values to approach a stable estimate of the parameter

35 Discrete Random Variables We can also calculate the variance of a discrete probability distribution by calculating the sum of squares for all possible values of the variable, multiplying them by their probability, and summing them over the values: σ 2 = i=k Σ (x i x)2 *P(x i ) i=1 These formulae are only useful for discrete probability distributions, for continuous probability dists. a different method is required

36 Probability Rules If the sets A and B do overlap in the Venn diagram, the sets are not mutually exclusive, and this represents a case of independent, but not exclusive events The union of sets A and B here is P(A Β) = P(A) + P(B) - P(A Β) because we do not wish to count the intersection area twice, thus we need to subtract it from the sum of the areas of A and B when taking the union of a pair of overlapping sets The intersection of sets A and B here is calculated by taking the product of the two probabilities, a.k.a. the multiplication rule: P(A Β) = P(A) * P(B) A P(A Β) = P(A) + P(B) - P(A Β) A B B P(A Β) = P(A) * P(B)

37 Probability Rules Consider set A to give the chance of precipitation at P(A)=0.4 and set B to give the chance of below freezing temperatures at P(B)=0.7 The complement of set A is P(A ) = 1 - P(A) P(A ) = = 0.6 This expresses the chance of it not raining or snowing at P(A ) = 0.6 The complement of the union of sets A and B is P(A Β) = 1 [P(A) + P(B) - P(A Β)] P(A Β) = 1 [ ] = 0.18 This expresses chance of it neither raining nor being below freezing at P(A Β) = 0.18 P(A Β) A A P(A ) = 1 - P(A) A B P(A Β) = 1 [P(A) + P(B) - P(A Β)]

38 Bernoulli Trials We can provide a general formula for calculating the probability of x successes, given n trials and a probability p of success: P(x) = C(n,x) * p x * (1 - p) n - x where C(n,x) is the number of possible combinations of x successes and (n x) failures: C(n,x) = n! x! * (n x)! where n! = n * (n 1) * (n 2) * * 3 * 2 * 1

39 The Poisson Distribution Procedure for finding Poisson probabilities and expected frequencies: 1. Set up a table with five columns as on the previous slide 2. Multiply the values of x by their observed frequencies (x * f obs ) 3. Sum the columns of f obs (observed frequency) and x * f obs 4. Compute λ = Σ (x * f obs ) / Σ f obs 5. Compute P(x) values using the eqn. or a table 6. Compute the values of F exp = P(x) * Σ f obs

40 The Normal Distribution You will recall the z-score (a.k.a. standard normal variate, standard normal deviate, or just the standard score), which is calculated by subtracting the sample mean from the observation, and then dividing that difference by the sample standard deviation: Z-score = x - µ σ The z-score is the means that is used to transform our normal distribution into a standard normal distribution, which is simply a normal distribution that has been standardized to have µ = 0 and σ = 1

41 Standard Normal Tables Using our example z-score of -1.75, we find the position of 1.75 in the table and use the value found there; because the normal distribution is symmetric the table does not need to repeat positive and negative values

Standard Normal Tables This table is defined to give the area under the curve between the specified value through to the rest of the tail of the distribution (theoretically to an infinite z-score):

42 Standard Normal Tables This table is defined to give the area under the curve between the specified value through to the rest of the tail of the distribution (theoretically to an infinite z-score): Looking up z = has given a P(x) = for the tail below the value of z = -1.75, and using this sort of information, we can retrieve the probability of any interval (to 3.09 z where the table ends, and up to 0.01 in precision)

43 Finding the P(x) for Various Intervals 1. P(0 Z a) = [0.5 (table value)] a Total Area under the curve = 1, thus the area above x is equal to 0.5, and we subtract the area of the tail 2. P(Z a) = [1 (table value)] a Total Area under the curve = 1, and we subtract the area of the tail a 3. P(Z a) = (table value) Table gives the value of P(x) in the tail above a

44 Finding the P(x) for Various Intervals 4. P(a Z 0) = [0.5 (table value)] a This is equivalent to P(0 Z a) when a is positive 5. P(Z a) = [1 (table value)] a This is equivalent to P(Z a) when a is positive 6. P(Z a) = (table value) a Table gives the value of P(x) in the tail below a, equivalent to P(Z a) when a is positive

45 Finding the P(x) for Various Intervals 7. b a P(a Z b) if a < 0 and b > 0 = [0.5 (table value for a)] + [0.5 (table value for b)] = [1 {(table value for a) + (table value for b)}] Since the table gives us the value of P for the tail beyond the specified z- score, simply subtract the area of the two tails from the total area = 1 With this set of building blocks, you should be able to calculate the probability for any interval using a standard normal table

46 Standard Error The standard deviation of the sample sampling distribution X is formulated as: σ X = σ n This is the standard error, and it is the unit of measurement of a confidence interval, used to express the closeness of a statistic to a parameter When we construct a confidence interval we are finding how many standard errors away from the mean we have to go to find the area under the curve equal to the confidence level

47 Constructing a Confidence Interval 1. Select our desired level of confidence 2. Transform that confidence level into a probability, usually denoted by the symbol α 3. Calculate α/2 since we want an interval that is symmetric about the mean of the distribution and has our α to equally partitioned into the two tails 4. Look up the corresponding z-score in a standard normal table 5. Multiply the z-score by the standard error to find the interval by adding and subtracting this product from the mean

48 Constructing a Confidence Interval - Steps 1. Select our desired level of confidence Let s suppose we want to construct an interval using the 95% level of confidence 2. Transform that confidence level into a probability The value of α is calculated using 100 * (1 - α) %, i.e. if the confidence level is 95%, then α is 5% or α = Calculate α/2 We re going to find the interval on either side of the mean, so we need a z-score for α/2, here 0.025

49 Constructing a Confidence Interval - Steps 4. Look up the corresponding z-score We use the a standard normal table (like Table A.2 on page 214 of Rogerson) to find the z-score that corresponds with our α/2. In this case, we have selected α/2 = to a z-score of Multiply the z-score by the standard error etc. We are finding an interval which corresponds to 95% of the area under the curve, which is the interval from z = to z = 1.96, which is equal to [µ -(Z α/2 * std. error), µ + (Z α/2 * std. error)] when we have the population µ and σ

50 Constructing a Confidence Interval - Steps 5. Multiply the z-score by the standard error cont. I.e. if the population µ and σ are known because we are working with a known distribution, the standard error is σ X = σ n and the interval can then be expressed as: [µ -(Z α/2 * σ/ n), µ + (Z α/2 * σ/ n)] Using sample statistics, we can make the interval: [x - (Z α/2 * s/ n), x + (Z α/2 * s/ n)]

Some Characteristics of Data

Some Characteristics of Data Not all data is the same, and depending on some characteristics of a particular dataset, there are some limitations as to what can and cannot be done with that data. Some key