Some Characteristics of Data - PDF Free Download

Some Characteristics of Data Not all data is the same, and depending on some characteristics of a particular dataset, there are some limitations as to what can and cannot be done with that data. Some key characteristics that must be considered are: A. Scale of Measurement B. Continuous vs. Discrete C. Grouped vs. Individual

Sigma Notation The mathematical notation that is used most often in the formulation of statistics is the summation notation The uppercase Greek letter Σ (sigma) is used as shorthand, as a way to indicate that a sum is to be taken: Σ x i i=1 i=n The notation above is equivalent to writing x 1 + x 2 + x 3 + + x n-1 + x n

Sigma Notation - Components Looking at sigma notation in more detail: refers to where the sum of terms ends indicates we are taking a sum i=n Σ x i i=1 indicates what we are summing up refers to where the sum of terms begins

Pi Notation Just as we have summation notation that we can use as a shorthand for sums, there is also product notation for multiplication The uppercase Greek letter Π (pi) is used as shorthand, as a way to indicate that a product is to be calculated: Π x i i=1 i=n The notation above is equivalent to writing x 1 * x 2 * x 3 * * x n-1 * x n

Measures of Central Tendency Think of this from the following point of view: We have some distribution in which we want to locate the center, and we need to choose an appropriate measure of central tendency. We can choose from: 1. Mode 2. Median 3. Mean Each of these measures is appropriate to different distributions / under different circumstances

Measures of Central Tendency - Mode 1. Mode This is the most frequently occurring value in the distribution In the event that multiple values tie for the highest frequency, we have a problem A potential solution in this situation involves constructing frequency classes and identify the most frequently occurring class This is the only measure of central tendency that can be used with nominal data The mode allows the distribution s peak to be located quickly

Measures of Central Tendency - Median 2. Median This is the value of a variable such that half of the observations are above and half are below this value i.e. this value divides the distribution into two groups of equal size Note: When the distribution has an even number of observations, finding the median requires averaging two numbers The key advantage of the median is that its value is unaffected by extreme values at the end of a distribution (which potentially are outliers)

Measures of Central Tendency - Median 2. Median cont. When we find the median of a distribution, we do so by dividing it into two equal parts. We can divide distributions into a greater number of parts as well: Quartiles each contains 25% of all values Quintiles each contains 20% of all values Deciles each contains 10% of all values Percentiles each contains 1% of all values Except for quintiles (because they give an odd number of parts) each of these also gives us the median

Measures of Central Tendency - Mean 3. Mean a.k.a. average, the most commonly used measure of central tendency x = i=n Σ x i i=1 n Sample mean µ = i=n Σ x i i=1 N Population mean When we compute a mean using these basic formulae, we are assuming that each observation is equally significant

Selecting a Measure of Central Tendency Most often, the mean is selected by default The mean s key advantage is that is sensitive to any change in the value of any observation However, we really must consider the nature of the distribution to choose properly: 1. Multi-modal distribution The mode must be used, as the median or mean would be rather meaningless 2. Unimodal symmetric The mean is fine if the distribution approaches being symmetric

Selecting a Measure of Central Tendency 3. Unimodal skewed A skewed distribution creates significant differences between the measures of central tendency: Negatively skewed Mean < Median < Mode Positively skewed Mode < Median < Mean In both cases of skew, the median is appropriate, especially in cases with extreme outliers (e.g. dist. of salaries of UNC Geography graduates) 4. Ordinal Data Median is well applied 5. Nominal Data Mode is the only choice

Measures of Dispersion In addition to measures of central tendency, we can also summarize data by characterizing its variability Measures of dispersion are concerned with the distribution of values around the mean in data: 1. Range 2. Quartile range etc. 3. (Mean deviation) 4. (Variance, standard deviation and z-scores) 5. Coefficient of variation

Measures of Dispersion - Range 1. Range this is the most simply formulated of all measures of dispersion Given a set of measurements x 1, x 2, x 3,,x n-1, x n, the range is defined as the difference between the largest and smallest values: Range = x max x min This is another descriptive measure that is vulnerable to the influence of outliers in a data set, which result in a range that is not really descriptive of most of the data

Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. We can divide distributions into a number of parts each containing an equal number of observations: Quartiles each contains 25% of all values Quintiles each contains 20% of all values Deciles each contains 10% of all values Percentiles each contains 1% of all values A standard application of this approach for describing dispersion involves calculating the interquartile range (a.k.a. quartile deviation)

Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. cont. Rogerson (p. 6) defines the interquartile range as being the difference between the values of the 25 th and 75 th percentiles (i.e. the minimum value of the 2 nd quartile and the maximum value of the 3 rd quartile) This is well applied to skewed distributions, since it measures deviation around the median The interquartile range provides 2 of the 5 values displayed in a box plot, which is a convenient graphical summary of a data set

Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. cont. A box plot graphically displays the following five values: Median Minimum value Maximum value 25 th percentile value max. median 75 th percentile value min. 75 th %-ile 25 th %-ile Rogerson, p. 8. Under some circumstances, the whiskers are not used for the min. and max. because of outliers

Measures of Dispersion Coefficient of Variation 5. Coefficient of Variation We cannot directly compare the standard deviations of frequency distributions with different means, because a distribution with a higher mean is likely to have a larger deviation In addition to z-scores (which describe the deviation of an observation), we need an overall measure of dispersion that is normalized with respect to the mean from the same distribution: S σ Coefficient of variation = or (*100%) x µ

Further Moments of the Distribution - Skewness Skewness This statistic measures the degree of asymmetry exhibited by the data (i.e. whether there are more observations on one side of the mean than the other): Skewness = i=n Σi=1 (x i x) 3 ns 3 Because the exponent in this moment is odd, skewness can be positive or negative; positive skewness has more observations below the mean than above it (negative vice-versa)

Further Moments of the Distribution - Skewness Skewness cont. Skewness can also be assessed by comparing the mean and the median Positive skewness Median < Mean Negative skewness Mean < Median This can also be assessed by calculating Pearson s coefficient of skewness: x is the mean Md is the median S is the std. deviation 3(x Md) Sk = S Sk follows the above convention, and values less than 3 are moderately skewed

Further Moments of the Distribution - Kurtosis Kurtosis This statistic measures how flat or peaked the distribution is, and is formulated as: i=n Σi=1 (x i x) 4 Kurtosis = ns 4-3 The 3 is included in this formula because it results in the kurtosis of a normal distribution to have the value 0 (this condition is also termed having a mesokurtic distribution)

Further Moments of the Distribution - Kurtosis Kurtosis cont. When the kurtosis < 0, the frequencies throughout the curve are closer to equal (i.e. the curve is more flat and wide) and this condition is termed platykurtic When the kurtosis > 0, there are high frequencies in only a small part of the curve (i.e. the curve is more peaked) and this condition is termed leptokurtic NOTE: Both skewness and kurtosis are sensitive to the size of n; when n is small and there are outliers, they are less useful

Random Variables and Probability Distributions The concept of probability is the key to making statistical inferences by sampling a population What we are doing is trying to ascertain the probability of an event having a given outcome, e.g. We summarize a sample statistically and want to make some inferences about it, such as what proportion of the population has values within a given range we could do this by finding the area under the curve in a frequency distribution This requires us to be able to specify the distribution of a variable before we can make inferences

Probability Some Definitions Probability Refers to the likelihood that something (an event) will have a certain outcome An Event Any phenomenon you can observe that can have more than one outcome (e.g. flipping a coin) An Outcome Any unique condition that can be the result of an event (e.g. the available outcomes when flipping a coin are heads and tails), a.k.a. simple events or sample points Sample Space The set of all possible outcomes associated with an event (e.g. the sample space for flipping a coin includes heads and tails)

Probability An Example For example, suppose we have a data set where in six cities, we count the number of malls located in that city present: Each count of the # of malls in a city is an event # of Malls City 1 1 2 4 3 4 4 4 5 2 6 3 Outcome #1 Outcome #2 Sample Space Outcome #3 Outcome #4 We might wonder if we randomly pick one of these six cities, what is the chance that it will have n malls?

Random Variables and Probability Functions What we have here is a random variable defined as variable X whose range is values x i are sampled randomly from a population To put this another way, a random variable is a function defined on the sample space this means that we are interested in all the possible outcomes The question was: If we randomly pick one of the six cities, what is the chance that it will have n malls?

Random Variables and Probability Functions To answer this question, we need to form a probability function (a.k.a. probability distribution) from the sample space that gives all values of a random variable and their probabilities A probability distribution expresses the relative number of times we expect a random variable to assume each and every possible value We either base a probability function on either a very large empirically-gathered set of outcomes, or else we determine the shape of a probability function mathematically

Probability Mass Functions Probability mass functions have the following rules that dictate their possible values: 1. The probability of any outcome must be greater than or equal to zero and must also be less than or equal to one, i.e. 0 P(x i ) 1 for i = {1, 2, 3,, k-1, k} 2. The sum of all probabilities in the sample space must total one, i.e. i=k Σ P(x i ) = 1 i=1

Continuous Random Variables Continuous random variable can assume all real number values within an interval, for example: measurements of precipitation, ph, etc. Some random variables that are technically discrete exhibit such a tremendous range of values, that is it desirable to treat them as if they were continuous variables, e.g. population Discrete random variables are described by probability mass functions, and continuous random variables are described by probability density functions

Probability Density Functions Probability density functions are defined using the same rules required of probability mass functions, with some additional requirements: 1. The function must have a non-negative value throughout the interval a to b, i.e. f(x) 0 for a x b 2. The area under the curve defined by f(x), within the interval a to b, must equal 1: f(x) a area=1 b x

The Poisson Distribution The usual application of probability distributions is to find a theoretical distribution that reflects a process such that it explains what we see in some observed sample of a geographic phenomenon The theoretical distribution does this by virtue of the fact that the form of the sampled information and theoretical distribution can be compared and be found to similar through a test of significance One theoretical concept that we often study in geography concerns discrete random events in space and time (e.g. where will a tornado occur?)

The Poisson Distribution The discrete random events in question happen rarely (if at all), and the time and place of these events are independent and random The greatest probability is zero occurrences at a certain time or place, with a small chance of one occurrence, an even smaller chance of two occurrences, etc. A distribution with these characteristics will be heavily peaked and skewed: P(x i ) 0.5 0.25 0 0 1 2 3 4 x i

The Poisson Distribution The Poisson distribution is sometimes known as the Law of Small Numbers, because it describes the behavior of events that are rare, despite there being many opportunities for them to occur We can observe the frequency of some rare phenomenon, find its mean occurrence, and then construct a Poisson distribution and compare our observed values to those from the distribution (effectively expected values) to see the degree to which our observed phenomenon is obeying the Law of Small Numbers:

The Poisson Distribution Procedure for finding Poisson probabilities and expected frequencies: 1. Set up a table with five columns as on the previous slide 2. Multiply the values of x by their observed frequencies (x * f obs ) 3. Sum the columns of f obs (observed frequency) and x * f obs 4. Compute λ = Σ (x * f obs ) / Σ f obs 5. Compute P(x) values using the eqn. or a table 6. Compute the values of F exp = P(x) * Σ f obs

The Poisson Distribution One characteristic of the Poisson distribution is that we expect the variance ~ mean (i.e. the two should have approximately the same values) When we apply the Poisson distribution to geographic patterns, we can see how a variance to mean ratio (σ 2 :x) of about 1 corresponds to a random pattern that is distributed according to Poisson probabilities Suppose we have a point pattern in an (x,y) coordinate space and we lay down quadrats and count the number of points per quadrat

The Poisson Distribution Here, the counts of points per quadrat form the frequencies we use to check Poisson probabilities: Regular Low variance Mean 1 σ 2 :x is low Random Variance Mean σ 2 :x ~ 1 Clustered Low variance Mean 0 σ 2 :x is high

The Normal Distribution You will recall the z-score (a.k.a. standard normal variate, standard normal deviate, or just the standard score), which is calculated by subtracting the sample mean from the observation, and then dividing that difference by the sample standard deviation: Z-score = x - µ σ The z-score is the means that is used to transform our normal distribution into a standard normal distribution, which is simply a normal distribution that has been standardized to have µ = 0 and σ = 1

The Standard Normal Distribution For example, if we have a data set with µ = 55 and σ = 20, we calculate z-scores using: Z-score = x - µ = x -55 σ 20 If one of our data values x = 20 then: Z-score = x - µ = 20-55 = -35 σ 20 20 = -1.75 Using z-scores in conjunction with standard normal tables (like Table A.2 on page 214 of Rogerson) we can look up areas under the curve associated with intervals, and thus probabilities

Standard Normal Tables Using our example z-score of -1.75, we find the position of 1.75 in the table and use the value found there; because the normal distribution is symmetric the table does not need to repeat positive and negative values

Standard Normal Tables This table is defined to give the area under the curve between the specified value through to the rest of the tail of the distribution (theoretically to an infinite z-score): Looking up z = -1.75 has given a P(x) = 0.0401 for the tail below the value of z = -1.75, and using this sort of information, we can retrieve the probability of any interval (to 3.09 z where the table ends, and up to 0.01 in precision)