MBEJ 1023 Planning Analytical Methods Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment Contents What is statistics? Population and Sample Descriptive Statistics Inferential Statistics Statistical data analysis Contents Scales of Measurement Skewness Measure of Dispersion Using Excel 1
What is statistics Statistics consists of a body of methods for collecting and analyzing data Agresti & Finlay, 1997 Statistics Raw data What kind and how much data need to be collected? Quantitative techniques How should we organize and summarize the data? How can we analyse the data and draw conclusions from it? Meaningful information How can we assess the strength of the conclusions and evaluate their uncertainty? Population and Sample What kind and how much data need to be collected? Population is the collection of all individuals or items under consideration in a statistical study Weiss, 1999 Sample is that part of the population from which information is collected Weiss, 1999 2
Population and Sample Ideal survey: The sampled population= the target population For obvious reasons it is impossible A perfect sample: A scaled down version of the target population, mirroring every characteristic of the target populationp It is impossible A good sample: Reproduce the characteristics of the target population as closely as possible Descriptive Statistics How should we organize and summarize the data? Descriptive statistics consist of methods for organizing and summarizing information Weiss, 1999 Descriptive Statistics Central Tendency Mean, Median, Mode, Sum, Dispersion Std. deviation, Variance, Range, Minimum, Maximum, Distribution Normal, Chi-square, Binomial, Poisson, Geometric, Percentile Quartiles, Percentiles,. 3
Inferential Statistics How can we assess the strength of the conclusions and evaluate their uncertainty? Inferential statistics consist of methods for drawing and measuring the reliability of conclusions about population based on information obtained from a sample of the population Weiss, 1999 Inferential Statistics Point Estimation Interval Estimation Hypothesis Testing Confidence level Margin of error Statistical data analysis How can we analyse the data and draw conclusions from it? Scale of measurement Number of groups Nature of the relationship between groups Number of variables Assumptions of statistical tests 4
Statistical data analysis Begin Formulate the research problem Define population and sample Collect the data Do descriptive data analysis Use appropriate statistical methods to solve the research problem Report the results End Basic mathematical notations Variable Number of Observations, n Counter Variable A variable can be defined as a known characteristic or phenomenon of a population or sample. Variable Quantitative (e.g. height, income, etc.) Qualitative (e.g. gender, religion, etc.) Student s Weight (in kg) = {56, 45, 65, 47, 50} w = {56, 45, 65, 47, 50} Uppercase variable Population s characteristic Lowercase variable Sample s characteristic 5
Number of Observations, n x = {44, 71, 55, 32, 27} y = {3.5, 2.7, 3.0, 4.5, 5.2} z = {-8, 6, -4} For x and y, n = 5 For z, n = 3 Counter b = {90, 85, 76, 92, 85, 53, 74, 85, 90, 66} n = 10 To avoid misunderstanding same values for different observations, mathematicians use counter to refer to the individual value in a set of observations A counter normally is represented by the letters i, j and k b i i = 2 b 2 = 85 i = 3 b 3 = 76 Example Counter c = {91, 86, 77, 93, 86, 54, 75, 86, 95, 67, 80} a) i = 5 86 b) i = n 80 c) i = 1, 2,, n 91, 86, 77, 93, 86, 54, 75, 86, 95, 67, 80 d) c 9 95 e) a = {c 3, c 8 } a = {77, 86} Descriptive Statistics Central Tendency Mean, Median, Mode, Sum, Dispersion Std. deviation, Variance, Range, Minimum, Maximum, Distribution Normal, Chi-square, Binomial, Poisson, Geometric, Percentile Quartiles, Percentiles,. When your data are described correctly and adequately, everybody will have an insight on the features of your data Descriptive statistics help us to simplify large amounts of data in a sensible way. For instance, the Grade Point Average (GPA) describes the general performance of a student across a potentially wide range of course experiences. 6
Scales of Measurement The ways that numbers are being assigned to observations Measurement is basically the process of assigning numbers to observations according to certain rules Sprinthall, 2000 Scales of Measurement Nominal Establish identity (Apartment has pool = 1, Apartment does not have pool =0) Ordinal Place into an order and ranking (Apartments are ranked according to their prices) Interval Position along a continuous scale The scale does not have absolute zero (we cannot talk about no temperature) Ratio Measure on a ratio scale (Floor area) Zero has meaning Zero denotes the absence of something 1. Magnitude the ability to be counted 2. Order the ability to be ranked 3. Interval having equal distance 4. Rational zero a number zero on the scale that is meaningful The location of the distribution is assessed by its central tendency Central tendency, by definition, is a typical or representative score Mode Median Mean 7
Mode The mode is the most frequently occurring score value Distance, d, between home and workplace (in km) d --- 9 11 12 12 13 14 14 14 15 16 17 M o = 14 The mode may be seen on a frequency distribution as the score value which corresponds to the highest point Mode A distribution may have more than one mode Age, a, when started working (in years) a --- 19 20 20 21 22 23 23 24 Such distributions are called bimodal M o = 20, 23 Median The median is the score value which cuts the distribution in half, such that half the scores fall above the median and half fall below it (1) Order the scores from lowest to highest (2) If there are odd numbers of scores M d = X i i = (N + 1) / 2 (2) If there is an even number of scores M d = (X i + X i ) / 2 1 2 i 1 = N / 2 i 2 = (N + 2) / 2 d --- 9 11 12 12 13 14 14 14 15 16 17 a --- 19 20 20 21 22 23 23 24 There are odd numbers of scores (11) i = (11 + 1) / 2 = 6 M d = d 6 M d = 14 There is an even number of scores (8) i 1 = 8 / 2 = 4 i 2 = (8 + 2) / 2 = 5 M d = (a 4 + a 5 ) / 2 M d = (21 + 22) / 2 M d = 21.5 8
Calculating the median for the class interval data Monthly household income (in RM) Household Income No. of Households (X) (f) < RM1,000 10 RM1,001 RM2,000 18 RM2,001 RM3,000 45 RM3,001 RM4,000 22 > RM4,000 5 M d = the median ll = lower exact limit containing the n(0.50) score n = total number of scores cf = cumulative frequency of scores above the interval containing the n(0.50) score f i = frequency of scores in the interval containing the n(0.50) score w = width of the class interval Calculating the median for the class interval data Monthly household income (in RM) Household Income Exact Limits No. of Cumulative (X) Households Frequency (f) (cf) < 1,000 0 1000.5 10 10 1,001 2,000 1000.5 2000.5 18 28 2,001 3,000 2000.5 3000.5 45 73 3,001 4,000 3000.5 4000.5 22 95 > 4,000 4000.5-5 100 Class Boundaries are the midpoints between the upper class limit of a class and the lower class limit of the next class in the sequence n = 100 n (0.50) = 50 II = 2000.5 cf = 28 f i = 45 w = 1000 M d = 2000.5 + ((50 28 ) / 45) 1000 M d = 2489.4 M d = the median ll = lower exact limit containing the n(0.50) score n = total number of scores cf = cumulative frequency of scores above the interval containing the n(0.50) score f i = frequency of scores in the interval containing the n(0.50) score w = width of the class interval Mean The population mean is the sum of the observations divided by the population size = The population mean, N = Population size a --- 19 20 20 21 22 23 23 24 = Sample mean, n = Sample size = (19 + 20 + 20 + 21 + 22 + 23 + 23 + 24) / 8 = 21.5 9
Mean Sometimes, we are given data arranged in frequency table Age No. of Respondents (x) (f) 19 5 20 4 21 5 22 3 23 1 24 2 = the sample mean x i = individual observation f i = class frequency n = the number of classes = (19 (5) + 20 (4) + 21 (5) + 22 (3) + 23 (1) + 24 (2)) / 5 + 4 + 5 + 3 + 1 + 2 = 417 / 20 = 20.85 Mean Sometimes, data are further summarised using class intervals Age, x Frequency, f 18 20 9 21 23 9 24-26 2 Class Intervals Midpoint, x i Frequency, f i x i f i 18 20 19 9 171 21 23 22 9 198 24 26 25 2 50 Sum, 20 419 = the sample mean x i = the midpoint of the class interval f i = class frequency n = the number of class intervals = 419 / 20 = 20.95 Choosing Appropriate Measure of Central Tendency Measurement Scale Nominal Ordinal Interval/Ratio Measure of Central Tendency Mode: The value that appears most often in a distribution. Median: The value that t divides id the distribution ib ti of responses into two equal size groups (the value of the 50th percentile). Mode and Median Mean: The arithmetic average of a distribution. 10
Skewness Mode Median M o M d Mean The three measures of location can be used together to describe the central tendency of a distribution Skewness Symmetrical distribution It described a distribution that is normally distributed. The concept of Normal distribution is used in many statistical analysis and tests. Symmetrical distribution has a zero skewness. A positively skewed distribution occurs when both the mode and the median are located to the left of the mean If the mode and the median are located to the right of the mean, then we have a distribution that is negatively skewed. Skewness Pearson s Index of Skewness (I) = the sample mean = the median = the sample standard deviation 11
Skewness Example The sample mean of a set of data is 3.45, the median is 4.00 and the sample standard deviation is 1.22. Compute the Pearson s Index of Skewness, and determine if the data is symmetrically distributed. = 3.45 = 4 = 1.22 I = 3 (3.45 4) / 1.22 I = -1.35 The distribution is not symmetric around the mean. The distribution is negatively skewed. Measure of Dispersion The range The variance The standard deviation Measures of dispersion express quantitatively the degree of variation or dispersion of values in a population or in a sample Measure of Dispersion The range Range = Largest - smallest Distribution 1: 32 35 36 37 38 40 42 42 43 43 45 Range = 45 32 = 13 Distribution 2: 32 32 32 32 34 34 34 34 34 35 45 Range = 45 32 = 13 The range is greatly affected by extreme scores The range is not the most important measure of variability 12
Measure of Dispersion The variance The standard deviation The population variance ( within the population. ) is a measure of variability between observations X = the individual observation in the population = the population mean N = the size of the population The population standard deviation ( ) is the positive square root of the variance A small variance indicates that the data tends to be very close to the mean and hence to each other, while a high variance indicates that the data is very spread out around the mean and from each other. Measure of Dispersion The variance The standard deviation The sample variance Steps to compute the Variance Step 1 - Find the mean of the scores. Step 2 - Subtract the mean from every score. Step 3 - Square the results of Step 2. = Sample mean x i = individual observation n = Sample size The sample standard deviation Step 4 - Sum the results of Step 3. Step 5 - Divide the results of Step 4 by n-1. Step 6 - Take the square root of Step 5. The result at Step 5 is the sample variance. The sample standard deviation is obtained in Step 6. Measure of Dispersion The variance The standard deviation Example i x i x i -x (x i x) 2 1 3-4 16 2 6-1 1 3 8 1 1 4 8 1 1 5 10 3 9 Total, 35 0 28 Step 5 - Divide the results of Step 4 by n-1. S 2 = 28 / (5 1) = 7 Step 6 - Take the square root of Step 5. Step 1 - Find the mean of the scores. _ x = 35 / 5 = 7 Step 2 - Subtract the mean from every score. _ x i -x Step 3 - Square the results of Step 2. _ (x i x) 2 Step 4 - Sum the results of Step 3. _ (x i x) 2 = 28 13
Measure of Dispersion The variance The standard deviation The standard deviation measures variability in units of measurement, while the variance does so in units of measurement squared. For example, if one measured height in inches, then the standard deviation would be in inches, while the variance would be in inches squared. For this reason, the standard deviation is usually the preferred measure when describing the variability of distributions. The variance, however, has some unique properties which make it very useful later on in the course. Exercise: Use the following data and calculate the variance and the standard deviation. Age No. of Respondents (x) (f) 19 5 20 4 21 5 22 3 23 1 24 2 If the number of data that we need to process exceeds a certain limit, we will find that even the simplest data analysis will be troublesome. There are various applications range from the very general spreadsheet applications like MS Excel and Lotus 1-2-3 to a more advanced statistical applications like SPSS, Minitab, S-Plus and SAS to solve this problem. The spreadsheet applications are easier to learn but they lack advanced statistical functions. Statistical applications like SPSS have more data analysis capabilities but require advanced mathematical knowledge. Various applications share some common steps in performing statistical analysis 14
Perform Data Entry To enter a fresh new set of data, you can select the Type in data radio button and click OK. An easier approach, however, is just to click CANCEL. The title bar The menu bar The tool bar Column heading Once you are in the data editor, you can enter your data A variable can be in many forms such as numerical, strings, date, The number of decimal places that SPSS will display Which numbers represent which categories (for discrete data of both nominal and ordinal levels of measurement). For example, you could assign the labels 'Male and 'Female to the numeric values 1 and 2 The width of a variable is the number of characters SPSS will allow to be entered for the variable A string of text to indentify in more detail what a variable represents To name a column, just go to the Variable View 15
Tell SPSS what to do when encounter missing values in our data file. The columns property tells SPSS how wide the column should be for each variable. Don't confuse this one with width, which indicates how many digits of the number will be displayed. The column size indicates how much space is allocated rather than the degree to which it is filled. Indicates whether the information in the Data View should be left-justified, rightjustified, or cantered Our data is a scale, ordinal, or nominal data To name a column, just go to the Variable View It is always good computing practice to frequently save your data. In SPSS, almost all statistical analysis that you want to perform are located in the Analyze menu. The same goes to the descriptive statistical analysis that we want to conduct. To proceed with our descriptive statistical analysis, select the menu Analyze > Descriptive Statistics > Descriptives 16
To customize the analysis to be performed, click on the Options Now, a dialog box will appear prompting to you to select the variable(s) that you want to describe. To do this, click on the variable. With the variable x selected, click on the arrow to move the variable x into the selected variable(s) box on the right For this exercise, we will select several statistics that we have covered thus far. These options are Mean, Standard (Std.) Deviation, Variance, Range, Minimum and Maximum. Among these options only Mean is used to measure the central tendency, whereas the other statistics are used to measure the degree of dispersion. Using Excel Excel and SPSS use the cell paradigm Excel uses the workbook paradigm where a single workbook can contain many data sheets. Excel is a spreadsheet application and its greatest use is when you have a lot of data to manipulate. The spreadsheet applications are easier to learn but they lack advanced statistical functions. 17
Using Excel Data Entry Simply enter your data The data sheet in this example does not have any more space at the top of the sheet to insert our column header. We can solve this by inserting a new row at the top. Using Excel Column border To insert a new row in Excel, first right-click on the row number, above which you want to insert a new row. Among the options available in this pop-up menu is one called Insert. Click Insert and a new row is automatically inserted on-top of row. Double-clicking the column border will increase the width of the column to fit the widest of the text in that column. You can save your data by selecting the menu File > Save or by clicking the button. Using Excel In Excel, data manipulation is achieved through entering a set of formulas. To enter a formula in a cell, you must start by typing the equal ( = ) sign. If the equal ( = ) is not entered, Excel will treat the formula as text which means that no computation will be performed. Functions in Excel for computing descriptive statistics Functions Formula Examples Mean =Average(Cells) =Average(A2:A6) Mode =MODE(Cells) =MODE(A2:A6) Median =MEDIAN(Cells) =MEDIAN(A2:A6) Minimum =MIN(Cells) =MIN(A2:A6) Maximum =MAX(Cells) =MAX(A2:A6) Variance =VAR(Cells) =VAR(A2:A6) Standard Deviation =STDEV(Cells) =STDEV(A2:A6) To perform computation in Excel, first select the cell where you want the result to appear. Then, write the formula by first typing the equal ( = ) sign in the cell (or, in the formula box). Press Enter and the result of the computation will be shown in the cell that you have selected earlier. 18
Using Excel Formula box To refer to a cell, you need not type the cell number. Instead, you can click the cell which you want to use and the cell number will be inserted in the formula. This way, you can avoid referring to the wrong cell. We can use formulas that we ourselves defined for computing statistics that are not defined by Excel. Thank you Dr. Mehdi Moeinaddini mehdi@utm.my 19