CSC Advanced Scientific Programming, Spring Descriptive Statistics

CSC 223 - Advanced Scientific Programming, Spring 2018 Descriptive Statistics

Overview Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions. Data consists of information coming from observations, counts, measurements, or responses. A population is the collection of all outcomes, responses, measurements, or counts that are of interest. A sample is a subset of the population. A parameter is a numerical description of a population characteristic. A statistic is a numerical description of a sample characteristic.

Branches of Statistics Descriptive statistics is the branch of statistics that involves the organization, summarization, and display of data. Inferential statistics is the branch of statistics that involves using a sample to draw conclusions about a population. A basic tool in the study of inferential statistics is probability.

Data Classification Types of data: Qualitative data consist of attributes, labels, or nonnumerical entries. Quantitative data consist of numerical measurements or counts. Levels of measurement: Nominal: categorized using names, labels, or qualities. Ordinal: can be arranged in order or ranked. Interval: can be ordered and meaningful differences between entries can be calculated. Ratio: similar to interval, but there is a zero entry that is an inherent zero (implies none).

Frequency Distributions A frequency distribution is a table that shows classes or intervals of data entries with a count of the number of entries in each class. The frequency f of a class is the number data entries in the class. Constructing a frequency distribution: 1 Decide on the number of classes. 2 Find the class width: the range of the data divided by the number of classes rounded up to a convenient number. 3 Find the class limits 4 Count up the number of data entries that fall within the class boundaries to determine the frequency f for each class.

Frequency Distribution Example Number of classes: 5 Data set: 7, 39, 13, 9, 25, 8, 22, 0, 2, 18, 2, 30, 7, 35, 12, 15, 8, 6, 5, 29, 0, 11, 39, 16, 15 Range = 39-0 Class width = 39 5 = 7.8 = 8 Frequency distribution: Class f 0-7 8 8-15 8 16-23 3 24-31 3 32-39 3 f = 25

Frequency Distributions The midpoint of a class is the sum of the lower and upper limits of the class divided by two. The midpoint is sometimes called the class mark. The relative frequency of a class is the portion or percentage of the data that falls in that class. To find the relative frequency of a class, divide the frequency f by the sample size n. The cumulative frequency of a class is the sum of the frequency for that class and all previous classes. The cumulative frequency of the last class is equal to the sample size n.

Frequency Distribution Example Number of classes: 5 Data set: 7, 39, 13, 9, 25, 8, 22, 0, 2, 18, 2, 30, 7, 35, 12, 15, 8, 6, 5, 29, 0, 11, 39, 16, 15 Frequency distribution: Class f Midpoint Relative Cumulative 0-7 8 3.5 0.32 8 8-15 8 11.5 0.32 16 16-23 3 19.5 0.12 19 24-31 3 27.5 0.12 22 32-39 3 35.5 0.12 25 f = 25 f n = 1

Measures of Central Tendency The mean of a data set is the sum of the data entries divided by the number of entries. Population mean: x µ = N Sample mean: x x = n The median of a data set is the value that lies in the middle of the data when the data is in sorted order. The mode of a data set is the data entry that occurs with the greatest frequency.

Measures of Central Tendency An outlier is a data entry that is far removed from the other entries in the data set. A weighted mean is the mean of a data set whose entries have varying weights. A weighted mean is given by: x = x w w where w is the weight of each entry x.

Measures of Central Tendency The mean of a frequency distribution for a sample is approximated by x f x = f where f is the frequency and x is the midpoint.

Frequency Distribution Example Number of classes: 5 Data set: 7, 39, 13, 9, 25, 8, 22, 0, 2, 18, 2, 30, 7, 35, 12, 15, 8, 6, 5, 29, 0, 11, 39, 16, 15 Frequency distribution: Class f Midpoint, x x f 0-7 8 3.5 28 8-15 8 11.5 92 16-23 3 19.5 58.5 24-31 3 27.5 82.5 32-39 3 35.5 106.5 f = 25 x f = 367.5 Mean of frequency distribution: 367.5 25 = 14.7

Measures of Variation The range of a data set is the difference between the maximum and minimum data entries in the set. The deviation of an entry x in a population data set is the difference between the entry and the mean µ of the data set. Deviation of x = x µ The population variance of a population data set of N entries is (x µ) Population variance = σ 2 2 = N where the symbol σ is a lowercase Greek letter Sigma.

Measures of Variation The population standard deviation of a population data set of N entries is the square root of the population variance σ = (x µ) σ 2 2 = N

Finding Population Variance and Standard Deviation 1. Find the mean of the population data set. µ = 2. Find the devation of each entry. x µ x N 3. Square each deviation. (x µ) 2 4. Add to get the sum of squares SS x = (x µ) 2 5. Divide by N to get the population variance. σ 2 = 6. Find the square root of the variance to get the population standard deviation. σ = (x µ) 2 N (x µ) 2 N

Measures of Variation The sample variance and sample standard deviation of a sample data set of n entries are Sample variance = s 2 = (x x) 2 n 1 (x x) 2 Sample standard deviation = s = n 1

Measures of Variation Symbols Population Sample Variance σ 2 s 2 Standard deviation σ s Mean µ x Number of entries N n Deviation x µ x x Sum of squares (x µ) 2 (x x) 2

Measures of Variation The standard deviation of a frequency distribution is: (x x) 2 f s = n 1 where n = f.