Dimension Reduction using Principal Components Analysis (PCA)

Similar documents
2/20/2013. of Manchester. The University COMP Building a yes / no classifier

Non-Inferiority Tests for the Ratio of Two Correlated Proportions

Annex 4 - Poverty Predictors: Estimation and Algorithm for Computing Predicted Welfare Function

institution Top 10 to 20 undergraduate

Risk and Return. Calculating Return - Single period. Calculating Return - Multi periods. Uncertainty of Investment.

Measuring the Indonesian provinces competitiveness by using PCA technique

Quantitative Aggregate Effects of Asymmetric Information

and their probabilities p

Sampling Procedure for Performance-Based Road Maintenance Evaluations

B. Maddah INDE 504 Discrete-Event Simulation. Output Analysis (3)

Forecasting Stocks with Multivariate Time Series Models.

Objectives. 5.2, 8.1 Inference for a single proportion. Categorical data from a simple random sample. Binomial distribution

C (1,1) (1,2) (2,1) (2,2)

A GENERALISED PRICE-SCORING MODEL FOR TENDER EVALUATION

Capital Budgeting: The Valuation of Unusual, Irregular, or Extraordinary Cash Flows

Publication Efficiency at DSI FEM CULS An Application of the Data Envelopment Analysis

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

A Multi-Objective Approach to Portfolio Optimization

The Relationship Between the Adjusting Earnings Per Share and the Market Quality Indexes of the Listed Company 1

Analysing indicators of performance, satisfaction, or safety using empirical logit transformation

Non-Gaussian Multivariate Statistical Models and their Applications May 19-24, Some Challenges in Portfolio Theory and Asset Pricing

Characterizing Microprocessor Benchmarks. Towards Understanding the Workload Design Space

Asian Economic and Financial Review A MODEL FOR ESTIMATING THE DISTRIBUTION OF FUTURE POPULATION. Ben David Nissim.

Prediction of Rural Residents Consumption Expenditure Based on Lasso and Adaptive Lasso Methods

Year 0 $ (12.00) Year 1 $ (3.40) Year 5 $ Year 3 $ Year 4 $ Year 6 $ Year 7 $ 8.43 Year 8 $ 3.44 Year 9 $ (4.

Objectives. 3.3 Toward statistical inference

Confidence Intervals for a Proportion Using Inverse Sampling when the Data is Subject to False-positive Misclassification

Financial Analysis The Price of Risk. Skema Business School. Portfolio Management 1.

Risk Control of Mean-Reversion Time in Statistical Arbitrage,

Appendix Large Homogeneous Portfolio Approximation

Setting the regulatory WACC using Simulation and Loss Functions The case for standardising procedures

Feasibilitystudyofconstruction investmentprojectsassessment withregardtoriskandprobability

Pairs trading. ROBERT J. ELLIOTTy, JOHN VAN DER HOEK*z and WILLIAM P. MALCOLM

***SECTION 7.1*** Discrete and Continuous Random Variables

The Impact of Flexibility And Capacity Allocation On The Performance of Primary Care Practices

19/01/2017. Profit maximization and competitive supply

Information and uncertainty in a queueing system

CONSUMER CREDIT SCHEME OF PRIVATE COMMERCIAL BANKS: CONSUMERS PREFERENCE AND FEEDBACK

Monetary policy is a controversial

8. From FRED, search for Canada unemployment and download the unemployment rate for all persons 15 and over, monthly,

Management Accounting of Production Overheads by Groups of Equipment

Topic 14: Maximum Likelihood Estimation

XLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

CLAS. CATASTROPHE LOSS ANALYSIS SERVICE for Corporate Risk Management. Identify the Hazard. Quantify Your Exposure. Manage Your Risk.

A STATISTICAL MODEL OF ORGANIZATIONAL PERFORMANCE USING FACTOR ANALYSIS - A CASE OF A BANK IN GHANA. P. O. Box 256. Takoradi, Western Region, Ghana

FORECASTING EARNINGS PER SHARE FOR COMPANIES IN IT SECTOR USING MARKOV PROCESS MODEL

TESTING THE CAPITAL ASSET PRICING MODEL AFTER CURRENCY REFORM: THE CASE OF ZIMBABWE STOCK EXCHANGE

Homework 10 Solution Section 4.2, 4.3.

THE DELIVERY OPTION IN MORTGAGE BACKED SECURITY VALUATION SIMULATIONS. Scott Gregory Chastain Jian Chen

Correlation Structures Corresponding to Forward Rates

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

Maximize the Sharpe Ratio and Minimize a VaR 1

Revisiting the risk-return relation in the South African stock market

ADB Working Paper Series on Regional Economic Integration. Methods for Ex Post Economic Evaluation of Free Trade Agreements

Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators

[D7] PROBABILITY DISTRIBUTION OF OUTSTANDING LIABILITY FROM INDIVIDUAL PAYMENTS DATA Contributed by T S Wright

ECONOMIC GROWTH CENTER

$0.00 $0.50 $1.00 $1.50 $2.00 $2.50 $3.00 $3.50 $4.00 Price

Cross-border auctions in Europe: Auction prices versus price differences

Investment in Production Resource Flexibility:

A Comparative Study of Various Loss Functions in the Economic Tolerance Design

Modeling and Estimating a Higher Systematic Co-Moment Asset Pricing Model in the Brazilian Stock Market. Autoria: Andre Luiz Carvalhal da Silva

Lecture 3: Factor models in modern portfolio choice

Third-Market Effects of Exchange Rates: A Study of the Renminbi

CPSC 540: Machine Learning

Portfolio Selection Model with the Measures of Information Entropy- Incremental Entropy-Skewness

Multidimensional RISK For Risk Management Of Aeronautical Research Projects

Quantitative Measure. February Axioma Research Team

Midterm Exam: Tuesday 28 March in class Sample exam problems ( Homework 5 ) available tomorrow at the latest

Effects of Size and Allocation Method on Stock Portfolio Performance: A Simulation Study

Appendix. A.1 Independent Random Effects (Baseline)

VI Introduction to Trade under Imperfect Competition

R & R Study. Chapter 254. Introduction. Data Structure

A random variable X is a function that assigns (real) numbers to the elements of the sample space S of a random experiment.

Intergenerational Persistence in Income and Social Class: The Impact of Within-Group Inequality. Jo Blanden, Paul Gregg and Lindsey Macmillan

ESD.70J Engineering Economy

THEORETICAL ASPECTS OF THREE-ASSET PORTFOLIO MANAGEMENT

9.1 Principal Component Analysis for Portfolios

FNCE 4030 Fall 2012 Roberto Caccia, Ph.D. Midterm_2a (2-Nov-2012) Your name:

Are capital expenditures, R&D, advertisements and acquisitions positive NPV?

From Asset Allocation to Risk Allocation

Making the Right Wager on Client Longevity By Manish Malhotra May 1, 2012

Solutions to questions in Chapter 8 except those in PS4. The minimum-variance portfolio is found by applying the formula:

Supplemental Appendix for Cost Pass-Through to Higher Ethanol Blends at the Pump: Evidence from Minnesota Gas Station Data.

CPSC 540: Machine Learning

Point-Biserial and Biserial Correlations

Summary of Statistical Analysis Tools EDAD 5630

As last year drew to a close, the December tax overhaul got a lot of

Asset Allocation vs. Security Selection: Their Relative Importance

Evaluating methods for approximating stochastic differential equations

Policyholder Outcome Death Disability Neither Payout, x 10,000 5, ,000

Identifying Useful Variables for Vehicle Braking Using the Adjoint Matrix Approach to the Mahalanobis-Taguchi System

Introducing a New Coffee Futures Pricing Model for the Nairobi Securities Exchange

ECON 1100 Global Economics (Fall 2013) Government Failure

Stochastic modelling of skewed data exhibiting long range dependence

Lecture 4: Return vs Risk: Mean-Variance Analysis

arxiv: v1 [math.oc] 22 Oct 2018

Port(A,B) is a combination of two stocks, A and B, with standard deviations A and B. A,B = correlation (A,B) = 0.

SECURITIES AND EXCHANGE COMMISSION SEC FORM 17.. Q

A COMPARISON AMONG PERFORMANCE MEASURES IN PORTFOLIO THEORY

Transcription:

Dimension Reduction using Princial Comonents Analysis (PCA)

Alication of dimension reduction Comutational advantage for other algorithms Face recognition image data (ixels) along new axes works better for recognizing faces Image comression

Data for 25 undergraduate rograms at business schools in US universities in 1995. Use PCA to: 1) Reduce # columns Additional benefits: 2) Identify relation between columns 3) Visualize universities in 2D Univ SAT To10 Accet SFRatio Exenses GradRate Brown 1310 89 22 13 22,704 94 CalTech 1415 100 25 6 63,575 81 CMU 1260 62 59 9 25,026 72 Columbia 1310 76 24 12 31,510 88 Cornell 1280 83 33 13 21,864 90 Dartmouth 1340 89 23 10 32,162 95 Duke 1315 90 30 12 31,585 95 Georgetown 1255 74 24 12 20,126 92 Harvard 1400 91 14 11 39,525 97 JohnsHokins 1305 75 44 7 58,691 87 MIT 1380 94 30 10 34,870 91 Northwestern 1260 85 39 11 28,052 89 NotreDame 1255 81 42 13 15,122 94 PennState 1081 38 54 18 10,185 80 Princeton 1375 91 14 8 30,220 95 Purdue 1005 28 90 19 9,066 69 Stanford 1360 90 20 12 36,450 93 TexasA&M 1075 49 67 25 8,704 67 UCBerkeley 1240 95 40 17 15,140 78 UChicago 1290 75 50 13 38,380 87 UMichigan 1180 65 68 16 15,470 85 UPenn 1285 80 36 11 27,553 90 UVA 1225 77 44 14 13,349 92 UWisconsin 1085 40 69 15 11,857 71 Yale 1375 95 19 11 43,514 96 Source: US News & World Reort, Set 18 1995

PCA Inut Outut Univ SAT To10 Accet SFRatio Exenses GradRate PC 1 PC 2 PC 3 PC 4 PC 5 PC 6 Brown 1310 89 22 13 22,704 94 CalTech 1415 100 25 6 63,575 81 CMU 1260 62 59 9 25,026 72 Columbia 1310 76 24 12 31,510 88 Cornell 1280 83 33 13 21,864 90 Dartmouth 1340 89 23 10 32,162 95 Duke 1315 90 30 12 31,585 95 Georgetown 1255 74 24 12 20,126 92 Harvard 1400 91 14 11 39,525 97 JohnsHokins 1305 75 44 7 58,691 87 Hoe is that a fewer columns may cature most of the information from the original dataset

The Primitive Idea Intuition First How to comress the data loosing the least amount of information?

Inut PCA Outut measurements/ original columns rincial comonents (= weighted averages of original measurements) Correlated Uncorrelated Ordered by variance Kee to rincial comonents; dro rest

Mechanism Univ SAT To10 Accet SFRatio Exenses GradRate PC 1 PC 2 PC 3 PC 4 PC 5 PC 6 Brown 1310 89 22 13 22,704 94 CalTech 1415 100 25 6 63,575 81 CMU 1260 62 59 9 25,026 72 Columbia 1310 76 24 12 31,510 88 Cornell 1280 83 33 13 21,864 90 Dartmouth 1340 89 23 10 32,162 95 Duke 1315 90 30 12 31,585 95 Georgetown 1255 74 24 12 20,126 92 Harvard 1400 91 14 11 39,525 97 JohnsHokins 1305 75 44 7 58,691 87 The i th rincial comonent is a weighted average of original measurements/columns: PC = a X + a X + + i i1 1 Weights (a ij ) are chosen such that: 1. PCs are ordered by their variance (PC 1 has largest variance, followed by PC 2, PC 3, and so on) 2. Pairs of PCs have correlation = 0 3. For each PC, sum of squared weights =1 i2 2 a i X

PC = a X + a X + + i i1 1 i2 2 a i X Demystifying weight comutation Main idea: high variance = lots of information Var(PC i ) = a 2 i1 Var(X 1 ) + Var(X ) + + + 2ai1ai 2Cov(X 1,X 2 ) + + 2ai -1aiCov(X -1,X a 2 i 2 2 a 2 i ) Var(X ) + Also want, Covar ( PC i, PC j ) Goal: Find weights a ij that maximize variance of PC i, while keeing PC i uncorrelated to other PCs. The covariance matrix of the X s is needed. = 0, when i j

Standardize the inuts Why? variables with large variances will have bigger influence on result Solution Standardize before alying PCA Univ Z_SAT Z_To10 Z_Accet Z_SFRatio Z_Exenses Z_GradRate Brown 0.4020 0.6442-0.8719 0.0688-0.3247 0.8037 CalTech 1.3710 1.2103-0.7198-1.6522 2.5087-0.6315 CMU -0.0594-0.7451 1.0037-0.9146-0.1637-1.6251 Columbia 0.4020-0.0247-0.7705-0.1770 0.2858 0.1413 Cornell 0.1251 0.3355-0.3143 0.0688-0.3829 0.3621 Dartmouth 0.6788 0.6442-0.8212-0.6687 0.3310 0.9141 Duke 0.4481 0.6957-0.4664-0.1770 0.2910 0.9141 Georgetown -0.1056-0.1276-0.7705-0.1770-0.5034 0.5829 Harvard 1.2326 0.7471-1.2774-0.4229 0.8414 1.1349 JohnsHokins 0.3559-0.0762 0.2433-1.4063 2.1701 0.0309 MIT 1.0480 0.9015-0.4664-0.6687 0.5187 0.4725 Northwestern -0.0594 0.4384-0.0101-0.4229 0.0460 0.2517 NotreDame -0.1056 0.2326 0.1419 0.0688-0.8503 0.8037 PennState -1.7113-1.9800 0.7502 1.2981-1.1926-0.7419 Princeton 1.0018 0.7471-1.2774-1.1605 0.1963 0.9141 Purdue -2.4127-2.4946 2.5751 1.5440-1.2702-1.9563 Stanford 0.8634 0.6957-0.9733-0.1770 0.6282 0.6933 TexasA&M -1.7667-1.4140 1.4092 3.0192-1.2953-2.1771 UCBerkeley -0.2440 0.9530 0.0406 1.0523-0.8491-0.9627 UChicago 0.2174-0.0762 0.5475 0.0688 0.7620 0.0309 UMichigan -0.7977-0.5907 1.4599 0.8064-0.8262-0.1899 UPenn 0.1713 0.1811-0.1622-0.4229 0.0114 0.3621 UVA -0.3824 0.0268 0.2433 0.3147-0.9732 0.5829 UWisconsin -1.6744-1.8771 1.5106 0.5606-1.0767-1.7355 Yale 1.0018 0.9530-1.0240-0.4229 1.1179 1.0245 Excel: =standardize(cell, average(column), stdev(column))

Standardization shortcut for PCA Rather than standardize the data manually, you can use correlation matrix instead of covariance matrix as inut PCA with and without standardization gives different results!

PCA Transform > Princial Comonents (correlation matrix has been used here) Princial Comonents PCs are uncorrelated Var(PC1) > Var (PC2) >... PC = a X + a X + + i i1 1 i2 2 a i X Scaled Data PC Scores

Comuting rincial scores For each record, we can comute their score on each PC. Multily each weight (a ij ) by the aroriate X ij Examle for Brown University (using standardized numbers): PC Score1 for Brown University = ( 0.458)(0.40) +( 0.427)(.64) +(0.424)( 0.87) +(0.391)(.07) + ( 0.363)( 0.32) + ( 0.379)(.80) = 0.989

R Code for PCA (Assignment) OPTIONAL R Code install.ackages("gdata") ## for reading xls files install.ackages("xlsx") ## for reading xlsx files mydata<-read.xlsx("university Ranking.xlsx",1) ## use read.csv for csv files mydata ## make sure the data is loaded correctly hel(rincom) ## to understand the ai for rincom caobj<-rincom(mydata[1:25,2:7], cor = TRUE, scores = TRUE, covmat = NULL) ## the first column in mydata has university names ## rincom(mydata, cor = TRUE) not_same_as rcom(mydata, scale=true); similar, but different summary(caobj) loadings(caobj) lot(caobj) bilot(caobj) caobj$loadings caobj$scores

Goal #1: Reduce data dimension PCs are ordered by their variance (=information) Choose to few PCs and dro the rest! Examle: PC 1 catures most??% of the information. The first 2 PCs cature??% Data reduction: use only two variables instead of 6.

Matrix Transose OPTIONAL: R code hel(matrix) A<-matrix(c(1,2),nrow=1,ncol=2,byrow=TRUE) A t(a) B<-matrix(c(1,2,3,4),nrow=2,ncol=2,byrow=TRUE) B t(b) C<matrix(c(1,2,3,4,5,6),nrow=3,ncol=2,byrow=TRUE) C t(c)

Matrix Multilication OPTIONAL R Code A<matrix(c(1,2,3,4,5,6),nrow=3,ncol=2,byrow= TRUE) A B<matrix(c(1,2,3,4,5,6,7,8),nrow=2,ncol=4,byro w=true) B C<-A%*%B D<-t(B)%*%t(A) ## note, B%*%A is not ossible; how does D look like?

Matrix Inverse If, A B = I,identity matrix Then, B= A -1 Identity matrix: 10...0 0 01...0 0... 0 0...10 0 0...01 OPTIONAL R Code ## How to create nˣn Identity matrix? hel(diag) A<-diag(5) ## find inverse of a matrix solve(a)

Data Comression PCScores N = [ ScaledData] N PrincialComonents [ ScaledData] N = [ PCScores] N [ PrincialComonents] = PCScores N 1 [ PrincialComonents] T c = Number of comonents ket; c Aroximation: [ AroximatedScaledData ] = [ PCScores] N c [ PrincialComonents] N T c

Goal #2: Learn relationshis with PCA by interreting the weights a i1,, a i are the coefficients for PC i. They describe the role of original X variables in comuting PC i. Useful in roviding context-secific interretation of each PC.

PC 1 Scores (choose one or more) 1. are aroximately a simle average of the 6 variables 2. measure the degree of high Accet & SFRatio, but low Exenses, GradRate, SAT, and To10

Goal #3: Use PCA for visualization The first 2 (or 3) PCs rovide a way to roject the data from a -dimensional sace onto a 2D (or 3D) sace

Scatter Plot: PC 2 vs. PC 1 scores

Monitoring batch rocesses using PCA Multivariate data at different time oints Historical database of successful batches are used Multivariate trajectory data is rojected to low-dimensional sace >>> Simle monitoring charts to sot outlier

Your Turn! 1. If we use a subset of the rincial comonents, is this useful for rediction? for exlanation? 2. What are advantages and weaknesses of PCA comared to choosing a subset of the variables? 3. PCA vs. Clustering