Classification and Regression Trees

Size: px

Start display at page:

Download "Classification and Regression Trees"

Dorthy Carr
5 years ago
Views:

1 Classification and Regression Trees In unsupervised classification (clustering), there is no response variable ( dependent variable), the regions corresponding to a given node are based on a similarity of the observations to each other. In classification and regression trees, the region at each node is based on some similarity of the response variables to each other. Classification and regression trees are formed divisively, based on a response variable. If a node has more than one group, we may divide the node into multiple nodes that are more pure. 1

2 Classification and Regression Trees We will use the same notation as we used for clustering trees. Each node in the tree corresponds to some specific region X m of the feature space (space of independent variables). We will also occasionally abuse the notation slightly to use X m as the set of indexes i such that x i X m. The branches from a node are defined in terms of rules to split the feature space. 2

3 Classification and Regression Trees If the response variable is a group category, a classification tree is formed. Each node in a classification tree corresponds to the predominant value of the response within the subdomain of the features corresponding to that node. If the response is a numeric variable, a regression tree is formed. Each node in a regression tree corresponds to the average of the response within a given region, rather than to a predominant value of the response. 3

4 Impurity of Nodes of Trees Nodes are split based on their impurity. Impurity is a measure of how badly the observations at a given node fit the model. In a regression tree, for example, the impurity may be measured by the residual sum of squares within that node. In a classification tree, there are various ways of measuring the impurity, such as the misclassification error, the Gini index, and the entropy. 4

5 Deviance The term deviance is used in various ways in statistics. In general, it is a measure of variability that is not accounted for by the fitted model. It is usually not scaled either to account for the number of observations or to account for their magnitude; thus, a larger set of observations will usually have a larger deviance than a smaller set in the same situation, and likewise, data with larger values will usually have a larger deviance than the same data if measured on a larger scale. The fact that a deviance is not scaled for the number of observations yields an additive property for the nodes in a tree. 5

6 Regression Trees Our general model for regression (in one form) has been y i = β 0 + x T i β + ɛ i, where ɛ i is assumed to be a random variable with E(ɛ i ) = 0. This yields an expression for y i conditional on the corresponding x i ; hence we may write the left hand side as y i x i. The model for a regression tree is of the form y i x i = µ m + ɛ i, where m is determined by index of the region X m of the feature space such that x i X m, µ m = E(Y x i X m ), and ɛ i is assumed to be a random variable with E(ɛ i ) = 0 as before. At node m the model fitted is just ŷ i = 1 n m i X m y i. 6

7 Measure of Impurity in Regression Trees The obvious unscaled measure of impurity of any node in a regression tree is the residual sum of squares, RSS. An node m, it is just i X m (y i ŷ i ) 2. This is called the deviance. 7

8 Example set.seed(5) n <- 20 n1 <- 12 n2 <- n-n1 exreg <- data.frame(cbind(x1=c(rnorm(n1),rnorm(n2)+1.0), x2=c(rnorm(n1),rnorm(n2)+1.5), y=(c(1+0.25*rnorm(n1),2+0.25*rnorm(n2)))) exreg attach(exreg) plot(x1,x2,main="population Means of Responses") text(x1[1:n1]+.05,x2[1:n1], 1 ) text(x1[(n1+1):n]+.05,x2[(n1+1):n], 2 ) 8

9 The Data for the Regression Tree x1 x2 y

10 The Data for the Regression Tree Population Means of Responses 2 x x1 10

11 R on the Regression Tree Example library(tree) regtree <- tree(y~x1+x2) This produces node), split, n, deviance, yval * denotes terminal node 1) root ) x1 < ) x2 < * 5) x2 > * 3) x1 > ) x2 < * 7) x2 > * 11

12 Prediction in the Regression Tree Example Now, let s classify some new observations using the regression tree that R computed. newdata <- data.frame(cbind(x1=c(1,-1),x2=c(3,0))) predict.tree(regtree, newdata) This produces This is based on the average value of the response within each region. 12

13 Classification Trees in R 13

14 Example set.seed(5) n <- 20 n1 <- 12 n2 <- n-n1 exclass <- data.frame(cbind(x1=c(rnorm(n1),rnorm(n2)+1.0), x2=c(rnorm(n1),rnorm(n2)+1.5), y=c(rep(1,n1),rep(2,n2)))) attach(exclass) plot(x1,x2,col=y) 14

15 x x1 15

16 Classification Trees A classification tree is formed by recursively dividing up the space of the data. A simple procedure is to choose one feature at a time and make a split at a particular value of that feature. Let s just do this by eyeball. 16

17 Classification Trees x x1 17

18 Pruning Trees x x x x1 18

19 Nodes in Classification Trees For any node m, we consider the proportion of observations that we have assigned to class j (that is, estimated to be in class j). We denote the region of the feature space representing node m as R m, and the number of observations at that node as n m. Initially, of course, R 1 is the full space of the features in all observations and n 1 = n. Define ˆp mj = 1 n m x i R m I(y i = j). 19

20 Nodes in Classification Trees The class of a node is j(m) = argmax j ˆp mj. This is not well-defined if there is no unique maximum. The most common way of dealing with this is to may a random (or arbitrary) choice. In the case of only two classes with numeric labels, another way of assigning a class to a node is to take the average value of the class labels. Let s identify these in the figure on the previous slide. (The numbering of the nodes is arbitrary, but it should be done systematically.) 20

21 Impurity of Nodes in Classification Trees Nodes are split based on their impurity. A pure node has only one class, and obviously would not be split. There are various measures of impurity. Misclassification error: 1 n m i R m I(y i j(m)) = 1 ˆp mj(m). (Note that we also use R m to denote the set of indices of observations at node m.) Gini index: ˆp mjˆp m j = j j k ˆp mj (1 ˆp mj ) j=1 Cross-entropy: ˆp mj log(ˆp mj ) j ˆp mj >0 21

22 Impurity of Nodes in Classification Trees Let s identify these in the figure on the earlier slide. Let s number the nodes so that 2 bottom part of graph 3 top part of graph 4 bottom left part of graph 5 bottom right part of graph Notice that for node 5, which has 6 observations, half and half, the misclassification error requires us to choose a class for the node. There is no unique argmax j ˆp mj. The misclassification error is invariant to our choice, however. Note that if we try to use some kind of average class value, the misclassification error would need to be defined differently. 22

23 Impurity of Nodes in Classification Trees For example in node 5, which has 6 observations, half and half, we get Misclassification error: 0.5 Gini index: 0.5 Cross-entropy:

24 Classification and Regression Trees in R There are some R packages that provide functions for classification trees and regression trees. The R package tree was the first and probably still the most common one. The main function is tree. This function has several options. Other functions are tree.control, predict.tree, and prune.tree. The arguments of tree.control, such as minsize can also be included in the invocation of tree. 24

25 Classification and Regression Trees in R Another R package is rpart. It is probably the best one. The main function is rpart. This function has several options. The technical report by Therneau and Atkinson (2011, and subsequent dates) remains the best documentation for rpart. 25

26 Methods in rpart The R function rpart allows different splitting criteria, which can be specified in the argument method. For a regression tree, the obvious criterion is the reduction in sum of squares. Since this is the idea in analysis of variance, this method is called anova. This method is the default unless the response is a factor. For a classification tree, one common criterion is the Gini index. Use of the Gini index is specified by the method called class. This method is the default when the response is a factor. 26

27 Data Frames in R Over the years, as data frames have been developed in R, I have been more aggravated by the non-intuitive aspects of the structure and by its limited uses in R functions than I been pleased with its usefulness. In the data frame of my example, even if we use y=factor(c(rep(1,n1),rep(2,n2))) or y=as.factor(c(rep(1,n1),rep(2,n2))), y is not of mode factor. In case, yy=factor(example$y) yields a variable of mode factor. Of course, we could put this variable in the data frame. 27

28 Classification Trees in R Rather than fool with the vagaries required to coerce the variable in the data frame to be of mode factor, I prefer to do the coercion at the point of usage. The advantage of this is that nothing is hidden in the code. Hiding properties in an object is great, unless you want to be sure of what is being done. library(tree) attach(exclass) classtree <- tree(as.factor(y)~x1+x2) 28

29 Classification Trees in R This produces node), split, n, deviance, yval, (yprob) * denotes terminal node 1) root ( ) 2) x1 < ( ) * 3) x1 > ( ) 6) x2 < ( ) * 7) x2 > ( ) * Which is better than we got by our crude eye method. 29

30 Classification Trees in R Let s plot it: minx1 <-min(x1) maxx1 <-max(x1) minx2 <-min(x2) maxx2 <-max(x2) plot(x1,x2,col=y) lines(c( , ),c(minx2,maxx2)) lines(c( ,maxx2),c( , )) 30

31 Example: Classification Trees x x1 31

32 Prediction in the Classification Tree Example Now, let s classify some new observations using the classification tree that R computed. newdata <- data.frame(cbind(x1=c(1,-1),x2=c(3,0))) predict.tree(classtree, newdata) This produces [,1] [,2] This is based on the proportion of the classes within each region. 32

Lecture 9: Classification and Regression Trees

Lecture 9: Classification and Regression Trees Advanced Applied Multivariate Analysis STAT 2221, Spring 2015 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department of Mathematical