Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Evaluation of Models. Niels Landwehr

Size: px

Start display at page:

Download "Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Evaluation of Models. Niels Landwehr"

Asher Flynn
6 years ago
Views:

1 Universität Potsdam Institut für Informatik ehrstuhl Maschinelles ernen Evaluation of Models Niels andwehr

2 earning and Prediction Classification, Regression: earning problem Input: training data Output: model f : X Y {( x, y ),...,( x, y )} 1 1 Model will be used to obtain predictions for novel test instances m m f ( x)? x test instance Have seen several different types of models inear models, decision trees 2

3 Evaluation of Models Question: After having implemented learning algorithm, having trained model etc.: how accurate are our predictions? What exactly do we mean by accurate? How do we calculate / measure / estimate accuracy? We care about accuracy of predictions when applying the model to unseen, novel test data (not about accuracy on observable training data). Evaluation of models: Estimate accuracy of predictions of learned models. 3

4 Evaluation of models: Assumptions In order to study the evaluation of models formally, we have to make assumptions about the properties of training and test data. Central assumption: all data are drawn from fixed (unknown) distribution p(x,y) Distribution over instances Distribution over labels given instance Example spam-filtering p( x, y) p( x) p( y x) p(x) Probability to see x p(y x) Probability to see label y {Spam/Ok} for x. 4

5 Evaluation of models: Assumptions i.i.d. -assumption: Examples are independent and identically distributed. Training instances are drawn independently from distribution p(x,y) : ( x, y ) ~ p( x, y) i i training examples Test instances are also drawn independently from this distribution ( x, y) ~ p( x, y) test instance (seen when applying the model) Is this always realistic? In the following, we will always assume i.i.d. data. 5

6 oss functions We have made assumption about the instances (x,y) that we will see New test instance (x,y) arrives, model predicts f(x). oss function defines how good/bad the prediction is: Non-negative: Problem specific, given a priori. oss functions for classification Zero-one loss: ( y, y') 0, if y y'; 1, otherwise ( y, f( x)) y, y': ( y, y') 0 Class-dependent cost matrix oss functions for regression Squared loss: ( y, y') ( y y') oss of prediction f(x) for instance (x,y) 2 6

7 Evaluating models: Risk of model Central definition when evaluating models: risk. Risk of a model: expected loss for novel test instance ( x, y) ~ p( x, y) test instance x with label y (random variable) ( y, f( x)) loss for test instance (random variable) R( f ) E[ ( y, f ( x))] ( y, f ( x)) p( x, y) dxdy For zero-one-loss risk is also called error rate. For squared loss risk is also called mean squared error. Main goal when evaluating models: determine risk of model. Risk cannot be determined exactly, because p(x,y) is unknown estimation problem. 7

8 Evaluation of models: risk estimate Estimating risk from data. If data sampled from p(x,y) is available, T {( x, y ),...,( x, y )} ( x, y ) ~ p( x, y) 1 1 we can estimate the risk: m 1 m m m Rˆ( f ) ( y, f ( x )) "empirical risk" j1 j j i i Important: Which data T to use? Training data (T=)? Split available data into und T. Cross-validation. 8

9 Estimator as a random variable Estimator Estimator is random variable: 1 ˆ( ) m j 1 ( j, ( x j )) m R f y f Instances in T have been drawn randomly ( x, y ) ~ p( x, y) Which ( x, y ) where drawn? j j Value of estimator depends on randomly sampled instances, thus it is the result of a random process. j j Estimator has an expected value E[ Rˆ ( f )]. Estimator is unbiased if and only if: Expectation of empirical risk = true risk. 9

10 Bias of estimator ˆ f Estimator R ( ) is unbiased if and only if: E[ Rˆ( f )] R( f ) ˆ f Otherwise, R ( ) has a bias: Bias E[ Rˆ ( f )] R( f ). Estimator is optimistic, if E[ Rˆ ( f )] R( f ). Estimator is pessimistic, if E[ Rˆ ( f )] R( f ). 10

11 Variance of estimator Estimator ( ) has a variance The larger the sample T used for computing the estimate is, the lower the resulting variance. Variance vs. bias: ˆ f R 2 2 Var[ Rˆ ( f )] E[ Rˆ ( f ) ] E[ Rˆ ( f )] High variance: large random component in empirical risk estimate. arge bias: systematic error in empirical risk estimate. value Rˆ ( f) R bias dominates R variance dominates 11

12 Risk estimate on training data Which set T should we use? 1. Try: training data Model f, trained on {( x, y ),...,( x, y )} 1 1 Empirical risk measured on training data R ˆ 1 ( f m ) 1 ( y j j, f ( x j )) m Risk estimated on Is this risk estimate an unbiased optimistic pessimistic estimator of the true risk R( f)? m m 12

13 Risk estimate on training data Empirical risk on training data is an optimistic estimator of the true risk. Empirical risk of all possible models for a fixed? Due to random effects it holds for some models f, that Rˆ ( f ) R ( f ) and for other models f, that Rˆ ( f ) R ( f ). earning algorithm chooses a model f with small empirical risk Rˆ ( f ). ˆ ( ) ( ) ikely that R f R f (optimistic risk estimate). 13

14 Risk estimate on training data Empirical risk of the model chosen by the learning algorithm on the training data ( training error ) is optimistic estimator of true risk: E[ Rˆ ( f )] R( f ). The problem is caused by the dependency of the chosen model on the data used for evaluation. Approch to fix the problem: use test data that are independent of the training data. 14

15 Holdout-Testing Idea: estimate risk on independent test data Given: data Split data into D {( x, y ),...,( x, y )} 1 1 Training data {( x, y ),...,( x, y )} and 1 1 m m Test data T {( x, y ),...,( x, y )} m1 m1 d d d d T 15

16 Holdout-Testing Run learning algorithm on data, this yields model. Compute empirical risk Rˆ ( f ) on test data T. T Run learning algorithm on data D, this yields model. Output: Model f, use ˆ D RT ( f ) as estimator for true risk of the model f D. f f D T 16

17 Holdout-Testing: Analysis Is the estimator Rˆ ( f ) for the risk of the model unbiased, optimistic, pessimistic? T f D 17

18 Holdout-Testing: Analysis Estimator Rˆ T ( f ) is pessimistic for R( f D ): Rˆ ( f ) is unbiased for f T f was learned on fewer training examples than f D, and therefore has a higher risk (in expectation). But the estimator Rˆ T ( f ) is useful in practice, while the estimator Rˆ ( f ) is usually wildly optimistic (often close to 0). Why do we train and return model f D? Final model f D rather than f, because f D has a lower risk, and is therefore better. 18

19 Holdout-Testing: Analysis What are the advantages/disadvantages when choosing the test set T large small? T should be large to ensure that risk estimate Rˆ T ( f ) has low variance. T should be small to ensure that Rˆ T ( f ) has low bias, that is, is not too pessimistic. We need a lot of data in order to obtain good estimates In practice, holdout-testing is only used when data is plentyful. Cross-validation (next slide) usually gives better results. 19

20 Cross-Validation Given: data D {( x, y ),...,( x, y )} 1 1 d d Split D into n equally sized blocks with D D and D i D 0 n i1 Repeat for i=1 n i earn model f i with i =D \ D i. Compute empirical risk ˆ ( ) on D i. j R i D f i D1 D2 D3 D4 D,..., 1 Dn Training examples 20

21 Cross-Validation Average empirical risk estimates from the different test sets D i : 1 n R Rˆ D ( f ) n i 1 earn model f D on complete data set D. Return model f D and estimator R. i i Training examples 21

22 Cross-Validation: Analysis Is the estimator optimistic / pessimistic / unbiased? 22

23 Cross-Validation: Analysis Is the estimator optimistic / pessimistic / unbiased? Estimator is pessimistic: Models f i are trained on fraction (n-1)/n of overall data. Model f D will be trained on all data. Cross-Validation Holdout Training examples 23

24 Cross-Validation: Analysis Bias/Variance compared to holdout-testing? Variance is lower than for holdout-testing Averaging over several holdout experiments, this reduces variance Estimator is based on all data, because all instances appear as test instances in some block. Bias similar as for holdout-testing, depends on number of blocks. Cross-Validation Holdout Training examples 24

25 Example: regularized polynomial regression Polynomial model fw ( x) wi x i0 earn model by minimizing regularized loss M w* arg min w ( ( ) ) ln 18 i1 M 2 fw xi yi w i 2 Training data {( x, y ),...,( x, y )} 1 1 m m earned model True model y x 25

26 Tune regularization parameter We have to determine a good regularization parameter. Regularization parameter controls complexity of model. ln ln 18 ln 0 26

27 Tune regularization parameter Perform cross-validation for different parameters, save the corresponding risk estimates. When training the final model on all of the data, use the * parameter that resulted in smallest risk estimate. Training error minimal for unregularized model, but test error better for moderate regularization. * 0 27

28 Tune regularization parameter Algorithm: earn model with optimal regularization parameter. Function For : trainmodeloptimalambda( D) 1 1 { 2 k,2 k...,2 k, 2 k } Determine error( ) crossvalidatio n(, D) cross-validation risk estimate for model with parameter on data D. Set earn * arg mi n error( ) f trainmodel(, D) * * earning model with * parameter on data D. Output: model f *. 28

29 Estimating error of model with tuned regularization parameter How do we estimate the error of the model with tuned * regularization parameter? Warning: we can not simply use the error estimate error( * )! * The parameter was chosen such that error( * ) is as small as possible. The error estimate error( * ) is therefore optimistic. Compare with earlier argument: training error is optimistic, because model parameters have been chosen based on training data. Instead, what we need is a nested cross-validation (see next slide). 29

30 Nested Cross-Validation Algorithm: risk estimate with tuned regularization parameter Function trainandevaluateoptimalambda( D) Split D into n equally sized blocks D,..., 1 Dn with D n D i1 i and D D 0. For earn i i {1,..., n}: f j * i trainmodeloptimalambda( D \ D) i Determine empirical risk * Rˆ ( f ) on data Di D i i Average the different empirical risk estimates: earn f * trainmodeloptimalambda( D) R 1 n R ˆ ( f ). i1 Di i n * f R. Output: model and risk estimate 30

31 Evaluation: Summary Studied the problem of risk estimation: expected loss on novel test data. Training error optimistic, cannot be used as risk estimate. Appropriate approaches are holdout-testing and crossvalidation. Cross-validation is also used to tune hyperparameters such as regularization parameter. Error estimate for model with tuned hyperparameters requires nested cross-validation. 31

Computational Finance Least Squares Monte Carlo

Computational Finance Least Squares Monte Carlo School of Mathematics 2019 Monte Carlo and Binomial Methods In the last two lectures we discussed the binomial tree method and convergence problems. One