Oracle inequalities for computationally budgeted model selection

Oracle nequaltes for computatonally budgeted model selecton Alekh Agarwal John C. Duch Unversty of Calforna, Berkeley alekh,jduch}@cs.berkeley.edu Peter L. Bartlett QUT and UC Berkeley bartlett@cs.berkeley.edu Clement Levrard École Normale Supéreure clement@ens.fr Abstract We analyze general model selecton procedures usng penalzed emprcal loss mnmzaton under computatonal constrants. Whle classcal model selecton approaches do not consder computatonal aspects of performng model selecton, we argue that any practcal model selecton procedure must not only trade off estmaton and approxmaton error, but also the effects of the computatonal effort requred to compute emprcal mnmzers for dfferent functon classes. We provde a framework for analyzng such problems, and we gve algorthms for model selecton under a computatonal budget. These algorthms satsfy oracle nequaltes that show that the rsk of the selected model s not much worse than f we had devoted all of our computatonal budget to the best functon class. 1 Introducton In the standard statstcal predcton settng, one receves samples z 1,..., z n } Z drawn..d. from some unknown dstrbuton P over a sample space Z, and gven a loss functon l, seeks a functon f to mnmze the rsk Rf := E[lz, f]. 1 Snce Rf s unknown, the typcal approach s to approxmately mnmze the emprcal rsk, R n f := 1 n n =1 lz, f, over a functon class F. We seek a functon f n wth a rsk close to the Bayes rsk, the mnmal rsk over all measurable functons, whch s R 0 := nf f Rf. There s a natural tradeoff based on the class F one chooses, snce Rf n R 0 = Rf n nf f F Rf + nf Rf R 0, f F whch decomposes the excess rsk of f n nto estmaton error left and approxmaton error rght. A common approach to addressng ths tradeoff s to express F as a unon of classes F 1,..., F k. The model selecton problem s to choose a class F and a functon f F to gve the best tradeoff between estmaton error and approxmaton error. 1 A common approach to the model selecton problem s the now classcal dea of complexty regularzaton, whch arose out of early works by Mallows 1973 and Akake 1974. The complexty regularzaton approach balances two competng objectves: the mnmum emprcal rsk of a model class F approxmaton error and a complexty penalty to control estmaton error for the class. Dfferent choces of the complexty penalty gve rse to dfferent model selecton crtera and algorthms see e.g. Massart, 2003, and the references theren. Results of several authors e.g. Bartlett et al., 2002; Lugos and Wegkamp, 2004; Massart, 2003 show that gven a dataset of sze n, the output f n of the procedure roughly satsfes ER f n R 0 mn [ nf Rf R 0 + γ n f F ] 1 + O p n, 2 where γ n s a complexty penalty for class, whch s usually decreasng to zero n n and ncreasng n. Several approaches to complexty regularzaton are possble, and an ncomplete bblography 1 In general, the number of classes K can be nfnte, though we restrct attenton to fntely many classes for ths paper.

ncludes Vapnk and Chervonenks, 1974; Geman and Hwang, 1982; Rssanen, 1983; Barron, 1991; Bartlett et al., 2002; Lugos and Wegkamp, 2004. These oracle nequaltes show that, for a gven sample sze, the model selecton procedure gves the best trade-off between the approxmaton and estmaton errors. A drawback wth the above mentoned approaches s that we need to be able to optmze over each model n the herarchy on the entre data, n order to prove guarantees on the result of the model selecton procedure. Ths s natural when the sample sze s the key lmtaton, and t s computatonally feasble when the sample sze s small and the samples are low-dmensonal. However, the cost of tranng K dfferent model classes on the entre data sequence can be prohbtve when the datasets become large and hghdmensonal as s common n modern settngs. In these cases, t s computatonal resources rather than the sample sze that are the key constrant. In ths paper, we consder model selecton from ths computatonal perspectve, vewng the amount of computaton, rather than the sample sze, as the parameter whch wll enter our oracle nequaltes. Specfcally, we consder model selecton methods that work wthn a gven computatonal budget. An nterestng and dffcult aspect of the problem that we must address s the nteracton between model class complexty and computaton tme. It s natural to assume that for a fxed sample sze, t s more expensve to estmate a model from a complex class than a smple class. Put nversely, gven a computatonal bound, a smple model class can ft a model to a much larger sample sze than a rch model class. So any strategy for model selecton under a computatonal budget constrant should trade off two crtera: the relatve tranng cost of dfferent model classes, whch allows smpler classes to receve far more data thus makng them reslent to overfttng, and lower approxmaton error n the more complex model classes. In addressng these computatonal and statstcal ssues, ths paper makes three man contrbutons. Frst, we propose a novel computatonal perspectve on the model selecton problem, whch we beleve should be a natural consderaton n statstcal learnng problems. Secondly, wthn ths framework, we provde an algorthm explotng algorthms for mult-armed bandt problems that uses confdence bounds based on concentraton nequaltes to select a good model under a gven computatonal budget. We also prove a mnmax optmal oracle nequalty on the performance of the selected model. Our thrd man contrbuton s another algorthm based on a coarse-grd search, for model herarches that are structured by ncluson, that s, F 1 F 2... F K. Under natural assumptons regardng the growth of the complexty penaltes as we go to more complex classes, the coarse-grd search procedure satsfes better oracle nequaltes than the earler bandt algorthm. Both of our algorthms are computatonally smple and effcent. The remander of ths paper s organzed as follows. In the next secton, we formalze our settng and present the algorthms. Secton 3 presents our man results as well as some consequences for specfc problems and examples. We provde proofs n Sectons 4 through 5; Secton 4 contans the proof of the result for unstructured model selecton problems, whle Sec. 5 contans the proofs of oracle nequaltes for model selecton problems wth nested classes F. 2 Setup and algorthms In ths secton, we wll descrbe our statstcal and computatonal assumptons about the problem, gvng examples of classes of problems and statstcal procedures that satsfy the assumptons. We wll follow ths wth descrptons of our algorthms, ncludng ntutve explanatons of the procedures. 2.1 Setup and Goals Recall from the ntroducton that we have a collecton of K model classes F 1,..., F K. Let us begn by descrbng our computatonal assumptons. Frst, we assume as our basc unt of measure a computatonal quantum; wthn ths quantum, a model can be traned on any sngle class F usng n samples. That s, we assocate wth each class F a number of samples n N, where n s chosen so that tranng a model from class F on n examples requres the same amount of tme as tranng a model from class F j on n j samples. We assume an overall tme budget of T quanta, so that f we devote the entre computatonal budget to class, we could use T n samples to tran a model. 2 Our hgh level goal s to derve algorthms that perform nearly as well as f an oracle gave the best model class n advance, and we could devote the entre computatonal budget T to class. For the statstcal assumptons n our problem, we take an approach smlar to that of Bartlett et al. 2002, restrctng our attenton to complexty penaltes based on concentraton nequaltes. 2 The lnearty assumpton s essentally no loss of generalty. In addton, several algorthms satsfy t. We can work wth general non-lnear scalngs too, at the cost of sgnfcant notatonal burden whch we choose to avod here. 2

Each of our model selecton procedures uses a black-box algorthm A for fttng functons from the model class F to the data. We requre that these algorthms be statstcally well-behaved, n the sense that the emprcal rsk of A s output model f s near the true rsk of f. Recallng the defntons of R, R from the ntroducton, and defnng [K] = 1,..., K}, we state our man concentraton assumpton: Assumpton A. Let A, n F denote the output of algorthm A on a sample of n data ponts. a For each [K], there s a functon γ and constants κ 1, κ 2 > 0 such that for any n N, P R n A, n RA, n > γ n + κ 2 ɛ κ 1 exp 4nɛ 2. 3 b The output A, n s a γ n-mnmzer of R n, that s, c The functon γ n c n α for some α > 0. R n A, n nf Rn f γ n. f F d For any fxed functon f F, P R n f Rf > κ 2 ɛ κ 1 exp 4nɛ 2. There are many classes of functons and correspondng algorthms that satsfy Assumpton A. For one smple example, let F } be VC-classes of functons, where each F has VC-dmenson d, and l be the hnge loss, where lz, f = [1 yfx] +. Assumng that lz, f B for all f F, Dudley s entropy ntegral n ths case gves Dudley, 1978 d γ n = O and κ 1 2, κ 2 = OB. 4 n Smlar results hold for other convex losses and problems, for example regresson and densty estmaton problems wth squared or log losses. For functon classes of bounded complexty, such as VC, Sobolev, or Besov classes, penalty functons γ n can be computed that satsfy Assumpton A usng many technques; some relevant approaches nclude Rademacher and Gaussan complextes of the functon classes F, metrc entropy, Dudley s entropy ntegral, or localzaton technques e.g. Pollard, 1984; Bartlett and Mendelson, 2002; Dudley, 1967. In many concrete cases, such as parametrc models or VC classes, Assumpton Ac s satsfed wth α = 1 2. Our approach, smlar to the dea of complexty regularzaton, s to perform a knd of penalzed model selecton. If we knew the true rsk functonal R, we could mnmze a combnaton of the rsk and complexty penalty based on the number of samples our computatonal budget allows for the class. In partcular, gven penalty functons γ, we defne the best class n hndsght as } := argmn [K] nf Rf + γ T n f F. 5 The dea s that an algorthm performng model selecton whle takng nto account ts computatonal lmtatons should choose the best class consderng the total number of samples t could possbly have seen for the class. We note that ths s also closely related to the crteron 2 mnmzed n the absence of a computatonal budget, but n the classcal case t s assumed that each functon class can be evaluated on an dentcal and fxed number of samples n. 2.2 Upper-confdence bound algorthm wthout structure We now turn to outlnng the frst of the two man scenaros analyzed n ths paper. For now, we do not assume any structure relatng the collecton of model classes F 1,..., F K. The man dea of our algorthm n ths case s to ncrementally allocate our computatonal quota amongst the functon classes, where we trade off recevng samples for classes that have good rsk performance aganst explorng classes for whch we have receved few data ponts. We vew the budgeted model selecton problem as a repeated game wth T rounds. At teraton t, the procedure allocates one addtonal quantum of computaton to a to be specfed functon class. We assume that the computatonal complexty of fttng a model grows lnearly and ncrementally wth the number of samples, whch means that allocatng an addtonal quantum of tranng tme allows the black-box tranng algorthm A to process an addtonal n samples for class F. The lnear growth assumpton s satsfed, for nstance, when the loss functon l s convex and the black-box learnng algorthm A s a stochastc or onlne convex optmzaton procedure e.g. Cesa-Banch and Lugos, 2006; Nemrovsk et al., 2009. 3

Algorthm 1 Mult-armed bandt algorthm for selecton of best class î. Foreach [K] query n examples from class F For t = K + 1 to T Set n t to be the number of examples seen for class at tme t n t Let t = argmn [K] Rj, n t Query n t examples for class t Output î, the ndex of the most frequently selected class. Usng our prevously defned notaton, we now defne the crteron we use n our procedure to select the class to whch we allocate a quantum. The optmstc selecton crteron for class, assumng that F has seen n samples at ths pont n the game, s R, n = R log K n A, n γ n + γ T n. 6 n The ntuton behnd the defnton of R, n s that we would lke the algorthm to choose functons f and classes that mnmze R n f+γ T n Rf+γ T n, but the negatve γ n and log K/n terms lower the crteron sgnfcantly when n s small and thus encourage ntal exploraton. The crteron 6 essentally combnes the penalzed model-selecton objectve used by Bartlett et al. 2002 though we use a log K term, as we assume a fnte number of classes wth an optmstc crteron smlar to those used n mult-armed bandt algorthms Auer et al., 2002. Algorthm 1 contans our bandt procedure for model selecton. We run Alg. 1 for T rounds, where T s such that the entre computatonal budget s exhausted. Our results n Secton 3.1 show that Alg. 1 satsfes our twofold goals of selectng the class wth hgh probablty and outputtng a functon f wth good rsk performance. 2.3 Coarse-grd search algorthm for ncluson herarchy In practce, the classes F are rarely completely unrelated; perhaps the most common scenaro n model selecton s structural rsk mnmzaton, where the model classes F are subsets ordered n ncreasng complexty of a larger model space. To that end, our second man scenaro nvolves studyng computatonally constraned model selecton procedures under the followng assumpton. Assumpton B. The functon classes F satsfy an ncluson herarchy: F 1 F 2 F K 7 One smple example satsfyng above assumpton s classes of functons of the form x fx = θ, x, where each functon class F s dentfed wth an ncreasng bound on θ. A second smple famly of examples conssts of scenaros n whch f F s of the form x fx = θ, φ x where φ s a feature mappng of the nput data Z and φ s a projecton of φ +1. For example, functons n class + 1 observe more features than those n class or the dfferent classes F may consst of an ncreasng sequence of wavelet bases. Intutvely, we expect the structure assumed above to help our model selecton procedure because the mnmum expected rsks of dfferent functon classes are no longer ndependent of each other. It s easy to see that under our assumpton, R R j for j. 8 Clearly, under Assumpton B, the penaltes can always be chosen to be ncreasng as a functon of the class complexty: γ n γ j n for j. 9 Snce our approach nvolves gvng a dfferent number of samples to each class, we requre a slghtly stronger orderng than the above equaton. We assume that for any budget T, we have γ T n γ j T n j for j. 10 Ths assumpton s reasonable snce we expect that γ n s a decreasng functon of n and n n j for j, so that γ T n γ T n j γ j T n j. We now show a smple grd-search based algorthm that gves oracle nequaltes dependng only logarthmcally on the number of classes for ths ncluson herarchy under natural condtons on the 4

growth of the complexty penaltes as a functon of the class ndex. The method takes nspraton from the naïve strategy that splts the budget T unformly across the K classes and fnds the class wth the smallest penalzed emprcal rsk, usng T n /K samples for class. Of course, the naïve approach has the drawback that the computatonal budget avalable to each class s reduced by a factor of K, whch yelds very poor scalng wth the number K of classes. The key observaton we explot s that under the nestng structure 7, we do not need to fnd the smallest regularzed emprcal rsk for each class. We can nstead pck a small subset S of classes and perform model selecton only over the classes n S, then use the ncluson assumpton B to reason about the classes not n S for approprate choces of S. Wth ths ntuton, we now defne a good choce for S: Defnton 1 Coarse grd. For a set S [K], we say that S satsfes the coarse grd condtons wth parameters s N and λ > 0 f S = s and for each [K] there s an ndex j S such that T n T nj T n γ γ j 1 + λγ. 11 s s s We defne to be the sze of the smallest set S satsfyng condton 11, notng that K. In general, for a gven λ there may be no small set S satsfyng Defnton 1; however, we are nterested n settngs where a set S of sze = Olog K exsts. Example 1. Let F } be an ncreasng collecton of VC-classes, say f F s of the form x fx = θ, φ x where φ s a d -dmensonal mappng and F has VC-dmenson d. In ths case, recallng the VC-bound 4, we know that up to constant factors γ n d /n. Makng the reasonable assumpton that tranng tme s lnearly dependent on the VC dmenson, we have n = n K d K /d for [K], so T nk d K γ T n = γ d d 2 1 = d. T n K d K T nk d K Example 1 s suggestve of a pattern common to many herarches of functon classes ncludng parametrc and VC-classes wth ndexng VC-dmenson where the penalty functons nteract wth the sample szes n so that γ splts naturally nto a product γ T n = gt h for some functons g and h whch may depend on K. For such cases, the condton 11 reduces to ensurng T T T g h g hj 1 + λg h, whch amounts to showng h hj 1 + λh ndependent of the settng of snce h s non-ncreasng, we need only show the latter nequalty. Let S = j 1,..., j }. We construct S by settng j = K and recursvely defnng j to be the smallest ndex j < j +1 such that hj +1 1 + λhj. Then the number of classes can be bounded by usng the relaton hk = hj 1 + λhj 1 1 + λ h1, so that so long as loghk/h1 log1+λ, we can choose a set S satsfyng condton 11 wth S =. In partcular, s logarthmc n K as long as the functon h grows sub-exponentally. Other natural examples of functon classes satsfyng such growth condtons nclude Besov or Sobolev functon classes nested by degree or smoothness as well as wavelet bases. We refer the reader to the work of Barron et al. 1999 for a compendum of results where γ T n = gt h. Gven the above, our algorthm has a smple descrpton. We fx a desred accuracy λ and fnd the smallest set S satsfyng Defnton 1. We then pck the class î satsfyng î argmn S RT n/ + γ T n / }, 12 where S =. We observe that the penalty functons are typcally known n closed form wth the excepton of data-dependent complexty penaltes, and hence computaton of the set S can be effcent and s generally much cheaper than tranng the models. In Secton 3.2, we gve an oracle nequalty on the performance of the estmate î from the procedure 12 that has only mld dependence on the number of classes so long as does not grow too fast wth K. 5

3 Man results and ther consequences In ths secton, we come to the descrpton of the performance guarantees for Algorthms 1 and 12. To buld ntuton, we also specalze the theorems to specfc statstcal problems and model classes. 3.1 Oracle nequaltes for unstructured model classes In ths secton we gve performance guarantees on the class pcked by Algorthm 1. We defne the excess penalzed rsk := R + γ T n R γ T n 0. 13 Essentally wthout loss of generalty, we assume that the nfmum n the equaton R = nf f F Rf s attaned by a functon f. If the nfmum s not attaned we smply choose some fxed f such that Rf nf f F Rf + δ for an arbtrarly small δ > 0. We frst perform analyss under the assumpton that > 0 strctly for, but we wll then relax to allow non-unque. The gans of a computatonally adaptve strategy over naïve strateges are best seen when the gap 13 s non-zero. Under ths assumpton, we can follow the deas of Auer et al. 2002 and show that the fracton of the computatonal budget allocated to any suboptmal class goes quckly to zero as T grows. We provde the proof of the followng theorem n Secton 4. Theorem 1. Let Alg. 1 be run for T rounds and T T be the number of tmes class s quered. Let be defned as n 13, the condtons of Assumpton A hold, and assume that T K. Defne β = max1/α, 2}. There s a constant C such that E[T T ] C n c + κ 2 β and P T T > C n c + κ 2 β κ 1 T K 4, where c and α come from the defnton of the concentraton functon γ n Assumpton Ac. At a hgh level, ths result shows that the fracton of budget allocated to any suboptmal class log 1 goes to 0 at the rate T β. n T Hence, asymptotcally n T, we wll receve exponentally more samples for than any other class and wll perform almost as well as f we had known n advance. To see an example of concrete rates that can be concluded from the above result, let F 1,..., F K be model classes wth fnte VC-dmenson, 3 so that Assumpton A s satsfed wth α = 1 2. Then we have Corollary 1. Under the condtons of Theorem 1, assume F 1,..., F K are model classes of fnte VC-dmenson, where F has dmenson d. Then there s a constant C such that E[T T ] C maxd, κ 2 2 } 2 n and P T T > C maxd, κ 2 2 } 2 n κ 1 T K 4. The result of Corollary 1 s nearly optmal n general due to a lower bound for the specal case of mult-armed bandt problems La and Robbns, 1985. To see the connecton, let F correspond to the th arm n a mult-armed bandt problem and the rsk R be the expected reward of arm. In ths case, the complexty penalty γ for each class s 0. La and Robbns gve a lower bound that shows that the expected number of pulls of any suboptmal arm s at least E[T T ] = Ω KLp p, where p and p are the reward dstrbutons for the th and optmal arms, respectvely. Unfortunately, the condton that > 0 may not always be satsfed, or may be so small as to render the bound n Theorem 1 vacuous. Nevertheless, we ntutvely beleve that our algorthm can quckly fnd a small set of good classes those wth small penalzed rsk and spend ts computatonal budget to try to dstngush amongst them. In ths case, though, Algorthm 1 wll not vst suboptmal classes and so can stll output a functon f satsfyng good oracle bounds. In order to prove a result quantfyng ths ntuton, we frst upper bound the regret of Algorthm 1, that s, the average excess rsk suffered by the algorthm over all teratons, and then show how to use ths bound for obtanng a model wth a small rsk. We state our results for the case where α α and defne β = max1/α, 2}. Proposton 1. Use the same assumptons as Theorem 1, but further assume that α α for all. Wth probablty at least 1 κ 1 /T K 3, the regret average excess rsk of Algorthm 1 s bounded as K K T T 2eT C 1 1/β c + κ 2 β =1 =1 n 1/β 3 Smlar corollares hold for any model class whose metrc entropy grows as polylog 1 ɛ. 6

for a constant C dependent on α. In order to obtan a model wth a small rsk, we need to make an addtonal assumpton that the models are compatble n the sense that one can defne the addton operator f +g for f F, g F j meanngfully. We also assume that the rsk functonal Rf s convex n f. In such a settng, we can average the functons mnmzng the objectve R, n, that s, f t = argmn f Ft Rntf, to obtan a functon satsfyng the desred oracle nequalty. For ths theorem, we also assume that the constants c from Assumpton Ac satsfy c = O. Theorem 2. Use the same assumptons as Proposton 1. Let f t be the functon chosen by algorthm A at round t of Alg. 1 and defne the average functon f T = 1 T T t=1 f t. If the rsk functonal R s convex, there are constants C, C dependent on α such that wth probablty greater than 1 2κ 2 /T K 3, K 1/β R f T R + γ T n + 2eκ 2 T β C + C T 1/β K =1 n =1 [ c n α + κ 2 n 1 2 log K + κ2 n ] 1/β 1 β 2. Let us nterpret the above bound and dscuss ts optmalty. When α = 1 2 e.g., for VC classes, we have β = 2; moreover, t s clear that K =1 C n = OK. Thus, to wthn constant factors, we have R f K max, log K} T = R + γ T n + O. T Ignorng logarthmc factors, the above bound s mnmax optmal, whch follows by a reducton of our model selecton problem to the specal case of a mult-armed bandt problem. In ths case, Theorem 5.1 of Auer et al. 2003 shows that for any set of K, T values, there s a dstrbuton over the rewards of arms whch forces Ω KT regret, that s, the average excess rsk of the classes chosen by Alg. 1 must be Ω KT. We provde proofs of Proposton 1 and Theorem 2 n the long verson of the paper. 3.2 Oracle nequaltes for nested herarches In ths secton we provde an oracle nequalty on the output of the procedure 12 that has a more favorable dependence on the number of classes K than our bounds for unstructured functon classes F. The man dea s to use Assumpton B along wth Defnton 1 to show that performng a coarse grd search over S s suffcent to deduce an oracle nequalty over the entre herarchy. The next theorem provdes an oracle nequalty for the rsk of the functon f = Aî, nît/, whch s the output of the learnng algorthm A appled to the class î pcked by our algorthm. Theorem 3. Let f = Aî, nît/ be the output of the algorthm A for class î specfed by the procedure 12. Let Assumptons A B hold. Wth probablty at least 1 3κ 1 exp 4m } T Rf mn R n log K m + 21 + λγ + κ 2 + κ 2. [K] 2T n K T n K Remark: It s possble to reduce the κ 2 log K/2T nk term n the bound above to a 1 + λ log /2T n term appearng nsde the mnmum over classes [K] by requrng the coarse grd condton 11 to hold over terms of the form γ T n / + log /2T n. Ths stronger bound apples, for example, to sequences of VC-classes as descrbed n Example 1. The above result makes t clear that the excess rsk of the algorthm outsde of the mnmum over all the classes scales as OT 1/2. It s of nterest to contrast Theorem 3 wth results of the prevous secton. In the completely general case, we have a dependence on K better than K only when there s constant separaton between the penalzed rsks of dfferent classes. Snce K, the result of the above theorem s essentally as strong as any of the results from the prevous secton, as we would hope when we know F F +1. Nonetheless, the man strength of Theorem 3 s n scenaros where = Olog K, such as VCclasses e.g. Example 1 wth at most polynomal growth n VC-dmenson. In such scenaros, the 7

functon f that the procedure outputs s compettve up to logarthmc factors wth an oracle that devotes the entre computaton budget to the optmal class. We note that model selecton procedures suffer a penalty of log K or log even n computatonally unconstraned settngs see, e.g., Bartlett et al., 2002, so our computatonally restrcted procedure suffers at most an addtonal penalty of O log K. We conclude by recallng that many common model selecton scenaros satsfy = Olog K, as noted n Secton 2.3. 4 Proof of Theorem 1 At a hgh level, the proof of ths theorem nvolves combnng the technques for analyss of multarmed bandts developed by Auer et al. 2002 wth Assumpton A. We start by gvng a lemma whch wll be useful to prove the theorem. The lemma states that after a suffcent number of ntal teratons τ, the probablty Algorthm 1 chooses to receve samples for a sub-optmal functon class s extremely small. Recall also our notatonal conventon that β = max1/α, 2}. Lemma 1. For any class, any s [1, T ] and s [τ, T ] where τ > 0 satsfes under Assumpton A we have P τ > 2β c + κ 2 + κ2 log K β, n β R, n s κ 2 n s R, n s κ 2 n s 2κ 1 T K 4. We defer the proof of the lemma to Appendx A, though at a hgh level the proof works as follows. The bad event n Lemma 1, that s, Algorthm 1 selects a sub-optmal class, occurs only f one of the followng three errors occurs: the emprcal rsk of class s much lower than ts true rsk, the emprcal rsk of class s hgher than ts true rsk, or s s not large enough to actually separate the true penalzed rsks from one another. Under the assumptons of the lemma, however, coupled wth the unform convergence propertes n Assumpton A, each of these three sub-events s qute unlkely. Now we turn to the proof of Theorem 1 assumng the lemma. Let t denote the model class ndex chosen by Algorthm 1 at tme t, and let s t denote the number of tmes class has been selected at round t of the algorthm. When no tme ndex s needed, s wll denote the same thng. Note that f t = and the number of tmes class s quered exceeds τ > 0, then by the defnton of the selecton crteron 6 and choce of t n Alg. 1, for some s τ,..., t 1} and s 1,..., t 1} we have R, n s κ 2 n s R, n s κ 2. n s Here we nterpret R, n s to mean a random realzaton of the observed rsk consstent wth the samples we observe. Usng the above mplcaton, we thus have T T T n = 1 + I t = τ + I t =, T t 1 τ τ + τ + t=k+1 T t=k+1 T I t 1 t=1 s =1 s =τ t=k+1 mn R, n s κ 2 τ s <t n s t 1 I R, n s κ 2 n s R, n s κ 2 To control the last term, we nvoke Lemma 1 and obtan that τ > 2β c + κ 2 + κ2 log K β T t 1 E[T n] τ + n β max 0<s<t R, n s κ 2 n s. 14 n s t 1 t=1 s=1 s =τ κ 1 2 T K 4 τ + κ 1 T K 4. Hence for any suboptmal class, E[T n] τ + κ 1 /T K 4, where τ satsfes the lower bound of Lemma 1 and s thus logarthmc n T. Under the assumpton that T K, for, E[T T ] C c + κ 2 max1/α,2} n max1/α,2} 8 15

for a constant C 2 4 max1/α,2}. Now we prove the hgh-probablty bound. For ths part, we need only concern ourselves wth the sum of ndcators from 14. Markov s nequalty shows that T P I t =, T t 1 τ 1 κ 1 T K 4. t=k+1 Thus we can assert that the bound 15 on T T holds wth hgh probablty. Remark: By examnng the proof of Theorem 1, t s straghtforward to see that f we modfy the multplers on the terms n the crteron 6 by mκ 2 nstead of κ 2, we get that the probablty bound s of the order T 3 4m2 K 4m2, whle the bound on T T s scaled by m 1/α. 5 Model selecton over nested herarches In ths secton, we prove Theorem 3. The followng proposton states that the class returned by the output of the procedure 12 satsfes an oracle nequalty over the set S. Proposton 2. Let f = Aî, nît/ be the output of the algorthm A for class î specfed n Equaton 12. Under the condtons of Theorem 3, wth probablty at least 1 3κ 1 exp 4m T n log Rf mn S R + γ + κ 2 2T n + κ m 2. T n K The proof of the proposton follows from an argument smlar to that gven by Bartlett et al. 2002. We present a proof at the end of ths secton, snce our settng s slghtly dfferent: each class receves a dfferent number of ndependent samples. Frst, however, we complete the proof of Theorem 3 usng the proposton. Proof of Theorem 3 Let [K] be any class not necessarly n S, and let j S be the smallest class satsfyng j. Then by constructon of S, we know that T n T nj T n γ γ j 1 + λγ. Thus we can lower bound the penalzed rsk of class as T R n + 21 + λγ R j + 2γ j T nj where we used the nestng assumpton B to conclude that j mples Rj R. Now combnng the above lower bound wth the nequalty n Proposton 2 yelds that wth probablty at least 1 3κ 1 exp m Rf mn S snce K and n n K. R + γ T n + κ 2, log 2T n + m T n K } T mn R n log K + 21 + λγ + κ 2 + κ 2 [K] 2T n K m T n K Proof of Proposton 2 To prove the proposton, we would lke to control the probablty T n log P Rf > mn S R + 2γ + κ 2 2T n + ɛ [ } ] T n P Rf > mn R T S n/ + γ + ɛ/2 16 }} T 1 } T n T n log + P mn R T S n/ + γ > mn S R + 2γ + κ 2 2T n + ɛ/2 } } T 2 9

where the nequalty follows from a unon bound. We now bound the terms T 1 and T 2 separately. To bound the terms, we frst observe that by the constructon 12, the mnmum over the penalzed emprcal rsk s attaned for the class î. We thus smplfy T 1 as P [ Rf > mn S R T n/ + γ T n } ] [ T nî + ɛ/2 = P Rf > Rf + γî κ 1 exp T n î ɛ2 κ 2 2, } ] + ɛ/2 where the nequalty follows by applcaton of Assumpton Aa. To bound T 2 n the sum 16, we defne f = argmn f F Rf so that R = Rf. Notng that the event n T 2 mples that max S R T T n/ R n log γ κ 2 2T n > ɛ 2, we can use the unon bound to see T 2 P sup S R T n/ R γ P T RT n/ R n γ S P Rf R > ɛ 2 + κ 2 S T n κ 2 log 2T n log 2T n κ 2 log, > ɛ 2 > ɛ 2T n 2 where the fnal nequalty uses Assumpton Ab, whch states that A outputs a γ -mnmzer of the emprcal rsk. Now we can bound the devatons usng Assumpton Ad, snce f s non-random: T 2 κ 1 exp T n ɛ 2 κ 2 exp 2 log. S 2 m Settng ɛ = κ 2 T n K, see that the frst term n boundng T 2 reduces to exp mn /n K exp m snce n n K. Then we get T 2 S κ 1 exp m exp 2 log 2κ 1 exp m, where the last step uses =1 1/2 = π 2 /6 2. Fnally, pluggng the stated settng of ɛ nto the bound on T 1 completes the proof. Acknowledgements In performng ths research, AA was supported by a Mcrosoft Research Fellowshp, and JCD was supported by the Natonal Defense Scence and Engneerng Graduate Fellowshp NDSEG Program. AA and PB gratefully acknowledge the support of the NSF under award DMS-0830410. References H. Akake. A new look at the statstcal model dentfcaton. IEEE Transactons on Automatc Control, 196:716 723, December 1974. P. Auer, N. Cesa-Banch, Y. Freund, and R. E. Schapre. The nonstochastc multarmed bandt problem. SIAM J. Comput., 321:48 77, 2003. Peter Auer, Ncolò Cesa-Banch, and Paul Fscher. Fnte-tme analyss of the multarmed bandt problem. Mach. Learn., 472-3:235 256, 2002. ISSN 0885-6125. 10

A. R. Barron. Complexty regularzaton wth applcaton to artfcal neural networks. In Nonparametrc functonal estmaton and related topcs, pages 561 576. Kluwer Academc, 1991. Andrew Barron, Lucen Brgé, and Pascal Massart. Rsk bounds for model selecton va penalzaton. Probablty Theory and Related Felds, 113:301 413, 1999. P. L. Bartlett and S. Mendelson. Rademacher and Gaussan complextes: Rsk bounds and structural results. Journal of Machne Learnng Research, 3:463 482, 2002. P. L. Bartlett, S. Boucheron, and G. Lugos. Model selecton and error estmaton. Machne Learnng, 48:85 113, 2002. N. Cesa-Banch and G. Lugos. Predcton, Learnng, and Games. Cambrdge Unversty Press, 2006. R. M. Dudley. The szes of compact subsets of Hlbert space and contnuty of Gaussan processes. Journal of Functonal Analyss, 1:290 330, 1967. R. M. Dudley. Central lmt theorems for emprcal measures. The Annals of Probablty, 66: 899 929, 1978. S. Geman and C. R. Hwang. Nonparametrc maxmum lkelhood estmaton by the method of seves. Annals of Statstcs, 10:401 414, 1982. T. L. La and Herbert Robbns. Asymptotcally effcent adaptve allocaton rules. Advances n Appled Mathematcs, 6:4 22, 1985. G. Lugos and M. Wegkamp. Complexty regularzaton va localzed random penaltes. Annals of Statstcs, 324:1679 1697, 2004. C. L. Mallows. Some comments on C p. Technometrcs, 154:661 675, 1973. P. Massart. Concentraton nequaltes and model selecton. In J. Pcard, edtor, Ecole d Et de Probablts de Sant-Flour XXXIII - 2003 Seres. Sprnger, 2003. A. Nemrovsk, A. Judtsky, G. Lan, and A. Shapro. Robust stochastc approxmaton approach to stochastc programmng. SIAM Journal on Optmzaton, 194:1574 1609, 2009. Davd Pollard. Convergence of Stochastc Processes. Sprnger-Verlag, 1984. Jorma Rssanen. A unversal pror for ntegers and estmaton by mnmum descrpton length. The Annals of Statstcs, 112:416 431, 1983. V. N. Vapnk and A. Ya. Chervonenks. Theory of pattern recognton. Nauka, Moscow, 1974. In Russan. 11

A Proof of Lemma 1 Followng Auer et al. 2002, we show that the event n the lemma occurs wth very low probablty by breakng t up nto smaller events more amenable to analyss. Recall that we re nterested n controllng the probablty of the event R, n s κ 2 R, n s κ 2 17 n s n s For ths bad event to happen, at least one of the followng three events must happen: R ns A, n s nf f F Rf γ n s κ 2 log K κ 2 n s n s 18a log K R n s A, n s nf Rf γ n s + κ 2 + κ 2 18b f F n s n s log K R + γ T n R + γ T n + 2 γ n s + κ 2 + κ 2. 18c n s n s Temporarly use the shorthand f = A, n s and f = A, n s. The relatonshp between Eqs. 18a 18c and the event n 17 follows from the fact that f none of 18a 18c occur, then R, n s κ 2 n s log K = Rns f + γ T n γ n s κ 2 κ 2 n s n s 18a log K log t > nf Rf + γ T n 2 γ n s + κ 2 + κ 2 f F n s n s 18c log K > nf Rf + γ T n + 2 γ n s + κ 2 + κ 2 f F n s n s log K log n 2 γ n s + κ 2 + κ 2 n s n s 18b log K log t > Rn s f + γ T n γ n s κ 2 κ 2 n s n s log t = R, n s κ 2. n s From the above strng of nequaltes, to show that the event 17 has low probablty, we need smply show that each of 18a, 18b, and 18c have low probablty. To prove that each of the bad events have low probablty, we note the followng consequences of Assumpton A. Recall the defnton of f as the mnmzer of Rf over the class F. Then by Assumpton Aa, Rf γ n κ 2 ɛ RA, n γ n κ 2 ɛ < R n A, n, whle Assumptons Ab and Ad mply R n A, n R n f + γ n Rf + γ n + κ 2 ɛ, each wth probablty at least 1 κ 1 exp 4nɛ 2. In partcular, we see that the events 18a and 18b have low probablty: [ ] log K P R ns A, n s Rf γ n s κ 2 P [ log K κ 1 exp 4n s + log t = κ 1 n s n s tk 4 n s κ 2 log K R n s A, n s R γ n s + κ 2 n s log K κ 1 exp 4n s + = κ 1 n s n s tk 4. n s ] + κ 2 n s 12

What remans s to show that for large enough τ, 18c does not happen. Recallng the defnton that R + γ T n = R + γ T n, we see that for 18c to fal t s suffcent that > 2γ τn + 2κ 2 log K n τ + 2κ 2 n τ. Let x y := mnx, y} and x y := maxx, y}. Snce γ n c n α, the above s satsfed when 2 > c τn α 1 2 + κ 2 log Kτn α 1 2 + κ 2 τn α 1 2 19 We can solve 19 above and see mmedately that f then τ > 21/α 2 c + κ 2 + κ2 log K 1/α 2, n 1/α 2 R > R + 2 log K γ n τ + κ 2 + κ 2 n τ n τ Thus the event n 18c fals to occur, completng the proof of the lemma.. 20 13