1/20 High Dimensional Bayesian Optimisation and Bandits via Additive Models Kirthevasan Kandasamy, Jeff Schneider, Barnabás Póczos ICML 15 July 8 2015
2/20 Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics E.g: Hubble Constant Baryonic Density Cosmological Simulator Observation
2/20 Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics E.g: Hubble Constant Baryonic Density Cosmological Simulator Observation
2/20 Bandits & Optimisation Expensive Blackbox Function
2/20 Bandits & Optimisation Expensive Blackbox Function Examples: Hyper-parameter tuning in ML Optimal control strategy in Robotics
3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) f(x ) x x
3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x
3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Optimisation = Minimise Simple Regret. S T = f (x ) max f (x t). x t, t=1,...,t
Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Bandits = Minimise Cumulative Regret. T R T = f (x ) f (x t ). t=1 3/20
3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Optimisation = Minimise Simple Regret. S T = f (x ) max f (x t). x t, t=1,...,t
4/20 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x
4/20 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Obtain posterior GP..
Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Maximise acquisition function ϕ t : x t = argmax x ϕ t (x). ϕ t (x) x t = 0.828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 GP-UCB: ϕ t (x) = µ t 1 (x) + β 1/2 t σ t 1 (x) (Srinivas et al. 2010) x 4/20
Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Maximise acquisition function ϕ t : x t = argmax x ϕ t (x). ϕ t (x) x t = 0.828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ϕ t : Expected Improvement (GP-EI), Thompson Sampling etc. x 4/20
5/20 Scaling to Higher Dimensions Two Key Challenges: Statistical Difficulty: Nonparametric sample complexity exponential in D. Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O(ζ D ) effort.
5/20 Scaling to Higher Dimensions Two Key Challenges: Statistical Difficulty: Nonparametric sample complexity exponential in D. Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O(ζ D ) effort. Existing Work: (Chen et al. 2012): f depends on a small number of variables. Find variables and then GP-UCB. (Wang et al. 2013): f varies along a lower dimensional subspace. GP-EI on a random subspace. (Djolonga et al. 2013): f varies along a lower dimensional subspace. Find subspace and then GP-UCB.
5/20 Scaling to Higher Dimensions Two Key Challenges: Statistical Difficulty: Nonparametric sample complexity exponential in D. Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O(ζ D ) effort. Existing Work: Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013. Assumes f varies only along a low dimensional subspace. Perform BO on a low dimensional subspace. Assumption too strong in realistic settings.
6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) +... + f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =.
6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) +... + f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =. E.g. f (x {1,...,10} ) = f (1) (x {1,3,9} ) + f (2) (x {2,4,8} ) + f (3) (x {5,6,10} ). 1 2 3 4 5 6 7 8 9 10 Call {X (j)m j=1} = {(1, 3, 9), (2, 4, 8), (5, 6, 10)} the decomposition.
6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) +... + f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =. Assume each f (j) GP(0, κ (j) ). Then f GP(0, κ) where, κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ).
6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) +... + f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =. Assume each f (j) GP(0, κ (j) ). Then f GP(0, κ) where, κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). Given (X, Y ) = {(x i, y i ) T i=1 }, and test point x, f (j) (x (j) ) X, Y N ( µ (j), σ (j)2 ).
7/20 Outline 1. GP-UCB 2. The Add-GP-UCB algorithm Bounds on ST : exponential in D linear in D. An easy-to-optimise acquisition function. Performs well even when f is not additive. 3. Experiments 4. Conclusion & some open questions
8/20 GP-UCB x t = argmax x X µ t 1 (x) + β 1/2 t σ t 1 (x)
8/20 GP-UCB x t = argmax x X µ t 1 (x) + β 1/2 t σ t 1 (x) Squared Exponential Kernel ( x x κ(x, x 2 ) ) = A exp 2h 2 Theorem (Srinivas et al. 2010) Let f GP(0, κ). Then w.h.p, ( ) D S T O D (log T ) D. T
9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel.
9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel. Can be shown: If each κ (j) is a SE kernel, ( ) D S T O 2 d d (log T ) d. T
9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel. Can be shown: If each κ (j) is a SE kernel, ( ) D S T O 2 d d (log T ) d. T But ϕ t = µ t 1 + β 1/2 t σ t 1 is D-dimensional!
10/20 Add-GP-UCB ϕ t (x) = M j=1 µ (j) t 1 (x) + β1/2 t σ (j) t 1 (x (j) ).
10/20 Add-GP-UCB ϕ t (x) = M µ (j) t 1 j=1 (x) + β1/2 t t 1 (x (j) ). }{{} ϕ (j) t (x (j) ) σ (j) Maximise each ϕ (j) t separately. Requires only O(poly(D)ζ d ) effort (vs O(ζ D ) for GP-UCB).
10/20 Add-GP-UCB ϕ t (x) = M µ (j) t 1 j=1 (x) + β1/2 t t 1 (x (j) ). }{{} ϕ (j) t (x (j) ) σ (j) Maximise each ϕ (j) t separately. Requires only O(poly(D)ζ d ) effort (vs O(ζ D ) for GP-UCB). Theorem Let f (j) GP(0, κ (j) ) and f = j f (j). Then w.h.p, ( ) D S T O 2 d d (log T ) d. T
11/20 Summary of Theoretical Results (for SE Kernel) GP-UCB with no assumption on f : S T O (D ) D/2 (log T ) D/2 T 1/2 GP-UCB on additive f : S T O (DT ) 1/2 Maximising ϕ t : O(ζ D ) effort. Add-GP-UCB on additive f : S T O (DT ) 1/2 Maximising ϕ t : O(poly(D)ζ d ) effort.
Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) f (2) (x {2} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {2} f (1) (x {1} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {1} 12/20
Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) f (2) (x {2} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {2} f (1) (x {1} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {1} 12/20
Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {2} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {1} 12/20
Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) ϕ (2) (x {2} ) x (2) t = 0.141 x {2} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ϕ (1) (x {1} ) x (1) t = 0.869 x {1} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12/20
Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) ϕ (2) (x {2} ) x (2) t = 0.141 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {2} x t = (0.869,0.141) ϕ (1) (x {1} ) x (1) t = 0.869 x {1} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12/20
13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ).
13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ). Additive models are statistically simpler = worse bias, but much better variance in low sample regime.
13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ). Additive models are statistically simpler = worse bias, but much better variance in low sample regime. In BO applications queries are expensive. So we usually cannot afford many queries.
13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ). Additive models are statistically simpler = worse bias, but much better variance in low sample regime. In BO applications queries are expensive. So we usually cannot afford many queries. Observation: Add-GP-UCB does well even when f is not additive. Better bias/ variance trade-off in high dimensional regression. Easy to maximise acquisition function.
14/20 Unknown Kernel/ Decomposition in practice Learn kernel hyper-parameters and decomposition {X j } by maximising GP marginal likelihood periodically.
15/20 Experiments 2 Add- : Knows 10 decomposition. 10 1 Add-d/M: M groups of size d. 10 0 0 200 400 600 800 Use 1000 DiRect evaluations to maximise acquisition function. DiRect: Dividing Rectangles (Jones et al. 1993)
15/20 Experiments Add- : Knows decomposition. 10 2 10 1 Add-d/M: M groups of size d. 10 0 0 200 400 600 800 Use 4000 DiRect evaluations to maximise acquisition function.
16/20 SDSS Luminous Red Galaxies E.g: Hubble Constant Baryonic Density Cosmological Simulator Observation Task: Find maximum likelihood cosmological parameters. 20 Dimensions. But only 9 parameters are relevant. Each query takes 2-5 seconds. Use 500 DiRect evaluations to maximise acquisition function.
17/20 SDSS Luminous Red Galaxies 10 1 10 2 10 3 0 100 200 300 400 REMBO: (Wang et al. 2013)
18/20 Viola & Jones Face Detection A cascade of 22 weak classifiers. Image classified negative if the score < threshold at any stage. Task: Find optimal threshold values on a training set of 1000 images. 22 dimensions. Each query takes 30-40 seconds. Use 1000 DiRect evaluations to maximise acquisition function.
19/20 Viola & Jones Face Detection 95 90 85 80 75 70 65 0 100 200 300
20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice.
20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice. Similar results hold for Matérn kernels and in bandit setting.
20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice. Similar results hold for Matérn kernels and in bandit setting. Some open questions: How to choose (d, M)? Can we generalise to other acquisition functions?
20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice. Similar results hold for Matérn kernels and in bandit setting. Some open questions: How to choose (d, M)? Can we generalise to other acquisition functions? Code available: github.com/kirthevasank/add-gp-bandits Jeff s Talk: Friday 2pm @ Van Gogh Thank You.