High Dimensional Bayesian Optimisation and Bandits via Additive Models

1/20 High Dimensional Bayesian Optimisation and Bandits via Additive Models Kirthevasan Kandasamy, Jeff Schneider, Barnabás Póczos ICML 15 July 8 2015

2/20 Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics E.g: Hubble Constant Baryonic Density Cosmological Simulator Observation

2/20 Bandits & Optimisation Expensive Blackbox Function

2/20 Bandits & Optimisation Expensive Blackbox Function Examples: Hyper-parameter tuning in ML Optimal control strategy in Robotics

3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) f(x ) x x

3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Optimisation = Minimise Simple Regret. S T = f (x ) max f (x t). x t, t=1,...,t

Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Bandits = Minimise Cumulative Regret. T R T = f (x ) f (x t ). t=1 3/20

4/20 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

4/20 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Obtain posterior GP..

Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Maximise acquisition function ϕ t : x t = argmax x ϕ t (x). ϕ t (x) x t = 0.828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 GP-UCB: ϕ t (x) = µ t 1 (x) + β 1/2 t σ t 1 (x) (Srinivas et al. 2010) x 4/20

Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Maximise acquisition function ϕ t : x t = argmax x ϕ t (x). ϕ t (x) x t = 0.828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ϕ t : Expected Improvement (GP-EI), Thompson Sampling etc. x 4/20

5/20 Scaling to Higher Dimensions Two Key Challenges: Statistical Difficulty: Nonparametric sample complexity exponential in D. Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O(ζ D ) effort. Existing Work: (Chen et al. 2012): f depends on a small number of variables. Find variables and then GP-UCB. (Wang et al. 2013): f varies along a lower dimensional subspace. GP-EI on a random subspace. (Djolonga et al. 2013): f varies along a lower dimensional subspace. Find subspace and then GP-UCB.

5/20 Scaling to Higher Dimensions Two Key Challenges: Statistical Difficulty: Nonparametric sample complexity exponential in D. Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O(ζ D ) effort. Existing Work: Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013. Assumes f varies only along a low dimensional subspace. Perform BO on a low dimensional subspace. Assumption too strong in realistic settings.

6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) +... + f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =.

6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) +... + f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =. E.g. f (x {1,...,10} ) = f (1) (x {1,3,9} ) + f (2) (x {2,4,8} ) + f (3) (x {5,6,10} ). 1 2 3 4 5 6 7 8 9 10 Call {X (j)m j=1} = {(1, 3, 9), (2, 4, 8), (5, 6, 10)} the decomposition.

6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) +... + f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =. Assume each f (j) GP(0, κ (j) ). Then f GP(0, κ) where, κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ).

7/20 Outline 1. GP-UCB 2. The Add-GP-UCB algorithm Bounds on ST : exponential in D linear in D. An easy-to-optimise acquisition function. Performs well even when f is not additive. 3. Experiments 4. Conclusion & some open questions

8/20 GP-UCB x t = argmax x X µ t 1 (x) + β 1/2 t σ t 1 (x)

8/20 GP-UCB x t = argmax x X µ t 1 (x) + β 1/2 t σ t 1 (x) Squared Exponential Kernel ( x x κ(x, x 2 ) ) = A exp 2h 2 Theorem (Srinivas et al. 2010) Let f GP(0, κ). Then w.h.p, ( ) D S T O D (log T ) D. T

9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel.

9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel. Can be shown: If each κ (j) is a SE kernel, ( ) D S T O 2 d d (log T ) d. T

9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel. Can be shown: If each κ (j) is a SE kernel, ( ) D S T O 2 d d (log T ) d. T But ϕ t = µ t 1 + β 1/2 t σ t 1 is D-dimensional!

10/20 Add-GP-UCB ϕ t (x) = M j=1 µ (j) t 1 (x) + β1/2 t σ (j) t 1 (x (j) ).

10/20 Add-GP-UCB ϕ t (x) = M µ (j) t 1 j=1 (x) + β1/2 t t 1 (x (j) ). }{{} ϕ (j) t (x (j) ) σ (j) Maximise each ϕ (j) t separately. Requires only O(poly(D)ζ d ) effort (vs O(ζ D ) for GP-UCB).

10/20 Add-GP-UCB ϕ t (x) = M µ (j) t 1 j=1 (x) + β1/2 t t 1 (x (j) ). }{{} ϕ (j) t (x (j) ) σ (j) Maximise each ϕ (j) t separately. Requires only O(poly(D)ζ d ) effort (vs O(ζ D ) for GP-UCB). Theorem Let f (j) GP(0, κ (j) ) and f = j f (j). Then w.h.p, ( ) D S T O 2 d d (log T ) d. T

11/20 Summary of Theoretical Results (for SE Kernel) GP-UCB with no assumption on f : S T O (D ) D/2 (log T ) D/2 T 1/2 GP-UCB on additive f : S T O (DT ) 1/2 Maximising ϕ t : O(ζ D ) effort. Add-GP-UCB on additive f : S T O (DT ) 1/2 Maximising ϕ t : O(poly(D)ζ d ) effort.

Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) f (2) (x {2} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {2} f (1) (x {1} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {1} 12/20

Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {2} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {1} 12/20

Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) ϕ (2) (x {2} ) x (2) t = 0.141 x {2} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ϕ (1) (x {1} ) x (1) t = 0.869 x {1} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12/20

Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) ϕ (2) (x {2} ) x (2) t = 0.141 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {2} x t = (0.869,0.141) ϕ (1) (x {1} ) x (1) t = 0.869 x {1} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12/20

13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ). Additive models are statistically simpler = worse bias, but much better variance in low sample regime. In BO applications queries are expensive. So we usually cannot afford many queries.

14/20 Unknown Kernel/ Decomposition in practice Learn kernel hyper-parameters and decomposition {X j } by maximising GP marginal likelihood periodically.

15/20 Experiments 2 Add- : Knows 10 decomposition. 10 1 Add-d/M: M groups of size d. 10 0 0 200 400 600 800 Use 1000 DiRect evaluations to maximise acquisition function. DiRect: Dividing Rectangles (Jones et al. 1993)

15/20 Experiments Add- : Knows decomposition. 10 2 10 1 Add-d/M: M groups of size d. 10 0 0 200 400 600 800 Use 4000 DiRect evaluations to maximise acquisition function.

16/20 SDSS Luminous Red Galaxies E.g: Hubble Constant Baryonic Density Cosmological Simulator Observation Task: Find maximum likelihood cosmological parameters. 20 Dimensions. But only 9 parameters are relevant. Each query takes 2-5 seconds. Use 500 DiRect evaluations to maximise acquisition function.

17/20 SDSS Luminous Red Galaxies 10 1 10 2 10 3 0 100 200 300 400 REMBO: (Wang et al. 2013)

18/20 Viola & Jones Face Detection A cascade of 22 weak classifiers. Image classified negative if the score < threshold at any stage. Task: Find optimal threshold values on a training set of 1000 images. 22 dimensions. Each query takes 30-40 seconds. Use 1000 DiRect evaluations to maximise acquisition function.

19/20 Viola & Jones Face Detection 95 90 85 80 75 70 65 0 100 200 300

20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice.

20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice. Similar results hold for Matérn kernels and in bandit setting. Some open questions: How to choose (d, M)? Can we generalise to other acquisition functions?