High Dimensional Bayesian Optimisation and Bandits via Additive Models

Size: px

Start display at page:

Download "High Dimensional Bayesian Optimisation and Bandits via Additive Models"

Candice O’Neal’
5 years ago
Views:

1 1/20 High Dimensional Bayesian Optimisation and Bandits via Additive Models Kirthevasan Kandasamy, Jeff Schneider, Barnabás Póczos ICML 15 July

2 2/20 Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics E.g: Hubble Constant Baryonic Density Cosmological Simulator Observation

3 2/20 Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics E.g: Hubble Constant Baryonic Density Cosmological Simulator Observation

4 2/20 Bandits & Optimisation Expensive Blackbox Function

5 2/20 Bandits & Optimisation Expensive Blackbox Function Examples: Hyper-parameter tuning in ML Optimal control strategy in Robotics

6 3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) f(x ) x x

7 3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) x

8 3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) x Optimisation = Minimise Simple Regret. S T = f (x ) max f (x t). x t, t=1,...,t

9 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) x Bandits = Minimise Cumulative Regret. T R T = f (x ) f (x t ). t=1 3/20

10 3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) x Optimisation = Minimise Simple Regret. S T = f (x ) max f (x t). x t, t=1,...,t

11 4/20 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) x

12 4/20 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) x Obtain posterior GP..

13 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) x Maximise acquisition function ϕ t : x t = argmax x ϕ t (x). ϕ t (x) x t = GP-UCB: ϕ t (x) = µ t 1 (x) + β 1/2 t σ t 1 (x) (Srinivas et al. 2010) x 4/20

14 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) x Maximise acquisition function ϕ t : x t = argmax x ϕ t (x). ϕ t (x) x t = ϕ t : Expected Improvement (GP-EI), Thompson Sampling etc. x 4/20

15 5/20 Scaling to Higher Dimensions Two Key Challenges: Statistical Difficulty: Nonparametric sample complexity exponential in D. Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O(ζ D ) effort.

16 5/20 Scaling to Higher Dimensions Two Key Challenges: Statistical Difficulty: Nonparametric sample complexity exponential in D. Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O(ζ D ) effort. Existing Work: (Chen et al. 2012): f depends on a small number of variables. Find variables and then GP-UCB. (Wang et al. 2013): f varies along a lower dimensional subspace. GP-EI on a random subspace. (Djolonga et al. 2013): f varies along a lower dimensional subspace. Find subspace and then GP-UCB.

17 5/20 Scaling to Higher Dimensions Two Key Challenges: Statistical Difficulty: Nonparametric sample complexity exponential in D. Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O(ζ D ) effort. Existing Work: Chen et al. 2012, Wang et al. 2013, Djolonga et al Assumes f varies only along a low dimensional subspace. Perform BO on a low dimensional subspace. Assumption too strong in realistic settings.

18 6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =.

19 6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =. E.g. f (x {1,...,10} ) = f (1) (x {1,3,9} ) + f (2) (x {2,4,8} ) + f (3) (x {5,6,10} ) Call {X (j)m j=1} = {(1, 3, 9), (2, 4, 8), (5, 6, 10)} the decomposition.

20 6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =. Assume each f (j) GP(0, κ (j) ). Then f GP(0, κ) where, κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ).

21 6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =. Assume each f (j) GP(0, κ (j) ). Then f GP(0, κ) where, κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). Given (X, Y ) = {(x i, y i ) T i=1 }, and test point x, f (j) (x (j) ) X, Y N ( µ (j), σ (j)2 ).

22 7/20 Outline 1. GP-UCB 2. The Add-GP-UCB algorithm Bounds on ST : exponential in D linear in D. An easy-to-optimise acquisition function. Performs well even when f is not additive. 3. Experiments 4. Conclusion & some open questions

23 8/20 GP-UCB x t = argmax x X µ t 1 (x) + β 1/2 t σ t 1 (x)

24 8/20 GP-UCB x t = argmax x X µ t 1 (x) + β 1/2 t σ t 1 (x) Squared Exponential Kernel ( x x κ(x, x 2 ) ) = A exp 2h 2 Theorem (Srinivas et al. 2010) Let f GP(0, κ). Then w.h.p, ( ) D S T O D (log T ) D. T

25 9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel.

26 9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel. Can be shown: If each κ (j) is a SE kernel, ( ) D S T O 2 d d (log T ) d. T

27 9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel. Can be shown: If each κ (j) is a SE kernel, ( ) D S T O 2 d d (log T ) d. T But ϕ t = µ t 1 + β 1/2 t σ t 1 is D-dimensional!

28 10/20 Add-GP-UCB ϕ t (x) = M j=1 µ (j) t 1 (x) + β1/2 t σ (j) t 1 (x (j) ).

29 10/20 Add-GP-UCB ϕ t (x) = M µ (j) t 1 j=1 (x) + β1/2 t t 1 (x (j) ). }{{} ϕ (j) t (x (j) ) σ (j) Maximise each ϕ (j) t separately. Requires only O(poly(D)ζ d ) effort (vs O(ζ D ) for GP-UCB).

30 10/20 Add-GP-UCB ϕ t (x) = M µ (j) t 1 j=1 (x) + β1/2 t t 1 (x (j) ). }{{} ϕ (j) t (x (j) ) σ (j) Maximise each ϕ (j) t separately. Requires only O(poly(D)ζ d ) effort (vs O(ζ D ) for GP-UCB). Theorem Let f (j) GP(0, κ (j) ) and f = j f (j). Then w.h.p, ( ) D S T O 2 d d (log T ) d. T

31 11/20 Summary of Theoretical Results (for SE Kernel) GP-UCB with no assumption on f : S T O (D ) D/2 (log T ) D/2 T 1/2 GP-UCB on additive f : S T O (DT ) 1/2 Maximising ϕ t : O(ζ D ) effort. Add-GP-UCB on additive f : S T O (DT ) 1/2 Maximising ϕ t : O(poly(D)ζ d ) effort.

32 Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) f (2) (x {2} ) x {2} f (1) (x {1} ) x {1} 12/20

33 Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) f (2) (x {2} ) x {2} f (1) (x {1} ) x {1} 12/20

34 Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) x {2} x {1} 12/20

35 Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) ϕ (2) (x {2} ) x (2) t = x {2} ϕ (1) (x {1} ) x (1) t = x {1} /20

36 Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) ϕ (2) (x {2} ) x (2) t = x {2} x t = (0.869,0.141) ϕ (1) (x {1} ) x (1) t = x {1} /20

37 13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ).

38 13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ). Additive models are statistically simpler = worse bias, but much better variance in low sample regime.

39 13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ). Additive models are statistically simpler = worse bias, but much better variance in low sample regime. In BO applications queries are expensive. So we usually cannot afford many queries.

40 13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ). Additive models are statistically simpler = worse bias, but much better variance in low sample regime. In BO applications queries are expensive. So we usually cannot afford many queries. Observation: Add-GP-UCB does well even when f is not additive. Better bias/ variance trade-off in high dimensional regression. Easy to maximise acquisition function.

41 14/20 Unknown Kernel/ Decomposition in practice Learn kernel hyper-parameters and decomposition {X j } by maximising GP marginal likelihood periodically.

42 15/20 Experiments 2 Add- : Knows 10 decomposition Add-d/M: M groups of size d Use 1000 DiRect evaluations to maximise acquisition function. DiRect: Dividing Rectangles (Jones et al. 1993)

43 15/20 Experiments Add- : Knows decomposition Add-d/M: M groups of size d Use 4000 DiRect evaluations to maximise acquisition function.

44 16/20 SDSS Luminous Red Galaxies E.g: Hubble Constant Baryonic Density Cosmological Simulator Observation Task: Find maximum likelihood cosmological parameters. 20 Dimensions. But only 9 parameters are relevant. Each query takes 2-5 seconds. Use 500 DiRect evaluations to maximise acquisition function.

45 17/20 SDSS Luminous Red Galaxies REMBO: (Wang et al. 2013)

Task: Find optimal threshold values on a training set of 1000 images.

46 18/20 Viola & Jones Face Detection A cascade of 22 weak classifiers. Image classified negative if the score < threshold at any stage. Task: Find optimal threshold values on a training set of 1000 images. 22 dimensions. Each query takes seconds. Use 1000 DiRect evaluations to maximise acquisition function.

47 19/20 Viola & Jones Face Detection

48 20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice.

49 20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice. Similar results hold for Matérn kernels and in bandit setting.

50 20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice. Similar results hold for Matérn kernels and in bandit setting. Some open questions: How to choose (d, M)? Can we generalise to other acquisition functions?

51 20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice. Similar results hold for Matérn kernels and in bandit setting. Some open questions: How to choose (d, M)? Can we generalise to other acquisition functions? Code available: github.com/kirthevasank/add-gp-bandits Jeff s Talk: Friday Van Gogh Thank You.

Machine Learning for Quantitative Finance

Machine Learning for Quantitative Finance Fast derivative pricing Sofie Reyners Joint work with Jan De Spiegeleer, Dilip Madan and Wim Schoutens Derivative pricing is time-consuming... Vanilla option pricing