High Dimensional Bayesian Optimisation and Bandits via Additive Models

Similar documents
Machine Learning for Quantitative Finance

STAT 509: Statistics for Engineers Dr. Dewei Wang. Copyright 2014 John Wiley & Sons, Inc. All rights reserved.

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

Estimation after Model Selection

Modeling of Price. Ximing Wu Texas A&M University

Financial Risk Management

Laplace approximation

Introduction to Sequential Monte Carlo Methods

CS340 Machine learning Bayesian model selection

CS340 Machine learning Bayesian statistics 3

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN EXAMINATION

Regret-based Selection

Statistical Models and Methods for Financial Markets

Tuning bandit algorithms in stochastic environments

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

Treatment Allocations Based on Multi-Armed Bandit Strategies

Parameter estimation in SDE:s

Chapter 7: Estimation Sections

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Multi-Armed Bandit, Dynamic Environments and Meta-Bandits

THE investment in stock market is a common way of

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Relevant parameter changes in structural break models

Monte Carlo Methods for Uncertainty Quantification

Statistical Inference and Methods

Theoretical Problems in Credit Portfolio Modeling 2

A potentially useful approach to model nonlinearities in time series is to assume different behavior (structural break) in different subsamples

Toward a coherent Monte Carlo simulation of CVA

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

An Introduction to Statistical Extreme Value Theory

Results for option pricing

Mini-Minimax Uncertainty Quantification for Emulators

Kernel Conditional Quantile Estimation via Reduction Revisited

CSC 411: Lecture 08: Generative Models for Classification

Black-Scholes Option Pricing

Monte Carlo Methods in Option Pricing. UiO-STK4510 Autumn 2015

Chapter 7: Estimation Sections

12. Conditional heteroscedastic models (ARCH) MA6622, Ernesto Mordecki, CityU, HK, 2006.

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Solutions to Final Exam

1 Bayesian Bias Correction Model

D I S C O N T I N U O U S DEMAND FUNCTIONS: ESTIMATION AND PRICING. Rotterdam May 24, 2018

Semiparametric Modeling, Penalized Splines, and Mixed Models

European option pricing under parameter uncertainty

Semiparametric Modeling, Penalized Splines, and Mixed Models David Ruppert Cornell University

Chapter 8: Sampling distributions of estimators Sections

Dynamic Pricing with Varying Cost

Supplementary Material for Combinatorial Partial Monitoring Game with Linear Feedback and Its Application. A. Full proof for Theorems 4.1 and 4.

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Non-informative Priors Multiparameter Models

Adaptive Experiments for Policy Choice. March 8, 2019

Loss Simulation Model Testing and Enhancement

High-Frequency Data Analysis and Market Microstructure [Tsay (2005), chapter 5]

Statistical estimation

Improved Inference for Signal Discovery Under Exceptionally Low False Positive Error Rates

Regularizing Bayesian Predictive Regressions. Guanhao Feng

Multilevel quasi-monte Carlo path simulation

Probabilistic Meshless Methods for Bayesian Inverse Problems. Jon Cockayne July 8, 2016

Chapter 7: Point Estimation and Sampling Distributions

ELEMENTS OF MONTE CARLO SIMULATION

(5) Multi-parameter models - Summarizing the posterior

Making money in electricity markets

Exact Sampling of Jump-Diffusion Processes

1. You are given the following information about a stationary AR(2) model:

Beating the market, using linear regression to outperform the market average

Weight Smoothing with Laplace Prior and Its Application in GLM Model

Determining source cumulants in femtoscopy with Gram-Charlier and Edgeworth series

The Bernoulli distribution

Extended Libor Models and Their Calibration

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Financial Times Series. Lecture 6

Modeling the extremes of temperature time series. Debbie J. Dupuis Department of Decision Sciences HEC Montréal

Probability & Statistics

Optimal Portfolio Choice under Decision-Based Model Combinations

Monitoring Accrual and Events in a Time-to-Event Endpoint Trial. BASS November 2, 2015 Jeff Palmer

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

"Pricing Exotic Options using Strong Convergence Properties

Chapter 7: Estimation Sections

On modelling of electricity spot price

arxiv: v1 [math.st] 18 Sep 2018

Market Risk Analysis Volume II. Practical Financial Econometrics

Modelling Returns: the CER and the CAPM

UPDATED IAA EDUCATION SYLLABUS

GPD-POT and GEV block maxima

A New Hybrid Estimation Method for the Generalized Pareto Distribution

Practice Exercises for Midterm Exam ST Statistical Theory - II The ACTUAL exam will consists of less number of problems.

Equity correlations implied by index options: estimation and model uncertainty analysis

Estimation of dynamic term structure models

Research Memo: Adding Nonfarm Employment to the Mixed-Frequency VAR Model

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Bivariate Birnbaum-Saunders Distribution

Online Network Revenue Management using Thompson Sampling

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Anumericalalgorithm for general HJB equations : a jump-constrained BSDE approach

STATISTICS and PROBABILITY

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2009, Mr. Ruey S. Tsay. Solutions to Final Exam

Linear-Rational Term-Structure Models

Hierarchical Generalized Linear Models. Measurement Incorporated Hierarchical Linear Models Workshop

Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India

Internet Appendix for Asymmetry in Stock Comovements: An Entropy Approach

Structural GARCH: The Volatility-Leverage Connection

Transcription:

1/20 High Dimensional Bayesian Optimisation and Bandits via Additive Models Kirthevasan Kandasamy, Jeff Schneider, Barnabás Póczos ICML 15 July 8 2015

2/20 Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics E.g: Hubble Constant Baryonic Density Cosmological Simulator Observation

2/20 Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics E.g: Hubble Constant Baryonic Density Cosmological Simulator Observation

2/20 Bandits & Optimisation Expensive Blackbox Function

2/20 Bandits & Optimisation Expensive Blackbox Function Examples: Hyper-parameter tuning in ML Optimal control strategy in Robotics

3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) f(x ) x x

3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Optimisation = Minimise Simple Regret. S T = f (x ) max f (x t). x t, t=1,...,t

Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Bandits = Minimise Cumulative Regret. T R T = f (x ) f (x t ). t=1 3/20

3/20 Bandits & Optimisation f : [0, 1] D R is an expensive, black-box, nonconvex function. Let x = argmax x f (x). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Optimisation = Minimise Simple Regret. S T = f (x ) max f (x t). x t, t=1,...,t

4/20 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

4/20 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Obtain posterior GP..

Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Maximise acquisition function ϕ t : x t = argmax x ϕ t (x). ϕ t (x) x t = 0.828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 GP-UCB: ϕ t (x) = µ t 1 (x) + β 1/2 t σ t 1 (x) (Srinivas et al. 2010) x 4/20

Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). f(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Maximise acquisition function ϕ t : x t = argmax x ϕ t (x). ϕ t (x) x t = 0.828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ϕ t : Expected Improvement (GP-EI), Thompson Sampling etc. x 4/20

5/20 Scaling to Higher Dimensions Two Key Challenges: Statistical Difficulty: Nonparametric sample complexity exponential in D. Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O(ζ D ) effort.

5/20 Scaling to Higher Dimensions Two Key Challenges: Statistical Difficulty: Nonparametric sample complexity exponential in D. Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O(ζ D ) effort. Existing Work: (Chen et al. 2012): f depends on a small number of variables. Find variables and then GP-UCB. (Wang et al. 2013): f varies along a lower dimensional subspace. GP-EI on a random subspace. (Djolonga et al. 2013): f varies along a lower dimensional subspace. Find subspace and then GP-UCB.

5/20 Scaling to Higher Dimensions Two Key Challenges: Statistical Difficulty: Nonparametric sample complexity exponential in D. Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O(ζ D ) effort. Existing Work: Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013. Assumes f varies only along a low dimensional subspace. Perform BO on a low dimensional subspace. Assumption too strong in realistic settings.

6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) +... + f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =.

6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) +... + f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =. E.g. f (x {1,...,10} ) = f (1) (x {1,3,9} ) + f (2) (x {2,4,8} ) + f (3) (x {5,6,10} ). 1 2 3 4 5 6 7 8 9 10 Call {X (j)m j=1} = {(1, 3, 9), (2, 4, 8), (5, 6, 10)} the decomposition.

6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) +... + f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =. Assume each f (j) GP(0, κ (j) ). Then f GP(0, κ) where, κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ).

6/20 Additive Functions Structural assumption: f (x) = f (1) (x (1) ) + f (2) (x (2) ) +... + f (M) (x (M) ). x (j) X (j) = [0, 1] d, d D, x (i) x (j) =. Assume each f (j) GP(0, κ (j) ). Then f GP(0, κ) where, κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). Given (X, Y ) = {(x i, y i ) T i=1 }, and test point x, f (j) (x (j) ) X, Y N ( µ (j), σ (j)2 ).

7/20 Outline 1. GP-UCB 2. The Add-GP-UCB algorithm Bounds on ST : exponential in D linear in D. An easy-to-optimise acquisition function. Performs well even when f is not additive. 3. Experiments 4. Conclusion & some open questions

8/20 GP-UCB x t = argmax x X µ t 1 (x) + β 1/2 t σ t 1 (x)

8/20 GP-UCB x t = argmax x X µ t 1 (x) + β 1/2 t σ t 1 (x) Squared Exponential Kernel ( x x κ(x, x 2 ) ) = A exp 2h 2 Theorem (Srinivas et al. 2010) Let f GP(0, κ). Then w.h.p, ( ) D S T O D (log T ) D. T

9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel.

9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel. Can be shown: If each κ (j) is a SE kernel, ( ) D S T O 2 d d (log T ) d. T

9/20 GP-UCB on additive κ If f GP(0, κ) where κ(x, x ) = κ (1) (x (1), x (1) ) + + κ (M) (x (M), x (M) ). κ (j) SE Kernel. Can be shown: If each κ (j) is a SE kernel, ( ) D S T O 2 d d (log T ) d. T But ϕ t = µ t 1 + β 1/2 t σ t 1 is D-dimensional!

10/20 Add-GP-UCB ϕ t (x) = M j=1 µ (j) t 1 (x) + β1/2 t σ (j) t 1 (x (j) ).

10/20 Add-GP-UCB ϕ t (x) = M µ (j) t 1 j=1 (x) + β1/2 t t 1 (x (j) ). }{{} ϕ (j) t (x (j) ) σ (j) Maximise each ϕ (j) t separately. Requires only O(poly(D)ζ d ) effort (vs O(ζ D ) for GP-UCB).

10/20 Add-GP-UCB ϕ t (x) = M µ (j) t 1 j=1 (x) + β1/2 t t 1 (x (j) ). }{{} ϕ (j) t (x (j) ) σ (j) Maximise each ϕ (j) t separately. Requires only O(poly(D)ζ d ) effort (vs O(ζ D ) for GP-UCB). Theorem Let f (j) GP(0, κ (j) ) and f = j f (j). Then w.h.p, ( ) D S T O 2 d d (log T ) d. T

11/20 Summary of Theoretical Results (for SE Kernel) GP-UCB with no assumption on f : S T O (D ) D/2 (log T ) D/2 T 1/2 GP-UCB on additive f : S T O (DT ) 1/2 Maximising ϕ t : O(ζ D ) effort. Add-GP-UCB on additive f : S T O (DT ) 1/2 Maximising ϕ t : O(poly(D)ζ d ) effort.

Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) f (2) (x {2} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {2} f (1) (x {1} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {1} 12/20

Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) f (2) (x {2} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {2} f (1) (x {1} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {1} 12/20

Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {2} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {1} 12/20

Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) ϕ (2) (x {2} ) x (2) t = 0.141 x {2} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ϕ (1) (x {1} ) x (1) t = 0.869 x {1} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12/20

Add-GP-UCB f (x {1,2} ) = f (1) (x {1} ) + f (2) (x {2} ) ϕ (2) (x {2} ) x (2) t = 0.141 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x {2} x t = (0.869,0.141) ϕ (1) (x {1} ) x (1) t = 0.869 x {1} 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12/20

13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ).

13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ). Additive models are statistically simpler = worse bias, but much better variance in low sample regime.

13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ). Additive models are statistically simpler = worse bias, but much better variance in low sample regime. In BO applications queries are expensive. So we usually cannot afford many queries.

13/20 Additive modeling in non-additive settings Additive models common in high dimensional regression. E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x {1,...,D} ) = f (x {1} ) + f (x {2} ) + + f (x {D} ). Additive models are statistically simpler = worse bias, but much better variance in low sample regime. In BO applications queries are expensive. So we usually cannot afford many queries. Observation: Add-GP-UCB does well even when f is not additive. Better bias/ variance trade-off in high dimensional regression. Easy to maximise acquisition function.

14/20 Unknown Kernel/ Decomposition in practice Learn kernel hyper-parameters and decomposition {X j } by maximising GP marginal likelihood periodically.

15/20 Experiments 2 Add- : Knows 10 decomposition. 10 1 Add-d/M: M groups of size d. 10 0 0 200 400 600 800 Use 1000 DiRect evaluations to maximise acquisition function. DiRect: Dividing Rectangles (Jones et al. 1993)

15/20 Experiments Add- : Knows decomposition. 10 2 10 1 Add-d/M: M groups of size d. 10 0 0 200 400 600 800 Use 4000 DiRect evaluations to maximise acquisition function.

16/20 SDSS Luminous Red Galaxies E.g: Hubble Constant Baryonic Density Cosmological Simulator Observation Task: Find maximum likelihood cosmological parameters. 20 Dimensions. But only 9 parameters are relevant. Each query takes 2-5 seconds. Use 500 DiRect evaluations to maximise acquisition function.

17/20 SDSS Luminous Red Galaxies 10 1 10 2 10 3 0 100 200 300 400 REMBO: (Wang et al. 2013)

18/20 Viola & Jones Face Detection A cascade of 22 weak classifiers. Image classified negative if the score < threshold at any stage. Task: Find optimal threshold values on a training set of 1000 images. 22 dimensions. Each query takes 30-40 seconds. Use 1000 DiRect evaluations to maximise acquisition function.

19/20 Viola & Jones Face Detection 95 90 85 80 75 70 65 0 100 200 300

20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice.

20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice. Similar results hold for Matérn kernels and in bandit setting.

20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice. Similar results hold for Matérn kernels and in bandit setting. Some open questions: How to choose (d, M)? Can we generalise to other acquisition functions?

20/20 Summary Additive assumption improves regret: exponential in D linear in D. Acquisition function is easy to maximise. Even for non-additive f is not additive, Add-GP-UCB does well in practice. Similar results hold for Matérn kernels and in bandit setting. Some open questions: How to choose (d, M)? Can we generalise to other acquisition functions? Code available: github.com/kirthevasank/add-gp-bandits Jeff s Talk: Friday 2pm @ Van Gogh Thank You.