Multi-armed bandits in dynamic pricing

Similar documents
Multi-armed bandit problems

D I S C O N T I N U O U S DEMAND FUNCTIONS: ESTIMATION AND PRICING. Rotterdam May 24, 2018

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Dynamic Pricing for Competing Sellers

Dynamic Pricing with Varying Cost

Lecture 7: Bayesian approach to MAB - Gittins index

6.896 Topics in Algorithmic Game Theory February 10, Lecture 3

The Irrevocable Multi-Armed Bandit Problem

The method of Maximum Likelihood.

Lecture 11: Bandits with Knapsacks

European option pricing under parameter uncertainty

1 Dynamic programming

Pricing Problems under the Markov Chain Choice Model

A Dynamic Network Model of the Unsecured Interbank Lending Market 1

Budget Management In GSP (2018)

EE641 Digital Image Processing II: Purdue University VISE - October 29,

Chapter 10 Inventory Theory

Crises and Prices: Information Aggregation, Multiplicity and Volatility

Characterization of the Optimum

An Approximation Algorithm for Capacity Allocation over a Single Flight Leg with Fare-Locking

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models

14.461: Technological Change, Lectures 12 and 13 Input-Output Linkages: Implications for Productivity and Volatility

EC316a: Advanced Scientific Computation, Fall Discrete time, continuous state dynamic models: solution methods

CS 343: Artificial Intelligence

Portfolio theory and risk management Homework set 2

Chapter 4: Asymptotic Properties of MLE (Part 3)

Revenue Management Under the Markov Chain Choice Model

Online Network Revenue Management using Thompson Sampling

4 Reinforcement Learning Basic Algorithms

Framework and Methods for Infrastructure Management. Samer Madanat UC Berkeley NAS Infrastructure Management Conference, September 2005

Department of Agricultural Economics. PhD Qualifier Examination. August 2010

STOCHASTIC REPUTATION DYNAMICS UNDER DUOPOLY COMPETITION

Unobserved Heterogeneity Revisited

Modelling, Estimation and Hedging of Longevity Risk

Answer Key for M. A. Economics Entrance Examination 2017 (Main version)

Agricultural and Applied Economics 637 Applied Econometrics II

Handout 8: Introduction to Stochastic Dynamic Programming. 2 Examples of Stochastic Dynamic Programming Problems

GMM Estimation. 1 Introduction. 2 Consumption-CAPM

Chapter 4 Topics. Behavior of the representative consumer Behavior of the representative firm Pearson Education, Inc.

GMM for Discrete Choice Models: A Capital Accumulation Application

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

1 Appendix A: Definition of equilibrium

Chapter One NOISY RATIONAL EXPECTATIONS WITH STOCHASTIC FUNDAMENTALS

Chapter 7: Estimation Sections

Treatment Allocations Based on Multi-Armed Bandit Strategies

Economics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints

Sentiments and Aggregate Fluctuations

Mean-Variance Analysis

Estimating Term Structure of U.S. Treasury Securities: An Interpolation Approach

The revenue management literature for queues typically assumes that providers know the distribution of

Eco504 Spring 2010 C. Sims MID-TERM EXAM. (1) (45 minutes) Consider a model in which a representative agent has the objective. B t 1.

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016

What can we do with numerical optimization?

A Decentralized Learning Equilibrium

CPSC 540: Machine Learning

PhD Qualifier Examination

Econometric Methods for Valuation Analysis

Much of what appears here comes from ideas presented in the book:

A simple wealth model

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning and Simulation-Based Search

Sentiments and Aggregate Fluctuations

Financial Giffen Goods: Examples and Counterexamples

Generalized Multi-Factor Commodity Spot Price Modeling through Dynamic Cournot Resource Extraction Models

The test has 13 questions. Answer any four. All questions carry equal (25) marks.

Mixed strategies in PQ-duopolies

The Analytics of Information and Uncertainty Answers to Exercises and Excursions

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Portfolio Management and Optimal Execution via Convex Optimization

The Optimization Process: An example of portfolio optimization

An Online Algorithm for Multi-Strategy Trading Utilizing Market Regimes

Appendix to: AMoreElaborateModel

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Notes on Macroeconomic Theory II

Ph.D. Preliminary Examination MICROECONOMIC THEORY Applied Economics Graduate Program August 2017

Efficient Market Making via Convex Optimization, and a Connection to Online Learning

ECON 6022B Problem Set 2 Suggested Solutions Fall 2011

Modeling the extremes of temperature time series. Debbie J. Dupuis Department of Decision Sciences HEC Montréal

Resolution of a Financial Puzzle

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

Expected utility theory; Expected Utility Theory; risk aversion and utility functions

Resource Allocation within Firms and Financial Market Dislocation: Evidence from Diversified Conglomerates

Identification and Estimation of Dynamic Games when Players Belief Are Not in Equilibrium

Lecture outline W.B.Powell 1

A potentially useful approach to model nonlinearities in time series is to assume different behavior (structural break) in different subsamples

Point Estimators. STATISTICS Lecture no. 10. Department of Econometrics FEM UO Brno office 69a, tel

Eco504 Spring 2010 C. Sims FINAL EXAM. β t 1 2 φτ2 t subject to (1)

Heterogeneous Hidden Markov Models

Supplemental Online Appendix to Han and Hong, Understanding In-House Transactions in the Real Estate Brokerage Industry

JEFF MACKIE-MASON. x is a random variable with prior distrib known to both principal and agent, and the distribution depends on agent effort e

Sequential Decision Making

Notes on the EM Algorithm Michael Collins, September 24th 2005

Lecture 14 Consumption under Uncertainty Ricardian Equivalence & Social Security Dynamic General Equilibrium. Noah Williams

Dynamic Portfolio Execution Detailed Proofs

CSCI 1951-G Optimization Methods in Finance Part 07: Portfolio Optimization

6.207/14.15: Networks Lecture 10: Introduction to Game Theory 2

Chapter 3. Dynamic discrete games and auctions: an introduction

Adaptive Experiments for Policy Choice. March 8, 2019

Microeconomic Theory May 2013 Applied Economics. Ph.D. PRELIMINARY EXAMINATION MICROECONOMIC THEORY. Applied Economics Graduate Program.

Optimally Thresholded Realized Power Variations for Lévy Jump Diffusion Models

Transcription:

Multi-armed bandits in dynamic pricing Arnoud den Boer University of Twente, Centrum Wiskunde & Informatica Amsterdam Lancaster, January 11, 2016

Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods.

Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ;

Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ; (ii) observe demand d t = θ 1 + θ 2 p t + ɛ t, where θ = (θ 1, θ 2 ) are unknown parameters in known set Θ, ɛ t unobservable random disturbance term;

Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ; (ii) observe demand d t = θ 1 + θ 2 p t + ɛ t, where θ = (θ 1, θ 2 ) are unknown parameters in known set Θ, ɛ t unobservable random disturbance term; (iii) collect revenue p t d t.

Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ; (ii) observe demand d t = θ 1 + θ 2 p t + ɛ t, where θ = (θ 1, θ 2 ) are unknown parameters in known set Θ, ɛ t unobservable random disturbance term; (iii) collect revenue p t d t. Which non-anticipating prices [ p 1,..., p T maximize cumulative T ] expected revenue min θ Θ E t=1 p td t?

Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ; (ii) observe demand d t = θ 1 + θ 2 p t + ɛ t, where θ = (θ 1, θ 2 ) are unknown parameters in known set Θ, ɛ t unobservable random disturbance term; (iii) collect revenue p t d t. Which non-anticipating prices [ p 1,..., p T maximize cumulative T ] expected revenue min θ Θ E t=1 p td t? Intractable problem

Myopic pricing An intuitive solution: Choose arbitrary initial prices p 1 p 2.

Myopic pricing An intuitive solution: Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p perceived optimal decision

Myopic pricing An intuitive solution: Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p perceived optimal decision Always choose the perceived optimal action.

Convergence Does ˆθ t converge to θ as t?

Convergence Does ˆθ t converge to θ as t? No It seems that ˆθ t always converges, but w.p. zero to the true θ. Open problem.

Convergence Does ˆθ t converge to θ as t? No It seems that ˆθ t always converges, but w.p. zero to the true θ. Open problem. Caused by the prevalence of indeterminate equilibria: Parameter estimates such that the true expected demand at the myopic optimal price equals the predicted expected demand.

Indeterminate equilibria If ˆθ suff. close to θ, then arg max p (ˆθ 1 + ˆθ 2 p) = ˆθ 1 /(2ˆθ 2 ). p Then: True expected demand: θ 1 + θ 2 ˆθ 1 2ˆθ 2. (1) Predicted expected demand: ˆθ 1 + ˆθ 2 ˆθ 1 2ˆθ 2. (2)

Indeterminate equilibria If ˆθ suff. close to θ, then arg max p (ˆθ 1 + ˆθ 2 p) = ˆθ 1 /(2ˆθ 2 ). p Then: True expected demand: θ 1 + θ 2 ˆθ 1 2ˆθ 2. (1) Predicted expected demand: ˆθ 1 + ˆθ 2 ˆθ 1 2ˆθ 2. (2) If (1) equals (2), then ˆθ is an IE. Model output confirms correctness of the (incorrect) estimates.

Indeterminate equilibria: example

Back to original problem Which non-anticipating prices p 1,..., p T maximize [ min E T p t d t ], θ Θ t=1 or, equivalently, minimize the Regret(T ) [ max E T max p (θ 1 + θ 2 p) θ Θ p T ] p t d t t=1

Back to original problem Which non-anticipating prices p 1,..., p T maximize [ min E T p t d t ], θ Θ t=1 or, equivalently, minimize the Regret(T ) [ max E T max p (θ 1 + θ 2 p) θ Θ p Exact solution intractable T ] p t d t t=1

Back to original problem Which non-anticipating prices p 1,..., p T maximize [ min E T p t d t ], θ Θ t=1 or, equivalently, minimize the Regret(T ) [ max E T max p (θ 1 + θ 2 p) θ Θ p Exact solution intractable Myopic pricing not optimal T ] p t d t t=1

Back to original problem Which non-anticipating prices p 1,..., p T maximize [ min E T p t d t ], θ Θ t=1 or, equivalently, minimize the Regret(T ) [ max E T max p (θ 1 + θ 2 p) θ Θ p Exact solution intractable Myopic pricing not optimal T ] p t d t t=1 Let s find asymptotically optimal policies: smallest growth rate of Regret(T ) in T.

Asymptotically optimal policy Important observation: Variation in controls better estimates.

Asymptotically optimal policy Important observation: Variation in controls better estimates. ( ) ˆθ t θ 2 log t = O tvar(p 1,..., p t ) a.s. Lai and Wei, Annals of Statistics, 1982.

Asymptotically optimal policy Important observation: Variation in controls better estimates. ( ) ˆθ t θ 2 log t = O tvar(p 1,..., p t ) a.s. Lai and Wei, Annals of Statistics, 1982. To ensure convergence of ˆθ t, some amount of experimentation is necessary.

Asymptotically optimal policy Important observation: Variation in controls better estimates. ( ) ˆθ t θ 2 log t = O tvar(p 1,..., p t ) a.s. Lai and Wei, Annals of Statistics, 1982. To ensure convergence of ˆθ t, some amount of experimentation is necessary. But, not too much.

Controlled Variance pricing Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p

Controlled Variance pricing Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p perceived optimal decision

Controlled Variance pricing Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p s.t. t Var(p 1,..., p t+1 ) f (t), perceived optimal decision information constraint

Controlled Variance pricing Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p s.t. t Var(p 1,..., p t+1 ) f (t), perceived optimal decision information constraint for some increasing f : N (0, ).

Controlled Variance pricing Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p s.t. t Var(p 1,..., p t+1 ) f (t), perceived optimal decision information constraint for some increasing f : N (0, ). Always choose the perceived optimal action that induces sufficient experimentation.

Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t).

Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation.

Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation. Optimal f gives Regret(T ) = O( T log T ).

Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation. Optimal f gives Regret(T ) = O( T log T ). No policy beats T.

Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation. Optimal f gives Regret(T ) = O( T log T ). No policy beats T. Thus, you can characterize asymptotically (near)-optimal amount of experimentation.

Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation. Optimal f gives Regret(T ) = O( T log T ). No policy beats T. Thus, you can characterize asymptotically (near)-optimal amount of experimentation. (the optimal constant is not yet known, in general).

Extension: multiple products K products: price vector ( p t = ) (p t (1),..., p t (K)), 1 demand vector d t = θ + ɛ, matrix θ, noise-vector ɛ. p t

Extension: multiple products K products: price vector ( p t = ) (p t (1),..., p t (K)), 1 demand vector d t = θ + ɛ, matrix θ, noise-vector ɛ. p t Convergence rates of LS-estimator: ( ) ˆθ t θ 2 log t = O a.s., λ min (t) where λ min (t) is the smallest eigenvalue of the information matrix t ( 1 p i ) i=1 p i p i p i

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p )

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p ) perceived optimal decision

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), ) perceived optimal decision information constraint

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), for some increasing f : N (0, ). ) perceived optimal decision information constraint

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), for some increasing f : N (0, ). ) perceived optimal decision information constraint Problem: λ min (t + 1) is a complicated object.

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), for some increasing f : N (0, ). ) perceived optimal decision information constraint Problem: λ min (t + 1) is a complicated object. Convertible to non-convex but tractable quadratic constraint.

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), for some increasing f : N (0, ). ) perceived optimal decision information constraint Problem: λ min (t + 1) is a complicated object. Convertible to non-convex but tractable quadratic constraint. ( Regret(T ) = O f (T ) + ) T log t t=1 f (t), optimal f gives Regret(T ) = O( T log T ).

Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p);

Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p); Time-varying markets (how much data to use for inference?)

Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p); Time-varying markets (how much data to use for inference?) Strategic customer behavior (can you detect this from data?)

Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p); Time-varying markets (how much data to use for inference?) Strategic customer behavior (can you detect this from data?) Competition (repeated games with incomplete information? Mean field games with learning?)

Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p); Time-varying markets (how much data to use for inference?) Strategic customer behavior (can you detect this from data?) Competition (repeated games with incomplete information? Mean field games with learning?) den Boer (2015) Surveys in Operations Research and Management Science 20(1)

Why a parametric demand model? d t = θ 1 + θ 2 p t + ɛ t...

Why a parametric demand model? d t = θ 1 + θ 2 p t + ɛ t... Preferred by price managers By smartly choosing experimentation prices converging to the optimal price, you can hedge against misspecified linear demand.

Can t this log-term be removed? Regret(T ) = O( T log T ) Convergence rates of LS estimators: not completely understood Does more data lead to better estimators?

Pricing airline tickets Sell C N perishable products during (consecutive) selling season of S N periods

Pricing airline tickets Sell C N perishable products during (consecutive) selling season of S N periods Demand in period t is Bernoulli h(β 0 + β 1 p t ), unknown β 0, β 1. Goal of the firm: maximize total expected revenue.

Full-information solution If demand distribution known: Markov decision problem. C c 0 1 s S Optimal prices π β (c, s) [p l, p h ] for each pair (c, s) of remaining inventory c {0, 1,..., C} and stage s {1,..., S}.

Pricing airline tickets: incomplete information Neglecting some technicalities, certainty-equivalent pricing performs well! I.e., if in period t state is (c t, s t ), use price π ˆβt (c t, s t ),

Pricing airline tickets: incomplete information Neglecting some technicalities, certainty-equivalent pricing performs well! I.e., if in period t state is (c t, s t ), use price π ˆβt (c t, s t ),

Pricing airline tickets: endogenous learning Reason for good performance: endogenous learning property

Pricing airline tickets: endogenous learning Reason for good performance: endogenous learning property The optimal price πβ (c, s) depends on marginal value of inventory This quantity changing throughout the selling season Thus, natural price dispersion if πβ is used

Pricing airline tickets: endogenous learning Reason for good performance: endogenous learning property The optimal price πβ (c, s) depends on marginal value of inventory This quantity changing throughout the selling season Thus, natural price dispersion if π β is used By continuity arguments: price dispersion if ˆβ t close to β, for all t in selling season

Pricing airline tickets: endogenous learning Reason for good performance: endogenous learning property The optimal price πβ (c, s) depends on marginal value of inventory This quantity changing throughout the selling season Thus, natural price dispersion if π β is used By continuity arguments: price dispersion if ˆβ t close to β, for all t in selling season Endogenous learning causes fast converge of estimates: [ ] ( ) E ˆβ(t) β (0) 2 log(t) = O t