Multi-armed bandits in dynamic pricing

Multi-armed bandits in dynamic pricing Arnoud den Boer University of Twente, Centrum Wiskunde & Informatica Amsterdam Lancaster, January 11, 2016

Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods.

Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ;

Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ; (ii) observe demand d t = θ 1 + θ 2 p t + ɛ t, where θ = (θ 1, θ 2 ) are unknown parameters in known set Θ, ɛ t unobservable random disturbance term; (iii) collect revenue p t d t.

Myopic pricing An intuitive solution: Choose arbitrary initial prices p 1 p 2.

Myopic pricing An intuitive solution: Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p perceived optimal decision

Convergence Does ˆθ t converge to θ as t?

Convergence Does ˆθ t converge to θ as t? No It seems that ˆθ t always converges, but w.p. zero to the true θ. Open problem.

Convergence Does ˆθ t converge to θ as t? No It seems that ˆθ t always converges, but w.p. zero to the true θ. Open problem. Caused by the prevalence of indeterminate equilibria: Parameter estimates such that the true expected demand at the myopic optimal price equals the predicted expected demand.

Indeterminate equilibria If ˆθ suff. close to θ, then arg max p (ˆθ 1 + ˆθ 2 p) = ˆθ 1 /(2ˆθ 2 ). p Then: True expected demand: θ 1 + θ 2 ˆθ 1 2ˆθ 2. (1) Predicted expected demand: ˆθ 1 + ˆθ 2 ˆθ 1 2ˆθ 2. (2) If (1) equals (2), then ˆθ is an IE. Model output confirms correctness of the (incorrect) estimates.

Indeterminate equilibria: example

Back to original problem Which non-anticipating prices p 1,..., p T maximize [ min E T p t d t ], θ Θ t=1 or, equivalently, minimize the Regret(T ) [ max E T max p (θ 1 + θ 2 p) θ Θ p T ] p t d t t=1

Back to original problem Which non-anticipating prices p 1,..., p T maximize [ min E T p t d t ], θ Θ t=1 or, equivalently, minimize the Regret(T ) [ max E T max p (θ 1 + θ 2 p) θ Θ p Exact solution intractable Myopic pricing not optimal T ] p t d t t=1

Asymptotically optimal policy Important observation: Variation in controls better estimates.

Asymptotically optimal policy Important observation: Variation in controls better estimates. ( ) ˆθ t θ 2 log t = O tvar(p 1,..., p t ) a.s. Lai and Wei, Annals of Statistics, 1982.

Asymptotically optimal policy Important observation: Variation in controls better estimates. ( ) ˆθ t θ 2 log t = O tvar(p 1,..., p t ) a.s. Lai and Wei, Annals of Statistics, 1982. To ensure convergence of ˆθ t, some amount of experimentation is necessary.

Controlled Variance pricing Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p s.t. t Var(p 1,..., p t+1 ) f (t), perceived optimal decision information constraint

Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t).

Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation.

Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation. Optimal f gives Regret(T ) = O( T log T ).

Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation. Optimal f gives Regret(T ) = O( T log T ). No policy beats T.

Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation. Optimal f gives Regret(T ) = O( T log T ). No policy beats T. Thus, you can characterize asymptotically (near)-optimal amount of experimentation.

Extension: multiple products K products: price vector ( p t = ) (p t (1),..., p t (K)), 1 demand vector d t = θ + ɛ, matrix θ, noise-vector ɛ. p t

Extension: multiple products K products: price vector ( p t = ) (p t (1),..., p t (K)), 1 demand vector d t = θ + ɛ, matrix θ, noise-vector ɛ. p t Convergence rates of LS-estimator: ( ) ˆθ t θ 2 log t = O a.s., λ min (t) where λ min (t) is the smallest eigenvalue of the information matrix t ( 1 p i ) i=1 p i p i p i

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p )

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p ) perceived optimal decision

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), ) perceived optimal decision information constraint

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), for some increasing f : N (0, ). ) perceived optimal decision information constraint

Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), for some increasing f : N (0, ). ) perceived optimal decision information constraint Problem: λ min (t + 1) is a complicated object. Convertible to non-convex but tractable quadratic constraint.

Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p);

Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p); Time-varying markets (how much data to use for inference?)

Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p); Time-varying markets (how much data to use for inference?) Strategic customer behavior (can you detect this from data?) Competition (repeated games with incomplete information? Mean field games with learning?)

Why a parametric demand model? d t = θ 1 + θ 2 p t + ɛ t...

Why a parametric demand model? d t = θ 1 + θ 2 p t + ɛ t... Preferred by price managers By smartly choosing experimentation prices converging to the optimal price, you can hedge against misspecified linear demand.

Can t this log-term be removed? Regret(T ) = O( T log T ) Convergence rates of LS estimators: not completely understood Does more data lead to better estimators?

Pricing airline tickets Sell C N perishable products during (consecutive) selling season of S N periods

Pricing airline tickets Sell C N perishable products during (consecutive) selling season of S N periods Demand in period t is Bernoulli h(β 0 + β 1 p t ), unknown β 0, β 1. Goal of the firm: maximize total expected revenue.

Full-information solution If demand distribution known: Markov decision problem. C c 0 1 s S Optimal prices π β (c, s) [p l, p h ] for each pair (c, s) of remaining inventory c {0, 1,..., C} and stage s {1,..., S}.

Pricing airline tickets: incomplete information Neglecting some technicalities, certainty-equivalent pricing performs well! I.e., if in period t state is (c t, s t ), use price π ˆβt (c t, s t ),

Pricing airline tickets: endogenous learning Reason for good performance: endogenous learning property

Pricing airline tickets: endogenous learning Reason for good performance: endogenous learning property The optimal price πβ (c, s) depends on marginal value of inventory This quantity changing throughout the selling season Thus, natural price dispersion if π β is used By continuity arguments: price dispersion if ˆβ t close to β, for all t in selling season