Multi-armed bandits in dynamic pricing

Size: px

Start display at page:

Download "Multi-armed bandits in dynamic pricing"

Baldwin Webster
5 years ago
Views:

1 Multi-armed bandits in dynamic pricing Arnoud den Boer University of Twente, Centrum Wiskunde & Informatica Amsterdam Lancaster, January 11, 2016

2 Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods.

3 Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ;

4 Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ; (ii) observe demand d t = θ 1 + θ 2 p t + ɛ t, where θ = (θ 1, θ 2 ) are unknown parameters in known set Θ, ɛ t unobservable random disturbance term;

5 Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ; (ii) observe demand d t = θ 1 + θ 2 p t + ɛ t, where θ = (θ 1, θ 2 ) are unknown parameters in known set Θ, ɛ t unobservable random disturbance term; (iii) collect revenue p t d t.

6 Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ; (ii) observe demand d t = θ 1 + θ 2 p t + ɛ t, where θ = (θ 1, θ 2 ) are unknown parameters in known set Θ, ɛ t unobservable random disturbance term; (iii) collect revenue p t d t. Which non-anticipating prices [ p 1,..., p T maximize cumulative T ] expected revenue min θ Θ E t=1 p td t?

7 Dynamic pricing A firm sells a product, with abundant inventory, during T N discrete time periods. Each period t = 1,..., T : (i) choose selling price p t ; (ii) observe demand d t = θ 1 + θ 2 p t + ɛ t, where θ = (θ 1, θ 2 ) are unknown parameters in known set Θ, ɛ t unobservable random disturbance term; (iii) collect revenue p t d t. Which non-anticipating prices [ p 1,..., p T maximize cumulative T ] expected revenue min θ Θ E t=1 p td t? Intractable problem

8 Myopic pricing An intuitive solution: Choose arbitrary initial prices p 1 p 2.

9 Myopic pricing An intuitive solution: Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p perceived optimal decision

10 Myopic pricing An intuitive solution: Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p perceived optimal decision Always choose the perceived optimal action.

11 Convergence Does ˆθ t converge to θ as t?

12 Convergence Does ˆθ t converge to θ as t? No It seems that ˆθ t always converges, but w.p. zero to the true θ. Open problem.

13 Convergence Does ˆθ t converge to θ as t? No It seems that ˆθ t always converges, but w.p. zero to the true θ. Open problem. Caused by the prevalence of indeterminate equilibria: Parameter estimates such that the true expected demand at the myopic optimal price equals the predicted expected demand.

14 Indeterminate equilibria If ˆθ suff. close to θ, then arg max p (ˆθ 1 + ˆθ 2 p) = ˆθ 1 /(2ˆθ 2 ). p Then: True expected demand: θ 1 + θ 2 ˆθ 1 2ˆθ 2. (1) Predicted expected demand: ˆθ 1 + ˆθ 2 ˆθ 1 2ˆθ 2. (2)

15 Indeterminate equilibria If ˆθ suff. close to θ, then arg max p (ˆθ 1 + ˆθ 2 p) = ˆθ 1 /(2ˆθ 2 ). p Then: True expected demand: θ 1 + θ 2 ˆθ 1 2ˆθ 2. (1) Predicted expected demand: ˆθ 1 + ˆθ 2 ˆθ 1 2ˆθ 2. (2) If (1) equals (2), then ˆθ is an IE. Model output confirms correctness of the (incorrect) estimates.

16 Indeterminate equilibria: example

17 Back to original problem Which non-anticipating prices p 1,..., p T maximize [ min E T p t d t ], θ Θ t=1 or, equivalently, minimize the Regret(T ) [ max E T max p (θ 1 + θ 2 p) θ Θ p T ] p t d t t=1

18 Back to original problem Which non-anticipating prices p 1,..., p T maximize [ min E T p t d t ], θ Θ t=1 or, equivalently, minimize the Regret(T ) [ max E T max p (θ 1 + θ 2 p) θ Θ p Exact solution intractable T ] p t d t t=1

19 Back to original problem Which non-anticipating prices p 1,..., p T maximize [ min E T p t d t ], θ Θ t=1 or, equivalently, minimize the Regret(T ) [ max E T max p (θ 1 + θ 2 p) θ Θ p Exact solution intractable Myopic pricing not optimal T ] p t d t t=1

20 Back to original problem Which non-anticipating prices p 1,..., p T maximize [ min E T p t d t ], θ Θ t=1 or, equivalently, minimize the Regret(T ) [ max E T max p (θ 1 + θ 2 p) θ Θ p Exact solution intractable Myopic pricing not optimal T ] p t d t t=1 Let s find asymptotically optimal policies: smallest growth rate of Regret(T ) in T.

21 Asymptotically optimal policy Important observation: Variation in controls better estimates.

22 Asymptotically optimal policy Important observation: Variation in controls better estimates. ( ) ˆθ t θ 2 log t = O tvar(p 1,..., p t ) a.s. Lai and Wei, Annals of Statistics, 1982.

23 Asymptotically optimal policy Important observation: Variation in controls better estimates. ( ) ˆθ t θ 2 log t = O tvar(p 1,..., p t ) a.s. Lai and Wei, Annals of Statistics, To ensure convergence of ˆθ t, some amount of experimentation is necessary.

24 Asymptotically optimal policy Important observation: Variation in controls better estimates. ( ) ˆθ t θ 2 log t = O tvar(p 1,..., p t ) a.s. Lai and Wei, Annals of Statistics, To ensure convergence of ˆθ t, some amount of experimentation is necessary. But, not too much.

25 Controlled Variance pricing Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p

26 Controlled Variance pricing Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p perceived optimal decision

27 Controlled Variance pricing Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p s.t. t Var(p 1,..., p t+1 ) f (t), perceived optimal decision information constraint

28 Controlled Variance pricing Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p s.t. t Var(p 1,..., p t+1 ) f (t), perceived optimal decision information constraint for some increasing f : N (0, ).

29 Controlled Variance pricing Choose arbitrary initial prices p 1 p 2. For each t 2: (i) determine LS estimate ˆθ t of θ, based on available sales data; (ii) set p t+1 = arg max p (ˆθ t1 + ˆθ t2 p) p s.t. t Var(p 1,..., p t+1 ) f (t), perceived optimal decision information constraint for some increasing f : N (0, ). Always choose the perceived optimal action that induces sufficient experimentation.

30 Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t).

31 Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation.

32 Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation. Optimal f gives Regret(T ) = O( T log T ).

33 Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation. Optimal f gives Regret(T ) = O( T log T ). No policy beats T.

34 Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation. Optimal f gives Regret(T ) = O( T log T ). No policy beats T. Thus, you can characterize asymptotically (near)-optimal amount of experimentation.

35 Controlled Variance pricing - performance Regret(T ) = O ( f (T ) + T t=1 ) log t f (t). f balances between exploration and exploitation. Optimal f gives Regret(T ) = O( T log T ). No policy beats T. Thus, you can characterize asymptotically (near)-optimal amount of experimentation. (the optimal constant is not yet known, in general).

36 Extension: multiple products K products: price vector ( p t = ) (p t (1),..., p t (K)), 1 demand vector d t = θ + ɛ, matrix θ, noise-vector ɛ. p t

37 Extension: multiple products K products: price vector ( p t = ) (p t (1),..., p t (K)), 1 demand vector d t = θ + ɛ, matrix θ, noise-vector ɛ. p t Convergence rates of LS-estimator: ( ) ˆθ t θ 2 log t = O a.s., λ min (t) where λ min (t) is the smallest eigenvalue of the information matrix t ( 1 p i ) i=1 p i p i p i

38 Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p )

39 Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p ) perceived optimal decision

40 Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), ) perceived optimal decision information constraint

41 Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), for some increasing f : N (0, ). ) perceived optimal decision information constraint

42 Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), for some increasing f : N (0, ). ) perceived optimal decision information constraint Problem: λ min (t + 1) is a complicated object.

43 Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), for some increasing f : N (0, ). ) perceived optimal decision information constraint Problem: λ min (t + 1) is a complicated object. Convertible to non-convex but tractable quadratic constraint.

44 Extension: multiple products Same type of policy: p t+1 = arg max p ˆθ ( 1 t p p s.t. λ min (t + 1) f (t), for some increasing f : N (0, ). ) perceived optimal decision information constraint Problem: λ min (t + 1) is a complicated object. Convertible to non-convex but tractable quadratic constraint. ( Regret(T ) = O f (T ) + ) T log t t=1 f (t), optimal f gives Regret(T ) = O( T log T ).

45 Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p);

46 Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p); Time-varying markets (how much data to use for inference?)

47 Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p); Time-varying markets (how much data to use for inference?) Strategic customer behavior (can you detect this from data?)

48 Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p); Time-varying markets (how much data to use for inference?) Strategic customer behavior (can you detect this from data?) Competition (repeated games with incomplete information? Mean field games with learning?)

49 Many more extensions Non-linear demand functions (generalized linear models) E[D(p)] = h(θ 1 + θ 2 p); Time-varying markets (how much data to use for inference?) Strategic customer behavior (can you detect this from data?) Competition (repeated games with incomplete information? Mean field games with learning?) den Boer (2015) Surveys in Operations Research and Management Science 20(1)

50 Why a parametric demand model? d t = θ 1 + θ 2 p t + ɛ t...

51 Why a parametric demand model? d t = θ 1 + θ 2 p t + ɛ t... Preferred by price managers By smartly choosing experimentation prices converging to the optimal price, you can hedge against misspecified linear demand.

52 Can t this log-term be removed? Regret(T ) = O( T log T ) Convergence rates of LS estimators: not completely understood Does more data lead to better estimators?

53 Pricing airline tickets Sell C N perishable products during (consecutive) selling season of S N periods

54 Pricing airline tickets Sell C N perishable products during (consecutive) selling season of S N periods Demand in period t is Bernoulli h(β 0 + β 1 p t ), unknown β 0, β 1. Goal of the firm: maximize total expected revenue.

55 Full-information solution If demand distribution known: Markov decision problem. C c 0 1 s S Optimal prices π β (c, s) [p l, p h ] for each pair (c, s) of remaining inventory c {0, 1,..., C} and stage s {1,..., S}.

56 Pricing airline tickets: incomplete information Neglecting some technicalities, certainty-equivalent pricing performs well! I.e., if in period t state is (c t, s t ), use price π ˆβt (c t, s t ),

57 Pricing airline tickets: incomplete information Neglecting some technicalities, certainty-equivalent pricing performs well! I.e., if in period t state is (c t, s t ), use price π ˆβt (c t, s t ),

58 Pricing airline tickets: endogenous learning Reason for good performance: endogenous learning property

59 Pricing airline tickets: endogenous learning Reason for good performance: endogenous learning property The optimal price πβ (c, s) depends on marginal value of inventory This quantity changing throughout the selling season Thus, natural price dispersion if πβ is used

60 Pricing airline tickets: endogenous learning Reason for good performance: endogenous learning property The optimal price πβ (c, s) depends on marginal value of inventory This quantity changing throughout the selling season Thus, natural price dispersion if π β is used By continuity arguments: price dispersion if ˆβ t close to β, for all t in selling season

61 Pricing airline tickets: endogenous learning Reason for good performance: endogenous learning property The optimal price πβ (c, s) depends on marginal value of inventory This quantity changing throughout the selling season Thus, natural price dispersion if π β is used By continuity arguments: price dispersion if ˆβ t close to β, for all t in selling season Endogenous learning causes fast converge of estimates: [ ] ( ) E ˆβ(t) β (0) 2 log(t) = O t

Multi-armed bandit problems

Multi-armed bandit problems Stochastic Decision Theory (2WB12) Arnoud den Boer 13 March 2013 Set-up 13 and 14 March: Lectures. 20 and 21 March: Paper presentations (Four groups, 45 min per group). Before