Dynamic Pricing with Varying Cost

Similar documents
Multi-armed bandit problems

Zooming Algorithm for Lipschitz Bandits

An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits

Treatment Allocations Based on Multi-Armed Bandit Strategies

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Tuning bandit algorithms in stochastic environments

The Irrevocable Multi-Armed Bandit Problem

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Lecture 7: Bayesian approach to MAB - Gittins index

Online Network Revenue Management using Thompson Sampling

D I S C O N T I N U O U S DEMAND FUNCTIONS: ESTIMATION AND PRICING. Rotterdam May 24, 2018

Dynamic Programming and Reinforcement Learning

Lecture 11: Bandits with Knapsacks

Multi-armed bandits in dynamic pricing

Rollout Allocation Strategies for Classification-based Policy Iteration

Adaptive Experiments for Policy Choice. March 8, 2019

Bernoulli Bandits An Empirical Comparison

Regret Minimization against Strategic Buyers

Bandit algorithms for tree search Applications to games, optimization, and planning

Multi-Armed Bandit, Dynamic Environments and Meta-Bandits

Approximate Revenue Maximization with Multiple Items

An Approximation Algorithm for Capacity Allocation over a Single Flight Leg with Fare-Locking

Recharging Bandits. Joint work with Nicole Immorlica.

Multi-period mean variance asset allocation: Is it bad to win the lottery?

Monte-Carlo Planning: Basic Principles and Recent Progress

Learning for Revenue Optimization. Andrés Muñoz Medina Renato Paes Leme

Bandit Problems with Lévy Payoff Processes

THE TRAVELING SALESMAN PROBLEM FOR MOVING POINTS ON A LINE

TTIC An Introduction to the Theory of Machine Learning. The Adversarial Multi-armed Bandit Problem Avrim Blum.

1 Dynamic programming

1 Precautionary Savings: Prudence and Borrowing Constraints

6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE

Assortment Optimization Over Time

OPTIMAL PORTFOLIO CONTROL WITH TRADING STRATEGIES OF FINITE

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Decision Theory: Value Iteration

An optimal policy for joint dynamic price and lead-time quotation

Dynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming

Hedging Under Jump Diffusions with Transaction Costs. Peter Forsyth, Shannon Kennedy, Ken Vetzal University of Waterloo

Bandit Learning with switching costs

Lower Bounds on Revenue of Approximately Optimal Auctions

Final exam solutions

Lec 1: Single Agent Dynamic Models: Nested Fixed Point Approach. K. Sudhir MGT 756: Empirical Methods in Marketing

Stock Repurchase with an Adaptive Reservation Price: A Study of the Greedy Policy

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Homework 3: Asset Pricing

Statistics for Managers Using Microsoft Excel 7 th Edition

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Sequential Decision Making

The Neoclassical Growth Model

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

Notes on Intertemporal Optimization

Uncertainty in Equilibrium

Optimal Long-Term Supply Contracts with Asymmetric Demand Information. Appendix

MATH 425: BINOMIAL TREES

TDT4171 Artificial Intelligence Methods

Extraction capacity and the optimal order of extraction. By: Stephen P. Holland

Information Acquisition under Persuasive Precedent versus Binding Precedent (Preliminary and Incomplete)

Posted-Price Mechanisms and Prophet Inequalities

E-companion to Coordinating Inventory Control and Pricing Strategies for Perishable Products

Forecast Horizons for Production Planning with Stochastic Demand

Elif Özge Özdamar T Reinforcement Learning - Theory and Applications February 14, 2006

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Information aggregation for timing decision making.

The Real Numbers. Here we show one way to explicitly construct the real numbers R. First we need a definition.

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Problem Set 3. Thomas Philippon. April 19, Human Wealth, Financial Wealth and Consumption

Characterization of the Optimum

4 Martingales in Discrete-Time

Dynamic and Stochastic Knapsack-Type Models for Foreclosed Housing Acquisition and Redevelopment

Finite Memory and Imperfect Monitoring

Optimizing S-shaped utility and risk management

Dynamic Pricing under Finite Space Demand Uncertainty: A Multi-Armed Bandit with Dependent Arms

Modelling Anti-Terrorist Surveillance Systems from a Queueing Perspective

We study a seller that starts with an initial inventory of goods, has a target horizon over which to sell the

Optimization Models in Financial Mathematics

Dynamically Scheduling and Maintaining a Flexible Server

High Dimensional Bayesian Optimisation and Bandits via Additive Models

Competing Mechanisms with Limited Commitment

Handout 4: Deterministic Systems and the Shortest Path Problem

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm

Regret Minimization and Security Strategies

EE266 Homework 5 Solutions

The Value of Information in Central-Place Foraging. Research Report

Optimally Thresholded Realized Power Variations for Lévy Jump Diffusion Models

STATE UNIVERSITY OF NEW YORK AT ALBANY Department of Economics. Ph. D. Comprehensive Examination: Macroeconomics Fall, 2010

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models

Lecture 1: Lucas Model and Asset Pricing

Maintenance Management of Infrastructure Networks: Issues and Modeling Approach

JEFF MACKIE-MASON. x is a random variable with prior distrib known to both principal and agent, and the distribution depends on agent effort e

Unobserved Heterogeneity Revisited

Portfolio Optimization using Conditional Sharpe Ratio

Importance Sampling for Fair Policy Selection

EE365: Markov Decision Processes

Multiproduct-Firm Oligopoly: An Aggregative Games Approach

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May 1, 2014

Data-Driven Pricing of Demand Response

Chapter 6 Analyzing Accumulated Change: Integrals in Action

X ln( +1 ) +1 [0 ] Γ( )

SOLVING ROBUST SUPPLY CHAIN PROBLEMS

Transcription:

Dynamic Pricing with Varying Cost L. Jeff Hong College of Business City University of Hong Kong Joint work with Ying Zhong and Guangwu Liu

Outline 1 Introduction 2 Problem Formulation 3 Pricing Policy 4 Regret Analysis 5 Numerical Results Dynamic Pricing with Varying Cost 3 / 21

Online Pricing Problem A company sells a product online. Customers arrive one at a time and buy one unit of the product if the price is lower than their willingness to pay (WTP). Customers are homogenous, having the same WTP distribution. The company has a menu of prices, e.g., 3.99, 4.99 and 5.99, to choose from. Question: How to set the price? Dynamic Pricing with Varying Cost 4 / 21

Learning and Earning The objective is to maximize the cumulative profit by adaptively offering different prices to different customers. The decision maker faces a tradeoff between exploration of the acceptance probabilities at different prices (learning) and exploitation of the immediate profit (earning). The problem was first introduced by Rothschild (1974). Without any assumptions on the WTP distribution, the problem is typically formulated as a multi-armed bandit (MAB) problem. Dynamic Pricing with Varying Cost 5 / 21

Multi-armed Bandit Originally formulated by Robbins (1952), the MAB is an important class of sequential optimization problems. Objective: Devise a sampling policy among a group of K 2 statistical populations (arms) that maximizes expected cumulative reward over a finite time horizon. Dynamic Pricing with Varying Cost 6 / 21

Multi-armed Bandit Policies are evaluated based on the regret, R[T] = T E [ ] µ i µ It, t=1 where i is the optimal arm and I t is the arm chosen in period t. Lai and Robbins (1985) proved that the regret for the MAB problem has to grow at least O ( log T ). The upper-confidence-bound (UCB) policy of Auer et al. 2002 has R[T] C log T for some constant C > 0. Dynamic Pricing with Varying Cost 7 / 21

UCB Policy 1 Initialization: Play each arm once. 2 Loop: Play arm j that maximizes µ j + 2 log t T j (t 1) where µ j is the average reward obtained from arm j, T j (t 1) is the number of times arm j has been played so far and t is the overall number of plays done so far. Dynamic Pricing with Varying Cost 8 / 21

Varying Cost In some practical applications, costs may vary for different customers. Online sales of an insurance product: Potential customers are usually asked to fill questionnaires before getting quotes for the product. The insurance company is able to assess the potential risk (cost) of each individual customer through these questionnaires. The cost for each customer, as is often the case, varies. To maximize the cumulative profit, different premiums (prices) should be asked for different customers based on their costs. Other examples include: the sales of some perishable goods, e.g. gasoline, fresh fruit, etc. Dynamic Pricing with Varying Cost 9 / 21

Notation T: Total length of of the selling periods (or customers). c t : The cost observed in period t. We assume they are i.i.d. samples from a fixed (unknown) distribution on C. p 1 < p 2 < p K : Prices choices. We assume p 1 > max c C c. K = {1, 2..., K}: index set of all the prices. µ k (c): The profit function of price k when the cost is c, ( µ k (c) = E[D(p k )] (p k c) = π k 1 c ), p k where π k = E[D(p k )]p k is the expected revenue at p k. We also assume that the observed revenue D(p k )p k [0, 1]. Dynamic Pricing with Varying Cost 10 / 21

Problem Formulation Consider a company selling a product over T (unknown) periods. At the beginning of each period t, upon observing a cost c t, the decision maker needs to choose a price p k where k K. The index of the true optimal price at time t is: i (c t ) = arg max k K µ k (c t ) Let I t (c t ) be the index of the price chosen by a pricing policy. Objective: Find a pricing policy that minimizes the cumulative regret: T R [T] = E [ µ i (c t ) (c t ) µ It (c t ) (c t ) ]. t=1 Dynamic Pricing with Varying Cost 11 / 21

Why Considering Varying Cost? Without considering the varying cost, suppose one uses the expected cost E(c t ) in making pricing decision. The problem becomes a MAB problem max k K µ k (E(c t )) Considering varying cost, the problem is max k K µ k (c t ) By Jensen s Inequality, [ ] max µ k (E(c t )) E max µ k (c t ) k K k K Dynamic Pricing with Varying Cost 12 / 21

Special Features Notice that µ k (c) = π k ( 1 c p k ), for any k K, the straight line µ k (c) always passes a fixed point [ p k, 0 ] and [0, π k ]. Precisely estimating π k is crucial. Dynamic Pricing with Varying Cost 13 / 21

Pricing Policy 1. Initialization: For t K, choose each price once. 2. Loop: For t > K Estimate revenue for each p k, and let π k,t = T 1 k (t 1) Π (p k ) T k (t 1) i where Π (p k ) i is the i-th realization of the revenue of p k. Write down the upper bound of the profit function for each p k in UCB manner, let µ k,t (c) = π k,t + Choose the price with index, i=1 2 log t T k (t 1) I t (c t ) = arg max k K µ k,t (c t ) ) (1 cpk Dynamic Pricing with Varying Cost 14 / 21

Main Results Theorem (1) If K = 2 and π 1 > π 2, the cumulative regret is bounded by R [T] C 1 ( log T ) 2 where C 1 is a positive constant that depends on the configuration of µ 1 (c) and µ 2 (c). Regret is mainly caused by the inaccurate estimation of the intersection point of µ 1 (c) and µ 2 (c), and the regret comes from the neighborhood of the intersection point. For each t, the expected regret is bounded by constant log t t. The result can be extended to K > 2 under some conditions. Dynamic Pricing with Varying Cost 15 / 21

Illustration Dynamic Pricing with Varying Cost 16 / 21

Intuitions The information learned at one value of the cost can also be used at other values of cost. Our problem is significantly more difficult than the standard MAB problem, because the profits can be arbitrarily close, making the selection very difficult. Yet, the regret is not much worse, O ( ( log T ) 2 ) compared to O ( log T ). The regret comes from the inaccurate estimation of the intersection point and, thus, causing wrong decisions. Dynamic Pricing with Varying Cost 17 / 21

A Special Case: C < If C <, at any cost, there is an gap between µ 1 (c) and µ 2 (c). Then, the inaccurate estimation of the intersection point will not happen infinitely often. What s the implication? Dynamic Pricing with Varying Cost 18 / 21

A Constant Bound Theorem (3) If C < and none of the feasible prices is inferior, then there exists a constant C 2 such that R [T] C 2. We would expect the problem with varying cost is more difficult than the one with constant cost. It is not! Because every price is good for some costs, one does not have to conduct exploration on prices that do not look good. Dynamic Pricing with Varying Cost 19 / 21

Numerical Results The cumulative regret with respective to T 1 0.9 C={1,3} C=[0,4] Cumulative Regret (normalized) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 T x 10 4 Figure: p 1 = 4.0, p 2 = 4.1, π 1 = 0.6, π 2 = 0.59 Dynamic Pricing with Varying Cost 20 / 21

Q & A Thank you! Dynamic Pricing with Varying Cost 21 / 21