Lecture 5 January 30

Similar documents
Optimal Stopping. Nick Hay (presentation follows Thomas Ferguson s Optimal Stopping and Applications) November 6, 2008

Lecture 7: Bayesian approach to MAB - Gittins index

6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE

Approximate Revenue Maximization with Multiple Items

Lecture 5. 1 Online Learning. 1.1 Learning Setup (Perspective of Universe) CSCI699: Topics in Learning & Game Theory

EE266 Homework 5 Solutions

Yao s Minimax Principle

Pakes (1986): Patents as Options: Some Estimates of the Value of Holding European Patent Stocks

Dynamic Programming (DP) Massimo Paolucci University of Genova

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Maximum Contiguous Subsequences

Lecture Notes 1

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models

Homework #4. CMSC351 - Spring 2013 PRINT Name : Due: Thu Apr 16 th at the start of class

Value of Flexibility in Managing R&D Projects Revisited

Online Appendix. ( ) =max

17 MAKING COMPLEX DECISIONS

Characterization of the Optimum

Handout 4: Deterministic Systems and the Shortest Path Problem

CMSC 858F: Algorithmic Game Theory Fall 2010 Introduction to Algorithmic Game Theory

Introduction to Dynamic Programming

1 Consumption and saving under uncertainty

Tug of War Game. William Gasarch and Nick Sovich and Paul Zimand. October 6, Abstract

Math 167: Mathematical Game Theory Instructor: Alpár R. Mészáros

THE TRAVELING SALESMAN PROBLEM FOR MOVING POINTS ON A LINE

IEOR E4004: Introduction to OR: Deterministic Models

Final exam solutions

EE365: Risk Averse Control

6.231 DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE

Homework 2: Solutions Sid Banerjee Problem 1: Practice with Dynamic Programming Formulation

16 MAKING SIMPLE DECISIONS

16 MAKING SIMPLE DECISIONS

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE

Definition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens.

Econ 101A Final Exam We May 9, 2012.

An Approximation Algorithm for Capacity Allocation over a Single Flight Leg with Fare-Locking

Department of Economics The Ohio State University Final Exam Questions and Answers Econ 8712

Dynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming

HW Consider the following game:

Forecast Horizons for Production Planning with Stochastic Demand

Support Vector Machines: Training with Stochastic Gradient Descent

Optimization Methods. Lecture 16: Dynamic Programming

Problem 1: Random variables, common distributions and the monopoly price

Optimal Investment for Worst-Case Crash Scenarios

Stock Repurchase with an Adaptive Reservation Price: A Study of the Greedy Policy

Handout 8: Introduction to Stochastic Dynamic Programming. 2 Examples of Stochastic Dynamic Programming Problems

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May 1, 2014

STP Problem Set 3 Solutions

( 0) ,...,S N ,S 2 ( 0)... S N S 2. N and a portfolio is created that way, the value of the portfolio at time 0 is: (0) N S N ( 1, ) +...

Multi-armed bandit problems

Problem 1: Random variables, common distributions and the monopoly price

Mean-Variance Analysis

Department of Economics The Ohio State University Final Exam Answers Econ 8712

Dynamic Contract Trading in Spectrum Markets

Dynamic Programming and Reinforcement Learning

OPTIMAL PORTFOLIO CONTROL WITH TRADING STRATEGIES OF FINITE

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2017

Basic Arbitrage Theory KTH Tomas Björk

Fundamental Theorems of Welfare Economics

Stochastic Optimal Control

Information aggregation for timing decision making.

Managing Consumer Referrals on a Chain Network

3 Arbitrage pricing theory in discrete time.

PORTFOLIO OPTIMIZATION AND EXPECTED SHORTFALL MINIMIZATION FROM HISTORICAL DATA

ECON Microeconomics II IRYNA DUDNYK. Auctions.

Homework 3: Asset Pricing

DRAFT. 1 exercise in state (S, t), π(s, t) = 0 do not exercise in state (S, t) Review of the Risk Neutral Stock Dynamics

1 Precautionary Savings: Prudence and Borrowing Constraints

Topics in Contract Theory Lecture 5. Property Rights Theory. The key question we are staring from is: What are ownership/property rights?

MYOPIC INVENTORY POLICIES USING INDIVIDUAL CUSTOMER ARRIVAL INFORMATION

Adaptive Experiments for Policy Choice. March 8, 2019

1 Asset Pricing: Bonds vs Stocks

Modelling Anti-Terrorist Surveillance Systems from a Queueing Perspective

1 Dynamic programming

FDPE Microeconomics 3 Spring 2017 Pauli Murto TA: Tsz-Ning Wong (These solution hints are based on Julia Salmi s solution hints for Spring 2015.

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

Uncertainty in Equilibrium

Introduction to Political Economy Problem Set 3

Homework 2: Dynamic Moral Hazard

Finite Memory and Imperfect Monitoring

Comparing Allocations under Asymmetric Information: Coase Theorem Revisited

4 Reinforcement Learning Basic Algorithms

Complex Decisions. Sequential Decision Making

Dynamic Portfolio Choice II

Dynamic tax depreciation strategies

6.231 DYNAMIC PROGRAMMING LECTURE 8 LECTURE OUTLINE

Linear functions Increasing Linear Functions. Decreasing Linear Functions

Problem Set 2: Sketch of Solutions

MULTISTAGE PORTFOLIO OPTIMIZATION AS A STOCHASTIC OPTIMAL CONTROL PROBLEM

Technical Appendix to Long-Term Contracts under the Threat of Supplier Default

OPTIMAL BLUFFING FREQUENCIES

MATH3075/3975 FINANCIAL MATHEMATICS TUTORIAL PROBLEMS

Modeling Portfolios that Contain Risky Assets Risk and Reward III: Basic Markowitz Portfolio Theory

Econ 8602, Fall 2017 Homework 2

Instantaneous rate of change (IRC) at the point x Slope of tangent

4 Martingales in Discrete-Time

Stochastic Games and Bayesian Games

ECON 6022B Problem Set 2 Suggested Solutions Fall 2011

Lec 1: Single Agent Dynamic Models: Nested Fixed Point Approach. K. Sudhir MGT 756: Empirical Methods in Marketing

Sublinear Time Algorithms Oct 19, Lecture 1

Transcription:

EE 223: Stochastic Estimation and Control Spring 2007 Lecture 5 January 30 Lecturer: Venkat Anantharam Scribe: aryam Kamgarpour 5.1 Secretary Problem The problem set-up is explained in Lecture 4. We review the notation and then study the optimal solution. Notation Let be the total number of secretaries. The set-up is over the duration of time 0 through N, where N = + 1. For the state space we have x 0 {0}, a dummy state, x k {(, k, 1), (, k, 0), (k, 1), (k, 0)}, x N {T } a terminal state. In the above, indicates that the secretary was picked earlier, k refers to the index of the secretary currently being considered, (, k, 1) (resp. (, k, 0)) means the secretary picked is the best (resp. not the best) of the k secretaries so far, and (k, 1) (resp. (k, 0)) means that the secretary currently being considered is the best (resp. not the best) of the k secretaries so far. The possible control actions at each non-terminal state are: u = 0, which for non- states means pick the current secretary and which leads to a state; and u = 1, which for non- states means don t pick the current secretary and leads to a non- state. In states, the control action is irrelevant. The problem can be put into our canonical framework via independent {0, 1}-valued random variables w 0, w 1,..., w N 1, as discussed in Lecture 4. DP Recursion evaluates the reward-to-go function: J N (T ) = 0 J k (, k, 1) = k, k = 1,..., N 1. This is the probability that the secretary who was picked, and who happens to be the best among the first k secretaries (this is what it means to be in state (, k, 1)) is actually the best overall. J k (, k, 0) = 0, k = 1,..., N 1. This is because if the secretary that was picked is not the best among the first k secretaries, he or she cannot possibly be the best overall. J k (k, 1) = max{ k, 1 J k+1 k+1(k +1, 1)+ k J k+1 k+1(k +1, 0)} where 1 corresponds to probability that the secretary at time k + 1 is better than current best secretary at time k and hence k+1 better than all previous ones. In this maximum, the first term corresponds to the choice u = 0 of picking the current secretary, and the second term corresponds to the choice u = 1 of deciding to keep interviewing secretaries. 1 J k (k, 0) = max{0, J k+1 k+1(k + 1, 1) + k J k+1 k+1(k + 1, 0)}. Here again in the maximum, the first term corresponds to the choice u = 0 of picking the current secretary, and the second 5-1

term corresponds to the choice u = 1 of deciding to keep interviewing secretaries. J 0 (0, 0) = max { 0, J 1 (1, 1)}. To understand the second term in the max, note that the first secretary seen will always be the best so far. Observations 1. In state (k, 0) reward u = 1 is an optimizer. This can seen from the update equation for J(k, 0) by noting that the reward-to-go functions are nonnegative. The intuitive meaning of this observation is that if the current secretary is not the best so far, you won t gain anything by choosing this person, but you may have a chance of choosing the best one if you play along. In fact u = 1 can be seen to be the unique optimizer in state (k, 0) for 0 k N 2, while in state (N 1, 0) either control action is an optimizer. 2. if J k (k, 1) > k then J k 1(k 1, 1) > k 1. Derivation of the above: J k (k, 1) > k 1 J k+1 k+1(k + 1, 1) + k J k+1 k+1(k + 1, 0) > k J k (k, 0) > k J k 1(k 1, 1) = max{ k 1, 1J k k(k, 1) + k 1J k k(k, 0)} > 1 + k 1 J k 1(k 1, 1) > k > k 1. This result confirms the intuition that if u = 1 (don t pick current secretary) is an optimizer in state (k, 1), it must have also been an optimizer in states (l, 1) for all 0 l k. 3. Based on above, it is seen that the optimal strategy is that there exists some threshold time L that one would let go of the first L-1 secretaries and pick the first best one afterward. Hence, the optimal arkov strategy is of the following type: 1. If the state is 0 chose u = 1. 2. If current state is (k, 0) choose u = 1 k = 1,..., N 1. 3. If current state is (k, 1) and k < L pick u = 1. If current state is (k, 1) and k L choose u = 0. Evaluating the Threshold We look for L to maximize the following: k=l P(kth secretary is the best and you have selected this person) = 1 L 1 k=l = L 1( 1 + + 1 ). k 1 L 1 1 To understand the expression 1 L 1 that is the k-th term in the summation above, k 1 L k, note that 1 is the probability that the kth secretary is the absolute best, and that if we condition on this event then the relative ordering of all the other secretaries is uniformly distributed. Now, with this threshold strategy we will end up picking the absolute best secretary precisely if at times L through k 1 we are not fooled into picking the current best secretary. Since L 1 is the probability that the best among secretaries 1... L 1 k 1 occurred at one of the times 1... k 1, this is precisely the conditional probability that we are not fooled. Now consider. Define x := L 1. The above summation approaches x 1 1 dt = x t xlog e x which is maximized at x = 1. Hence, as number of secretaries increases, the optimal e 5-2

strategy is to let 1 e fraction of them go by and then pick the first best one. Summary This problem indicates how to set up a problem as a DP problem. It illustrates that among all strategies optimal ones can be found among a small class of strategies (i.e. threshold type) and once you determine this class, it is relative easy to find an actual optimal strategy. This is typical of how dynamic programming is used in practice. Here the optimal strategy within the identified class of strategies was also found analytically, but in practice you may be able to use simulation and numerical techniques to find the best strategy within this class (after having identified which class of strategies to work with through analysis of the dynamic programming recursion). We now turn to another example. The point is to illustrate the importance of correctly modeling a real world problem. 5.2 Asset Selling Problem This problem is discussed in the textbook, section 4.4. The set-up is: 1. You have an asset that you would like to sell: e.g. a house with a Bay view 2. You have N offers, w 0,..., w N 1 one after another, modeled as i.i.d with a known distribution. 3. If you get an offer, you invest the cash at an interest rate r till the end of the process, at time N. If you reject an offer, it s gone once and for all. Objective: maximize the expected reward at end of the process. Note that this problem can be solved directly without using DP, but we will use a DP approach. State Space x 0 = {0}, a dummy state, x 1 = {w 0 }, x k = {w k 1, T } k = 2... N. At time 0 you move from the dummy state to x 1. At each time 1 k N 1, there are two control actions: either pick the current offer w k 1 and move to the terminal state or keep going. If you reach a nonterminal state at time N you are looking at the last offer x N = w N 1 and you have to accept it (this is not treated as a control action). Note that in contrast to our discussion in the secretary problem, we are abusing notation by not carrying the notion of time in the terminal state. We will attribute the reward of terminating (including investment gain) at the time that we choose to accept an offer thereby making the movement between terminal states from one time to another have zero reward, so there is no point distinguishing between terminal states at different times. 5-3

DP Recursion J k (T ) = 0 for all 1 k N, J k (x k ) = max{(1 + r) N k x k, E{J k+1 (w k )}} for 0 k N 1, where x k T. Here the maximization is taken over the two possible control actions. To understand this equation, note that for 1 k N 1 the decision to accept the offer x k = w k 1 allows you to invest it for N k time steps; this reward is paid up front and you move to the terminal state whose reward-to-go is 0. The decision to reject the offer moves you to state w k at time k + 1, you get no immediate reward, and the expected reward-to-go is now E[J k+1 (w k )]. J N (x N ) = x N for x N T. To understand this equation note that we assume that you have to accept the last offer if you have not yet accepted any offers, so we just treat the reward due to this (no investment gain since there is no time left to invest) as being a reward in the final state. Observations 1. An optimal strategy is given by a moving threshold. The strategy is given for 1 k N 1 by: accept the offer x k if x k > α k reject the offer x k if x k < α k, where α k = E{J k+1(w k )}. In case x (1+r) N k k = α k both decisions result in same reward. Note that α k is decreasing with k. This requires proof, and the proof is in the book, but the intuition is that as k increases, there is less chance to see an offer that becomes better. Hence, if the offer is good enough to be accepted at time k it should also be acceptable at time k + 1. 2. Why did we bother to discuss this example in class? Let s compare this problem to the secretary problem. In many ways it refers to the same kind of situation (you have a problem of picking one of N options which are offered to you in sequence, and if you reject an offer you can never go back to it). However the nature of the optimal strategy in the asset selling problem (moving threshold) is very different from than in the secretary problem (allow a fraction roughly 1 of the offers to go by and then pick the next best). This seems odd. The e reason is that the model is different in the two cases. Contrary to the secretary problem, here we know the distribution of the offers, hence we have some absolute notion of how good they are. oreover, there is reward associated with accepting each offer and not just the best offer. 3. The message is that the model is very important. Unless you model the problem well, you don t know what you are getting. As in all engineering: Junk in Junk out 5.3 Warehouse Restocking Problem This problem is also in the book in section 4.2. The importance of it is that it illustrates another general, widely used methodology for deriving qualitative properties of optimal strategies in problems amenable to the DP approach. 5-4

The set-up is: You have a warehouse. At each time k you get a random demand w k and you have to make a restocking order u k. We assume that u k 0. Let x k denote the amount of supplies in warehouse at time k. Then: x k+1 = x k + u k w k k = 0... N 1. Here we allow x k to be arbitrary real valued, with the convention that x k < 0 denotes borrowing from somebody else. The objective is to minimize the cost function: E{ N 1 k=0 (r(x k) + cu k ) + R(x N ))}, where r(x k ) will be taken to be a piecewise linear function such that when x k > 0 it comes from to a penalty per unit amount for keeping supplies in the warehouse and for x k < 0 it comes from a penalty per unit amount of borrowing from someone else. R(x N ) is similar. Thus, we consider r(x k ) = pmax(0, x k )+hmax(0, x k ), i.e. piecewise linear cost, with slope h when we have positive supply and slope p when we have negative cost, and the same for R(x N ). Further, c denotes the cost per unit amount of restocking. DP Recursion J N (N) = 0. J k (x k ) = min u 0 {E{J k+1 (x k + u w k ) + cu k + r(x k + u w k )}}, where the minimization is taken over all possible controls at time k. In this problem, we will show by induction that J k (x k ) is a nonnegative convex function which approaches as x k ±. From this the optimal solution is derived. This property of J k (x k ) will be proved by backwards induction, starting with J N 1 (x N 1 ). We will look at this example in more detail in the next lecture. Observation Often one can identify qualitative properties of the optimal cost-to-go functions, for example: convexity, monotonicity, multimodularity, etc., proving that these hold by backwards induction. Such properties can then indicate that the optimizing control strategies are in some class of strategies, for example: threshold strategies, time-varying threshold strategies, strategies based on some index rule, strategies based on some threshold function, etc., and hence one can determine optimal strategies for the problem at hand. 5-5