EE266 Homework 5 Solutions

Similar documents
Final exam solutions

EE365: Markov Decision Processes

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Handout 8: Introduction to Stochastic Dynamic Programming. 2 Examples of Stochastic Dynamic Programming Problems

EE365: Risk Averse Control

17 MAKING COMPLEX DECISIONS

Dynamic Portfolio Choice II

BSc (Hons) Software Engineering BSc (Hons) Computer Science with Network Security

EC316a: Advanced Scientific Computation, Fall Discrete time, continuous state dynamic models: solution methods

Non-Deterministic Search

Market Volatility and Risk Proxies

Chapter 2 Uncertainty Analysis and Sampling Techniques

2D5362 Machine Learning

Pakes (1986): Patents as Options: Some Estimates of the Value of Holding European Patent Stocks

Approximate Revenue Maximization with Multiple Items

Business Statistics 41000: Probability 3

AM 121: Intro to Optimization Models and Methods

4 Reinforcement Learning Basic Algorithms

Markov Decision Processes

Chapter 7. Sampling Distributions and the Central Limit Theorem

Lesson Plan for Simulation with Spreadsheets (8/31/11 & 9/7/11)

Eco504 Spring 2010 C. Sims FINAL EXAM. β t 1 2 φτ2 t subject to (1)

Multi-armed bandit problems

Homework 3: Asset Pricing

1.1 Interest rates Time value of money

Theory and practice of option pricing

The Binomial Lattice Model for Stocks: Introduction to Option Pricing

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

4.2 Probability Distributions

1 Consumption and saving under uncertainty

Random Variables and Applications OPRE 6301

CPSC 540: Machine Learning

What was in the last lecture?

Elif Özge Özdamar T Reinforcement Learning - Theory and Applications February 14, 2006

Intelligent Systems (AI-2)

Gamma. The finite-difference formula for gamma is

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

The Binomial Lattice Model for Stocks: Introduction to Option Pricing

Optimal Dam Management

CPSC 540: Machine Learning

Unit 5: Sampling Distributions of Statistics

Unit 5: Sampling Distributions of Statistics

,,, be any other strategy for selling items. It yields no more revenue than, based on the

1.010 Uncertainty in Engineering Fall 2008

Risk Neutral Valuation

Mengdi Wang. July 3rd, Laboratory for Information and Decision Systems, M.I.T.

Unobserved Heterogeneity Revisited

IEOR 3106: Introduction to OR: Stochastic Models. Fall 2013, Professor Whitt. Class Lecture Notes: Tuesday, September 10.

Lecture 5 January 30

1. In this exercise, we can easily employ the equations (13.66) (13.70), (13.79) (13.80) and

Optimal Search for Parameters in Monte Carlo Simulation for Derivative Pricing

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

Much of what appears here comes from ideas presented in the book:

Chapter 7. Sampling Distributions and the Central Limit Theorem

Decision Theory: Value Iteration

CS 188: Artificial Intelligence Fall 2011

Stochastic Dynamical Systems and SDE s. An Informal Introduction

Economics 883: The Basic Diffusive Model, Jumps, Variance Measures. George Tauchen. Economics 883FS Spring 2015

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

Monetary Economics Final Exam

Stochastic Optimal Control

Reasoning with Uncertainty

Self-organized criticality on the stock market

Extend the ideas of Kan and Zhou paper on Optimal Portfolio Construction under parameter uncertainty

Modelling Anti-Terrorist Surveillance Systems from a Queueing Perspective

Portfolio theory and risk management Homework set 2

Stratified Sampling in Monte Carlo Simulation: Motivation, Design, and Sampling Error

MVE051/MSG Lecture 7

M.I.T Fall Practice Problems

Lecture 7: Bayesian approach to MAB - Gittins index

Reinforcement Learning

Slides for Risk Management

Small Sample Bias Using Maximum Likelihood versus. Moments: The Case of a Simple Search Model of the Labor. Market

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model

Hand and Spreadsheet Simulations

ELEMENTS OF MONTE CARLO SIMULATION

36106 Managerial Decision Modeling Monte Carlo Simulation in Excel: Part IV

FE610 Stochastic Calculus for Financial Engineers. Stevens Institute of Technology

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Econ 8602, Fall 2017 Homework 2

16 MAKING SIMPLE DECISIONS

From Discrete Time to Continuous Time Modeling

Binomial model: numerical algorithm

IEOR E4703: Monte-Carlo Simulation

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

1 The EOQ and Extensions

- 1 - **** d(lns) = (µ (1/2)σ 2 )dt + σdw t

MYOPIC INVENTORY POLICIES USING INDIVIDUAL CUSTOMER ARRIVAL INFORMATION

6.231 DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE

Intro to Reinforcement Learning. Part 3: Core Theory

Approximation of Continuous-State Scenario Processes in Multi-Stage Stochastic Optimization and its Applications

Introduction to Real Options

5.3 Statistics and Their Distributions

Extensive-Form Games with Imperfect Information

Lecture 4: Model-Free Prediction

Statistics and Their Distributions

1 Geometric Brownian motion

Do You Really Understand Rates of Return? Using them to look backward - and forward

Multistage risk-averse asset allocation with transaction costs

Transcription:

EE, Spring 15-1 Professor S. Lall EE Homework 5 Solutions 1. A refined inventory model. In this problem we consider an inventory model that is more refined than the one you ve seen in the lectures. The amount of inventory at time t is denoted by q t {, 1,..., C}, where C > is the maximum capacity. New stock ordered at time t is u t {, 1,..., C q t }. This new stock is added to the inventory instantaneously. The demand that arrives after the new inventory, before the next time period, is d t {, 1,..., D}. The dynamics are q t+1 = (q t + u t d t ) +. The unmet demand is (q t + u t d t ) = max( q t u t + d t, ). The demands d, d 1,... are IID. We now describe the stage cost, which does not depend on t, as a sum of terms. The ordering cost is u = g order (u) = p fixed + p whole u 1 u u disc p fixed + p whole u disc + p disc (u u disc ) u > u disc where p fixed is the fixed ordering price, p whole is the wholesale price for ordering between one and u disc units, and p disc < p whole is the discount price for any amount ordered above u disc. The storage cost is g store (q) = s lin q + s quad q, where s lin and s quad are positive. The negative revenue is g rev (q, u, d) = p rev min(q + u, d), where p rev > is the retail price. The cost for unmet demand is g unmet (q, u, d) = p unmet (q + u d), where p unmet >. The terminal cost is a salvage cost, which is g sal (q) = p sal q where p sal > is the salvage price. We now consider a specific problem with and demand distribution The cost function parameters are T = 5, C =, D =, q = 1, Prob(d t =, 1,..., ) = (.,.5,.5,.,.1). p fixed =, p whole =, p disc = 1., u disc =, s lin =.1, s quad =.5, p rev = 3, p unmet = 3, p sal = 1.5 (a) Solve the MDP and report J. 1

(b) Plot the optimal policy for several interesting values of t, and describe what you see. Does the optimal policy converge as t goes to zero? If so, give the steady state optimal policy. (c) Plot E g order (u t ), E g store (q t ), E g rev (q t, u t, d t ), E g unmet (q t, u t, d t ), and E g sal (q t ), versus t, all on the same plot. Solution: (a) We solve the MDP using value iteration, which results in J = V (1) = 1.39. We notice that the value function converges in shape at around t = (with an offset of around. per time period), and the policy converges at around t = 7. (b) The policy converges at time t = 7 to that of ordering 7 units when the inventory is empty, and 5 units when we only have 1 unit left in the inventory. We have plotted the optimal policy for t = 9, t =, t = 7 and t = below. t=9 t= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 t=5 1 1 1 1 1 t= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (c) We formed the closed loop Markov chain under the optimal policy µ t, and evaluated the expected cost using distribution propagation. The plot is shown below.

unmet order revenue storage salvage 5 1 15 5 3 35 5. Exiting the market with incomplete fulfillment. You have S identical stocks that you can sell in one of T periods. The price fluctuates randomly; we model this as the prices being IID from a known distribution. At each period, you are told the price, and then decide whether to sell one stock or wait. However, the market is not necessarily liquid, so if you decide to sell a stock, there is some chance that no buyer is willing to buy your stock. This random event is modeled as a Bernoulli random variable with probability. that no buyer is willing to buy your stock, and a probability of. that you sell it. Note that you can decide to sell a maximum of one stock per period. Your goal is to maximize the expected revenue from the sales. (a) Model this problem as a Markov decision problem that you can use to find the optimal selling policy: that is, give the set of states X, the set of actions U, and the set of disturbances W. Give the dynamics function f, and reward function g. There is no terminal reward. What is the information pattern of the policy? (b) The stock prices are independent random variables following a discretized lognormal distribution with the log-prices having mean and standard deviation.. In particular, the prices take 15 values, ranging from. to. in increments of.1, and the probability mass function of p is proportional to the probability density function of the stated log-normal distribution. Compute the optimal policy and the optimal expected revenue for T = 5 periods and S = 1. Plot value functions at times t =, 5,, 9, 5, and policies at times t =,,, 5. What do you observe? (c) Let us define a threshold policy in which you decide to sell only when the stock s price is greater than the expected value of the prices. With the same parameters as in question (b), compute the expected revenue of this threshold policy. What 3

can you notice in comparison to question (b)? (d) For the optimal policy and the threshold policy described in (c), compute the probability that we have unsold stocks after the final time T. (e) Find a policy for which the probability of having unsold stock after the final time T is less than.5. Compute the expected revenue of this policy, and compare it to the expected revenues of the two previous policies you calculated in this question. Solution. (a) The set of states X for this Markov decision problem represents the number of stocks we can own at each time step t. So X = {, 1,,..., S 1, S} and x t X for t =,..., T. Then, the stock s owner can only take two types of actions: the owner can decide to sell one stock, i.e. u t = 1, or he/she can decide to wait i.e. u t =. Therefore U = {, 1}. The information pattern of the policy is a split w pattern where we measure x and part of w before determining the chosen action u. Thus W = W 1 W where W 1 contains all the possible prices that a stock can take (we know this information before taking action u), and W = {, 1}. Indeed, if wt = then no buyer is willing to buy the stock, and if wt = 1, a buyer is willing to buy a stock at the current price wt 1. This wt value is only known after the seller decides to sell or wait i.e. after the policy is chosen. Thus, the dynamics f of this probelm are: { x t+1 = f t (x t, u t, wt x t wt x t >, u t = 1, ) = otherwise. And the reward function g for t =,..., T 1 is wt 1 x t >, u t = 1, wt = 1, g t (x t, u t, wt 1, wt ) = x t =, u t = 1, otherwise. Finally, g T = as there is no terminal reward. (b) The optimal expected revenue for T = 5 and S = 1 is 11.3. The plots of the value function at time t =, 5,, 9, 5 are showed below. x t

The plots of the policies at time t =,,, 5 are showed below. Green means that the seller attempts to sell, and red means that he holds his stock. 5

We observe that the closer we get to T, the more the value function converges to a single value for every postive stock remaining to sell. Indeed, this seems logical as there is no terminal reward at time T so the seller has no interest in having any stock left at the terminal time T. Similarly, we observe the same trend looking at the policy plots. In particular, we see that for a fixed number of stocks, the price threshold at which the policy is to sell decreases as the time t increases. Along the same line, for a fixed price, we observe that the quantity threshold at which the policy is to sell decreases as the time t increases. (c) Implementing the described threshold policy with the same parameters as in question (b), the expected revenue is 9.39 which is less than the expected revenue found in question (b). Indeed, it is logical to find such a result, as the policy computed in question (b) is optimal whereas the policy used in this question follows a heuristic. (d) The probability that the seller has unsold stocks after the final time T is.3 with the optimal policy computed in question (b). This unsold stock probability is.7 for the threshold policy computed in question (c). (e) Implementing another threshold policy for which the seller will always decide to sell one stock if the price wt 1 is greater than.7 (second lowest price), the probability of having unsold stocks at the final time T is.7 <.5. The expected revenue of this new threshold policy is 1. which is between the expected revenues found in questions (b) and (c). It is always logical that this new policy will have a lower expected revenue than the one computed in question (b) because the policies implemented in question (b) are optimal. However, at first glance, it can be surprising that this risk-averse policy generates more revenue than the previous threshold policy, but looking at the relatively high probability of unsold stocks of.7 found in question (c) compared to.7 in question (d), it makes sense that minimizing his/her risk of possessing unsold stocks at the final T, increases his/her expected revenue. 3. Appliance scheduling with fluctuating real-time prices. An appliance has C cycles, c = 1,..., C, that must be run, in order, in T C time periods, t =,..., T 1. A schedule consists of a sequence t 1 < < t C T 1, where t c is the time period in which cycle c is run. Each cycle c uses a (known) amount of energy e c >, c = 1,..., C, and, in each period t, there is an energy price p t. The total energy cost is then J = C c=1 e cp tc. In the lecture on deterministic finite-state control, we considered an example of this type of problem, where the prices are known ahead of time. Here, however, we assume that the prices are independent log-normal random variables, with known means, p t, and variances, σt, t =,..., T 1. You can think of p t as the predicted energy price (say, from historical data), and p t as the actual realized real-time energy price. The following questions pertain to the specific problem instance defined in appliance_sched_data.json. (a) Minimum mean cost schedule. Find the schedule that minimizes E J. Give the optimal value of E J, and show a histogram of J (using Monte Carlo simulation). Here you do not know the real-time prices; you only know their distributions.

(b) Optimal policy with real-time prices. Now suppose that right before each time period t, you are told the real-time price p t, and then you can choose whether or not to run the next cycle in time period t. (If you have already run all cycles, there is nothing you can do.) Find the optimal policy, µ. Find the optimal value of E J, and compare it to the value found in part (a). Give a histogram of J. You may use Monte Carlo (or simple numerical integration) to evaluate any integrals that appear in your calculations. For simulations, the following facts will be helpful: If z N ( µ, σ ), then w = exp z is log-normal with mean µ and variance σ given by ( ) µ = e µ+ σ /, σ = e σ 1 e µ+ σ. We can solve these equations for ( ) µ µ = log, σ = log(1 + σ /µ ). µ + σ Solution. (a) We use state variable x t {,..., C}, where x t is the number of cycles run prior to time period t, so we start with x = and we require x T = C. The action u t {, 1} indicates whether or not we run a cycle at time t; we require u t = when x t = C. The state transition function is then x t+1 = x t + u t. The stage cost is e xt+1 p t u t = 1, x t C g t (x t, u t ) = u t = 1, x t = C otherwise, for t =,..., T 1, and the terminal cost is g T (x t, u t ) = if x T otherwise. This is a deterministic finite-state control problem. The dynamic programming iteration has the form V t (x t ) = min (g t (x t, ) + V t+1 (x t ), g t (x t, 1) + V t+1 (x t + 1)), which we can write as { min (V t+1 (x t ), e xt+1 p t + V t+1 (x t + 1)) x t, C V t (x t ) = V t+1 (x t ) x t = C. = C and The dynamic programming recursion is initialized with V T (x) = g T (x). For the given problem instance, we obtain an optimal value of E J = 9., which is achieved by the schedule (1,, 3,, 5,, 7,, 3, ). A histogram of J under this policy is shown below. 7

(b) Here we have a stochastic control problem, in which we know the disturbance before determining the action. The stage cost is identical to the cost in part (a), with the mean price p t replaced with the real-time price p t. The DP iteration has the form { E min (V t+1 (x t ), e xt+1p t + V t+1 (x t + 1)) x t C, V t (x t ) = V t+1 (x t ) x t = C, where the expectation is taken with respect to p t. The optimal policy has the form { µ 1 e xt+1p t + V t+1 (x t + 1) V t+1 (x t ) t (x t, p t ) =. otherwise This has a very nice interpretation: It says we should run the next cycle if the energy price is cheaper than p t V t+1(x t + 1) V t+1 (x t ). e xt+1 We can evaluate the expectation in the iteration for V t analytically, using the formula for the CDF of a log-normal variable, or (more simply) by Monte Carlo simulation. (Our code does the latter.) We obtain an optimal value of E J = 7.1, which is less than the average cost found in part (a). 5 Minimum mean cost 15 1 5 5 5 1 15 5 J Real time prices 15 1 5 5 1 15 5 J