CS360 Homework 14 Solution

Similar documents
CEC login. Student Details Name SOLUTIONS

CS 343: Artificial Intelligence

Algorithms and Networking for Computer Games

Markov Decision Process

Announcements. Today s Menu

CS188 Spring 2012 Section 4: Games

Expectimax and other Games

CS 188: Artificial Intelligence

Q1. [?? pts] Search Traces

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your three crib sheets.

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

CS 4100 // artificial intelligence

CS 188: Artificial Intelligence

Introduction to Artificial Intelligence Spring 2019 Note 2

CS 343: Artificial Intelligence

CS221 / Spring 2018 / Sadigh. Lecture 9: Games I

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

CS 5522: Artificial Intelligence II

Markov Decision Processes

Non-Deterministic Search

CSE 473: Artificial Intelligence

Lecture 9: Games I. Course plan. A simple game. Roadmap. Machine learning. Example: game 1

To earn the extra credit, one of the following has to hold true. Please circle and sign.

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

Tug of War Game. William Gasarch and Nick Sovich and Paul Zimand. October 6, Abstract

Markov Decision Processes

Foundations of Artificial Intelligence

Introduction to Artificial Intelligence Midterm 1. CS 188 Spring You have approximately 2 hours.

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

Algorithmic Game Theory and Applications. Lecture 11: Games of Perfect Information

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Uncertain Outcomes. CS 188: Artificial Intelligence Uncertainty and Utilities. Expectimax Search. Worst-Case vs. Average Case

Worst-Case vs. Average Case. CSE 473: Artificial Intelligence Expectimax, Uncertainty, Utilities. Expectimax Search. Worst-Case vs.

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2017

Introduction to Fall 2011 Artificial Intelligence Midterm Exam

CSEP 573: Artificial Intelligence

CS 188: Artificial Intelligence Spring Announcements

343H: Honors AI. Lecture 7: Expectimax Search 2/6/2014. Kristen Grauman UT-Austin. Slides courtesy of Dan Klein, UC-Berkeley Unless otherwise noted

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Introduction to Fall 2011 Artificial Intelligence Midterm Exam

Markov Decision Processes

CS 798: Homework Assignment 4 (Game Theory)

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

Chapter 10: Mixed strategies Nash equilibria, reaction curves and the equality of payoffs theorem

16 MAKING SIMPLE DECISIONS

CS 188: Artificial Intelligence. Outline

Handout 4: Deterministic Systems and the Shortest Path Problem

Continuing Education Course #287 Engineering Methods in Microsoft Excel Part 2: Applied Optimization

Example: Grid World. CS 188: Artificial Intelligence Markov Decision Processes II. Recap: MDPs. Optimal Quantities

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Complex Decisions. Sequential Decision Making

CS 6300 Artificial Intelligence Spring 2018

Extending MCTS

June 11, Dynamic Programming( Weighted Interval Scheduling)

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours.

This method uses not only values of a function f(x), but also values of its derivative f'(x). If you don't know the derivative, you can't use it.

16 MAKING SIMPLE DECISIONS

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Reinforcement Learning and Simulation-Based Search

MDPs and Value Iteration 2/20/17

Homework #4. CMSC351 - Spring 2013 PRINT Name : Due: Thu Apr 16 th at the start of class

Probabilities. CSE 473: Artificial Intelligence Uncertainty, Utilities. Reminder: Expectations. Reminder: Probabilities

Introduction to Decision Making. CS 486/686: Introduction to Artificial Intelligence

Reinforcement Learning

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

COS402- Artificial Intelligence Fall Lecture 17: MDP: Value Iteration and Policy Iteration

Expectimax Search Trees. CS 188: Artificial Intelligence Fall Expectimax Example. Expectimax Pseudocode. Expectimax Pruning?

CS 188: Artificial Intelligence Fall 2011

On the Optimality of a Family of Binary Trees Techical Report TR

Expectimax Search Trees. CS 188: Artificial Intelligence Fall Expectimax Quantities. Expectimax Pseudocode. Expectimax Pruning?

Outline Introduction Game Representations Reductions Solution Concepts. Game Theory. Enrico Franchi. May 19, 2010

= quantity of ith good bought and consumed. It

Stochastic Games and Bayesian Games

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Stochastic Games and Bayesian Games

Introduction to Fall 2007 Artificial Intelligence Final Exam

Optimal Satisficing Tree Searches

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions

Sensitivity Analysis with Data Tables. 10% annual interest now =$110 one year later. 10% annual interest now =$121 one year later

COMPUTER SCIENCE 20, SPRING 2014 Homework Problems Recursive Definitions, Structural Induction, States and Invariants

Max strategy. CLASSWORK 26.1(6) tictactoe.f95 Write a program which allows two players to play Tic-tac-toe. X always starts.

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May 1, 2014

CS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I

Lecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world

Announcements. CS 188: Artificial Intelligence Fall Preferences. Rational Preferences. Rational Preferences. MEU Principle. Project 2 (due 10/1)

Chapter 7. Economic Growth I: Capital Accumulation and Population Growth (The Very Long Run) CHAPTER 7 Economic Growth I. slide 0

3.3 - One More Example...

EconS 301 Intermediate Microeconomics Review Session #4

Integer Programming Models

CS 188: Artificial Intelligence Fall Markov Decision Processes

The exam is closed book, closed notes except a two-page crib sheet. Non-programmable calculators only.

m 11 m 12 Non-Zero Sum Games Matrix Form of Zero-Sum Games R&N Section 17.6

Announcements. CS 188: Artificial Intelligence Spring Expectimax Search Trees. Maximum Expected Utility. What are Probabilities?

Exercises Solutions: Oligopoly

Transcription:

CS360 Homework 14 Solution Markov Decision Processes 1) Invent a simple Markov decision process (MDP) with the following properties: a) it has a goal state, b) its immediate action costs are all positive, c) all of its actions can result with some probability in the start state, and d) the optimal policy without discounting differs from the optimal policy with discounting and a discount factor of 0.9. Prove d) using value iteration. 0.001 S1 a2 a1 90 100 0.001 0.999 S2 11 a3 0.999 0.001 0.999 S3 With no discount factor, action a1 is preferred over action a2 in state s1: i 0 1 2 3 4... State1 a1 0 100 100.09 100.10009 100.1001001 a2 0 90 101.079 101.179 101.18909 State2 a3 0 11 11.09 11.10009 11.10010009 State3 0 0 0 0 0 With discount factor = 0.9, action a2 is preferred over action a1 in state s1: i 0 1 2 3 4... State1 a1 0 100 100.081 100.089974 100.0900476 a2 0 90 99.9711 100.0529011 100.0610432 State2 a3 0 11 11.081 11.08997399 11.09004761 State3 0 0 0 0 0 2) Consider the following problem (with thanks to V. Conitzer): Consider a rover that operates on a slope and uses solar panels to recharge. It can be in one of three states: high, medium and low on the slope. If it s its wheels, it climbs the slope in each time step (from low to medium or from medium to high) or stays high. If it does not its wheels, it slides down the slope in each time

step (from high to medium or from medium to low) or stays low. Spinning its wheels uses one unit of energy per time step. Being high or medium on the slope gains three units of energy per time step via the solar panels, while being low on the slope does not gain any energy per time step. The robot wants to gain as much energy as possible. a) Draw the MDP graphically. b) Solve the MDP using value iteration with a discount factor of 0.8. c) Describe the optimal policy. don t 0 L -1 M 2 don t don t 3 3 where L = low, M = medium and H = high. Starting with 0 as initial values, value iteration calculates the following: L M H ITR don t don t don t 1-1.00 0.00* 2.00 3.00* 2.00 3.00* 2 1.40* 0.00 4.40* 3.00 4.40 5.40* 3 2.52* 1.12 6.32* 4.12 6.32 6.52* 4 4.06* 2.02 7.22* 5.02 7.22 8.06* 5 4.77* 3.24 8.44* 6.24 8.44 8.77* 6 5.76* 3.82 9.02* 6.82 9.02 9.76* 7 6.21* 4.60 9.80* 7.60 9.80 10.21* 8 6.84* 4.97 10.17* 7.97 10.17 10.84* 9 7.14* 5.47 10.67* 8.47 10.67 11.14* 10 7.54* 5.71 10.91* 8.71 10.91 11.54*... 20 8.64* 6.88 12.08* 9.88 12.08 12.64*... 28 8.76* 7.00 12.20* 10.00 12.20 12.76* 29 8.76* 7.00 12.20* 10.00 12.20 12.76* H 2 At each iteration, the value of a state is the value of the maximizing action in that state (since we are trying to maximize energy) and is marked with an asterisk. For instance, in iteration 4, the value of L, v 4 (L), is computed as follows: v 4 (L, ) = 0.8 v 3 (M) 1 = 0.8 6.32 1 4.06 v 4 (L, don t) = 0.8 v 3 (L) + 0 = 0.8 2.52 2.02 v 4 (L) = max(v 4 (L, ), v 4 (L, don t)) = 4.06

The optimal policy is to when the rover is low or medium on the slope and not to when it is high on the slope. Now answer the three questions above for the following variant of the robot problem: If it s its wheels, it climbs the slope in each time step (from low to medium or from medium to high) or stays high, all with probability 0.3. It stays where it is with probability 0.7. If it does not its wheels, it slides down the slope to low with probability 0.4 and stays where it is with probability 0.6. Everything else remains unchanged from the previous problem. don t 0 L 0.7 0.7-1 0.4 0.3 2 0.3 don t 3 0.6 0.4 M don t 3 0.6 Starting with 0 as initial values, value iteration calculates the following: H 2 L M H ITR don t don t don t 1-1.00 0.00* 2.00 3.00* 2.00 3.00* 2-0.28 0.00* 4.40 4.44* 4.40 4.44* 3 0.07* 0.00 5.55* 5.13 5.55* 5.13 4 0.37* 0.05 6.44* 5.69 6.44* 5.69 5 0.75* 0.30 7.15* 6.21 7.15* 6.21 6 1.14* 0.60 7.72* 6.67 7.72* 6.67 7 1.49* 0.91 8.18* 7.07 8.18* 7.07 8 1.80* 1.19 8.54* 7.40 8.54* 7.40 9 2.06* 1.44 8.83* 7.68 8.83* 7.68 10 2.27* 1.65 9.07* 7.90 9.07* 7.90... 20 3.08* 2.45 9.90* 8.72 9.90* 8.72... 30 3.17* 2.53 9.99* 8.81 9.99* 8.81 31 3.17* 2.54 9.99* 8.81 9.99* 8.81 32 3.17* 2.54 9.99* 8.81 9.99* 8.81 In this variant, in iteration 4, the value of L, v 4 (L), is computed as follows: v 4 (L, ) = 0.8 (0.3 v 3 (M) + 0.7 v 3 (L)) 1 = 0.8 (0.3 5.55 + 0.7 0.07) 1 0.37 v 4 (L, don t) = 0.8 v 3 (L) + 0 = 0.8 0.07 = 0.056

v 4 (L) = max(v 4 (L, ), v 4 (L, don t)) = 0.37 The optimal policy is to, wherever the rover is on the slope. 3) Consider the following problem (with thanks to V. Conitzer): Consider a rover that operates on a slope. It can be in one of four states: top, high, medium and low on the slope. If it s its wheels slowly, it climbs the slope in each time step (from low to medium or from medium to high or from high to top) with probability 0.3. It slides down the slope to low with probability 0.7. If it s its wheels rapidly, it climbs the slope in each time step (from low to medium or from medium to high or from high to top) with probability 0.5. It slides down the slope to low with probability 0.5. Spinning its wheels slowly uses one unit of energy per time step. Spinning its wheels rapidly uses two units of energy per time step. The rover is low on the slope and aims to reach the top with minimum expected energy consumptions. a) Draw the MDP graphically. b) Solve the MDP using undiscounted value iteration (that is, value iteration with a discount factor of 1). c) Describe the optimal policy. 0.7 0.7 0.7 (1)slowly 0.3 L M H T (2)rapidly 0.5 (1)slowly 0.3 (1)slowly 0.3 (2)rapidly 0.5 (2)rapidly 0.5 0.5 0.5 0.5 where L = low, M = medium H = high, and T = top. Starting with 0 as initial values, value iteration calculates the following: L M H ITR slowly rapidly slowly rapidly slowly rapidly 1 1.00* 2.00 1.00* 2.00 1.00* 2.00 2 2.00* 3.00 2.00* 3.00 1.70* 2.50 3 3.00* 4.00 2.91* 3.85 2.40* 3.00 4 3.97* 4.96 3.82* 4.70 3.10* 3.50 5 4.93* 5.90 4.71* 5.54 3.78* 3.99 6 5.86* 6.82 5.58* 6.35 4.45* 4.46 7 6.78* 7.72 6.44* 7.16 5.10 4.93* 8 7.68* 8.61 7.22* 7.85 5.75 5.39*

9 8.54* 9.45 7.99* 8.53 6.37 5.84*... 196 25.33* 25.67 23.13 22.00* 18.73 14.67* 197 25.33* 25.67 23.13 22.00* 18.73 14.67* 198 25.33* 25.67 23.13 22.00* 18.73 14.67* 199 25.33* 25.67 23.13 22.00* 18.73 14.67* At each iteration, the value of a state is the value of the minimizing action in that state (since we are trying to minimize cost). For instance, in iteration 4, the value of L, v 4 (L), is computed as follows: v 4 (L, slowly) = 0.3 v 3 (M) + 0.7 v 3 (L) + 1 3.97 v 4 (L, rapidly) = 0.5 v 3 (M) + 0.5 v 3 (L) + 2 4.96 v 4 (L) = min(v 4 (L, slowly), v 4 (L, rapidly)) = 3.97 The optimal policy is to slowly in the low state and to rapidly in the other states. Here s a sample C code for this value iteration. #include <s t d i o. h> #define min ( x, y ) ( ( ( x)<(y )? ( x ) : ( y ) ) ) main ( ) { int i t e r a t i o n = 0 ; float l = 0. 0 ; float m = 0. 0 ; float h = 0. 0 ; float t = 0. 0 ; float l s l o w, l r a p i d, m slow, m rapid, h slow, h rapid ; while ( 1 ) { l s l o w = 1 + 0.7 l + 0.3 m; l r a p i d = 2 + 0.5 l + 0.5 m; m slow = 1 + 0.7 l + 0.3 h ; m rapid = 2 + 0.5 l + 0.5 h ; h slow = 1 + 0.7 l + 0.3 t ; h rapid = 2 + 0.5 l + 0.5 t ; p r i n t f ( %d :, ++i t e r a t i o n ) ; p r i n t f ( %.2 f %.2 f, l s l o w, l r a p i d ) ; p r i n t f ( %.2 f %.2 f, m slow, m rapid ) ; p r i n t f ( %.2 f %.2 f \n, h slow, h rapid ) ; l = min ( l s l o w, l r a p i d ) ; m = min ( m slow, m rapid ) ;

} } h = min ( h slow, h rapid ) ; 4) You won the lottery and they will pay you one million dollars each year for 20 years (starting this year). If the interest rate is 5 percent, how much money do you need to get right away to be indifferent between this amount of money and the annuity? A million dollars we get right away is worth a million dollars to us now. A million dollars we get in a year from now is worth γ = 1/(1 + 0.05) million dollars to us now because, with interest, it would be (1/1.05) 1.05 = 1 million dollars in a year. Similarly, a million dollars we get in 19 years from now (in the beginning of the 20th year) is worth only (1/1.05) 19 0.4 million dollars to us now. Therefore, getting paid a million dollars each year for 20 years is worth 1 + γ + γ 2 + + γ 19 = (1 γ 20 )/(1 γ) 0.623/0.0476 13.08 million dollars to us now. 5) Assume that you use undiscounted value iteration (that is, value iteration with a discount factor of 1) for a Markov decision process with goal states, where the action costs are greater than or equal to zero. Give a simple example that shows that the values that value iteration converges to can depend on the initial values of the states, in other words, the values that value iteration converges to are not necessarily equal to the expected goal distances of the states. 0 S G a1 1 a2 Consider the initial values v 0 (S) = 0 and v 0 (G) = 0. Value iteration determines the values after convergence to be v (S) = 0 and v (G) = 0, yet the (expected) goal distance of S is 1, not 0. Now consider the initial values v 0 (S) = 2 and v 0 (G) = 0. Value iteration determines the values after convergence to be v (S) = 1 and v (G) = 0. 6) An MDP with a single goal state (S3) is given below. a) Given the expected goal distances c(s1) = 7, c(s2) = 4.2, and c(s3) = 0, calculate the optimal policy. b) Suppose that we want to follow a policy where we pick action a2 in state S1 and action a3 in state S2. Calculate the expected goal distances of S1 and S2 for this policy.

a) We use c(s, a) to denote the expected cost of reaching a goal state if one starts in state s, executes action a and then acts according to the policy. Since S3 is a goal state and S2 has only one available action, we only need to calculate c(s1, a1) and c(s1, a2) in order to decide whether to execute a1 or a2 at S1. c(s1, a1) = 0.25(1 + c(s3)) + 0.75(2 + c(s1)) = 0.25(1 + 0) + 0.75(2 + 7)) = 7 c(s1, a2) = 0.5(1 + c(s2)) + 0.5(2 + c(s1)) = 0.5(1 + 4.2) + 0.5(2 + 7)) = 7.1 Since c(s1, a1) < c(s1, a2), in the optimal policy, we execute a1 at S1. b) Since the given policy executes a2 at S1, we simply ignore a1 during our computation. We first generate the following set of equations: c(s1) = c(s1, a2) = 0.5(1 + c(s2)) + 0.5(2 + c(s1)) c(s2) = c(s2, a3) = 0.6(1 + c(s3)) + 0.4(2 + c(s1)) c(s3) = 0 Plugging c(s3) = 0 into our second equation, we get: c(s2) = 0.6(1 + 0) + 0.4(2 + c(s1)) c(s2) = 1.4 + 0.4c(S1) Plugging this value into our first equation, we get: c(s1) = 0.5(1 + 1.4 + 0.4c(S1)) + 0.5(2 + c(s1)) c(s1) = 2.2 + 0.7c(S1)

c(s1) = 2.2/0.3 = 7.33 Finally, we get: c(s2) = 1.4 + 0.4c(S1) = 1.4 + 0.4(7.33) = 4.333 Adversarial Search 7) What is the minimax value of the root node for the game tree below? Cross out the node(s) whose value(s) the alpha-beta method never determines, assuming that it performs a depth-first search that always generates the leftmost child node first and a loss (and win) of (and MIN) corresponds to a value of (and, respectively). Determine the alpha and beta values of the remaining node(s). 5 MIN 1 5 MIN 4 2 The minimax value is 5. 4 3

= - 5 = 5 = 5 = 5 MIN = 5 = 1 = 5 = 4 5 MIN 4 2 4 3 8) Assume that you are given a version of the alpha-beta method that is able to take advantage of the information that all node values are integers that are at least 1 and at most 6. Determine ALL values for X that require the algorithm to determine the values of ALL nodes of the following game tree, assuming that the alpha-beta method performs a depth-first search that always generates the leftmost child node first. MIN X 3 3 2 (Remember to initialize α = 1 and β = 6 for the root node of the minimax tree.) Let a be the sibling of the node with value X, and b be the node with

value 2. If we assign X = 1, then only node a can be pruned. If we assign X = 2, then all nodes must be expanded. For higher values of X, node b can be pruned. Therefore, the answer is 2. 9) The minimax algorithm returns the best move for under the assumption that MIN plays optimally. What happens if MIN plays suboptimally? Is it still a good idea to use the minimax algorithm? The outcome of can only be the same or better if MIN plays suboptimally compared to MIN playing optimally. So, in general, it seems like a good idea to use minimax. However, suppose assumes MIN plays optimally and minimax determines that MIN will win. In such cases, all moves are losing and are equally good, including those that lose immediately. A better algorithm would make moves for which it is more difficult for MIN to find the winning line.