Chapter 3: The Reinforcement Learning Problem. The Agent'Environment Interface. Getting the Degree of Abstraction Right. The Agent Learns a Policy

Size: px

Start display at page:

Download "Chapter 3: The Reinforcement Learning Problem. The Agent'Environment Interface. Getting the Degree of Abstraction Right. The Agent Learns a Policy"

Camron Anderson
6 years ago
Views:

1 Chpter 3: The Reinforcement Lerning Problem The Agent'Environment Interfce Objectives of this chpter: describe the RL problem we will be studying for the reminder of the course present idelized form of the RL problem for which we hve precise theoreticl results; introduce key components of the mthemtics: vlue functions nd Bellmn equtions; Agent nd environment interct t discrete time steps Agent observes stte t step t :!S produces ction t step t : t! A( ) gets resulting rewrd : r t +1! : t = 0, 1, 2, K describe trde'os between pplicbility nd mthemticl trctbility.... nd resulting next stte : +1 r t +1 s t +1 t +1 t r t +2 s r t +3 t +2 st t +2 t The Agent Lerns Policy Policy t step t,! t : mpping from stteo ction probbilities! t (s, ) = probbility tht t = when Reinforcement lerning methods specify how the gent chnges its policy s result of experience. Roughly, the gent(s gol io get s much rewrd s it cn over the long run. Getting the Degree of Abstrction Right Time steps need not refer to!xed intervls of rel time. Actions cn be low level e.g., voltgeo motors, or high level e.g., ccept job oer, %mentl& e.g., shift in focus of ttention, etc. Sttes cn be low'level %senstions&, or they cn be bstrct, symbolic, bsed on memory, or subjective e.g., the stte of being %surprised& or %lost&. An RL gent is not like whole niml or robot, which consist of mny RL gents s well s other components. The environment is not necessrily unknown to the gent, only incompletely controllble. 3 3 Rewrd computtion is in the gent(s environment becuse the gent cnnot chnge it rbitrrily. 4 4

2 Gols nd Rewrds Returns Is sclr rewrd signl n dequte notion of gol?) mybe not, but it is surprisingly *exible. Suppose the sequence of rewrds fter step t is : r t +1, r t+ 2, r t + 3, K Wht do we wnt to mximize? A gol should specify wht we wnt to chieve, not how we wnt to chieve it. A gol must be outside the gent(s direct control)thus outside the gent. The gent must be ble to mesure success: In generl, we wnt to mximize the expected return, E{ }, for ech step t. Episodic tsks: interction breks nturlly into episodes, e.g., plys of gme, triphrough mze. + r t +2 +L + r T, explicitly; frequently during its life'spn. where T is!nl time step t which terminl stte is reched, ending n episode Returns for Continuing Tsks An Exmple Continuing tsks: interction does not hve nturl episodes. Discounted return: +! r t+ 2 +! 2 r t +3 +L =! k r t + k +1, k =0 where!, 0! 1, ihe discount rte. shortsighted 0! 1 frsighted Avoid filure: the pole flling beyond criticl ngle or the crt hitting end of trck. As n episodic tsk where episode ends upon filure: rewrd = +1 for ech step before filure! return = number of steps before filure As continuing tsk with discounted return: rewrd =!1 upon filure; 0 otherwise return is relted to! k, for k steps before filure In either cse, return is mximized by voiding filure for s long s possible

3 Another Exmple Get to the top of the hill s quickly s possible. A Uni!ed Nottion In episodic tsks, we number the time steps of ech episode strting from zero. We usully do not hve to distinguish between episodes, so we write insted of for the stte t step t of episode j., j Think of ech episode s ending in n bsorbing stte tht lwys produces rewrd of zero: rewrd =!1 for ech step where not t top of hill return =! number of steps before reching top of hill Return is mximized by minimizing number of steps rech the top of the hill. We cn cover ll cses by writing =! k r t +k +1, where! cn be 1 only if zero rewrd bsorbing stte is lwys reched. k = The Mrkov Property Mrkov Decision Processes By %the stte& t step t, the book mens whtever informtion is vilble to the gent t step t bout its environment. The stte cn include immedite %senstions,& highly processed senstions, nd structures built up over time from sequences of senstions. Idelly, stte should summrize pst senstions so o retin ll %essentil& informtion, i.e., it should hve the Mrkov Property: If reinforcement lerning tsk hhe Mrkov Property, it is bsiclly Mrkov Decision Process (MDP). If stte nd ction sets re!nite, it is finite MDP. To de!ne!nite MDP, you need to give: stte nd ction sets one'step %dynmics& de!ned by trnsition probbilities = Pr{ +1!, r t +1 = r, t } Pr +1!, r t +1 = r, t,r t, 1, t 1,K, r 1,s 0, 0 for ll s!, r, nd histories, t,r t, 1, t 1,K, r 1, s 0, P s s! = Pr +1!, t = rewrd expecttions: R s! for ll s,! for ll s,! = E r t +1, t =, +1! 12 s S, A(s). s S, A(s). 12

4 An Exmple Finite MDP Recycling Robot MDP Recycling Robot At ech step, robot ho decide whether it should 1 ctively serch for cn, 2 wit for someone to bring it cn, or 3 go to home bse nd rechrge. Serching is better but runs down the bttery; if runs out of power while serching, ho be rescued which is bd. Decisions mde on bsis of current energy level: high, low. S = { high, low} A(high) = { serch, wit} A(low) = { serch, wit, rechrge } R serch = expected no. of cns while serching R wit = expected no. of cns while witing R serch > R wit Vlue Functions Bellmn Eqution for Policy! The vlue of stte ihe expected return strting from tht stte; depends on the gent(s policy: Stte - vlue function for policy! : = E! & k r t +k +1 V! (s) = E! % ' k =0 ( ) * The bsic ide: +! r t +2 +! 2 r t + 3 +! 3 r t + 4 L +! r t +2 +! r t +3 +! 2 r t + 4 L +! +1 ( ) The vlue of tking n ction in stte under policy! ihe expected return strting from tht stte, tking tht ction, nd therefter following! : So: V! (s) = E! = E! r t +1 + V! ( +1 ) Action- vlue function for policy! : = E! & k r t + k +1, t = Q! (s, ) = E!, t = % ' k = 0 ( ) * Or, without the expecttion opertor: V! (s) =!(s, ) P s s s [ R s s + V! ( s )]

5 More on the Bellmn Eqution V! (s) =!(s, ) P s s s [ R s s + V! ( s )] This is set of equtions in fct, liner, one for ech stte. The vlue function for! is its unique solution. Bckup digrms: Gridworld Actions: north, south, est, west; deterministic. If would tke gent o the grid: no move but rewrd = +1 Other ctions produce rewrd = 0, except ctionht move gent out of specil sttes A nd B s shown. Stte'vlue function for equiprobble rndom policy; = 0.9 for V! for Q! Stte is bll loction Rewrd of +1 for ech stroke until the bll is in the hole Vlue of stte? Actions: putt use putter driver use driver putt succeeds nywhere on the green Golf Optiml Vlue Functions For!nite MDPs, policies cn be prtilly ordered:!! if nd only if V! (s) V! (s) for ll s S There is lwys t lest one nd possibly mny policieht is better thn or equl to ll the others. This is n optiml policy. We denote them ll! *. Optiml policies shre the sme optiml stte-vlue function: V! (s) = mx V (s) for ll s S Optiml policies lso shre the sme optiml ction-vlue function: Q! (s, ) = mx Q (s, ) for ll s S nd A(s) This ihe expected return for tking ction in stte s nd therefter following n optiml policy

6 Optiml Vlue Function for Golf We cn hit the bll frther with driver thn with putter, but with less ccurcy Q*s,driver givehe vlue or using driver!rst, then using whichever ctions re best Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V!! (s) = mx A( s) Q (s,) = mx E r t +1 + V! ( +1 ), t = A( s) = mx & P s s % [ R s s % + V! ( s %)] A( s) s % The relevnt bckup digrm: V! ihe unique solution of this system of nonliner equtions Bellmn Optimlity Eqution for Q* s [ R s s + mx Q! ( s, ) ] Q! (s, ) = E r t +1 + mx Q! ( s, ), t = = P s s Why Optiml Stte'Vlue Functions re Useful Any policy tht is greedy with respect to V! Therefore, given, one'step'hed serch producehe long'term optiml ctions. E.g., bck to the gridworld: V! is n optiml policy. The relevnt bckup digrm: Q * ihe unique solution of this system of nonliner equtions

7 Wht About Optiml Action'Vlue Functions? Q * Given, the gent does not even hve to do one'step' hed serch:! (s) = rg mx A(s) Q (s, ) Solving the Bellmn Optimlity Eqution Finding n optiml policy by solving the Bellmn Optimlity Eqution requirehe following: ccurte knowledge of environment dynmics; we hve enough spce nd time to do the computtion; the Mrkov Property. How much spce nd time do we need? polynomil in number of sttes vi dynmic progrmming methods; Chpter 4, BUT, number of sttes is often huge e.g., bckgmmon hs bout 10**20 sttes. We usully hve to settle for pproximtions. Mny RL methods cn be understood s pproximtely solving the Bellmn Optimlity Eqution Summry Agent'environment interction Sttes Actions Rewrds Policy: stochstic rule for selecting ctions Return: the function of future rewrds gent trieo mximize Episodic nd continuing tsks Mrkov Property Mrkov Decision Process Trnsition probbilities Expected rewrds Vlue functions Stte'vlue function for policy Action'vlue function for policy Optiml stte'vlue function Optiml ction'vlue function Optiml vlue functions Optiml policies Bellmn Equtions The need for pproximtion 27 27

Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs?

Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs? CS 188: Artificil Intelligence Fll 2010 Lecture 9: MDP 9/2/2010 Reinforcement Lerning [DEMOS] Bic ide: Receive feedbck in the form of rewrd Agent utility i defined by the rewrd function Mut (lern to) ct