Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Pamela Elliott
5 years ago
Views:

1 Reiforcemet Learig Ala Fer * Based i part o slides by Daiel Weld

2 So far. Give a MDP model we kow how to fid optimal policies (for moderately-sized MDPs) Value Iteratio or Policy Iteratio Give just a simulator of a MDP we kow how to select actios Mote-Carlo Plaig What if we do t have a model or simulator? Like whe we were babies... Like i may real-world applicatios All we ca do is wader aroud the world observig what happe gettig rewarded ad puished Eters reiforcemet learig 2

3 Reiforcemet Learig No kowledge of eviromet Ca oly act i the world ad observe states ad reward May factors make RL difficult: Actios have o-determiistic effects Which are iitially ukow Rewards / puishmets are ifrequet Ofte at the ed of log sequeces of actios How do we determie what actio(s) were really resposible for reward or puishmet? (credit assigmet) World is large ad complex Nevertheless learer must decide what actios to take We will assume the world behaves as a MDP 3

4 Pure Reiforcemet Learig vs. Mote-Carlo Plaig I pure reiforcemet learig: the aget begis with o kowledge waders aroud the world observig outcomes I Mote-Carlo plaig the aget begis with o declarative kowledge of the world has a iterface to a world simulator that allows observig the outcome of takig ay actio i ay state The simulator gives the aget the ability to teleport to ay state, at ay time, ad the apply ay actio A pure RL aget does ot have the ability to teleport Ca oly observe the outcomes that it happes to reach 4

5 Pure Reiforcemet Learig vs. Mote-Carlo Plaig MC plaig is sometimes called RL with a strog simulator I.e. a simulator where we ca set the curret state to ay state at ay momet Pure RL is sometimes called RL with a weak simulator I.e. a simulator where we caot set the state A strog simulator ca emulate a weak simulator So pure RL ca be used i the MC plaig framework But ot vice versa 5

6 Passive vs. Active learig Passive learig The aget has a fixed policy ad tries to lear the utilities of states by observig the world go by Aalogous to policy evaluatio Ofte serves as a compoet of active learig algorithms Ofte ispires active learig algorithms Active learig The aget attempts to fid a optimal (or at least good) policy by actig i the world Aalogous to solvig the uderlyig MDP, but without first beig give the MDP model 6

7 Model-Based vs. Model-Free RL Model based approach to RL: lear the MDP model, or a approximatio of it use it for policy evaluatio or to fid the optimal policy Model free approach to RL: derive the optimal policy without explicitly learig the model useful whe model is difficult to represet ad/or lear We will cosider both types of approaches 7

8 Small vs. Huge MDPs We will first cover RL methods for small MDPs MDPs where the umber of states ad actios is reasoably small These algorithms will ispire more advaced methods Later we will cover algorithms for huge MDPs Fuctio Approximatio Methods Policy Gradiet Methods Least-Squares Policy Iteratio 8

9 Example: Passive RL Suppose give a statioary policy (show by arrows) Actios ca stochastically lead to uiteded grid cell Wat to determie how good it is 9

10 Objective: Value Fuctio 0

11 Passive RL Estimate V (s) Not give trasitio matrix, or reward fuctio! Follow the policy for may epochs givig traiig sequeces. (,)(,2)(,3)(,2)(,3)(2,3)(3,3) (3,4) + (,)(,2)(,3)(2,3)(3,3)(3,2)(3,3)(3,4) + (,)(2,)(3,)(3,2)(4,2) - Assume that after eterig + or - state the aget eters zero reward termial state So we do t bother showig those trasitios

12 Approach : Direct Estimatio Direct estimatio (also called Mote Carlo) Estimate V (s) as average total reward of epochs cotaiig s (calculatig from s to ed of epoch) Reward to go of a state s the sum of the (discouted) rewards from that state util a termial state is reached Key: use observed reward to go of the state as the direct evidece of the actual expected utility of that state Averagig the reward-to-go samples will coverge to true value at state 2

13 Direct Estimatio Coverge very slowly to correct utilities values (requires a lot of sequeces) Does t exploit Bellma costraits o policy values V ( s) R( s) T( a, s') V ( s') It is happy to cosider value fuctio estimates that violate this property badly. s' How ca we icorporate the Bellma costraits? 3

14 Approach 2: Adaptive Dyamic Programmig (ADP) ADP is a model based approach V Follow the policy for awhile Estimate trasitio model based o observatios Lear reward fuctio Use estimated model to compute utility of policy ( s) R( s) s' T( a, s') V ( s') leared How ca we estimate trasitio model T(a,s )? Simply the fractio of times we see s after takig a i state s. NOTE: Ca boud error with Cheroff bouds if we wat 4

15 ADP learig curves (4,3) (3,3) (2,3) (,) (3,) (4,) (4,2) 5

16 Approach 3: Temporal Differece Learig (TD) V Ca we avoid the computatioal expese of full DP policy evaluatio? Temporal Differece Learig (model free) Do local updates of utility/value fuctio o a per-actio basis Do t try to estimate etire trasitio fuctio! For each trasitio from s to s, we perform the followig update: ( s) V ( s) ( R( s) V ( s') V ( s)) updated estimate Ituitively moves us closer to satisfyig Bellma costrait Why? V learig rate ( s) R( s) discout factor s' T( a, s') V ( s') 6

17 7 Aside: Olie Mea Estimatio Suppose that we wat to icremetally compute the mea of a sequece of umbers (x, x 2, x 3,. ) E.g. to estimate the expected value of a radom variable from a sequece of samples. Give a ew sample x +, the ew mea is the old estimate (for samples) plus the weighted differece betwee the ew sample ad old estimate i i i i i i X x X x x x x X ˆ ˆ ˆ average of + samples

18 8 Aside: Olie Mea Estimatio Suppose that we wat to icremetally compute the mea of a sequece of umbers (x, x 2, x 3,. ) E.g. to estimate the expected value of a radom variable from a sequece of samples. i i i i i i X x X x x x x X ˆ ˆ ˆ average of + samples

19 9 Aside: Olie Mea Estimatio Suppose that we wat to icremetally compute the mea of a sequece of umbers (x, x 2, x 3,. ) E.g. to estimate the expected value of a radom variable from a sequece of samples. Give a ew sample x +, the ew mea is the old estimate (for samples) plus the weighted differece betwee the ew sample ad old estimate i i i i i i X x X x x x x X ˆ ˆ ˆ average of + samples sample + learig rate

20 Approach 3: Temporal Differece Learig (TD) V TD update for trasitio from s to s : ( s) V ( s) ( R( s) V ( s') V ( s)) updated estimate learig rate (oisy) sample of value at s based o ext state s So the update is maitaiig a mea of the (oisy) value samples If the learig rate decreases appropriately with the umber of samples (e.g. /) the the value estimates will coverge to true values! (o-trivial) V ( s) R( s) s' T( a, s') V ( s') 20

21 Approach 3: Temporal Differece Learig (TD) V TD update for trasitio from s to s : ( s) V ( s) ( R( s) V ( s') V ( s)) learig rate (oisy) sample of utility based o ext state Ituitio about covergece V Whe V satisfies Bellma costraits the expected update is 0. ( s) R( s) s' T( a, s') V ( s') Ca use results from stochastic optimizatio theory to prove covergece i the limit 2

22 The TD learig curve Tradeoff: requires more traiig experiece (epochs) tha ADP but much less computatio per epoch Choice depeds o relative cost of experiece vs. computatio 22

23 Passive RL: Comparisos Mote-Carlo Direct Estimatio (model free) Simple to implemet Each update is fast Does ot exploit Bellma costraits Coverges slowly Adaptive Dyamic Programmig (model based) Harder to implemet Each update is a full policy evaluatio (expesive) Fully exploits Bellma costraits Fast covergece (i terms of updates) Temporal Differece Learig (model free) Update speed ad implemetatio similiar to direct estimatio Partially exploits Bellma costraits---adjusts state to agree with observed successor Not all possible successors as i ADP Covergece i betwee direct estimatio ad ADP 23

24 Betwee ADP ad TD Movig TD toward ADP At each step perform TD updates based o observed trasitio ad imagied trasitios Imagied trasitio are geerated usig estimated model The more imagied trasitios used, the more like ADP Makig estimate more cosistet with ext state distributio Coverges i the limit of ifiite imagied trasitios to ADP 24

25 Active Reiforcemet Learig So far, we ve assumed aget has a policy We just leared how good it is Now, suppose aget must lear a good policy (ideally optimal) While actig i ucertai world 25

26 Naïve Approach. Act Radomly for a (log) time 2. Lear Or systematically explore all possible actios Trasitio fuctio Reward fuctio 3. Use value iteratio, policy iteratio, 4. Follow resultig policy thereafter. Will this work? Ay problems? Yes (if we do step log eough ad there are o dead-eds ) We will act radomly for a log time before exploitig what we kow. 26

27 Revisio of Naïve Approach. Start with iitial (uiformed) model 2. Solve for optimal policy give curret model (usig value or policy iteratio) 3. Execute actio suggested by policy i curret state 4. Update estimated model based o observed trasitio 5. Goto 2 This is just ADP but we follow the greedy policy suggested by curret value estimate Will this work? No. Ca get stuck i local miima. What ca be doe? 27

28 Exploratio versus Exploitatio Two reasos to take a actio i RL Exploitatio: To try to get reward. We exploit our curret kowledge to get a payoff. Exploratio: Get more iformatio about the world. How do we kow if there is ot a pot of gold aroud the corer. To explore we typically eed to take actios that do ot seem best accordig to our curret model. Maagig the trade-off betwee exploratio ad exploitatio is a critical issue i RL Basic ituitio behid most approaches: Explore more whe kowledge is weak Exploit more as we gai kowledge 28

29 ADP-based RL. Start with iitial model 2. Solve for optimal policy give curret model (usig value or policy iteratio) 3. Take actio accordig to a explore/exploit policy (explores more early o ad gradually uses policy from 2) 4. Update estimated model based o observed trasitio 5. Goto 2 This is just ADP but we follow the explore/exploit policy Will this work? Depeds o the explore/exploit policy. Ay ideas? 29

30 Explore/Exploit Policies Greedy actio is actio maximizig estimated Q-value Q( a) R( s) s' T( a, s') V ( s') where V is curret optimal value fuctio estimate (based o curret model), ad R, T are curret estimates of model Q(a) is the expected value of takig actio a i state s ad the gettig the estimated value V(s ) of the ext state s Wat a exploratio policy that is greedy i the limit of ifiite exploratio (GLIE) Guaratees covergece GLIE Policy O time step t select radom actio with probability p(t) ad greedy actio with probability -p(t) p(t) = /t will lead to covergece, but is slow 30

31 Explore/Exploit Policies Greedy actio is actio maximizig estimated Q-value Q( a) R( s) s' T( a, s') V where V is curret value fuctio estimate, ad R, T are curret estimates of model GLIE Policy 2: Boltzma Exploratio Select actio a with probability, Pr( a s) a' A ( s') T is the temperature. Large T meas that each actio has about the same probability. Small T leads to more greedy behavior. Typically start with large T ad decrease with time exp Q( a) / T exp Q( a') / T 3

32 The Impact of Temperature Pr( a s) a' A exp Q( a) / T exp Q( a') / T Suppose we have two actios ad that Q(a) =, Q(a2) = 2 T=0 gives Pr(a s) = 0.48, Pr(a2 s) = 0.52 Almost equal probability, so will explore T= gives Pr(a s) = 0.27, Pr(a2 s) = 0.73 Probabilities more skewed, so explore a less T = 0.25 gives Pr(a s) = 0.02, Pr(a2 s) = 0.98 Almost always exploit a2 32

33 Alterative Approach: Optimistic Exploratio. Start with iitial model 2. Solve for optimistic policy (uses optimistic variat of value iteratio) (iflates value of actios leadig to uexplored regios) 3. Take greedy actio accordig to optimistic policy 4. Update estimated model 5. Goto 2 Basically act as if all uexplored state-actio pairs are maximally rewardig. 33

34 Optimistic Exploratio Recall that value iteratio iteratively performs the followig update at all states: V ( s) Optimistic variat adjusts update to make actios that lead to uexplored regios look good Implemet variat of VI that assigs the highest possible value V max to ay state-actio pair that has ot bee explored eough Maximum value is whe we get maximum reward forever V R( s) max max max t max R R t 0 a T( a, s') V ( s') What do we mea by explored eough? N(a) > N e, where N(a) is umber of times actio a has bee tried i state s ad N e is a user selected parameter s' 34

35 Optimistic Exploratio V ( s) R( s) max a T( a, s') V ( s') Optimistic value iteratio computes a optimistic value fuctio V + usig updates V ( s) R( s) max a V s' s' T( a, s') V The aget will behave iitially as if there were woderful rewards scattered all over aroud optimistic. But after actios are tried eough times we will perform stadard o-optimistic value iteratio max Some recet theoretical results show how to set N e is to arrive at provably optimal learig (with high probability) i polyomial time (the RMAX algorithm), ( s'), N( a) N( a) N e N e 35

36 Aother View o Optimistic Exploratio: The Rmax Algorithm. Start with a optimistic model (assig largest possible reward to uexplored states ) (actios from uexplored states oly self trasitio) 2. Solve for optimal policy i optimistic model 3. Take greedy actio accordig to policy 4. Update optimistic estimated model (if a state becomes kow the use its true statistics) 5. Goto 2 Aget always acts greedily accordig to a model that assumes all uexplored states are maximally rewardig 36

37 Rmax: Optimistic Model Keep track of umber of times a state-actio pair is tried If N(a) < N e the T(a,s)= ad R(s) = Rmax otherwise T(a,s ) ad R(s) are based o estimates obtaied from the N e experieces (the estimate of true model) For large eough N e these will be accurate estimates A optimal policy for this optimistic model will try to reach uexplored states (those with uexplored actios) sice it ca stay at those states ad accumulate maximum reward Never explicitly explores. Is always greedy, but with respect to a optimistic outlook. Theoretically efficiet algorithm. 37

38 TD-based Active RL. Start with iitial value fuctio 2. Take actio from explore/exploit policy givig ew state s (should coverge to greedy policy, i.e. GLIE) 3. Update estimated model 4. Perform TD update V( s) V( s) ( R( s) V( s') V( s)) V(s) is ew estimate of optimal value fuctio at state s. 5. Goto 2 Just like TD for passive RL, but we follow explore/exploit policy Give the usual assumptios about learig rate ad GLIE, TD will coverge to a optimal value fuctio! 38

39 TD-based Active RL. Start with iitial value fuctio 2. Take actio from explore/exploit policy givig ew state s (should coverge to greedy policy, i.e. GLIE) 3. Update estimated model 4. Perform TD update V( s) V( s) ( R( s) V( s') V( s)) V(s) is ew estimate of optimal value fuctio at state s. 5. Goto 2 Requires a estimated model. Why? To compute Q(a) for greedy policy executio Ca we costruct a model-free variat? 39

40 Q-Learig: Model-Free RL Istead of learig the optimal value fuctio V, directly lear the optimal Q fuctio. Recall Q(a) is the expected value of takig actio a i state s ad the followig the optimal policy thereafter The optimal Q-fuctio satisfies which gives: Q( a) R( s) R( s) T( a, s') V ( s') T( a, s') max Q( s', a') Give the Q fuctio we ca act optimally by selectig actio greedily accordig to Q(a) without a model s' s' V ( s) max Q( a') How ca we lear the Q-fuctio directly? a' a' 40

41 Q-Learig: Model-Free RL Bellma costraits o optimal Q-fuctio: Q( a) R( s) s' T( a, s')max Q( a') We ca perform updates after each actio just like i TD. After takig actio a i state s ad reachig state s do: (ote that we directly observe reward R(s)) a' Q( a) Q( a) ( R( s) max a' Q( s', a') Q( a)) (oisy) sample of Q-value based o ext state 4

42 Q-Learig. Start with iitial Q-fuctio (e.g. all zeros) 2. Take actio from explore/exploit policy givig ew state s (should coverge to greedy policy, i.e. GLIE) 3. Perform TD update Q( a) Q( a) ( R( s) max Q( s', a') Q( a)) Q(a) is curret estimate of optimal Q-fuctio. 4. Goto 2 Does ot require model sice we lear Q directly! Uses explicit S x A table to represet Q Explore/exploit policy directly uses Q-values E.g. use Boltzma exploratio. Book uses exploratio fuctio for exploratio (Figure 2.8) a' 42

43 Q-Learig: Speedup for Goal-Based Problems Goal-Based Problem: receive big reward i goal state ad the trasitio to termial state Mii-project 2 is goal based Cosider iitializig Q(a) to zeros ad the observig the followig sequece of (state, reward, actio) triples (s0,0,a0) (0,a) (s2,0,a2) (termial,0) The sequece of Q-value updates would result i: Q(s0,a0) = 0, Q(a) =0, Q(s2,a2)=0 So othig was leared at s0 ad s Next time this trajectory is observed we will get o-zero for Q(a) but still Q(s0,a0)=0 43

44 Q-Learig: Speedup for Goal-Based Problems From the example we see that it ca take may learig trials for the fial reward to back propagate to early state-actio pairs Two approaches for addressig this problem:. Trajectory replay: store each trajectory ad do several iteratios of Q-updates o each oe 2. Reverse updates: store trajectory ad do Q-updates i reverse order I our example (with learig rate ad discout factor equal to for ease of illustratio) reverse updates would give Q(s2,a2) = 0, Q(a) = 0, Q(s0,a0)=0 44

45 Q-Learig: Suggestios for Mii Project 2 A very simple exploratio strategy is -greedy exploratio (geerally called epsilo greedy ) Select a small value for e (perhaps 0.) O each step: With probability select a radom actio, ad with probability - select a greedy actio But it might be iterestig to play with exploratio a bit (e.g. compare to a decreasig exploratio rate) You ca use a discout factor of oe. 45

46 Active Reiforcemet Learig Summary Methods ADP Temporal Differece Learig Q-learig All coverge to optimal policy assumig a GLIE exploratio strategy Optimistic exploratio with ADP ca be show to coverge i polyomial time with high probability All methods assume the world is ot too dagerous (o cliffs to fall off durig exploratio) So far we have assumed small state spaces 46

47 ADP vs. TD vs. Q Differet opiios. (my opiio) Whe state space is small the this is ot such a importat issue. Computatio Time ADP-based methods use more computatio time per step Memory Usage ADP-based methods uses O(m 2 ) memory Active TD-learig uses O(m 2 ) memory (must store model) Q-learig uses O(m) memory for Q-table Learig efficiecy (performace per uit experiece) ADP-based methods make more efficiet use of experiece by storig a model that summarizes the history ad the reasoig about the model (e.g. via value iteratio or policy iteratio) 47

48 What about large state spaces? Oe approach is to map the origial state space S to a much smaller state space S via some hashig fuctio. Ideally similar states i S are mapped to the same state i S The do learig over S istead of S. Note that the world may ot look Markovia whe viewed through the les of S, so covergece results may ot apply But, still the approach ca work if a good eough S is egieered (requires careful desig), e.g. Empirical Evaluatio of a Reiforcemet Learig Spoke Dialogue System. With S. Sigh, D. Litma, M. Walker. Proceedigs of the 7th Natioal Coferece o Artificial Itelligece, 2000 We will ow study three other approaches for dealig with large state-spaces Value fuctio approximatio Policy gradiet methods Least Squared Policy Iteratio 48

Parametric Density Estimation: Maximum Likelihood Estimation

Parametric Density Estimation: Maximum Likelihood Estimation Parametric Desity stimatio: Maimum Likelihood stimatio C6 Today Itroductio to desity estimatio Maimum Likelihood stimatio Itroducto Bayesia Decisio Theory i previous lectures tells us how to desig a optimal