Speeding Up Exact Solutions of Interactive Dynamic Influence Diagrams Using Action Equivalence

Size: px

Start display at page:

Download "Speeding Up Exact Solutions of Interactive Dynamic Influence Diagrams Using Action Equivalence"

Jessie Wheeler
5 years ago
Views:

1 1 / 28 Speeding Up Exact Solutions of Interactive Dynamic Influence Diagrams Using Action Equivalence Yifeng Zeng Aalborg University, Denmark Prashant Doshi University of Georgia, USA

2 2 / 28 Outline Outline Motivation Background Interactive Influence Diagrams (I - ID) Interactive Dynamic Influence Diagrams (I - DID) Exact Method Behavioral Equivalence Action Equivalence Theoretical Results Experimental Results

3 3 / 28 Motivation State of the Art Exact solutions of I-DIDs up to 4 time horizons on standard problems (Doshi et al., JAAMAS 09) Multiagent Tiger example: S =2, Ω =6, A =3 Multiagent Machine example: S =4, Ω =2, A =4 Approximate solutions using model clustering (Zeng and Doshi, AAAI 07) Solves I-DIDs up to 8 time horizons Improved approximation using discriminative model updates (Doshi and Zeng, AAMAS 09) Solves I-DIDs up to time horizons Critical need for exact solutions over longer time horizons Provides optimal benchmarks for approximations

4 4 / 28 Background Interactive Influence Diagram (I - ID) (Doshi et al., JAAMAS 09) A generic level l I-ID for agent i situated with one other agent j Model Node: M j,l 1 Models of agent j at level l 1 Policy link: dashed arrow Distribution over agent j s actions given its models Beliefs on M j,l 1 : P(M j,l 1 s)

5 / 28 Background Model Node Details of the Model Node Members of the model node Different chance nodes: solutions of models m j,l 1 Mod[Mj ] represents the

5 5 / 28 Background Model Node Details of the Model Node Members of the model node Different chance nodes: solutions of models m j,l 1 Mod[Mj ] represents the different models of agent j CPT of the chance node A j is a multiplexer Assumes the distribution of each of the action nodes (A 1 j, A2 j ) depending on the value of Mod[M j ]

6 6 / 28 Background Interactive Dynamic Influence Diagrams (I - DID) Interactive Dynamic Influence Diagram (I-DID) (Doshi et al., JAAMAS 09) A generic two time-slice level l I-DID for agent i Ri Ri A i t Aj t A i t+1 Aj t+1 S t S t+1 M j,l-1 t M j,l-1 t+1 O i t O i t+1

7 7 / 28 Background Model Update ink Semantics of Model Update ink The CPT of Mod[M t+1 j ] is an indicator function: τ(bj,l 1 t, at j, ot+1 j, b t+1 j,l 1 ) S t Mj,l-1 t Aj t Mj,l-1 t+1 Aj t+1 S t+1 Mod[Mj t+1 ] Mod[Mj t ] Oj t+1 Ai t mj,l-1t+1,1 Aj 1 mj,l-1 t,1 Aj 1 Oj 1 mj,l-1t+1,2 Aj 2 mj,l-1 t,2 Aj 2 Oj 2 mj,l-1t+1,3 Aj 3 mj,l-1t+1,4 Aj 4

8 8 / 28 Background Exact Solutions Computational Complexity Primary complexity of solving I-DIDs is due to the exponential number of models that must be solved over time At time step t: M 0 ( A Ω ) t Nested modeling adds to the complexity Multiple agents N+1 agent setting: (NM) l models (M is bounded number of models at each level)

9 9 / 28 Behavioral Equivalence Behavioral Equivalence Two models are behaviorally equivalent if they prescribe the same policy across the whole time horizon E.g., Expanded models at t=1 below exhibit same behaviors Actions (node labels): = isten O = Open left door = Open right door Observations (edge labels): G = Growl from left door = Growl from right door mj,l-1 1 mj,l-1 2 O t=0 G G t=1 G G t=3 G G

10 10 / 28 Behavioral Equivalence Build the policy graph m j,l-1 1 m j,l-1 2 O m j,l-1 3 G G G O G G O Merge (a)

11 11 / 28 Behavioral Equivalence Build the policy graph mj,l-1 1 mj,l-1 2 mj,l-1 3 mj,l-1 1 mj,l-1 2 mj,l-1 3 O time t=0 O G G G G G Merge G O O G G G G O O Merge (a) (b)

12 12 / 28 Behavioral Equivalence Build the policy graph mj,l-1 1 mj,l-1 2 mj,l-1 3 mj,l-1 1 mj,l-1 2 mj,l-1 3 O time t=0 O G G G G G Merge G O O G G G G O O Merge (a) mj,l-1 1 mj,l-1 2 mj,l-1 3 O time t=0 (b) G G O G G O (c)

13 13 / 28 Action Equivalence Action Equivalence Two models are actionally equivalent if they prescribe identical actions at a particular time step They differ in the entire behavior over the whole time horizon Actions (node labels): = isten O = Open left door = Open right door Observations (edge labels): G = Growl from left door = Growl from right door mj,l-1 1 mj,l-1 2 mj,l-1 3 G O G O time t=0 G G O

14 14 / 28 Action Equivalence Aggregate Action Equivalence Models Each class contains models prescribing an identical action at a particular time step E.g.: M t=0 j,l 1 = {Mt=0,1 j,l 1, Mt=0,2 j,l 1 } O time t=0 G G O G G O

15 Action Equivalence Model Update Issue The CPT of Mod[M t+1 j,l 1 ], is no longer deterministic (an indicator function) but probabilistic Modify indicator function: τ(bj,l 1 t, at j, ot+1 j, b t+1 j,l 1 ) O time t=0 G G O G G O (a) time t= O 15 / 28

16 16 / 28 Action Equivalence Revise CPT Pr(M t+1,p j,l 1 Mt,q j,l 1, at j, ot+1 j ) = m t+1 = = = j,l 1 Mt+1,p j,l 1 m t+1 j,l 1 Mt+1,p j,l 1 Pr(m t+1 j,l 1 Mt,q j,l 1, at j, ot+1 j ) Pr(m t+1 j,l 1,Mt,q j,l 1,at j,ot+1 j ) Pr(M t,q j,l 1,at j,ot+1 ) j M t+1,p Pr(m t+1 j,l 1,Mt,q j,l 1 mt j,l 1,at j,ot+1 )Pr(m j j,l 1 t at j,ot+1 ) j j,l 1 M t,q Pr(mj,l 1 t at j,ot+1 ) j j,l 1 m t+1 j,l 1 Mt+1,p j,l 1,mt j,l 1 Mt,q j,l 1 m t j,l 1 Mt,q j,l 1 b i (m t j,l 1 )τ(bt j,l 1,at j,ot+1 j,b t+1 j,l 1 ) b i (m t j,l 1 ) The probability is the proportion of the probability mass of individual models in a class

17 17 / 28 Action Equivalence Example O time t=0 G G O G G O (a)

18 18 / 28 Action Equivalence Example O time t= O G G G G O O G G G G O O (a) (b)

19 19 / 28 Action Equivalence Example O time t= O G G G G O O G G G G O O (a) (b) O time t=0,0.54 G,0.54,0.46,1.0 G,0.46 O G G O (c)

20 20 / 28 Action Equivalence Example O time t= O G G G G O O G G G G O O (a) (b) O time t= O,0.54 G,0.54,0.46,1.0 G,0.46,0.54,0.46 G,0.54, 1.0 G,0.46 O O G G,0.27,1.0 G,0.79,0.73 G,0.21, 1.0 O O (c) (d)

21 21 / 28 Theoretical Discussions Computational Savings Exact method: At most M 0 j,l 1 ( A j Ω j ) t models Behavioral equivalence: The number of behavioral equivalence classes over time Grow over time Action equivalence: A j model classes

22 Theoretical Discussions Complexity and Optimality - Behavioral Equivalence Proposition (Complexity of tree merge) Worst-case complexity of the procedure for merging policy trees to form a policy graph is O(( Ω j T 1 ) M0 j,l 1 ) where T is the horizon. Proposition (Optimality of Solutions) Model update using the policy graph preserves the I-DIDs solution. 22 / 28

23 23 / 28 Theoretical Discussions Optimality - Action Equivalence Proposition (Optimality) The predictive distribution over j s actions on aggregating model space due to action equivalence is preserved. I-DID solution optimality is preserved.

24 24 / 28 Experimental Results Comparison Experimental Results Exact method (Exact), action equivalence (Exact-AE), and behavioral equivalence (Exact-BE) Two standard domains: Multi-agent Tiger and Machine Maintenance Problems Average rewards obtained in the same amount of time Computational savings Model space Timing

25 Experimental Results Comparison Multiagent Tiger problem 11 evel 1 10 Average Reward Exact-BE Exact-AE Time(s) Exact-AE Exact-BE Exact Model classes evel Horizon Average Reward Exact-AE Exact-BE Exact Time(s) 25 / 28

26 Experimental Results Comparison Multiagent Machine problem 0.85 evel Average Reward Exact-BE Exact-AE Time(s) Exact-AE Exact-BE Exact Model Classes evel Horizon Average Reward Exact-AE Exact-BE Exact Time(s) 26 / 28

27 27 / 28 Experimental Results Comparison Run Times over Horizons evel 2 T Time (s) Exact-AE Exact-BE Exact Tiger MM Table: Aggregation using action equivalence scales significantly better to larger horizons. All experiments are run on a WinXP platform with a dual processor Xeon 2.0GHz and 2GB memory.

28 28 / 28 Summary Conclusions and Future Work Summary Using behavioral equivalence significantly reduces the model space Using action equivalence provides further reduction while preserving the solution optimality Further work Approximation using action equivalence Application on more scalable multiagent problems

Markov Decision Processes

Markov Decision Processes Ryan P. Adams COS 324 Elements of Machine Learning Princeton University We now turn to a new aspect of machine learning, in which agents take actions and become active in their