Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker. Guy Van den Broeck

Size: px

Start display at page:

Download "Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker. Guy Van den Broeck"

Walter Hicks
5 years ago
Views:

1 Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker Guy Van den Broeck

3 Should I bluff? Deceptive play

4 Should I bluff? Is he bluffing? Opponent modeling

5 Should I bluff? Is he bluffing? Who has the Ace? Incomplete information

6 Should I bluff? Is he bluffing? Who has the Ace? What are the odds? Game of chance

7 Should I bluff? Is he bluffing? Who has the Ace? What are the odds? I'll bet because he always calls Exploitation

8 Should I bluff? Is he bluffing? Who has the Ace? What are the odds? What can happen next? I'll bet because he always calls Huge state space

9 Should I bluff? Should I bet $5 or $10? Is he bluffing? Who has the Ace? What are the odds? What can happen next? I'll bet because he always calls Risk management & Continuous action space

10 Should I bluff? Should I bet $5 or $10? Who has the Ace? Is he bluffing? What are the odds? What can happen next? I'll bet because he always calls Take-Away Message: We can solve all these problems!

11 Problem Statement A bot for Texas hold'em poker No-Limit & > 2 players Not done before! Exploitative, not game theoretic Game tree search + Opponent modeling Applies to any problem with either incomplete information non-determinism continuous actions

12 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Opponent model Conclusion

13 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Opponent model Conclusion

14 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Opponent model Conclusion

15 Poker Game Tree Minimax trees: deterministic Tic-tac-toe, checkers, chess, go, max min

16 Poker Game Tree Minimax trees: deterministic Tic-tac-toe, checkers, chess, go, max min Expecti(mini)max trees: chance Backgammon, max min mix

17 Poker Game Tree Minimax trees: deterministic Tic-tac-toe, checkers, chess, go, max min Expecti(mini)max trees: chance Backgammon, max min mix Miximax trees: hidden information max mix mix + opponent model

18 my action fold raise call

19 my action fold Resolve raise call

20 0 fold my action Resolve raise call

21 0 fold my action Resolve raise call Reveal Cards

22 0 fold my action Resolve raise call Reveal Cards

23 0 fold my action Resolve raise call Reveal Cards

24 my action 0 fold Resolve raise call Reveal Cards

25 my action 0 fold Resolve raise 1 Reveal Cards call

26 my action 0 fold Resolve raise 1 Reveal Cards -1 3 opp-1 action 0.6 fold call 0.3 call 0.1 raise

27 my action 0 fold Resolve raise 1 call opp-1 action Reveal Cards fold call opp-2 action fold 0.1 raise

28 my action 0 fold Resolve raise 1 call opp-1 action Reveal Cards call fold opp-2 action fold 0.1 raise

29 my action 0 fold Resolve raise 1 call opp-1 action Reveal Cards call fold raise opp-2 action fold

30 my action 0 fold Resolve raise 1 call opp-1 action Reveal Cards call fold raise 2 opp-2 action fold

31 my action 0 fold Resolve raise 1 call opp-1 action Reveal Cards call fold raise 2 opp-2 action fold

32 my action 0 fold Resolve raise 1 call opp-1 action Reveal Cards call fold raise 2 opp-2 action 0 fold

33 my action 0 fold Resolve raise 1 call 3 opp-1 action Reveal Cards call fold raise 2 opp-2 action 0 fold

34 3 my action 0 fold Resolve raise 1 call 3 opp-1 action Reveal Cards call fold raise 2 opp-2 action 0 fold

35 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Opponent model Conclusion

36 Short Experiment

37 Opponent Model Set of probability trees Weka's M5' Separate model for Actions Hand cards at showdown

38 Fold Probability nballplayerraises <= 1.5 : callfrequency <= : nbactionsthisround <= 2.5 : potodds <= 0.28 : AF <= : AF > : potsize <= : round=flop <= 0.5 : round=flop > 0.5 : potsize > : potodds > 0.28 : stacksize <= : callfrequency <= : callfrequency > : round=flop <= 0.5 : round=flop > 0.5 : nbseatedplayers <= 7.5 : nbseatedplayers > 7.5 : stacksize > : potsize <= : foldfrequency <= : foldfrequency > : potsize > : nbactionsthisround > 2.5 : potodds <= : callfrequency <= : callfrequency > : potodds > : AF <= : AF > : 0.921

39 (Can also be relational) Tilde probability tree [Ponsen08]

40 Opponent Ranks Learn distribution of hand ranks at showdown Probability Probability Rank Bucket Number of Raises

41 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Opponent model Conclusion

42 Traversing the tree Limit Texas Hold em 1018 nodes Fully traversable No-limit >1071 nodes Too large to traverse Sampled, not searched Monte-Carlo Tree Search

43 Monte-Carlo Tree Search [Chaslot08]

44 Selection In each node: is an estimate of the reward is the number of samples

45 Selection In each node: is an estimate of the reward is the number of samples UCT (Multi-Armed Bandit)

46 Selection In each node: is an estimate of the reward is the number of samples UCT (Multi-Armed Bandit) exploitation

47 Selection In each node: is an estimate of the reward is the number of samples UCT (Multi-Armed Bandit) exploitation exploration

48 Selection In each node: is an estimate of the reward is the number of samples UCT (Multi-Armed Bandit) exploitation exploration CrazyStone

49 Expansion Simulation

50 Backpropagation is an estimate of the reward is the number of samples

51 Backpropagation is an estimate of the reward is the number of samples Sample-weighted average

52 Backpropagation is an estimate of the reward is the number of samples Sample-weighted average Maximum child

53 Initial experiments 1*MCTS + 2*rule based Exploitative! MCTS Bot

54 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Opponent model Conclusion

55 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Uncertainty in MCTS Continuous action spaces Opponent model Online learning Concept drift Conclusion

56 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Uncertainty in MCTS Continuous action spaces Opponent model Online learning Concept drift Conclusion

57 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Uncertainty in MCTS Continuous action spaces Opponent model Online learning Concept drift Conclusion

58 MCTS for games with uncertainty? Expected reward distributions (ERD) Sample selection using ERD Backpropagation of ERD [VandenBroeck09]

59 Expected reward distribution MiniMax Estimating 10 samples 100 samples samples Variance

60 Expected reward distribution MiniMax Estimating 10 samples 100 samples samples Variance

61 Expected reward distribution MiniMax Estimating 10 samples 100 samples samples Variance

62 Expected reward distribution MiniMax Estimating 10 samples 100 samples samples Variance

63 Expected reward distribution MiniMax Estimating 10 samples 100 samples samples Variance Sampling

64 Expected reward distribution MiniMax Estimating 10 samples 100 samples samples Variance Sampling ExpectiMax/MixiMax

65 Expected reward distribution MiniMax Estimating 10 samples 100 samples samples Variance Sampling ExpectiMax/MixiMax

66 Expected reward distribution MiniMax Estimating 10 samples 100 samples samples Variance Sampling ExpectiMax/MixiMax

67 Expected reward distribution MiniMax Estimating 10 samples 100 samples samples Variance Sampling ExpectiMax/MixiMax

68 Expected reward distribution MiniMax ExpectiMax/MixiMax Estimating 10 samples 100 samples samples Variance Sampling Uncertainty + Sampling

69 Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples samples Variance Sampling Uncertainty + Sampling

70 Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples samples Variance Sampling Uncertainty + Sampling

71 Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples samples Variance Sampling Uncertainty + Sampling

72 Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples samples Variance Sampling Uncertainty + Sampling

73 Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples samples Variance Sampling Uncertainty + Sampling Sampling

74 Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples samples Variance Sampling Uncertainty + Sampling Sampling

75 ERD selection strategy Objective? Find maximum expected reward Sample more in subtrees with (1) High expected reward (2) Uncertain estimate UCT does (1) but not really (2) CrazyStone does (1) and (2) for deterministic games (Go) UCT+ selection: (1) (2)

76 ERD selection strategy Objective? Find maximum expected reward Sample more in subtrees with (1) High expected reward (2) Uncertain estimate UCT does (1) but not really (2) CrazyStone does (1) and (2) for deterministic games (Go) UCT+ selection: Expected value under perfect play

77 ERD selection strategy Objective? Find maximum expected reward Sample more in subtrees with (1) High expected reward (2) Uncertain estimate UCT does (1) but not really (2) CrazyStone does (1) and (2) for deterministic games (Go) UCT+ selection: Measure of uncertainty due to sampling

78 ERD max-distribution backpropagation max A B 3 4

79 ERD max-distribution backpropagation sample-weighted max 3.5 A B 3 4

80 ERD max-distribution backpropagation sample-weighted max 3.5 A max B 4 3 4

81 ERD max-distribution backpropagation sample-weighted max 3.5 A max B When the game reaches P, we'll have more time to find the real

82 ERD max-distribution backpropagation sample-weighted max 3.5 A max B max-distribution 4.5

83 ERD max-distribution backpropagation P(B<4) = 0.5 P(B>4) = 0.5 P(A<4) = 0.8 P(A>4) = 0.2 max A 3 B 4 B<4 B>4 A<4 A>4 0.8* * * *0.5 P(max(A,B)>4) = 0.6 >

84 Experiments 2*MCTS Max-distribution Sample-weighted 2*MCTS UCT+ (stddev) UCT

85 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Uncertainty in MCTS Continuous action spaces Opponent model Online learning Concept drift Conclusion

86 Dealing with continuous actions Sample discrete actions Progressive unpruning [Chaslot08] (ignores smoothness of EV function)... Tree learning search (work in progress) relative betsize

87 Tree learning search Based on regression tree induction from data streams training examples arrive quickly nodes split when significant reduction in stddev training examples are immediately forgotten Edges in TLS tree are not actions, but sets of actions, e.g., (raise in [2,40]), (fold or call) MCTS provides a stream of (action,ev) examples Split action sets to reduce stddev of EV (when significant)

88 Tree learning search max Bet in [0,10] {Fold, Call} max??

89 Tree learning search max Bet in [0,10] {Fold, Call} max??

90 Tree learning search max Bet in [0,10] {Fold, Call} max?? Optimal split at 4

91 Tree learning search max Bet in [0,10] {Fold, Call} max Bet in [0,4] Bet in [4,10] max max????

92 Tree learning search one action of P1 one action of P2

93 Selection Phase P1 Sample 2.4 Each node has EV estimate, which generalizes over actions

94 Expansion P1 P2 Selected Node

95 Expansion P1 P2 P3 Expanded node Represents any action of P3

96 Backpropagation New sample; Split becomes significant

97 Backpropagation New sample; Split becomes significant

98 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Uncertainty in MCTS Continuous action spaces Opponent model Online learning Concept drift Conclusion

99 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Uncertainty in MCTS Continuous action spaces Opponent model Online learning Concept drift Conclusion

100 Online learning of opponent model Start from (safe) model of general opponent Exploit weaknesses of specific opponent Start to learn model of specific opponent (exploration of opponent behavior)

101 Multi-agent interaction

102 Multi-agent interaction Yellow learns model for Blue and changes strategy

103 Multi-agent interaction Yellow learns model for Blue and changes strategy Yellow doesn't profit!

104 Multi-agent interaction Yellow learns model for Blue and changes strategy Yellow doesn't profit! Green profits without changing strategy!!

105 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Uncertainty in MCTS Continuous action spaces Opponent model Online learning Concept drift Conclusion

106 Concept drift While learning from a stream, the training examples in the stream change In opponent model: changing strategy Changing gears is not just about bluffing, it's about changing strategy to achieve a goal. Learning with concept drift adapt quickly to changes yet robust to noise (recognize recurrent concepts)

107 Basic approach to concept drift Maintain a window of training examples large enough to learn small enough to adapt quickly without 'old' concepts Heuristics to adjust window size based on FLORA2 framework [Widmer92]

108 4 components of a single opponent model Accuracy Start online learning Concept drift Window size

109 Bad parameters for heuristic Accuracy NOT ROBUST Window size

110 Outline Overview approach The Poker game tree Opponent model Monte-Carlo tree search Research challenges Search Opponent model Conclusion

111 Conclusions First exploitive poker Challenge for MCTS bot for games with uncertainty No-limit Holdem > 2 players continuous action space Challenge for ML Apply in other games backgammon computational pool... online learning concept drift (relational learning)

112 Thanks for listening!

Extending MCTS

Extending MCTS 2-17-16 Reading Quiz (from Monday) What is the relationship between Monte Carlo tree search and upper confidence bound applied to trees? a) MCTS is a type of UCT b) UCT is a type of MCTS