Foundations of Artificial Intelligence

Similar documents
MDP Algorithms. Thomas Keller. June 20, University of Basel

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning: Basic Principles and Recent Progress

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Extending MCTS

Action Selection for MDPs: Anytime AO* vs. UCT

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

CS360 Homework 14 Solution

Cooperative Games with Monte Carlo Tree Search

Monte Carlo Tree Search with Sampled Information Relaxation Dual Bounds

Markov Decision Processes

Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker. Guy Van den Broeck

CSE 473: Artificial Intelligence

CSEP 573: Artificial Intelligence

Algorithms and Networking for Computer Games

CS221 / Spring 2018 / Sadigh. Lecture 9: Games I

Applying Monte Carlo Tree Search to Curling AI

Non-Deterministic Search

CS 188: Artificial Intelligence Spring Announcements

Lecture 9: Games I. Course plan. A simple game. Roadmap. Machine learning. Example: game 1

CS 188: Artificial Intelligence

CS 6300 Artificial Intelligence Spring 2018

CS 343: Artificial Intelligence

Markov Decision Processes

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week

CS 188: Artificial Intelligence

Decision making in the presence of uncertainty

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

CEC login. Student Details Name SOLUTIONS

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

CS188 Spring 2012 Section 4: Games

Reinforcement Learning

CS 5522: Artificial Intelligence II

Multi-armed bandit problems

CS 188: Artificial Intelligence. Outline

Adaptive Experiments for Policy Choice. March 8, 2019

Chapter 3. Dynamic discrete games and auctions: an introduction

Variance Reduction in Monte-Carlo Tree Search

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Tuning bandit algorithms in stochastic environments

Introduction to Artificial Intelligence Spring 2019 Note 2

Adding Double Progressive Widening to Upper Confidence Trees to Cope with Uncertainty in Planning Problems

CS 343: Artificial Intelligence

343H: Honors AI. Lecture 7: Expectimax Search 2/6/2014. Kristen Grauman UT-Austin. Slides courtesy of Dan Klein, UC-Berkeley Unless otherwise noted

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Q1. [?? pts] Search Traces

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo

Lecture outline W.B.Powell 1

Application of Monte-Carlo Tree Search to Traveling-Salesman Problem

Robust Dual Dynamic Programming

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

Bandit algorithms for tree search Applications to games, optimization, and planning

Uncertain Outcomes. CS 188: Artificial Intelligence Uncertainty and Utilities. Expectimax Search. Worst-Case vs. Average Case

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Announcements. Today s Menu

Bandit based Monte-Carlo Planning

Expectimax and other Games

The exam is closed book, closed calculator, and closed notes except your three crib sheets.

Monte-Carlo Beam Search

Expectimax Search Trees. CS 188: Artificial Intelligence Fall Expectimax Example. Expectimax Pseudocode. Expectimax Pruning?

CS 188: Artificial Intelligence Fall 2011

Markov Decision Process

Introduction to Artificial Intelligence Midterm 1. CS 188 Spring You have approximately 2 hours.

Biasing Monte-Carlo Simulations through RAVE Values

Lecture 7: Bayesian approach to MAB - Gittins index

Worst-Case vs. Average Case. CSE 473: Artificial Intelligence Expectimax, Uncertainty, Utilities. Expectimax Search. Worst-Case vs.

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

Generalised Discount Functions applied to a Monte-Carlo AIµ Implementation

Dynamic Pricing with Varying Cost

COS402- Artificial Intelligence Fall Lecture 17: MDP: Value Iteration and Policy Iteration

Probabilities. CSE 473: Artificial Intelligence Uncertainty, Utilities. Reminder: Expectations. Reminder: Probabilities

Expectimax Search Trees. CS 188: Artificial Intelligence Fall Expectimax Quantities. Expectimax Pseudocode. Expectimax Pruning?

Example: Grid World. CS 188: Artificial Intelligence Markov Decision Processes II. Recap: MDPs. Optimal Quantities

Reinforcement Learning

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Intro to Reinforcement Learning. Part 3: Core Theory

CS 188: Artificial Intelligence Fall 2011

An Experimental Study of the Behaviour of the Proxel-Based Simulation Algorithm

Decision Theory: Value Iteration

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010

Markov Decision Processes. Lirong Xia

Reinforcement Learning

2D5362 Machine Learning

16 MAKING SIMPLE DECISIONS

Reinforcement Learning

Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning

IV. Cooperation & Competition

The Irrevocable Multi-Armed Bandit Problem

Lecture 8 Feb 16, 2017

Topics in Computational Sustainability CS 325 Spring 2016

Sequential Decision Making

CS 4100 // artificial intelligence

Markov Decision Processes

Complex Decisions. Sequential Decision Making

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

Announcements. CS 188: Artificial Intelligence Spring Expectimax Search Trees. Maximum Expected Utility. What are Probabilities?

CS 188: Artificial Intelligence Spring Announcements

Transcription:

Foundations of Artificial Intelligence 44. Monte-Carlo Tree Search: Introduction Thomas Keller Universität Basel May 27, 2016

Board Games: Overview chapter overview: 41. Introduction and State of the Art 42. Minimax Search and Evaluation Functions 43. Alpha-Beta Search 44. Monte-Carlo Tree Search: Introduction 45. Monte-Carlo Tree Search: Advanced Topics 46. AlphaGo and Outlook

Introduction

Monte-Carlo Tree Search: Brief History Starting in the 1930s: first researchers experiment with Monte-Carlo methods 1998: Ginsberg s GIB player competes with expert Bridge players 2002: Kearns et al. propose Sparse Sampling 2002: Auer et al. present UCB1 action selection for multi-armed bandits 2006: Coulom coins the term Monte-Carlo Tree Search (MCTS) 2006: Kocsis and Szepesvári combine UCB1 and MCTS to the most famous MCTS variant, UCT

Monte-Carlo Tree Search: Brief History Starting in the 1930s: first researchers experiment with Monte-Carlo methods 1998: Ginsberg s GIB player competes with expert Bridge players this chapter 2002: Kearns et al. propose Sparse Sampling this chapter 2002: Auer et al. present UCB1 action selection for multi-armed bandits Chapter 45 2006: Coulom coins the term Monte-Carlo Tree Search (MCTS) this chapter 2006: Kocsis and Szepesvári combine UCB1 and MCTS to the most famous MCTS variant, UCT Chapter 45

Monte-Carlo Tree Search: Applications Examples for successful applications of MCTS in games: board games (e.g., Go Chapter 46) card games (e.g., Poker) AI for computer games (e.g., for Real-Time Strategy Games or Civilization) Story Generation (e.g., for dynamic dialogue generation in computer games) General Game Playing Also many applications in other areas, e.g., MDPs (planning with stochastic effects) or POMDPs (MDPs with partial observability)

Monte-Carlo Methods

Monte-Carlo Methods: Idea summarize a broad family of algorithms decisions are based on random samples results of samples are aggregated by computing the average apart from that, algorithms can differ significantly

Monte-Carlo Methods: Example Bridge Player GIB, based on Hindsight Optimization (HOP) perform samples as long as resources (deliberation time, memory) allow: sample hand for all players that is consistent with current knowledge about the game state for each legal action, compute if perfect information game that starts with executing that action is won or lost compute win percentage for each action over all samples play the card with the highest win percentage

Hindsight Optimization: Example

Hindsight Optimization: Example 0% 100% 0%

Hindsight Optimization: Example 50% 100% 0%

Hindsight Optimization: Example 67% 100% 33%

Hindsight Optimization: Restrictions HOP well-suited for imperfect information games like most card games (Bridge, Skat, Klondike Solitaire) must be possible to solve or approximate sampled game efficiently often not optimal even if provided with infinite resources

Introduction Monte-Carlo Methods Sparse Sampling MCTS Hindsight Optimization: Suboptimality le b gam sa fe Summary

Introduction Monte-Carlo Methods Sparse Sampling MCTS Hindsight Optimization: Suboptimality le b gam miss hit sa fe Summary

Sparse Sampling

Reminder: Minimax for Games Minimax: alternate maximization and minimization

Excursion: Expectimax for MDPs Expectimax: alternate maximization and expectation (expectation = probability weighted sum)

Sparse Sampling: Idea search tree creation: sample a constant number of outcomes according to their probability in each state and ignore the rest update values by replacing probability weighted updates with average near-optimal: utility of resulting policy close to utility of optimal policy runtime independent from the number of states

Sparse Sampling: Search Tree Without Sparse Sampling

Sparse Sampling: Search Tree With Sparse Sampling

Sparse Sampling: Problems independent from number of states, but still exponential in lookahead horizon constant that gives the number of outcomes large for good bounds on near-optimality search time difficult to predict tree is symmetric resources are wasted in non-promising parts of the tree

MCTS

Monte-Carlo Tree Search: Idea perform iterations as long as resources (deliberation time, memory) allow: builds a search tree of nodes n with annotated utility estimate ˆQ(n) visit counter N(n) initially, the tree contains only the root node execute the action that leads to the node with the highest utility estimate

Monte-Carlo Tree Search: Iterations Each iteration consist of four phases: selection: traverse the tree by applying tree policy expansion: add to the tree the first visited state that is not in the tree simulation: continue by applying default policy until terminal state is reached (which yields utility of current iteration) backpropagation: for all visited nodes n, increase N(n) extend the current average ˆQ(n) with yielded utility

Monte-Carlo Tree Search Selection: apply tree policy to traverse tree 11 13 12 5 14 4 6 1 7 3 1 4 1 8 18 2 2 18 2 1 1 5 6 1 12 1 16 1

Monte-Carlo Tree Search Selection: apply tree policy to traverse tree 11 13 12 5 14 4 6 1 7 3 1 4 1 8 18 2 2 18 2 1 1 5 6 1 12 1 16 1

Monte-Carlo Tree Search Selection: apply tree policy to traverse tree 11 13 12 5 14 4 6 1 7 3 1 4 1 8 18 2 2 18 2 1 1 5 6 1 12 1 16 1

Monte-Carlo Tree Search Selection: apply tree policy to traverse tree 11 13 12 5 14 4 6 1 7 3 1 4 1 8 18 2 2 18 2 1 1 5 6 1 12 1 16 1

Monte-Carlo Tree Search Expansion: create a node for first state beyond the tree 11 13 12 5 14 4 6 1 7 3 1 4 1 8 18 2 2 18 2 1 1 5 6 1 12 1 0 0 16 1

Monte-Carlo Tree Search Simulation: apply default policy until terminal state is reached 11 13 12 5 14 4 6 1 7 3 1 4 1 8 18 2 2 18 2 1 1 5 6 1 12 1 0 0 16 1 39

Monte-Carlo Tree Search Backpropagation: update utility estimates of visited nodes 11 13 12 5 14 4 6 1 7 3 1 4 1 8 18 2 2 18 2 1 1 5 6 1 12 1 39 1 16 1 39

Monte-Carlo Tree Search Backpropagation: update utility estimates of visited nodes 11 13 12 5 14 4 6 1 7 3 1 4 1 8 18 2 3 25 2 1 1 5 6 1 12 1 39 1 16 1 39

Monte-Carlo Tree Search Backpropagation: update utility estimates of visited nodes 11 13 12 5 19 5 6 1 7 3 1 4 1 8 18 2 3 25 2 1 1 5 6 1 12 1 39 1 16 1 39

Monte-Carlo Tree Search Backpropagation: update utility estimates of visited nodes 13 14 12 5 19 5 6 1 7 3 1 4 1 8 18 2 3 25 2 1 1 5 6 1 12 1 39 1 16 1 39

Monte-Carlo Tree Search: Pseudo-Code Monte-Carlo Tree Search tree := new SearchTree n 0 = tree.add root node() while time allows(): visit node(tree, n 0 ) n = arg max n succ(n0) ˆQ(n) return n.get action()

Monte-Carlo Tree Search: Pseudo-Code function visit node(tree, n) if is final(n.state): return u(n.state) s = tree.get unvisited successor(n) if s none: n = tree.add child node(n, s) utility = apply default policy() backup(n, utility) else: n = apply tree policy(n) utility = visit node(tree, n ) backup(n, utility) return utility

Summary

Summary Simple Monte-Carlo methods like Hindsight Optimization perform well in some games, but are suboptimal even with unbound resources Sparse Sampling allows near-optimal solutions independent of the state size, but it wastes time in non-promising parts of the tree Monte-Carlo Tree Search algorithms iteratively build a search tree. Algorithms are specified in terms of a tree policy and a default policy. (We analyze its theoretical properties in the next chapter)