Inverse reinforcement learning from summary data

Similar documents
Lecture 17: More on Markov Decision Processes. Reinforcement learning

Reinforcement Learning

Machine Learning for Quantitative Finance

Reinforcement Learning and Simulation-Based Search

ST440/550: Applied Bayesian Analysis. (5) Multi-parameter models - Summarizing the posterior

Chapter 7: Estimation Sections

Reasoning with Uncertainty

Stochastic Volatility (SV) Models

Chapter 7: Estimation Sections

Top-down particle filtering for Bayesian decision trees

INVERSE REWARD DESIGN

(5) Multi-parameter models - Summarizing the posterior

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

Estimation of the Markov-switching GARCH model by a Monte Carlo EM algorithm

Introduction to Reinforcement Learning. MAL Seminar

Equity correlations implied by index options: estimation and model uncertainty analysis

Extracting Information from the Markets: A Bayesian Approach

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMSN50)

Lending Club Loan Portfolio Optimization Fred Robson (frobson), Chris Lucas (cflucas)

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010

A potentially useful approach to model nonlinearities in time series is to assume different behavior (structural break) in different subsamples

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model

Robust Longevity Risk Management

Extend the ideas of Kan and Zhou paper on Optimal Portfolio Construction under parameter uncertainty

Chapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 29

Adaptive Experiments for Policy Choice. March 8, 2019

Unobserved Heterogeneity Revisited

1 Bayesian Bias Correction Model

Estimation of a Ramsay-Curve IRT Model using the Metropolis-Hastings Robbins-Monro Algorithm

Chapter 7: Estimation Sections

Intro to Reinforcement Learning. Part 3: Core Theory

Introduction to Sequential Monte Carlo Methods

Identifying Long-Run Risks: A Bayesian Mixed-Frequency Approach

Mini-Minimax Uncertainty Quantification for Emulators

1 Explaining Labor Market Volatility

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

The Time-Varying Effects of Monetary Aggregates on Inflation and Unemployment

Chapter 3. Dynamic discrete games and auctions: an introduction

F19: Introduction to Monte Carlo simulations. Ebrahim Shayesteh

Analysis of truncated data with application to the operational risk estimation

Estimating a Dynamic Oligopolistic Game with Serially Correlated Unobserved Production Costs. SS223B-Empirical IO

Exact Inference (9/30/13) 2 A brief review of Forward-Backward and EM for HMMs

Quarterly Storage Model of U.S. Cotton Market: Estimation of the Basis under Rational Expectations. Oleksiy Tokovenko 1 Lewell F.

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Machine Learning in Computer Vision Markov Random Fields Part II

Relevant parameter changes in structural break models

TDT4171 Artificial Intelligence Methods

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Pakes (1986): Patents as Options: Some Estimates of the Value of Holding European Patent Stocks

Importance Sampling for Fair Policy Selection

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMSN50)

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Part II: Computation for Bayesian Analyses

Model Estimation. Liuren Wu. Fall, Zicklin School of Business, Baruch College. Liuren Wu Model Estimation Option Pricing, Fall, / 16

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

Bayesian Hierarchical/ Multilevel and Latent-Variable (Random-Effects) Modeling

BIASES OVER BIASED INFORMATION STRUCTURES:

Sequential Sampling for Selection: The Undiscounted Case

Statistical estimation

Non-informative Priors Multiparameter Models

Importance sampling and Monte Carlo-based calibration for time-changed Lévy processes

10703 Deep Reinforcement Learning and Control

DRAFT. 1 exercise in state (S, t), π(s, t) = 0 do not exercise in state (S, t) Review of the Risk Neutral Stock Dynamics

Intro to Decision Theory

Analysis of the Bitcoin Exchange Using Particle MCMC Methods

Probabilistic Meshless Methods for Bayesian Inverse Problems. Jon Cockayne July 8, 2016

The method of Maximum Likelihood.

The Option-Critic Architecture

Experience with the Weighted Bootstrap in Testing for Unobserved Heterogeneity in Exponential and Weibull Duration Models

Information Processing and Limited Liability

ROBUST OPTIMIZATION OF MULTI-PERIOD PRODUCTION PLANNING UNDER DEMAND UNCERTAINTY. A. Ben-Tal, B. Golany and M. Rozenblit

Choice Probabilities. Logit Choice Probabilities Derivation. Choice Probabilities. Basic Econometrics in Transportation.

Integrating Contract Risk with Schedule and Cost Estimates

Likelihood-based Optimization of Threat Operation Timeline Estimation

STA 532: Theory of Statistical Inference

Week 7 Quantitative Analysis of Financial Markets Simulation Methods

6.825 Homework 3: Solutions

Missing Data. EM Algorithm and Multiple Imputation. Aaron Molstad, Dootika Vats, Li Zhong. University of Minnesota School of Statistics

ADVANCED OPERATIONAL RISK MODELLING IN BANKS AND INSURANCE COMPANIES

CPSC 540: Machine Learning

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Introduction Dickey-Fuller Test Option Pricing Bootstrapping. Simulation Methods. Chapter 13 of Chris Brook s Book.

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Conjugate Models. Patrick Lam

Generating Random Numbers

CPSC 540: Machine Learning

Monitoring Accrual and Events in a Time-to-Event Endpoint Trial. BASS November 2, 2015 Jeff Palmer

The Monte Carlo Method in High Performance Computing

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMSN50)

Distribution of state of nature: Main problem

Statistical Inference and Methods

THAT COSTS WHAT! PROBABILISTIC LEARNING FOR VOLATILITY & OPTIONS

Personalized screening intervals for biomarkers using joint models for longitudinal and survival data

Dynamic Programming and Reinforcement Learning

Computational social choice

GPD-POT and GEV block maxima

A study of the Surplus Production Model. Université Laval.

Transcription:

Inverse reinforcement learning from summary data Antti Kangasrääsiö, Samuel Kaski Aalto University, Finland ECML PKDD 2018 journal track Published in Machine Learning (2018), 107:1517 1535 September 12, 2018 Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 1 / 20

Modelling human decision-making: Motivation Our overarching goal is to have accurate white-box models of human decision-making Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 2 / 20

Modelling human decision-making: Motivation Our overarching goal is to have accurate white-box models of human decision-making Applications of high-fidelity user models Replicating demonstrated behavior (imitation learning) Optimizing user interfaces (human-computer interaction) Estimating cognitive state/goals of humans (chatbots) Understanding human cognition (cognitive science) Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 2 / 20

Modelling human decision-making: Problem How to infer the parameters of sequential decision-making models when the available observation data is limited? Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 3 / 20

Modelling human decision-making: Problem How to infer the parameters of sequential decision-making models when the available observation data is limited? Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 3 / 20

Modelling human decision-making: Problem How to infer the parameters of sequential decision-making models when the available observation data is limited? Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 3 / 20

Modelling human decision-making: Problem How to infer the parameters of sequential decision-making models when the available observation data is limited? Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 3 / 20

Modelling human decision-making: Problem How to infer the parameters of sequential decision-making models when the available observation data is limited? Main contribution: We demonstrate that posterior inference is possible for realistic models of decision-making, even with very limited observations of human behavior Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 3 / 20

Reinforcement learning models We use the RL framework for modelling sequential decision-making The main assumption is that human decisions can be approximated by an optimal policy trained for a certain decision problem (eg. MDP, POMDP) Humans make rational decisions within the limitations they have Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 4 / 20

Inverse reinforcement learning (IRL) Inverse reinforcement learning: Given a set of observations, which MDP has a matching optimal policy? Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 5 / 20

Inverse reinforcement learning (IRL) Inverse reinforcement learning: Given a set of observations, which MDP has a matching optimal policy? Traditional IRL problem Given an MDP with reward-function R(s; θ), θ unknown a set of state-action trajectories Ξ = {ξ 1,..., ξ N } demonstrating optimal behavior, where ξ i = (s i 0, ai 1,..., ai T i 1, si T i ) a prior P(θ) Determine a point estimate ˆθ or the posterior P(θ Ξ) Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 5 / 20

Existing solutions Traditional IRL has been gradient descent on the likelihood L(θ Ξ) = N i=1 T i 1 P(s0) i πθ (si t, at)p(s i t+1 s i t, i at) i t=0 Tractable when all states and actions are observed what about when this is not the case? 1 Activity forecasting, Kitani et al. 2012 2 EM for IRL with hidden data, Bogert et al. 2016 Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 6 / 20

Existing solutions Traditional IRL has been gradient descent on the likelihood L(θ Ξ) = N i=1 T i 1 P(s0) i πθ (si t, at)p(s i t+1 s i t, i at) i t=0 Tractable when all states and actions are observed what about when this is not the case? Previous work: If state observations are corrupted with i.i.d. noise 1 or part of them are missing 2, EM-approach can be used to estimate the true states, after which standard IRL methods apply 1 Activity forecasting, Kitani et al. 2012 2 EM for IRL with hidden data, Bogert et al. 2016 Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 6 / 20

Existing solutions Traditional IRL has been gradient descent on the likelihood L(θ Ξ) = N i=1 T i 1 P(s0) i πθ (si t, at)p(s i t+1 s i t, i at) i t=0 Tractable when all states and actions are observed what about when this is not the case? Previous work: If state observations are corrupted with i.i.d. noise 1 or part of them are missing 2, EM-approach can be used to estimate the true states, after which standard IRL methods apply However, this approach is not feasible in the more realistic cases, with complex non-i.i.d. noise or most of the states and actions missing 1 Activity forecasting, Kitani et al. 2012 2 EM for IRL with hidden data, Bogert et al. 2016 Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 6 / 20

IRL from summary data (IRL-SD) We ask whether IRL is possible in realistic cases, where the true trajectories ξ i are filtered through a generic summarizing function σ, yielding summaries ξ iσ σ(ξ i ) Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 7 / 20

IRL from summary data (IRL-SD) We ask whether IRL is possible in realistic cases, where the true trajectories ξ i are filtered through a generic summarizing function σ, yielding summaries ξ iσ σ(ξ i ) Example: Alice walks to work every day along her preferred secret route. Could we infer Alice s scenery preferences given only the durations of the commutes and the location of her work and home? Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 7 / 20

IRL from summary data (IRL-SD) We ask whether IRL is possible in realistic cases, where the true trajectories ξ i are filtered through a generic summarizing function σ, yielding summaries ξ iσ σ(ξ i ) Example: Alice walks to work every day along her preferred secret route. Could we infer Alice s scenery preferences given only the durations of the commutes and the location of her work and home? IRL from summary data (IRL-SD) problem Given an MDP with unknown parameters θ a set of summaries Ξ σ = {ξ 1σ,..., ξ Nσ } from optimal behavior the summary function σ a prior P(θ) Determine a point estimate ˆθ or the posterior P(θ Ξ σ ). Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 7 / 20

Exact solution The likelihood corresponding to an IRL-SD problem is N L(θ Ξ σ ) = P(ξ iσ ξ i )P(ξ i θ), i=1 ξ i Ξ ap where we marginalize over the unobserved true ξ i Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 8 / 20

Exact solution The likelihood corresponding to an IRL-SD problem is L(θ Ξ σ ) = N i=1 ξ i Ξ ap P(ξ iσ ξ i )P(ξ i θ), where we marginalize over the unobserved true ξ i The set of all plausible true trajectories is Ξ ap S Tmax +1 A Tmax Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 8 / 20

Exact solution The likelihood corresponding to an IRL-SD problem is L(θ Ξ σ ) = N i=1 ξ i Ξ ap P(ξ iσ ξ i )P(ξ i θ), where we marginalize over the unobserved true ξ i The set of all plausible true trajectories is Ξ ap S Tmax +1 A Tmax P(ξ iσ ξ i ) is determined by the summary function σ Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 8 / 20

Exact solution The likelihood corresponding to an IRL-SD problem is L(θ Ξ σ ) = N i=1 ξ i Ξ ap P(ξ iσ ξ i )P(ξ i θ), where we marginalize over the unobserved true ξ i The set of all plausible true trajectories is Ξ ap S Tmax +1 A Tmax P(ξ iσ ξ i ) is determined by the summary function σ The likelihood of a trajectory is as before T i 1 P(ξ i θ) = P(s0) i πθ (si t, at)p(s i t+1 s i t, i at) i t=0 Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 8 / 20

Exact solution The likelihood corresponding to an IRL-SD problem is L(θ Ξ σ ) = N i=1 ξ i Ξ ap P(ξ iσ ξ i )P(ξ i θ), where we marginalize over the unobserved true ξ i The set of all plausible true trajectories is Ξ ap S Tmax +1 A Tmax P(ξ iσ ξ i ) is determined by the summary function σ The likelihood of a trajectory is as before T i 1 P(ξ i θ) = P(s0) i πθ (si t, at)p(s i t+1 s i t, i at) i t=0 Takeaway: L(θ Ξ σ ) can be evaluated, but it is very expensive to do so due to Ξ ap being generally large or challenging to determine Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 8 / 20

Monte-Carlo approximation We can estimate L(θ Ξ σ ) by solving π θ and then sampling N MC trajectories, Ξ MC, leading to the Monte-Carlo estimate N ˆL(θ Ξ σ ) = 1 N MC i=1 ξ n Ξ MC P(ξ iσ ξ n ) Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 9 / 20

Monte-Carlo approximation We can estimate L(θ Ξ σ ) by solving π θ and then sampling N MC trajectories, Ξ MC, leading to the Monte-Carlo estimate N ˆL(θ Ξ σ ) = 1 N MC i=1 ξ n Ξ MC P(ξ iσ ξ n ) However P(ξ iσ ξ n ) may be 0 for all ξ n Ξ MC, forcing ˆL(θ Ξ σ ) to be 0 Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 9 / 20

Monte-Carlo approximation We can estimate L(θ Ξ σ ) by solving π θ and then sampling N MC trajectories, Ξ MC, leading to the Monte-Carlo estimate ˆL(θ Ξ σ ) N ( 1 i=1 N MC ) P(ξ iσ ξ n )+η ξ n Ξ MC However P(ξ iσ ξ n ) may be 0 for all ξ n Ξ MC, forcing ˆL(θ Ξ σ ) to be 0 (can be fixed with a prior η) Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 9 / 20

Monte-Carlo approximation We can estimate L(θ Ξ σ ) by solving π θ and then sampling N MC trajectories, Ξ MC, leading to the Monte-Carlo estimate ˆL(θ Ξ σ ) N ( 1 i=1 N MC ) P(ξ iσ ξ n )+η ξ n Ξ MC However P(ξ iσ ξ n ) may be 0 for all ξ n Ξ MC, forcing ˆL(θ Ξ σ ) to be 0 (can be fixed with a prior η) σ needs to be known as a distribution P(ξ iσ ξ n ) Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 9 / 20

Monte-Carlo approximation We can estimate L(θ Ξ σ ) by solving π θ and then sampling N MC trajectories, Ξ MC, leading to the Monte-Carlo estimate ˆL(θ Ξ σ ) N ( 1 i=1 N MC ) P(ξ iσ ξ n )+η ξ n Ξ MC However P(ξ iσ ξ n ) may be 0 for all ξ n Ξ MC, forcing ˆL(θ Ξ σ ) to be 0 (can be fixed with a prior η) σ needs to be known as a distribution P(ξ iσ ξ n ) Takeaway: L(θ Ξ σ ) can be estimated with Monte-Carlo, but there are few technical issues we would like to avoid Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 9 / 20

Approximate Bayesian computation ABC also performs inference using on Monte-Carlo sampling Instead of estimating the likelihood of each trajectory ξ i separately, the likelihood of the entire observation set Ξ is estimated together Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 10 / 20

Approximate Bayesian computation ABC also performs inference using on Monte-Carlo sampling Instead of estimating the likelihood of each trajectory ξ i separately, the likelihood of the entire observation set Ξ is estimated together How ABC works: Simulate observations using the MC sample: Ξ sim σ = {σ(ξ MC,n )} (only requires us to sample from σ) Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 10 / 20

Approximate Bayesian computation ABC also performs inference using on Monte-Carlo sampling Instead of estimating the likelihood of each trajectory ξ i separately, the likelihood of the entire observation set Ξ is estimated together How ABC works: Simulate observations using the MC sample: Ξ sim σ = {σ(ξ MC,n )} (only requires us to sample from σ) Estimate discrepancy: δ(ξ σ, Ξ sim σ ) [0, ) (matches distributions; reduces effect of individual rare observations) Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 10 / 20

Approximate Bayesian computation ABC also performs inference using on Monte-Carlo sampling Instead of estimating the likelihood of each trajectory ξ i separately, the likelihood of the entire observation set Ξ is estimated together How ABC works: Simulate observations using the MC sample: Ξ sim σ = {σ(ξ MC,n )} (only requires us to sample from σ) Estimate discrepancy: δ(ξ σ, Ξ sim σ ) [0, ) (matches distributions; reduces effect of individual rare observations) The ε-approximate ABC likelihood: L ε (θ Ξ σ ) = P(δ(Ξ σ, Ξ sim σ ) ε θ) Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 10 / 20

Approximate Bayesian computation ABC also performs inference using on Monte-Carlo sampling Instead of estimating the likelihood of each trajectory ξ i separately, the likelihood of the entire observation set Ξ is estimated together How ABC works: Simulate observations using the MC sample: Ξ sim σ = {σ(ξ MC,n )} (only requires us to sample from σ) Estimate discrepancy: δ(ξ σ, Ξ sim σ ) [0, ) (matches distributions; reduces effect of individual rare observations) The ε-approximate ABC likelihood: L ε (θ Ξ σ ) = P(δ(Ξ σ, Ξ sim σ ) ε θ) Intuition: If simulating observations with θ leads to small prediction error, then likelihood of θ is high and vice versa Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 10 / 20

Approximate Bayesian computation ABC also performs inference using on Monte-Carlo sampling Instead of estimating the likelihood of each trajectory ξ i separately, the likelihood of the entire observation set Ξ is estimated together How ABC works: Simulate observations using the MC sample: Ξ sim σ = {σ(ξ MC,n )} (only requires us to sample from σ) Estimate discrepancy: δ(ξ σ, Ξ sim σ ) [0, ) (matches distributions; reduces effect of individual rare observations) The ε-approximate ABC likelihood: L ε (θ Ξ σ ) = P(δ(Ξ σ, Ξ sim σ ) ε θ) Intuition: If simulating observations with θ leads to small prediction error, then likelihood of θ is high and vice versa Takeaway: The issues with MC (numerical problems with rare observations, σ known as a distribution) can be avoided by using ABC Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 10 / 20

Inference Now we can estimate L(θ Ξ) at any θ, but how to find the best θ Θ? Evaluating the functions is still expensive The functions don t have accessible gradients Due to limited observability (σ), parameter uncertainty is likely large Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 11 / 20

Inference Now we can estimate L(θ Ξ) at any θ, but how to find the best θ Θ? Evaluating the functions is still expensive The functions don t have accessible gradients Due to limited observability (σ), parameter uncertainty is likely large We estimate the log-likelihoods using a GP surrogate model, fit using Bayesian optimization. Mean and shape of distribution estimated from MCMC-samples. Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 11 / 20

Inference Now we can estimate L(θ Ξ) at any θ, but how to find the best θ Θ? Evaluating the functions is still expensive The functions don t have accessible gradients Due to limited observability (σ), parameter uncertainty is likely large We estimate the log-likelihoods using a GP surrogate model, fit using Bayesian optimization. Mean and shape of distribution estimated from MCMC-samples. Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 11 / 20

Inference Now we can estimate L(θ Ξ) at any θ, but how to find the best θ Θ? Evaluating the functions is still expensive The functions don t have accessible gradients Due to limited observability (σ), parameter uncertainty is likely large We estimate the log-likelihoods using a GP surrogate model, fit using Bayesian optimization. Mean and shape of distribution estimated from MCMC-samples. Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 11 / 20

Inference Now we can estimate L(θ Ξ) at any θ, but how to find the best θ Θ? Evaluating the functions is still expensive The functions don t have accessible gradients Due to limited observability (σ), parameter uncertainty is likely large We estimate the log-likelihoods using a GP surrogate model, fit using Bayesian optimization. Mean and shape of distribution estimated from MCMC-samples. Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 11 / 20

Simulation experiment We used grid world environments to validate our approach Task was to infer reward weights for state features: R(s) = φ(s) T θ Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 12 / 20

Simulation experiment We used grid world environments to validate our approach Task was to infer reward weights for state features: R(s) = φ(s) T θ We only knew the start and end locations of the agent and the length of the trajectory: ξ σ = (s 0, s T, T ) Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 12 / 20

Simulation experiment We used grid world environments to validate our approach Task was to infer reward weights for state features: R(s) = φ(s) T θ We only knew the start and end locations of the agent and the length of the trajectory: ξ σ = (s 0, s T, T ) Miniature example: What kind of terrain might the agent prefer, given that moving from A to B took it T steps? Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 12 / 20

Inferred distributions (example) Takeaways The parameter values can be inferred based on summary observations Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 13 / 20

Inferred distributions (example) Takeaways The parameter values can be inferred based on summary observations The approximate distributions are similar to the true distribution Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 13 / 20

Efficiency Takeaways Summing over all plausible trajectories is expensive with larger MDPs Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 14 / 20

Efficiency Takeaways Summing over all plausible trajectories is expensive with larger MDPs The approximate methods scale significantly better Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 14 / 20

Accuracy and model fit Takeaways Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 15 / 20

Accuracy and model fit Takeaways Good approximation performance while outperforming a random baseline Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 15 / 20

Accuracy and model fit Takeaways Good approximation performance while outperforming a random baseline Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 15 / 20

Accuracy and model fit Takeaways Good approximation performance while outperforming a random baseline Approximate methods continue performing well even with larger MDPs Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 15 / 20

Realistic Experiment We performed experiments using an RL model from cognitive science User searched repeatedly for target items from drop-down menus Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 16 / 20

Realistic Experiment We performed experiments using an RL model from cognitive science User searched repeatedly for target items from drop-down menus The MDP contained a simple model of human vision and short-term memory Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 16 / 20

Realistic Experiment We performed experiments using an RL model from cognitive science User searched repeatedly for target items from drop-down menus The MDP contained a simple model of human vision and short-term memory Goal: infer values of three model parameters based on observing task completion times (TCT) and whether the target item was present in the menu: ξ σ = (target present?, TCT ) Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 16 / 20

Realistic Experiment We performed experiments using an RL model from cognitive science User searched repeatedly for target items from drop-down menus The MDP contained a simple model of human vision and short-term memory Goal: infer values of three model parameters based on observing task completion times (TCT) and whether the target item was present in the menu: ξ σ = (target present?, TCT ) visual fixation duration f dur item selection duration d sel menu layout recall probability p rec Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 16 / 20

Model fit ABC Hold-out data Task Completion Time (abs) 430 ms 470 ms Task Completion Time (pre) 980 ms 970 ms abs = target absent from menu, pre = target present in menu Takeaways Predictions with parameters inferred by ABC match to hold-out observation data, indicating good model fit Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 17 / 20

Model fit ABC Hold-out data Task Completion Time (abs) 430 ms 470 ms Task Completion Time (pre) 980 ms 970 ms Number of Saccades (abs) 1.4 1.9 Number of Saccades (pre) 3.1 2.2 abs = target absent from menu, pre = target present in menu Takeaways Predictions with parameters inferred by ABC match to hold-out observation data, indicating good model fit Also unobserved features match approximately to predictions Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 18 / 20

Approximate posterior Takeaway Posterior indicates good identification of model parameter values Remaining parameter uncertainty is easy to visualize Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 19 / 20

Conclusions We proposed two approximate methods (MC, ABC) for solving the problem of trajectory-level observation noise in IRL More scalable than exact likelihood Good approximation quality Full posterior inference, which is important due to noisy observations Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 20 / 20

Conclusions We proposed two approximate methods (MC, ABC) for solving the problem of trajectory-level observation noise in IRL More scalable than exact likelihood Good approximation quality Full posterior inference, which is important due to noisy observations We demonstrated applicability for a realistic cognitive science model based on real observation data Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 20 / 20

Conclusions We proposed two approximate methods (MC, ABC) for solving the problem of trajectory-level observation noise in IRL More scalable than exact likelihood Good approximation quality Full posterior inference, which is important due to noisy observations We demonstrated applicability for a realistic cognitive science model based on real observation data Next steps: improve scalability Still requires solving RL problems in the inner loop Scalability of GP and BO to high dimensions Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 20 / 20

Conclusions We proposed two approximate methods (MC, ABC) for solving the problem of trajectory-level observation noise in IRL More scalable than exact likelihood Good approximation quality Full posterior inference, which is important due to noisy observations We demonstrated applicability for a realistic cognitive science model based on real observation data Next steps: improve scalability Still requires solving RL problems in the inner loop Scalability of GP and BO to high dimensions More details at the poster tomorrow Antti Kangasrääsiö, Samuel Kaski (Aalto) ECML PKDD 2018 September 12, 2018 20 / 20