Supplementary Material: Strategies for exploration in the domain of losses

Similar documents
Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Sequential Decision Making

Payoff Scale Effects and Risk Preference Under Real and Hypothetical Conditions

Introduction to Reinforcement Learning. MAL Seminar

2D5362 Machine Learning

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

Statistics 431 Spring 2007 P. Shaman. Preliminaries

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Technical Appendices to Extracting Summary Piles from Sorting Task Data

Supplementary Material for: Belief Updating in Sequential Games of Two-Sided Incomplete Information: An Experimental Study of a Crisis Bargaining

1 Bayesian Bias Correction Model

Markov Decision Processes

Chapter 2 Uncertainty Analysis and Sampling Techniques

Adaptive Experiments for Policy Choice. March 8, 2019

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

Non-informative Priors Multiparameter Models

Sample Size Calculations for Odds Ratio in presence of misclassification (SSCOR Version 1.8, September 2017)

Business Statistics 41000: Probability 3

Log-Robust Portfolio Management

EE266 Homework 5 Solutions

Application of MCMC Algorithm in Interest Rate Modeling

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

Likelihood-based Optimization of Threat Operation Timeline Estimation

10703 Deep Reinforcement Learning and Control

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model

Modelling the Sharpe ratio for investment strategies

Multi-period mean variance asset allocation: Is it bad to win the lottery?

Market Volatility and Risk Proxies

Characterization of the Optimum

An Introduction to Bayesian Inference and MCMC Methods for Capture-Recapture

4 Reinforcement Learning Basic Algorithms

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Gamma Distribution Fitting

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

Chapter 7: Estimation Sections

Expected Return Methodologies in Morningstar Direct Asset Allocation

Yale ICF Working Paper No First Draft: February 21, 1992 This Draft: June 29, Safety First Portfolio Insurance

Lecture 7: Bayesian approach to MAB - Gittins index

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

Web Appendix. Are the effects of monetary policy shocks big or small? Olivier Coibion

16 MAKING SIMPLE DECISIONS

The topics in this section are related and necessary topics for both course objectives.

Lecture Stat 302 Introduction to Probability - Slides 15

CS360 Homework 14 Solution

Limit Theorems for the Empirical Distribution Function of Scaled Increments of Itô Semimartingales at high frequencies

Financial Risk Forecasting Chapter 6 Analytical value-at-risk for options and bonds

Small Sample Bias Using Maximum Likelihood versus. Moments: The Case of a Simple Search Model of the Labor. Market

Multi-Armed Bandit, Dynamic Environments and Meta-Bandits

Data Analysis and Statistical Methods Statistics 651

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

The Irrevocable Multi-Armed Bandit Problem

CS 343: Artificial Intelligence

Week 1 Quantitative Analysis of Financial Markets Distributions B

Multi-armed bandit problems

Estimating a Dynamic Oligopolistic Game with Serially Correlated Unobserved Production Costs. SS223B-Empirical IO

Monte Carlo Introduction

Machine Learning in Computer Vision Markov Random Fields Part II

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Financial Econometrics Jeffrey R. Russell. Midterm 2014 Suggested Solutions. TA: B. B. Deng

Estimation Appendix to Dynamics of Fiscal Financing in the United States

Model 0: We start with a linear regression model: log Y t = β 0 + β 1 (t 1980) + ε, with ε N(0,

Machine Learning for Quantitative Finance

Dynamic Asset and Liability Management Models for Pension Systems

Final exam solutions

Chapter 10 Inventory Theory

16 MAKING SIMPLE DECISIONS

An experimental investigation of evolutionary dynamics in the Rock- Paper-Scissors game. Supplementary Information

Analysis of truncated data with application to the operational risk estimation

Chapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 29

Commonly Used Distributions

CEO Attributes, Compensation, and Firm Value: Evidence from a Structural Estimation. Internet Appendix

A simple wealth model

Reasoning with Uncertainty

The following content is provided under a Creative Commons license. Your support

Bayesian Multinomial Model for Ordinal Data

Problem set 5. Asset pricing. Markus Roth. Chair for Macroeconomics Johannes Gutenberg Universität Mainz. Juli 5, 2010

A potentially useful approach to model nonlinearities in time series is to assume different behavior (structural break) in different subsamples

(5) Multi-parameter models - Summarizing the posterior

Econ 300: Quantitative Methods in Economics. 11th Class 10/19/09

Chapter 14 : Statistical Inference 1. Note : Here the 4-th and 5-th editions of the text have different chapters, but the material is the same.

Moral Hazard: Dynamic Models. Preliminary Lecture Notes

1. Consider the aggregate production functions for Wisconsin and Minnesota: Production Function for Wisconsin

Optimal Search for Parameters in Monte Carlo Simulation for Derivative Pricing

Yao s Minimax Principle

Omitted Variables Bias in Regime-Switching Models with Slope-Constrained Estimators: Evidence from Monte Carlo Simulations

THE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management

ECE 340 Probabilistic Methods in Engineering M/W 3-4:15. Lecture 10: Continuous RV Families. Prof. Vince Calhoun

A Non-Random Walk Down Wall Street

A Hidden Markov Model Approach to Information-Based Trading: Theory and Applications

Online Appendix (Not intended for Publication): Federal Reserve Credibility and the Term Structure of Interest Rates

Improving Returns-Based Style Analysis

Random variables. Discrete random variables. Continuous random variables.

Extended Model: Posterior Distributions

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

Transcription:

1 Supplementary Material: Strategies for exploration in the domain of losses Paul M. Krueger 1,, Robert C. Wilson 2,, and Jonathan D. Cohen 3,4 1 Department of Psychology, University of California, Berkeley 9472 2 Department of Psychology and Cognitive Science Program, University of Arizona 8721 3 Princeton Neuroscience Institute, Princeton University 844 4 Department of Psychology, Princeton University 844 equal contribution Full instructions for the task Before beginning the task, participants read a set of illustrated on-screen instructions. Each bullet point below shows text from a single screen (illustrations are omitted here to save space). The order in which participants were introduced to the gains and losses conditions, and all references to the tasks thereinafter, as well as the final example, reflected the block order of gains and losses for each particular participant. The example below is one in which the losses condition came first. Welcome! Thank you for participating in this experiment. In this experiment we would like you to choose between two one-armed bandits of the sort you might find in a casino. The one-armed bandits will be represented like this For the first half of the experiment, your task is to minimize how many points you lose overall. This is called the LOSSES task. For the LOSSES task, every time you choose to play a particular bandit, the lever will be pulled like this...... and the amount of points lost will be shown like this. For example, in this case, the left bandit has been played and is subtracting 23 points. For the second half of the experiment, your task is to maximize how many points you gain overall. This is called the GAINS task.

2 The GAINS task is played similarly to the LOSSES task, but with points added to your overall payment... For example, in this case, the left bandit has been played and is adding 77 points. The points you lose and gain by playing the bandits will be converted into REAL money at the end of the experiment. Therefore, the fewer points you lose and the more points you gain, the more money you will earn. A given bandit tends to subtract (in the LOSSES task) or add (in the GAINS task) the same amount of points on average, but there is variability in the amount on any given play. For example, if you re playing the LOSSES task, the average points subtracted for the bandit on the right might be, but on the first play we might see -48 points because of the variability...... on the second play we might see -44 points...... if we open a third box on the right we might see - points this time...... and so on, such that if we were to play the right bandit 1 times in a row we might see these points... If you re playing the GAINS task, the average points added for the bandit on the right might be, but on the first play we might see 2 points because of the variability...... on the second play we might see 6 points...... if we open a third box on the right we might see 4 points this time...... and so on, such that if we were to play the right bandit 1 times in a row we might see these points... Both bandits will have the same kind of variability and this variability will stay constant throughout the experiment. One of the bandits will always subtract fewer points (on the LOSSES task) or add more points (on the GAINS task) and hence be the better option to choose on average. When you move on to a new game, then the average amount of points of each bandit will change. To make your choice: Press<to play the left bandit. Press>to play the right bandit On any trial you can only play one bandit and the number of trials in each game is determined by the height of the bandits. For example, when the bandits are 1 boxes high, there are 1 trials in each game...... when the stacks are boxes high there are only trials in the game. The first 4 choices in each game are instructed trials where we will tell you which option to play. This will give you some experience with each option before you make your first choice.

3 These instructed trials will be indicated by a green square inside the box we want you to open and you must press the button to choose this option in order to move on to see the outcome and move on the next trial. For example, if you are instructed to choose the left box on the first trial, you will see this: If you are instructed to choose the right box on the second trial, you will see this: Once these instructed trials are complete you will have a free choice between the two stacks that is indicated by two green squares inside the two boxes you are choosing between. The first half of the experiment will be the LOSSES task, so remember to try to minimize the overall number of points lost. You will be notified when you re halfway through the experiment, before the task changes. Press space when you are ready to begin. Good luck! Reward magnitude model µ A n σn A µ B n σn B kn σ λ σ n µ γ n σn γ A ns B ns σ ns γ ns c nsg R nsg I nsg M nsg game g = 1:G subject s = 1:S condition n = 1:N Group-level parameters µ A n Gaussian(,1) σ A n Gamma(1,.1) µ B n Gaussian(,1) σ B n Gamma(1,.1) k σ n Exponential(.1) λ σ n Exponential(.1) µ γ n Gaussian(,1) σ γ n Gamma(1,.1) Subject specific parameters A ns Gaussian(µ A n,σ A n) B ns Gaussian(µ B n,σ B n) σ ns Gamma(k σ n,λσ n ) Observed choices p nsg [ 1+exp ( R nsg+a ns I nsg+b ns+ I nsgm nsgγ ns σ ns c nsg Bernoulli(p nsg ) )] 1 Figure S1 Graphical representation of the reward magnitude model.

4 Model of optimal behavior Adapted from Wilson et al. (214). We modeled optimal behavior by solving a dynamic programming problem that computes the action that will produce the maximum expected outcome over the course of a game. The model knows that the mean outcomes are generated from a truncated Gaussian distribution with a given variance. It treats the gains and losses conditions equivalently. The optimal model solves a dynamic programming problem (Bellman, 197; Duff, 22) to compute the action that will maximize the expected total reward over the course of each game. To do this the model first infers a distribution over the mean of each option given the observed rewards. We write r t to denote the reward on trial t in the game, c t to be the choice on trialtandd t to be the set of choices and rewards up to and including timet. We assume that the model knows that the rewards are generated from a truncated Gaussian distribution and we further assume that it knows that the standard deviation of this distribution, σ n. In this case, the inferred distribution over the mean of option a, µ a, given the history of choices and rewards is ) n (1) p(µ a a D t ) t 1 exp ( na t(µ a Rt/n a a t) 2 p(µ a ) 2πσ n where n a t is the number of times option a has been played, R a t is the cumulative sum of the rewards obtained from playing option a and p(µ a ) is the prior of the mean. In our model we assumed an improper, uniform prior on µ a (although we should note that it is straightforward to include a Gaussian prior instead). With this prior, equation (1) shows that the model s state of knowledge about option a is summarized by the two numbers, n a t andr a t. We can thus define the hyperstate (Duff, 22),S t, the state of information that the model has about both options as 2σ 2 n (2) S t = (n A t,r A t,n B t,r B t ). With the hyperstates defined in this way we can now specify a Markov decision process within this state space. In particular we can define a transition matrix,t(s t+1 S t,a), which describes the probability of transitioning between states S t+1 and S t given action a. To compute this we note that if action a = A is chosen on trial t and reward r t is observed, then new state on the next trial will be (3) S t+1 = (n A t +1,R A t +r t,n B t,r B t ). Further, given the distribution over the mean, using equation (1) we can predict that this outcome will occur with probability

p(r t S t,a) = dµ A p(r t µ A )p(µ A S t ) n A t 1 (4) = exp ( (r ) t Rt A /n A t ) 2 2π(1+n A t ) σ n 2σ 2 n Note that this result comes because bothp(r t µ a ) andp(µ a D t ) are Gaussians, withp(µ a D t ) defined in equation (1) and () p(r t µ a 1 ) = exp ( (r ) t µ a ) 2 2πσn In practice, to make the algorithm tractable we only consider a subset of possible outcomes, focusing on a set of 1 possible outcomes between and 1 for the horizon 1 case and 21 possible outcomes in the horizon 6 case. Given this approximation we can then compute the set of possible states encountered during the task and solve the dynamic program by iterating the equations for the state values (6) V(S t ) = maxq(a,s t ) a and the action values (7) Q(a,S t ) = S t+1 2σ 2 n T(S t+1 S t,a)(r t (S t+1 )+V(S t+1 )) In particular we start at the last trial, t = H, and work backwards in time to the first trial. Here, by definition the action value is just the expected value of the reward from each option; i.e., (8) Q(a H,S H ) = Ra H H n a H H Finally the optimal action is to choose the option for which has the highest value on the first free trial, i.e. (9) c 1 = argmax a Q(a,S 1 ) This analysis allows us to compute the optimal behavior on the task. To compute the optimal performance shown in Figure 3, we simulated choices from this optimal model on the same set of problems faced by the participants. We then computed performance in the same way as we did for humans (see Methods).

6 Choice curves analysis Focusing our analyses on the first free-choice trial, we computed p a, the probability of choosing bandit a over bandit b, as a function of the difference in observed mean of each bandit, using Equation 2. The parameters in Equation 2 were set as the mean of the estimated posterior distribution across participants. In the [1 3] unequal certainty condition, bandit a was defined as the lesser known bandit (i.e. the bandit that had been observed only once during the forced trials); in the [2 2] equal certainty condition, bandit a was arbitrarily defined as the bandit on the right. The resulting choice curves are shown in Figure S2, along with empirical averages across participants. The error-bars on the empirical data points indicate the standard error of the mean across participants. A 1 unequal information [1 3] B 1 equal information [2 2] probability of choosing more informative option.8.6.4.2 horizon 1 gains horizon 1 losses horizon 6 gains horizon 6 losses -3-2 -1 1 2 3 difference in means between more and less informative options probability of choosing option on the right.8.6.4.2-3 -2-1 1 2 3 difference in means between right and left options Figure S2. Choice curves for the first free-choice trial in the (A) [1 3] unequal and (B) [2 2] equal uncertainty conditions. Filled circles show experimental data averaged across participants, with error-bars indicating the standard error of the mean across participants. Curved lines show model-derived probability functions averaged across participants. (A) The fraction of times the more informative bandit is chosen, as a function of the difference in means between the more and less informative options. Compared to horizon 1 trials (gray-scale curves), horizon 6 trials (orange curves) show a greater information bonus, indicated by a shift in the indifference point (the point at which participants are equally likely to choose either option) further away from zero on the x-axis, as well as an increase in decision noise, indicated by a flattening of the slope of the curve. Within each horizon condition, the shift in indifference point is greater for the losses condition (light curves) than the gains condition (dark curves), indicating a greater uncertainty seeking in the losses condition. However, the slope of the curves within each horizon task is no different for the gains condition and the losses condition, indicating no change in decision noise. (B) In the equal uncertainty condition, there is less decision noise compared to the unequal uncertainty condition, as indicated by the steeper slopes of the curves within each horizon condition. There was no difference observed between the gains condition and the losses condition in the equal uncertainty condition. There is no information bonus in the equal uncertainty condition since both options have been sampled twice. Participants choices were sensitive to the difference in mean between the two options, such that when the difference was large, participants were likely to choose the more rewarding (or less punishing) option, but as the difference became smaller, participants were more likely to choose either of the bandits.

7 In line with our previous findings for gains alone (Wilson et al., 214), in the [1 3] unequal certainty condition there was a shift in the indifference point of the choice curves (the point at which participants were equally likely to choose either option) between horizon 1 and horizon 6. This was true for both the gains and losses conditions, and is consistent with directed exploration driven by an information bonus on the value of the lesser known option. That is, when participants had a longer time horizon in which to explore, they were biased towards the lesser known option, in hopes that acquiring more information about it would allow them to make more informed decisions later on, and hence improve their outcome overall. In addition to directed exploration, participants also showed random exploration, indicated by a flattening of the choice curve between horizons 1 and 6. This is also consistent with previous findings for gains (Wilson et al., 214), and was equally true for both the gains and losses. Comparing the gains and losses conditions, there was an overall increased bias toward the uncertain option for the losses condition, indicated by the overall leftwards shift in curves for the losses condition (light orange and grey curves), compared to the curves for the gains condition (dark orange and black curves; Figure S2A). Decision noise, indicated by the slope of the curve, does not change between gains and losses (Figure S2B). MCMC sampling convergence As noted in the main text, all parameters were fit simultaneously using a Markov Chain Monte Carlo (MCMC) approach to sample from the joint posterior. We ran 4 separate Markov Chains with burn-in steps to generate 1 samples from each chain with a thin rate of. Below are serial plots of samples from one chain (after the burn-in) for the parameters shown in Figure : information bonus, [1 3] decision noise, and [2 2] decision noise. Information bonus (µ A ): horizon 1, gains 1-1 2 3 4 6 7 8 9 1 horizon 1, losses 1-1 2 3 4 6 7 8 9 1 horizon 6, gains 1-1 2 3 4 6 7 8 9 1 horizon 6, losses 1-1 2 3 4 6 7 8 9 1

8 [1 3] decision noise (k σ /λ σ ): 1 horizon 1, gains 1 2 3 4 6 7 8 9 1 horizon 1, losses 1 1 2 3 4 6 7 8 9 1 horizon 6, gains 1 1 2 3 4 6 7 8 9 1 horizon 6, losses 1 1 2 3 4 6 7 8 9 1 [2 2] decision noise (k σ /λ σ ): 1 horizon 1, gains 1 2 3 4 6 7 8 9 1 horizon 1, losses 1 1 2 3 4 6 7 8 9 1 horizon 6, gains 1 1 2 3 4 6 7 8 9 1 horizon 6, losses 1 1 2 3 4 6 7 8 9 1