Policy Improvement for Repeated Zero-Sum Games with Asymmetric Information

Similar documents
1 Estimating sensitivities

A New Constructive Proof of Graham's Theorem and More New Classes of Functionally Complete Functions

The Time Value of Money in Financial Management

A New Approach to Obtain an Optimal Solution for the Assignment Problem

Sequences and Series

SUPPLEMENTAL MATERIAL

5. Best Unbiased Estimators

Rafa l Kulik and Marc Raimondo. University of Ottawa and University of Sydney. Supplementary material

Statistics for Economics & Business

Neighboring Optimal Solution for Fuzzy Travelling Salesman Problem

Productivity depending risk minimization of production activities

Combining imperfect data, and an introduction to data assimilation Ross Bannister, NCEO, September 2010

The Limit of a Sequence (Brief Summary) 1

Solution to Tutorial 6

FINM6900 Finance Theory How Is Asymmetric Information Reflected in Asset Prices?

Subject CT1 Financial Mathematics Core Technical Syllabus

1 Random Variables and Key Statistics

Lecture 9: The law of large numbers and central limit theorem

Unbiased estimators Estimators

point estimator a random variable (like P or X) whose values are used to estimate a population parameter

14.30 Introduction to Statistical Methods in Economics Spring 2009

CAPITAL PROJECT SCREENING AND SELECTION

Hopscotch and Explicit difference method for solving Black-Scholes PDE

Summary. Recap. Last Lecture. .1 If you know MLE of θ, can you also know MLE of τ(θ) for any function τ?

Section Mathematical Induction and Section Strong Induction and Well-Ordering

Research Article The Probability That a Measurement Falls within a Range of n Standard Deviations from an Estimate of the Mean

Exam 1 Spring 2015 Statistics for Applications 3/5/2015

The Valuation of the Catastrophe Equity Puts with Jump Risks

A Bayesian perspective on estimating mean, variance, and standard-deviation from data

Estimating Proportions with Confidence

A DOUBLE INCREMENTAL AGGREGATED GRADIENT METHOD WITH LINEAR CONVERGENCE RATE FOR LARGE-SCALE OPTIMIZATION

Online appendices from Counterparty Risk and Credit Value Adjustment a continuing challenge for global financial markets by Jon Gregory

. (The calculated sample mean is symbolized by x.)

Bayes Estimator for Coefficient of Variation and Inverse Coefficient of Variation for the Normal Distribution

Research Article The Average Lower Connectivity of Graphs

Monopoly vs. Competition in Light of Extraction Norms. Abstract

Subject CT5 Contingencies Core Technical. Syllabus. for the 2011 Examinations. The Faculty of Actuaries and Institute of Actuaries.

DESCRIPTION OF MATHEMATICAL MODELS USED IN RATING ACTIVITIES

ad covexity Defie Macaulay duratio D Mod = r 1 = ( CF i i k (1 + r k) i ) (1.) (1 + r k) C = ( r ) = 1 ( CF i i(i + 1) (1 + r k) i+ k ) ( ( i k ) CF i

CHAPTER 2 PRICING OF BONDS

The material in this chapter is motivated by Experiment 9.


Minhyun Yoo, Darae Jeong, Seungsuk Seo, and Junseok Kim

Chapter 8: Estimation of Mean & Proportion. Introduction

We consider the planning of production over the infinite horizon in a system with timevarying

Binomial Model. Stock Price Dynamics. The Key Idea Riskless Hedge

AUTOMATIC GENERATION OF FUZZY PAYOFF MATRIX IN GAME THEORY

Department of Mathematics, S.R.K.R. Engineering College, Bhimavaram, A.P., India 2

Inferential Statistics and Probability a Holistic Approach. Inference Process. Inference Process. Chapter 8 Slides. Maurice Geraghty,

ECON 5350 Class Notes Maximum Likelihood Estimation

Institute of Actuaries of India Subject CT5 General Insurance, Life and Health Contingencies

On Regret and Options - A Game Theoretic Approach for Option Pricing

Overlapping Generations

Linear Programming for Portfolio Selection Based on Fuzzy Decision-Making Theory

A random variable is a variable whose value is a numerical outcome of a random phenomenon.

Reinforcement Learning

EXERCISE - BINOMIAL THEOREM

We learned: $100 cash today is preferred over $100 a year from now

Marking Estimation of Petri Nets based on Partial Observation

INTERVAL GAMES. and player 2 selects 1, then player 2 would give player 1 a payoff of, 1) = 0.

Chapter 8. Confidence Interval Estimation. Copyright 2015, 2012, 2009 Pearson Education, Inc. Chapter 8, Slide 1

Parametric Density Estimation: Maximum Likelihood Estimation

Lecture 4: Probability (continued)

Mine Closure Risk Assessment A living process during the operation

Game Theory. Lecture Notes By Y. Narahari. Department of Computer Science and Automation Indian Institute of Science Bangalore, India July 2012

Solutions to Problem Sheet 1

SETTING GATES IN THE STOCHASTIC PROJECT SCHEDULING PROBLEM USING CROSS ENTROPY

Portfolio Optimization for Options

Maximum Empirical Likelihood Estimation (MELE)

Threshold Function for the Optimal Stopping of Arithmetic Ornstein-Uhlenbeck Process

Forecasting bad debt losses using clustering algorithms and Markov chains

Calculation of the Annual Equivalent Rate (AER)

Online appendices from The xva Challenge by Jon Gregory. APPENDIX 10A: Exposure and swaption analogy.

Introduction to Probability and Statistics Chapter 7

Sampling Distributions and Estimation

We analyze the computational problem of estimating financial risk in a nested simulation. In this approach,

1 + r. k=1. (1 + r) k = A r 1

SELECTING THE NUMBER OF CHANGE-POINTS IN SEGMENTED LINE REGRESSION

An Empirical Study of the Behaviour of the Sample Kurtosis in Samples from Symmetric Stable Distributions

x satisfying all regularity conditions. Then

0.07. i PV Qa Q Q i n. Chapter 3, Section 2

Repeated Sales with Multiple Strategic Buyers

5 Statistical Inference

The Communication Complexity of Coalition Formation among Autonomous Agents

ACTUARIAL RESEARCH CLEARING HOUSE 1990 VOL. 2 INTEREST, AMORTIZATION AND SIMPLICITY. by Thomas M. Zavist, A.S.A.

1 The Power of Compounding

Chapter 5: Sequences and Series

Topic-7. Large Sample Estimation

AY Term 2 Mock Examination

Positivity Preserving Schemes for Black-Scholes Equation

APPLICATION OF GEOMETRIC SEQUENCES AND SERIES: COMPOUND INTEREST AND ANNUITIES

Lecture 4: Parameter Estimation and Confidence Intervals. GENOME 560 Doug Fowler, GS

A Self-adaptive Predictive Policy for Pursuit-evasion Game

A Technical Description of the STARS Efficiency Rating System Calculation

Math 312, Intro. to Real Analysis: Homework #4 Solutions

Reduced Complexity Approaches to Asymmetric Information Games

18.S096 Problem Set 5 Fall 2013 Volatility Modeling Due Date: 10/29/2013

NOTES ON ESTIMATION AND CONFIDENCE INTERVALS. 1. Estimation

MA Lesson 11 Section 1.3. Solving Applied Problems with Linear Equations of one Variable

When you click on Unit V in your course, you will see a TO DO LIST to assist you in starting your course.

Transcription:

Policy Improvemet for Repeated Zero-Sum Games with Asymmetric Iformatio Malachi Joes ad Jeff S. Shamma Abstract I a repeated zero-sum game, two players repeatedly play the same zero-sum game over several stages. We assume that while both players ca observe the actios of the other, oly oe player kows the actual game, which was radomly selected from a set of possible games accordig to a kow distributio. The dilemma faced by the iformed player is how to trade off the short-term reward versus log-term cosequece of exploitig iformatio, sice exploitatio also risks revelatio. Classic work by Auma ad Maschler derives the recursive value equatio, which quatifies this tradeoff ad derives a formula for optimal policies by the iformed player. However, usig this model for explicit computatios ca be computatioally prohibitive as the umber of game stages icreases. I this paper, we derive a suboptimal policy based o the cocept of policy improvemet. The baselie policy is a o-revealig policy, i.e., oe that completely igores superior iformatio. The improved policy, which is implemeted i a recedig horizo maer, strategizes for the curret stage while assumig a o-revealig policy for future stages. We show that the improved policy ca be computed by solvig a liear program, ad the computatioal complexity of this liear program is costat with respect to the legth of the We derive bouds o the guarateed performace of the improved policy ad establish that the bouds are tight. I. INTRODUCTION We cosider a asymmetric zero-sum game, i which oe player is iformed about the true state of the world. This state determies the specific game that will be played repeatedly. The other player, the uiformed player, has ucertaity about the true state. Sice both players ca observe the actios of their oppoet, the uiformed player ca use his observatios of the iformed player s actios to estimate the true state of the world. Exploitig/revealig iformatio to achieve a short-term reward risks helpig the uiformed player better estimate the true state of the world. A better estimate of the true state ca allow the uiformed player to make better decisios that will cost the iformed player over the log term. A atural questio is how should the iformed player exploit his iformatio. Auma ad Maschler itroduced a recursive formulatio that characterizes the optimal payoff the iformed player ca achieve, which is referred to as the value of the This formulatio evaluates the tradeoff betwee short-term rewards ad log-term costs for all possible decisios of the iformed player, ad the optimal decisio is the decisio This research was supported by ARO/MURI Project W9NF-09--0553. M. Joes ad J.S. Shamma are with the School of Electrical ad Computer Egieerig, College of Egieerig, Georgia Istitute of Techology {kye4u, shamma}@gatech.edu that provides the best overall game payoff. Determiig the optimal decisio becomes icreasigly difficult as the legth of the game grows. Therefore, computig a optimal decisio for games of o-trivial legths ca be computatioal prohibitive. This difficulty exteds also to the simplest zerosum games, which have two states ad two possible actios for each player i a give state. Much of the curret work to address the iformatio exploitatio issue, which icludes the work of Domasky ad Kreps 2 ad Heur 3, has bee limited to fidig optimal strategies for special cases of the simplest zerosum games. Zamir provided a method to geerate optimal strategies uder certai coditios for 2x2 ad 3x3 matrix games 4. Gilpi ad Sadholm 5 propose a algorithm for computig strategies i games by usig a o-covex optimizatio formulatio for oly the ifiitely repeated case. There is work that cosiders usig suboptimal strategies to address the iformatio exploitatio issue for all classes of games 6. I this work, the iformed player ever uses his iformatio throughout the game, ad accordigly is orevealig. While the suboptimal strategies are readily computable, oly uder special circumstaces do these strategies offer strog suboptimal payoffs. I this paper, after itroducig basic repeated zero-sum game cocepts ad defiitios, we itroduce a suboptimal strategy that we refer to as the oe-time policy improvemet strategy. We show that the computatioal complexity of costructig this strategy is costat with respect to the legth of Next, we provide tight bouds o the guarateed performace of the costructed suboptimal policy. We the show that the policy improvemet strategy ca be computed by solvig a liear programig problem. Fially, we preset a illustrative simulatio. II. ZERO-SUM DEFINITIONS AND CONCEPTS I this sectio, we will itroduce basic zero-sum repeated game defiitios relevat to this paper. The first ad most importat cocept is Auma ad Maschler s dyamic programmig equatio for evaluatig the tradeoff betwee shortterm ad log-term payoff. We will the discuss the otio of o-revealig strategies, which will be exploited i our costructio of suboptimal policies to reduce computatioal complexity.

A. Setup ) Game Play: Two players repeatedly play a zero-sum matrix game over stages m =, 2,..., N. The row player is the maximizer, ad the colum player is the miimizer. The specific game is selected from a fiite set of possible games (or states of the world), K. Let (L) deote the set of probability distributios over some fiite set, L. Defie S to be the set of pure actios of the row player, ad similarly defie J to be the set of pure actios of the colum player. The game matrix at state k K is deoted M k R S J. Before stage m =, ature selects the specific game accordig to a probability distributio p (K), which is commo kowledge. This selectio remais fixed over all stages. The row player is iformed of the outcome, whereas the colum player is ot. 2) Strategies: Mixed strategies are distributios over pure strategies for each player. Sice the row player is iformed of the state of the world, he is allowed a mixed strategy for each state k. Let x k m (S) deote the mixed strategy of the row player i state k at stage m ad also deote x k m(s) to be the probability that the colum player plays pure move s at stage m ad state k. I repeated play, this strategy ca be a fuctio of the actios of both players durig stages,..., m. Likewise, let y m (J) deote the mixed strategy of the colum player at stage m, which agai ca deped o the players actios over stages,..., m. Let x m = { x m,..., m} xk deote the collectio of the row player s mixed strategies for all states at stage m, ad x = {x,..., x m } deote mixed strategies over all states ad stages. Likewise, let y = {y,.., y m } deote the colum player s mixed strategies over all stages. Defie H m = S J m to be the set of possible histories where a elemet h m H m is a sequece (s, j ; s 2, j 2 ;...; s m, j m ) of the players moves i the first m stages of the game, ad let h I m deote the history of player s moves. Each player ca perfectly observe the moves of the other player. Therefore, the histories of each player at stage m are idetical. Behavioral strategies are mappigs from states ad histories to mixed strategies. Let σ : k h (S) deote a behavioral strategy of the row player ad deote ˆσ m k h(s) to be the probability that the colum player plays pure move s at stage m, history h, ad state k. The colum player s behavioral strategy ca oly deped o histories ad is deoted by τ : h (J). Defie σ = {σ,..., σ } to be the collectio of behavioral strategies of the row player over all stages. Likewise, defie τ = {τ,..., τ } to be the collectio of behavioral strategies of the colum player over all stages. Auma established that behavioral strategies ca be equivaletly represeted as mixed strategies 7. 3) Beliefs: Sice the colum player is ot iformed of the selected state k, he ca build beliefs o which state was selected. These beliefs are a fuctio of the iitial distributio p ad the observed moves of the row player. Therefore, the row player must carefully cosider his actios at each stage as they could potetially reveal the true state of the world to the colum player. I order to get a worse case estimate of how much iformatio the row player trasmits through his moves, he models the colum player as a Bayesia player ad assumes that the colum player has his mixed strategy. The updated belief p + is computed as p + (p, x, s) = pk x k (s) x(p, x, s), () where x(p, x, s) := pk x k (s) ad x k (s) is the probability of playig pure actio s at state k. 4) Payoffs: Let γm(σ, p τ)=e g m p,σ k,τ deote the expected payoff for the pair of behavioral strategies (σ, τ) at stage m. The payoff for the -stage game is the defied as γ p (σ, τ) = m= γm p (σ, τ). (2) Similarly the payoff for the λ-discouted game is defied as γ p λ (σ, τ) = m= B. Short-term vs log-term tradeoff λ( λ) m γm p (σ, τ). (3) The dyamic programmig recursive formula v + (p) = maxmi x k + x y M k y + s S x s v (p + (p, x, s)) (4) itroduced by Auma ad Maschler characterizes the value of the zero-sum repeated game of icomplete iformatio. Note that is a o-egative iteger ad for the case where = 0, the problem reduces to v (p) = max x mi y x k M k y, (5) which is the value of the -shot zero-sum icomplete iformatio A key iterpretatio of this formulatio is that it also serves as a model of the tradeoff betwee short-term gais ad the log-term iformatioal advatage. For each decisio x of the iformed player, the model evaluates the payoff for the curret stage, which is represeted by the expressio pk x k Mk y, ad the log-term cost for decisio x, which is represeted by ) s S x sv (p + (p, x, s). It is worth poitig out that the computatioal complexity for fidig the optimal decisio x ca be attributed to the cost of calculatig the log-term payoff. Sice the log-term payoff is a recursive optimizatio problem that grows with respect to the game legth, it ca be difficult to fid optimal strategies for games of arbitrary legth. This difficulty is because the umber of decisio variables i the recursive optimizatio problem grows expoetially with respect to the game legth. Possard ad Sori 8 showed that zero-sum

repeated games of icomplete iformatio ca be formulated as a liear program (LP) to compute optimal strategies. However, i the LP formulatio, it ca be immediately see that the computatioal complexity is also expoetial with respect to the game legth. Oe of the goals of this research is to preset a method to compute suboptimal policies with tight lower bouds, ad a feature of this method is that its computatioal complexity remais costat for games of ay legth. C. No-revealig strategies Revealig iformatio is defied as usig a differet mixed strategy i each state k at stage m. From (), it follows that a mixed strategy x m at stage m does ot chage the curret beliefs of the row player if x k m = x k m k, k. As a cosequece, ot revealig iformatio is equivalet to ot chagig the colum player s beliefs about the true state of the world. I stochastic games, it is possible for the colum player s beliefs to chage eve if the row player uses a idetical mixed strategy for each state k. A optimal o-revealig strategy ca be computed by solvig u(p) = max mi x k M k y, (6) x NR y where the set of o-revealig strategies is defied as NR = {x m x k m = xm k k, k K}. By playig a optimal orevealig strategy at each stage of the game, the row player ca guaratee a game payoff of u(p). Note that the optimal game payoff for the -stage game, v (p), is equal to u(p) oly uder special coditios. III. POLICY IMPROVEMENT STRATEGY We have previously stated i Sectio II-B that the difficulty i explicitly computig optimal strategies for a arbitrary game grows expoetially with respect to the umber of stages i the I this paper, we cosider suboptimal strategies whose computatioal complexity remais costat for games of arbitrary legth. Oe such strategy that we itroduce is called oe-time policy improvemet. I icomplete iformatio games, the row player always has the optio of ot usig his superior iformatio. A simple policy for the -stage game would be for the row player to ever use his superior iformatio. As oted i Sectio II-C, if he uses such a policy, he ca oly guaratee a payoff of at most u(p). I a oe-time policy improvemet strategy, the simple policy becomes the baselie policy. The key differece is that with the oe-time policy improvemet strategy, the row player is allowed to deviate from the simple policy at the first stage of the After the first stage, he is obliged to use the simple policy for the ext - stages. The guarateed payoff for the oe-time policy improvemet strategy for the -stage game is deoted by ˆv (p). Determiig ˆv (p) ca be achieved by solvig ˆv (p) = max mi x k x y Mk y + ( ) s S ( x s u p + (p, x, s)) (7) for. Defiitio : Let cav u(p) deote the poit-wise smallest cocave fuctio g o (K) satisfyig g(p) u(p) p (K). Defiitio 2: Deote the oe-time policy improvemet behavioral strategy by ˆσ, where for each ˆσ m 2, let ˆσ m(h)s k = ˆσ m k (h)s k, k K. Defiitio 3: A perpetual policy improvemet strategy is a strategy that is implemeted i a recedig horizo maer ad strategizes for the curret stage while assumig a orevealig policy for future stages. Theorem : Oe-time policy improvemet guaratees a payoff of at least cav u(p) for the -stage zero-sum repeated games of icomplete iformatio o oe-side. We will devote the remaider of this sectio to the proof of this theorem. The proof will proceed as follows. ) We will first prove i Propositio 3.5 that for ay iitial distributio p there exists a oe-time policy improvemet behavioral strategy whose lower boud is greater tha or equal to l L α lu(p l ), where L <, α (L), p, p l (K), ad p = l L α lp l. 2) Next we ote that cav u(p) = e E α eu(p e ), where α (E), E <, p, p e (K), ad p = e E α ep e. 3) It the follows that the oe-time policy improvemet behavioral strategy has a lower boud of cav u(p), which is tight. 4) Recall that behavioral strategies ca be equivaletly represeted as mixed strategies. Therefore a oe-time policy improvemet behavioral strategy ca be equivaletly represeted as a oe-time policy improvemet mixed strategy. As a cosequece ˆv (p) cav u(p) Corollary 2: Perpetual policy improvemet guaratees a payoff of at least cav u(p) for the -stage zero-sum repeated games of icomplete iformatio o -side. This ca be show by usig stadard dyamic programmig argumets regardig policy improvemet. Lemma 3.: 9 Let L be a fiite set ad p = l L α lp l with α (L), ad p,p l (K) for all l i L The there exists a trasitio probability µ from (K, p) to L such that P(l) = α l ad P( l) = p l, where P = p x is the probability iduced by p ad µ o K L : P(k, l) = µ k (l). Lemma 3.2: Fix p arbitrarily. Let p be represeted as p = l L α lp l, where L <, α (L), ad p,p l (K)

The their exists a strategy σ, which will be referred to as the splittig strategy, such that γ p (σ, τ) l L α l u(p l ) τ (8) ) Itroduce strategy σ as follows. Let σ l be the strategy that guaratees u(p l ) for the -stage Defie µ k p (l) = α k l l. If the state is k, use the lottery µ k, ad if the outcome is l, play σ l. 2) To get a lower boud o Player s payoff, assume eve that Player 2 is iformed upo l. He is the facig strategy σ l 3) By Lemma 3., this occurs with probability α l ad the coditioal probability o K is p l, hece the game is γ p l, so that γ (σ, p τ) l L α lu(p l ) τ Lemma 3.3: Cosider mixed strategy x, where x = { x, x 2,..., x K }. There exists a oe-time policy improvemet behavioral strategy ˆσ where ˆσ m(s) k = x k (s) k. ) Cosider the followig behavioral strategy. At stage m =, Player uses mixed strategy x k, where k is the true state of the world (i.e. ˆσ : k { } x k ). Whatever move s S that is realized i stage m =, Player plays s at each stage for the remaider of the game (i.e. σˆ : h m h where m 2 ad h m is player s history at stage m ). 2) Clearly ˆσ (s) k = x k (s) by defiitio. 3) Cosider stage 2. Suppose the probability of playig move s i state k at stage is α. Recall that whatever move that the row player realizes i stage is played for the rest of the It follows that if the state of world is state k, the probability of playig move s i stage 2 is also α. 4) The same argumet ca be applied to stage m. 5) We have ow established that the followig equality ˆσ m k (s) = xk (s) holds for all m, where ˆσ is the oetime policy improvemet behavioral strategy. Propositio 3.4: Fix p arbitrarily. Let p be represeted as p = l L α lp l, where L <, α (L), ad p,p l (K). Suppose we costruct a behavioral strategy σ as follows. Defie σ l to be the optimal behavioral strategy to the o-revealig -stage game u(p l ). Let σ k = l L µk (l)σl k defie the mixed behavioral strategy for the -stage game, where µ k p (l) = α k l l. The lower boud for the payoff of behavioral strategy σ is l L α lu(p l ). Explicitly γ p ( σ, τ) = = m= l L E g m k, σk,τ m= µ k (l)e g m k,σ k l,τ (9) l L α l u(p l ) τ where τ is the behavioral strategy of player 2 We will first establish that the splittig strategy has the equivalet payoff of mixed behavioral strategy σ for arbitrary behavioral strategies τ of the colum player. We the ote that the splittig strategy has a lower boud of l L α lu(p l ). We coclude by makig the followig observatio. Sice the splittig strategy has a idetical payoff as strategy σ for each strategy τ, it also has the same lower boud l L α lu(p l ). ) Recall the splittig strategy as defied i Lemma 3.2. 2) The payoff for this strategy ca be expressed as follows: µ k (l) l L m= = p k µ k (l) l L m= = m= l L E g m k,σ k l,τ E g m k,σ k l,τ µ k (l)e g m k,σ k l,τ (0) 3) Observe that the payoff for strategy σ i (9) is equivalet to the payoff for the splittig strategy σ i (0) for arbitrary τ. Explicitly γ p ( σ, τ) = γ p (σ, τ) τ. 4) Recall Lemma 3.2, which states that the splittig strategy has a payoff with lower boud l L α lu(p l ). 5) Coclusio: Give a behavioral strategy τ of player 2, we have established that the payoff of strategy σ is equivalet to that of the splittig strategy. We show that the payoff of splittig strategy is lower bouded by l L α lu(p l ). This implies that strategy σ also has this lower boud. Propositio 3.5: There exists a oe-time policy improvemet strategy ˆσ such that the followig iequality holds. γ p (ˆσ, τ) l L α l u(p l ) ) Recall the mixed behavioral strategy σ as defied i Propositio 3.4, where σ k = l L µk (l)σl k. 2) Note first that sice σ l is a optimal o-revealig strategy, behavioral strategy σl k is the same for every state k (i.e. σl k = σl k k, k ). Furthermore, the NR mixed strategy x l is costat for each stage 3) Therefore σl k = x l k,. 4) Defie α k = l L µk (l)x l. 5) We ca the express σ as σ : K H α k, which is a statioary strategy. 6) Defie x as follows: x = {α, α 2,..., α K }.

7) By Lemma 3.3, there exists a oe-time policy improvemet strategy ˆσ such that ˆσ k m (s) = xk (s) s, m 8) We have ow established that γ p (ˆσ, τ) = γ p ( σ, τ) τ 9) Therefore it follows that γ p (ˆσ, τ) = γp ( σ, τ) l L α l u(p l ) τ () IV. POLICY IMPROVEMENT IN INFINITE HORIZON GAMES I the previous sectio, we have show that the oe-time policy improvemet strategy guaratees cav u(p) for the - stage I this sectio we will show that the guaratee also holds i the ifiite horizo games. The guarateed payoff for the oe-time policy improvemet strategy i the λ-discouted ifiite horizo games ca be computed by solvig { maxmi λ x k x y Mk y + ( λ) ( )} x s u p + (p, x, s) s S (2) for λ (0, ). Theorem 3: Oe-time policy improvemet guaratees a payoff of at least cav u(p) for the λ-discouted ifiite horizo zero-sum repeated games of icomplete iformatio o oe-side. We will show the existece of a oe-time policiy improvemet strategy that guaratees at least cav u(p). ) Fix λ (0, ) arbitrarily. 2) There exists N s.t. λ > N. Cosider this N. 3) Let λ = N, the ˆv λ(p) = ˆv N (p). 4) Ivokig Theorem yields ˆv λ(p) = ˆv N (p) cav u(p) 5) Claim: The optimal oe-time policy improvemet strategy for the discouted game ˆv λ(p) also guaratees at least cav u(p) for ˆv λ (p). Sice ˆv λ(p) cav u(p) ad s S x su(p + x, s) cav u(p), it follows that the optimal stage strategy x for ˆv λ(p) has the followig lower boud: x k M k y cav u(p). Note that λ > λ. Therefore λ x k M k y + ( λ) ( ) x s u p + (p, x, s) s S λ x k M k y + ( λ) ( ) x s u p + (p, x, s) s S cav u(p). Remark: If s S x su(p + (p, x, s)) = cav u(p) ad x k M k y = cav u(p) the ˆv λ (p) = ˆv λ(p) = cav u(p). 6) By usig the optimal stage strategy x obtaied from ˆv λ(p) ad playig a optimal o-revealig strategy thereafter, we have costructed a oe-time policy update strategy for ˆv λ (p) that guaratees cav u(p). Theorem 4: Oe-time policy improvemet is a optimal strategy for ifiitely-repeated zero-sum games of icomplete iformatio o oe-side. Usig a argumet similar to that of Theorem 3, oe ca establish that there exists a oe-time policy improvemet strategy that guaratees a payoff of at least cav u(p). Note that cav u(p) is the optimal payoff for the ifiitelyrepeated V. LP FORMULATION A oe-time policy-improvemet strategy that guaratees cav u(p) ca be computed by solvig a liear programig problem, ad the computatioal complexity of the liear program is costat with respect to the umber of stages of the The followig outlies a procedure to costruct the appropriate liear program. ) Let Σ deote the set of pure oe-time policyimprovemet behavioral strategies (i.e. σ : k s, σ m : h I s m 2, ad σ m = σ m m, m ) 2) Let T deote the set of pure strategies for the uiformed player. (i.e. τ : j, σ m : h I j m 2, ad τ m = τ m m, m ) 3) Note that the size of the strategy sets Σ ad T are ivariat with respect to the umber of stages of the 4) Observe that sice a oe-time policy-improvemet strategy is used, the strategy ad the correspodig payoff remais costat for m 2, so that γm( σ, p τ) = γ p 2 ( σ, τ) m 2. 5) Therefore, the game payoff for the strategy pair is γ p λ ( σi, τ l ) = λγ p ( σi, τ l ) + ( λ)γ p 2 ( σi, τ l ). For the -stage game, set λ =. 6) Cosider a matrix M, where elemet (i, l) deotes the game payoff γ p λ ( σi, τ l ) for the strategy pair ( σ i, τ l ). 7) Sice this is a zero sum game with fiite strategies for each player ad a payoff matrix M, a classic zero-sum game result ca be used to solve this zero-sum game as a LP. As a cosequece, a perpetual policy-improvemet strategy that guaratees cav u(p) ca be computed by solvig a liear programig problem at each stage of the game, ad the computatioal complexity of the liear program is costat with respect to the umber of stages of the The followig outlies a procedure for this costructio. ) Note that the stage behavioral strategy σ is the oetime policy-improvemet strategy that ca be computed by solvig a LP. 2) Cosider stage 2. Sice this is a game of perfect recall, the behavioral strategy σ ca be equivaletly represeted as a mixed strategy x.

3) A move of player was realized i stage, ad sice the mixed strategy x is kow, the posterior probability p + ca be computed. 4) Compute the stage 2 strategy by solvig the optimizatio problem ˆv λ (p + ). If it is a -stage game, set λ = N. 5) The same techiques that was used to compute the Stage strategy by solvig a LP, ca be used for Stage 2. 6) By a similar argumet, the Stage m strategy ca be computed by solvig a LP. VI. SIMULATION: CYBER SECURITY EXAMPLE The etwork admiistrator, the row player, maages two web applicatios wapp ad wapp 2 (i.e. e-mail ad remote logi). Each web applicatio is ru o its ow dedicated server (H ad H 2 ). The attacker would like to prevet users from accessig the web applicatios via a Deial of Service (DOS) attack. I order to help mitigate DOS attacks, a spare server (H spare ) ca be dyamically cofigured daily to also ru either wapp or wapp 2. We assume that the attacker ca observe which web applicatio the etwork admi decides to ru o the spare server. The attacker has a choice of which web applicatio to execute a Deial of Service. wapp wapp 2 wapp 0 wapp 2 0 2 State α wapp wapp 2 wapp 2 0 wap 2 0 State β I state α, all of the legitimate users o the etwork are usig oly wapp. Coversely i state β, they are all oly usig wapp 2. The payoff for each of the states are i terms of of quality of service. Suppose that p α =.5 ad p β =.5, where p α deotes the iitial probability of beig i state α ad let p = (p α, p β ). I this example, we will set the umber of stages of the game to 2. If the etwork admi. plays his domiat strategy o day, he will have fully revealed the state of the etwork to the attacker o day 2. The domiat strategy for the etwork admi. is to ru web applicatio o the backup server i state α ad to ru web applicatio 2 o the backup server i β. The etwork admi. will achieve a expected payoff of 0.5 o day ad a payoff of 0 o day 2 for a total payoff of.25. If the etwork admi uses a -time policy improvemet strategy, will he do better? Solvig umerically we compute a mixed strategy of x α = (0.50,.50) ad x β = (.50,.50), with a expected game payoff of 0.375 over the two day period. Suppose, o day 2 that the etwork admi. decides to agai improve upo the baselie policy of playig orevealig. By exploitig his iformatio o day 2, he ca yield a expected stage payoff of.50 ad a game payoff of.4375. VII. CONCLUSION Computig optimal strategies for zero-sum repeated games with asymmetric iformatio ca be computatioally prohibitive. Much of the curret work to address this issue has bee limited to special cases. I this work, we preset policy improvemet methods to compute suboptimal strategies that have tight lower bouds. We show that the policy improvemet strategy ca be computed by solvig a liear program, ad the computatioal complexity of the liear program remais costat with respect to the umber of stages i the I this paper, we focused o computatioal results for repeated games, where the state of the world remais fixed oce it has bee selected by ature. Repeated games are a special case of stochastic games. I stochastic games, the state of the world ca chage at each stage of the game ad the state trasitios ca be a fuctio of the actios of the players. We would like to cosider computatioal results for stochastic games ad also exted our curret results to stochastic games. REFERENCES R. J. Auma ad M. Maschler, Repeated Games with Icomplete Iformatio. MIT Press, 995. 2 V. C. Domasky ad V. L. Kreps, Evetually revealig repeated games of icomplete iformatio, Iteratioal Joural of Game Theory, vol. 23, pp. 89 09, 994. 3 M. Heur, Optimal strategies for the uiformed player, Iteratioal Joural of Game Theory, vol. 20, pp. 33 5, 99. 4 S. Zamir, O the relatio betwee fiitely ad ifiitely repeated games with icomplete iformatio, Iteratioal Joural of Game Theory, vol. 23, pp. 79 98, 97. 5 A. Gilpi ad T. Sadholm, Solvig two-perso zero-sum repeated games of icomplete iformatio, Iteratioal Joit Coferece o Autoomous Agets ad Multiaget Systems, vol. 2, pp. 903 90, 2008. 6 S. Zamir, Repeated games of icomplete iformatio: Zero-sum, Hadbook of Game Theory, vol., pp. 09 54, 999. 7 R. Auma, Mixed ad behavior strategies i ifiite extesive games. Priceto Uiversity, 96. 8 J. Possard ad S. Sori, The l-p formulatio of fiite zero-sum games with icomplete iformatio, Iteratioal Joural of Game Theory, vol. 9, pp. 99 05, 999. 9 S. Sori, A First Course o Zero-Sum Repeated Games. Spriger, 2002. 0 D. Blackwell, A aalog of the miimax theorem for vector payoffs. Pacific Joural of Mathematics, vol. 956, o., pp. 8, 956. Y. Freud ad R. E. Schapire, Game theory, o-lie predictio ad boostig, i Proceedigs of the ith aual coferece o Computatioal learig theory, ser. COLT 96. New York, NY, USA: ACM, 996, pp. 325 332. Olie. Available: http://doi.acm.org/0.45/23806.23863 2 D. Roseberg, E. Sola, ad N. Vieille, Stochastic games with a sigle cotroller ad icomplete iformatio, Northwester Uiversity, Ceter for Mathematical Studies i Ecoomics ad Maagemet Sciece, Tech. Rep. 346, May 2002. 3 J.-F. Mertes ad S. Zamir, The value of two-perso zero-sum repeated games with lack of iformatio o both sides, i Istitute of Mathematics, The Hebrew Uiversity of Jerusalem, 970, pp. 405 433. 4 J.-F. Mertes, The speed of covergece i repeated games with icomplete iformatio o oe side, Uiversit catholique de Louvai, Ceter for Operatios Research ad Ecoometrics (CORE), Tech. Rep. 995006, Ja. 995.