Online Appendix for Mislearning from Censored Data: The Gambler s Fallacy in Optimal-Stopping Problems

Size: px

Start display at page:

Download "Online Appendix for Mislearning from Censored Data: The Gambler s Fallacy in Optimal-Stopping Problems"

Laurel Horn
5 years ago
Views:

1 Online Appendix for Mislearning from Censored Data: The Gambler s Fallacy in Optimal-Stopping Problems Kevin He December 31, 018 OA 1 Proofs Omitted from the Appendix OA 1.1 Completing the Proof of of Lemma A.15 Proof. From the hypothesis on g s magnitude being bounded, there is some B < so that 0 < g(µ 1, µ < B for all µ 1, µ R. I check conditions A1 through A5 in Bunke and Milhaud (1998. The lemma follows from their Theorem when these conditions are satisfied. The parameter space is Θ = R. The data-generating density of observation (x, y is: φ(x; µ 1, σ φ(y; µ, σ if x < c f (x, y = φ(x; µ 1, σ φ(y; 0, 1 if x c where φ( ; µ, σ is the Gaussian density with mean µ and variance σ. Under the subjective model Ψ(ˆµ 1, ˆµ ; γ, the same observation has density: φ(x; ˆµ 1, σ φ(y; ˆµ γ (x ˆµ 1, σ if x < c fˆµ1,ˆµ (x, y = φ(x; ˆµ 1, σ φ(y; 0, 1 if x c. A1. Parameter space is a closed, convex set in R with nonempty interior. The density fˆµ1,ˆµ (x, y is bounded over (ˆµ 1, ˆµ, x, y and its carrier, {(x, y : fˆµ1,ˆµ (x, y > 0} is the same for all ˆµ 1, ˆµ. Evidently R is closed in itself. The density fˆµ1,ˆµ (x, y is bounded by the product of the modes of Gaussian densities with variance σ and variance 1. The density fˆµ1,ˆµ (x, y is strictly positive on R for any parameter values ˆµ 1, ˆµ. A. For all ˆµ 1, ˆµ, there is a sphere S[(ˆµ 1, ˆµ, η] of center (ˆµ 1, ˆµ and radius η > 0 such that: E f sup (µ 1,µ S[(ˆµ 1,ˆµ,η] ln f (X, Y <. f µ (X, Y 1,µ 1

2 Pick say η = 1. Consider the rectangle R[(ˆµ 1, ˆµ, η] consisting of points (µ 1, µ such that µ 1 ˆµ 1 < η and µ ˆµ < η. Since the the Gaussian distribution is single-peaked, for any (x, y R the absolute value of the log likelihood ratio ln f (X,Y f (X,Y µ on all of R[(ˆµ 1, ˆµ, η] 1,µ must be bounded by its value at the 4 corners. That is to say, sup ln f (X, Y (µ 1,µ S[(ˆµ 1,ˆµ,η] f µ (X, Y 1,µ sup ln f (X, Y (µ 1,µ R[(ˆµ 1,ˆµ,η] f µ (X, Y 1,µ ln f (X, Y fˆµ1 η,ˆµ η(x, Y + ln f (X, Y fˆµ1 η,ˆµ +η(x, Y + ln f (X, Y fˆµ1 +η,ˆµ η(x, Y + ln f (X, Y fˆµ1 +η,ˆµ +η(x, Y. It is easy to see that for any fixed parameter E f 4 terms gives a finite upper bound. [ ln f (X,Y f µ (X,Y 1,µ ] is finite, so the sum of these A3. For all fixed (x 0, y 0 R, the map from parameters to density (µ 1, µ f µ1,µ (x 0, y 0 has continuous derivatives with respect to parameters (µ 1, µ f µ 1 (x 0, y 0 ; µ 1, µ, (µ 1, µ f µ (x, y; µ 1, µ. There exist positive constants κ 0 and b 0 with [f µ1,µ (x, y] 1 f µ 1 (x, y; µ 1, µ f µ (x, y; µ 1, µ 1 f µ1,µ (x, y dydx < κ 0 (1 + (µ 1, µ b 0 satisfied for every (µ 1, µ R, where is a norm on R. Let s choose the max norm, v = max( v 1, v. x 0 < c, we can compute and For uncensored data (x 0, y 0 with [ f (1 + γ (x 0, y 0 ; µ 1, µ = f µ1,µ µ (x 0, y 0 (x µ 1 σ 1 + γ ] σ (y µ [ f γ (x 0, y 0 ; µ 1, µ = f µ1,µ µ (x 0, y 0 σ (x µ 1 1 ] σ (y µ. While for censored data (x 0, y 0 where x 0 > c, the likelihood of the data is unchanged by parameter µ since it neither changes the distribution of the early draw quality nor the distribution of the white noise term, meaning f µ (x 0, y 0 ; µ 1, µ = 0. Also, for the censored case f 1 (x 0, y 0 ; µ 1, µ = f µ1,µ µ (x 0, y 0 1 σ (x µ 1.

3 This means the integral to be bounded is: x=c x= (1+γ σ (x µ 1 + γ σ (y µ 1 f µ1,µ (x, y dy dx γ (x µ σ 1 1 (y µ σ x= [ + ( 1 σ (x µ 1 1 f µ1,µ (x, y dy x=c ] dx. Since the inner integrals are non-negative, this expression is smaller than the version where the domains of the outer integrals are expanded and the densities f µ1,µ (x, y are simply replaced with the joint density on R of the subjective model for Ψ(µ 1, µ ; γ, which I denote as f µ1,µ (x, y. (1+γ σ (x µ 1 + γ σ (y µ 1 f µ1,µ (x, y dy dx γ (x µ σ 1 1 (y µ σ [ + ( 1 σ (x µ 1 1 f µ1,µ (x, y dy ] dx. The second summand is a 1th moment of the joint normal random variable with distribution Ψ(µ 1, µ ; γ, so for all µ 1, µ it is given by some 1th order polynomial P (µ 1, µ. Similarly the first summand is also given by a 1th order polynomial P 1 (µ 1, µ. Therefore by choosing b 0 = 1 and choosing κ 0 appropriately according to the coefficients in P 1 and P, we achieved the desired bound. A4. For some positive constants b 1 and κ 1, the affinity function A(µ 1, µ := [f µ1,µ (x, y f (x, y] 1/ dydx satisfies A(µ 1, µ < κ 1 (µ 1, µ b 1 for all µ 1, µ. We have A(µ 1, µ [ [f µ1,µ (x, y f (x, y]dydx] 1/, so it s sufficient to find some κ 1 3

4 and b 1 that works to bound [f µ1,µ (x, y f (x, y]dydx. We have: [f µ1,µ (x, y f (x, y]dydx c = φ(x; µ 1, σ φ(x; µ 1, σ φ(y; µ γ(x µ 1, σ φ(y; µ, σ dydx x= + φ(x; µ 1, σ φ(x; µ 1, σ φ(y; 0, 1 φ(y; 0, 1dydx x=c φ(x; µ 1, σ φ(x; µ 1, σ φ(y; µ γ(x µ 1, σ φ(y; µ, σ dydx + φ(x; µ 1, σ φ(x; µ 1, σ φ(y; 0, 1 φ(y; 0, 1dydx. I show how to find κ 1 and b 1 to bound the first summand in the last expression above. It is easy to similarly bound the second summand. By Bromiley (003, the product of Gaussian densities φ(y; µ 0 γ(x µ 1, σ φ(y; µ, σ is itself a Gaussian density in y, φ(y, multiplied by a scaling factor equal to (4πσ 1/ exp ( γ [x (µ 4σ 1 µ γ + µ γ ]. So we have = φ(x; µ 1, σ φ(x; µ 1, σ φ(y; µ γ(x µ 1, σ φ(y; µ, σ dydx ( ( φ(x; µ 1, σ φ(x; µ 1, σ 4πσ 1/ exp γ ( = 4πσ 1/ φ(x; µ 1, σ φ(x; µ 1, σ exp ( 4σ [x (µ 1 µ γ + µ γ ] γ 4σ [x (µ 1 µ γ + µ γ ] φ(ydydx Again applying Bromiley (003, product of the two Gaussian densities φ(x; µ 1, σ φ(x; µ 1, σ is another Gaussian density with mean µ 1 +µ 1, variance σ, and multiplied to a scaling factor of (4πσ 1/ exp ( (µ 1 µ 1 4σ. So above expression is: K 1 exp ( (µ 1 µ 1 4σ dx. ( φ(x; µ 1 + µ 1, σ exp γ 4σ [x (µ 1 µ γ + µ γ ] dx where K 1 is a constant not dependent on µ 1, µ. Also, we may write exp ( γ 4σ [x (µ 1 µ γ + µ γ ] = K φ(x; (µ 1 µ γ + µ γ, σ B where σb = σ and K γ = (πσb 1/. Applying Bromiley (003 one final time, the product φ(x; µ 1 +µ 1, σ φ(x; (µ 1 µ γ + µ, γ σ B is a Gaussian density in x scaled by K 4 exp( K 3 where K 3, K 4 > 0 are constants not dependent on µ 1, µ. So altogether, ( µ 1 µ 1 µ µ γ the second summand we are bounding is a constant multiple of exp ( (µ 1 µ 1 4σ exp( K3 ( µ 1 µ 1 µ µ. For µ γ 1 µ, the max norm (µ 1, µ = µ 1 and exp ( (µ 1 µ 1 4σ 4

5 decreases exponentially fast in the norm. For µ 1 < µ, and µ µ 1 + µ γ > 0, exp( K 3 ( µ 1 µ 1 µ µ exp( K 3 ( µ γ µ 1 + µ γ. So for large enough µ, exp( K 3 ( µ 1 µ 1 µ µ γ will decrease exponentially fast in the norm. These two facts imply that there is some K > 0 so that whenever (µ 1, µ > K, φ(x; µ 1, σ φ(x; µ 1, σ φ(y; µ γ(x µ 1, σ φ(y; µ, σ dydx < (µ 1, µ 1. Now put κ 1 = K 1 and we can ensure for any value of (µ 1, µ we will have φ(x; µ 1, σ φ(x; µ 1, σ φ(y; µ γ(x µ 1, σ φ(y; µ, σ dydx < κ 1 (µ 1, µ 1. A5. There are positive constants b, b 3 so that for all (µ 1, µ and r > 0 it holds that g(s[(µ 1, µ, r] cr b (1 + ( (µ 1, µ + r b 3. Moreover, g assigns positive mass to every sphere with positive radius. Since we have assumed that density g is bounded by B, the prior mass assigned to the sphere S[(µ 1, µ, r] is bounded by B times its Euclidean volume. So, take b = and c = πb and the first statement is satisfied. Since we have assumed that g is strictly positive everywhere, the second statement is satisfied. OA 1. Proof of Lemma A.16 Proof. Let (µ 1, µ R be given. For any µ 1, µ, c R, we have U(c; µ 1, µ = c c + u 1 (x 1 φ(x 1 ; µ 1, σ dx 1 [ ] u (x 1, x φ(x ; µ γ(x 1 µ 1, σ dx φ(x 1 ; µ 1, σ dx 1. We first bound c u 1 (x 1 φ(x 1 ; µ 1, σ dx 1 c u 1 (x 1 φ(x 1 ; µ 1, σ dx 1 by a multiple of µ 1 µ 1. Suppose first µ 1 = µ 1 + for some > 0. We have c u 1 (x 1 φ(x 1 ; µ 1, σ dx 1 = c u 1 (x 1 + φ(x 1 ; µ 1, σ dx 1. By Lipschitz continuity of u 1, u 1 (x 1 u 1 (x 1 + K 1 for all x 1 R. Thus we conclude u 1 (x 1 φ(x 1 ; µ 1, σ dx 1 u 1 (x 1 φ(x 1 ; µ 1, σ c dx 1 K 1 + u 1 (x 1 φ(x 1 ; µ 1, σ dx 1. c c c 5

6 Again by Lipschitz continuity of u 1, for any x 1 R, u 1 (x 1 φ(x 1 ; µ 1, σ ( u 1 (0 + K 1 x 1 φ(x 1 ; µ 1, σ. Since the Gaussian density decreases to 0 exponentially fast as x 1 ±, the RHS is uniformly bounded for all x 1 R by some constant, say J 1 > 0. (Note that the RHS is not a function of c, so J 1 does not depend on c. This shows that c c c u 1 (x 1 φ(x 1 ; µ 1, σ dx 1 u 1 (x 1 φ(x 1 ; µ 1, σ dx 1 J 1 dx 1 = J 1. c c c So altogether, u 1 (x 1 φ(x 1 ; µ 1, σ dx 1 u 1 (x 1 φ(x 1 ; µ 1, σ dx 1 (K 1 + J 1. c c If instead µ 1 = µ 1, then a similar argument shows that c+ u 1 (x 1 φ(x 1 ; µ 1, σ dx 1 u 1 (x 1 φ(x 1 ; µ 1, σ dx 1 K 1 + u 1 (x 1 φ(x 1 ; µ 1, σ dx 1, c c c and again we may bound the second term by J 1 as before. We now turn to bounding the difference in the second summand making up U(c; µ 1, µ. First consider the case where µ = µ. For each x 1 R, let I(x 1 ; µ 1 := u (x 1, x φ(x ; µ γ(x 1 µ 1, σ dx, the expected continuation utility after X 1 = x 1, in the subjective model Ψ(µ 1, µ ; γ. The second summand in U(c; µ 1, µ is given by c I(x 1; µ 1 φ(x 1 ; µ 1, σ dx 1. For x 1 = x 1 + d 1, µ 1 = µ 1 + d, we have I(x 1; µ 1 = = Lipschitz continuity of u implies that u (x 1, x φ(x ; µ γ(x 1 µ 1, σ dx u (x 1 + d 1, x γ(d 1 d φ(x ; µ γ(x 1 µ 1, σ dx. u (x 1 + d 1, x γ(d 1 d u (x 1, x K ((1 + γ d 1 + γ d K (1 + γ ( d 1 + d, which shows I(x 1; µ 1 I(x 1; µ 1 K (1+γ ( x 1 x 1 + x x. That is, I is Lipschitz continuous. Suppose µ 1 = µ 1 + for some > 0. Similar to the above argument bounding the first 6

7 summand in (c; µ 1, µ, we have c I(x 1 ; µ 1 φ(x 1 ; µ 1, σ dx 1 = c I(x 1 + ; µ 1 + φ(x 1 ; µ 1, σ dx 1. By Lipschitz continuity of I, I(x 1 ; µ 1 I(x 1 + ; µ 1 + K (1 + γ for all x 1 R. Thus we conclude c I(x 1 ; µ 1 φ(x 1 ; µ 1, σ dx 1 K (1 + γ + c c c I(x 1 ; µ 1φ(x 1 ; µ 1, σ dx 1 I(x 1 ; µ 1φ(x 1 ; µ 1, σ dx 1. Since x 1 I(x 1 ; µ 1 is Lipschitz continuous, there exists J > 0 so that I(x 1 ; µ 1φ(x 1 ; µ 1, σ J for all x 1 R, which means c c I(x 1; µ 1φ(x 1 ; µ 1, σ dx 1 J. (Once again, J does not depend on c. The case of µ 1 = µ 1 is symmetric and we have shown that c I(x 1 ; µ 1 φ(x 1 ; µ 1, σ dx 1 I(x 1 ; µ 1φ(x 1 ; µ 1, σ dx 1 (K (1 + γ + J µ 1 µ 1. Finally, we investigate the difference in the second summand of U(c; µ 1, µ between parameters (µ 1, µ and (µ 1, µ for µ 1, µ R. This difference is bounded by c u (x 1, x φ(x ; µ γ(x 1 µ 1, σ dx u (x 1, x φ(x ; µ γ(x 1 µ 1, σ dx φ(x 1 ; µ 1, σ dx 1. (3 But for every x 1 R, u (x 1, x φ(x ; µ γ(x 1 µ 1, σ dx = u (x 1, x +(µ µ φ(x ; µ γ(x 1 µ 1, σ dx, and u (x 1, x + (µ µ u (x 1, x K µ µ by Lipschitz continuity of u. This shows that, for all values µ 1, µ R, (3 is bounded by K µ µ. Applying the triangle inequality to the second term, we conclude that U(c; µ 1, µ U(c; µ 1, µ (K 1 + J 1 µ 1 µ 1 + (K (1 + γ + J µ 1 µ 1 + K µ µ. So we see that setting K = K 1 + J 1 + (K (1 + γ + J establishes the lemma. OA 1.3 Proof of Lemma A.5 Proof. Consider the payoff difference between accepting x 1 and continuing under belief ν, D(x 1 ; ν := u 1 (x 1 E X N (µ γ(x 1 µ 1 [u,σ (x 1, X ]dν(µ. 7

8 Note that D(x 1, ν = D(x 1 ; µ 1, µ, γdν(µ. Lemma A. shows that for every µ R, D(x 1 ; µ 1, µ, γ is strictly increasing in x 1. Hence the same must hold for D(x 1, ν. Also, Lemma A. implies there exists some x 1 R so that D(x 1; µ 1, µ, γ < 0, and that there exists some x 1 R satisfying D(x 1; µ 1, µ, γ > 0. Since u increases in its second argument, we also get D(x 1; µ 1, µ, γ < 0 and D(x 1; µ 1, µ, γ > 0 for all µ [µ, µ ]. This implies D(x 1; ν < 0 and D(x 1; ν > 0, as ν is supported on (a subset of [µ, µ ]. Finally, I show D(x 1 ; ν is continuous in x 1. Fix x 1 R. Since u 1 is continuous, find δ > 0 so that whenever x 1 x 1 < 1, u 1 (x 1 u 1 ( x 1 < δ. Consider the function f : R R 0 defined by f(x, µ := u ( x 1, x γ + µ + u ( x 1, x + γ + µ + δ. Claim OA.1. Whenever x 1 x 1 < 1, u (x 1, x + γ( x 1 x 1 + µ f(x for every x, µ R. Proof. This is the same as the proof of Claim A.1. Claim OA.. µ µ ( f(x, µ φ(x ; γ( x 1 µ 1, σ dx dν(µ <. Proof. We may write f(x, µ := u +γ,+(x, µ + u +γ, (x, µ + u γ,+(x, µ + u γ, (x, µ + δ where u +γ,+ and u +γ, are the positive and negative parts of (x, µ u ( x 1, x + γ + µ, and u γ,+ and u γ, are the positive and negative parts of (x, µ u ( x 1, x γ + µ. From Assumption 1(d, for every µ [µ, µ ], each of u +γ,+(, µ, u +γ, (, µ,u γ,+(, µ,and u γ, (, µ is integrable over R with respect to the Gaussian density for N ( γ( x 1 µ 1, σ. These integrals are maximized at µ = µ for u +γ,+(, µ and u γ,+(, µ, and maximized at µ = µ for u +γ, (, µ and u γ, (, µ. In other words, for every µ [µ, µ ], + f(x, µ φ(x ; γ( x 1 µ 1, σ dx ( u +γ,+(x, µ + u γ,+(x, µ φ(x ; γ( x 1 µ 1, σ dx ( u +γ, (x, µ + u γ, (x, µ φ(x ; γ( x 1 µ 1, σ dx. This bound is finite and does not depend on µ, so the overall integral over dν(µ is also finite. 8

9 Consider a sequence x (n 1 x 1. We have D(x (n 1 ; ν = u 1 (x (n 1 = u 1 (x (n 1 = u 1 (x (n 1 E X N (µ γ(x (n 1 µ 1,σ [u (x (n 1, X]dν(µ E X N ( γ( x 1 µ 1,σ [u (x (n 1, X + γ( x 1 x (n µ µ 1 + µ ]dν(µ u (x (n 1, x + γ( x 1 x (n 1 + µ φ(x ; γ( x 1 µ 1, σ dx dν(µ. The sequence of functions (x, µ u (x (n 1, x + γ( x 1 x (n 1 + µ pointwise converge to u ( x 1, x + µ as n. From the two claims, for all large enough n, this sequence of functions are pointwise dominated by f, an absolutely integrable function on the same domain. Therefore continuity follows from dominated convergence theorem, as in the proof of Lemma A.. This means there exists a unique c so that D(c = 0. The cutoff strategy S c is optimal, because it stops at every x 1 whose stopping payoff exceeds expected continuation payoff, and continues at every x 1 where expected continuation payoff is higher than stopping payoff. For any c = c + δ for some δ > 0, the difference in expected payoffs of S c and S c is c +δ c D(x 1; ν > 0 since D(x 1 ; ν is strictly positive on the interval (c, c + δ]. So every strictly higher cutoff than c is strictly suboptimal. A similar argument shows every strictly lower cutoff than c is also strictly suboptimal. OA 1.4 Proof of Lemma A.18 Proof. For Assumption A.1(a, the marginal of F ( ; θ 1, θ on X 1 is simply Q 1 ( ; θ 1, which I assumed is strictly increasing in mean with respect to θ 1. For Assumptions A.1(b, it is well-known that by the copula construction, for all u, v [0, 1] P F ( ;θ1,θ [X Q 1 (v; θ 1 X 1 = Q 1 1 (u; θ ] = W W (u, v. This means (u, v is u u increasing in v. Fixing some x 1 I 1 and θ 1 Θ 1, put u = Q 1 (x 1 ; θ 1. Now for every θ and x I, we have P F ( ;θ1,θ [X x X 1 = x 1 ] = W (u, Q 1 u (x ; θ. Since the family of marginals Q ( ; θ increases in FOSD order as θ increases, Q 1 (x ; θ is decreasing in θ. Since W increases in its second argument, P u F ( ;θ 1,θ [X x X 1 = x 1 ] must then decrease in θ, that is to say the conditional distribution X X 1 = x 1 is increasing in FOSD order in θ. So in particular Assumption A.1(b is satisfied. For Assumption A.1(c, again start with the expression P F ( ;θ1,θ [X Q 1 (v; θ X 1 = Q 1 1 (u; θ 1 ] = W (u, v. u 9

10 For x 1 > x 1, put u = Q 1 (x 1 > Q 1 (x 1 = u. We have for every v [0, 1] that while P F ( ;θ1,θ [X Q 1 (v; θ X 1 = x 1] = W u (Q 1(x 1; θ 1, v P F ( ;θ1,θ [X Q 1 (v; θ X 1 = x 1] = W u (Q 1(x 1; θ 1, v. Since the distribution function Q 1 ( ; θ 1 has full support, Q 1 (x 1; θ 1 > Q 1 (x 1; θ 1. And since we assumed W u is increasing in its first argument, we see that P F ( ;θ 1,θ [X x X 1 = x 1 ] is increasing in x 1. That is, the conditional distribution X X 1 = x is decreasing in FOSD order in x 1. So Assumption A.1(c is satisfied. OA 1.5 Proof of Lemma A.19 Proof. Suppose (θ1 M, θ M is an MOM estimator. I show any other MOM estimator (ˆθ 1, ˆθ must be equal to it. We may rewrite the moments as: m 1 [H(θ1 M, θ M ; c] = E F1 ( ;θ 1 [X 1 ], m [H(θ1 M, θ M ; c] = E F ( ;θ1,θ [X X 1 < c]. The unconditional mean of X 1, namely E F1 ( ;θ 1 [X 1 ], is strictly increasing in θ 1 by Assumption A.1(a. So, at most one value of θ 1 Θ 1 can generate an unconditional mean that matches m 1 [H (c], meaning we must have ˆθ 1 = θ1 M. Given this unique θ1 M, Assumption A.1(b implies the conditional mean E F 1 ( ;θ1 M,θ x 1 [X ] is strictly increasing in θ for every x 1 < c. The conditional mean E F ( ;θ M 1,θ [X X 1 < c] is given by an integral over E F 1 ( ;θ1 M,θ x 1 [X ] across the values x 1 < c, therefore E F ( ;θ M 1,θ [X X 1 < c] is also strictly increasing in θ. So there is at most one value of θ such that m [H(θ1 M, θ ; c] = m [H (c], which gives ˆθ = θ M. OA 1.6 Proof of Proposition A.4 Proof. Since the marginal distribution of X 1 in F ( ; θ 1, θ only depends on θ 1 and is strictly increasing in it, and since m 1 [H (c] does not depend on c, we must have θ1 M (c = θ1 M (c. I denote this common value by θ1 M. In seeking to match the moment m [H (c ], we may break down the conditioning event 10

11 {X 1 < c } into the union {X 1 < c } {c X 1 < c }, so E F ( ;θ M 1,θ M (c [X X 1 < c ] = P F ( ;θ M 1,θM (c [X 1 < c ] P F ( ;θ M 1,θ M (c [X 1 < c ] E F ( ;θ M 1,θM (c [X X 1 < c ] + P F ( ;θ1 M,θM (c [c X 1 < c ] P F ( ;θ M 1,θ M (c [X E 1 < c ] F ( ;θ M 1,θ M (c [X c X 1 < c ]. Suppose by way of contradiction that θ M (c θ M (c. Then since E F 1 ( ;θ1 M strictly increasing in θ for every x 1 by Assumption A.1(b, we get,θ x 1 [X ] is E F ( ;θ M 1,θ M (c [X X 1 < c ] E F ( ;θ M 1,θ M (c [X X 1 < c ] = m [H (c ], where the equality comes from the second moment condition of the MOM estimator (θ1 M (c, θ M (c. Similarly E F ( ;θ M 1,θ M (c [X c X 1 < c ] E F ( ;θ M 1,θ M (c [X c X 1 < c ]. Now E F 1 ( ;θ1 M,θM (c x 1 [X ] is strictly decreasing in x 1 by Assumption A.1(c, so for every x 1 [c, c the expectation is smaller than for every x 1 < c. This shows that E F ( ;θ M 1,θ M (c [X c X 1 < c ] < E F ( ;θ M 1,θ M (c [X X 1 < c ] = m [H (c ]. Since F ( ; θ1 M, θ M (c has full support on I 1 I, the probability P F ( ;θ M 1,θ M (c [c X 1 < c ] is strictly positive since [c, c is an interval in the interior of I 1. So we see that E F ( ;θ M 1,θ M (c [X X 1 < c ] is a convex combination between a term that is no larger than m [H (c ] and another term that is strictly smaller than m [H (c ], with strictly positive weight on the latter. Since m [H (c ] = m [H (c ], we see that (θ1 M (c, θ M (c cannot match the second moment condition of m [H(θ1 M (c, θ M (c ; c ] = m [H (c ], contradiction. Hence we conclude θ M (c > θ M (c. OA 1.7 Proof of Corollary A.1 Proof. I first show that under any of the models F ( ; θ 1, θ, agent s subjectively optimal stopping rule is a cutoff rule (possibly involving never stopping or always stopping. It suffices to show that x 1 (u 1 (x 1 E F 1 ( ;θ 1,θ x 1 [u (x 1, X ] 11

12 is strictly increasing in x 1. By linearity of u in its second argument, this expression is equal to Suppose x 1 > x 1. By Assumption 1(b, x 1 (u 1 (x 1 u (x 1, E F 1 ( ;θ 1,θ x 1. u 1 (x 1 u (x 1, E F 1 ( ;θ 1,θ x 1 u 1(x 1 u (x 1, E F 1 ( ;θ 1,θ x 1. By Assumption A.1(c, E F 1 ( ;θ 1,θ x 1 < E F 1 ( ;θ 1,θ x 1. Combined with Assumption 1(a, it gives u (x 1, E F 1 ( ;θ 1,θ x < u (x 1, E 1 F 1 ( ;θ 1,θ x, hence showing u 1 (x 1 u (x 1, E F 1 ( ;θ 1,θ x 1 > u 1(x 1 u (x 1, E F 1 ( ;θ 1,θ x 1. Also, suppose F ( ; θ 1, θ induces either a stopping threshold which is an interior point of I 1, or always stopping. Then F ( ; θ 1, θ induces a higher stopping threshold or always stopping whenever θ θ. To see this, if there is an indifference point x 1 in the interior of I 1 with u 1 ( x 1 = u ( x 1, E F 1 ( ;θ 1,θ x 1, then we have E F 1 ( ;θ 1,θ x 1 > E F 1 ( ;θ 1,θ x due to 1 Assumption A.1(b, so u 1 ( x 1 < u ( x 1, E F 1 ( ;θ 1,θ x 1. This shows under F ( ; θ 1, θ the agent strictly prefers continuing at x 1, so the acceptance threshold must be higher. Similarly, if the agent prefers always stopping at every x 1 I 1 under F ( ; θ 1, θ. then she prefers strictly stopping at every x 1 under F ( ; θ 1, θ. I now show that µ M 1,[t], µm,[t], and cm [t] are well defined for every t 1. MOM agents in generation t 1 face t sub-datasets of censored histories, with the distribution H (c [0],..., H (c [t 1] where c [0] int(i 1. The moments to match are m 1 (H (c [0],..., H (c [K 1] = E F [X 1 ], 1 m (H (c [0],..., H (c [K 1] = E F [X ], where the second-period moment is well defined because c [0] is interior, so a positive fraction of histories in at least one sub-dataset contain uncensored X. These moments are interior values in I 1, I respectively, since F has full-support marginal distributions. By Assumption A.(b, there exists θ 1 Θ 1, independent of K and (c [0],..., c [K 1], so that E F1 ( ; θ 1 [X 1 ] = E F [X 1 ]. By combining Assumption A.1(b and A.(c, we get that θ m (H( θ 1, θ ; c [0],..., H( θ 1, θ ; c [K 1] is increasing, continuous on Θ with a range of I. (This uses the fact that c [0] is in the interior 1

13 of I 1. Since MOM agents are matching an interior value E F [X ] int(i, this shows that for any K and (c [0],..., c [K 1] with c [0] int(i 1, θ M 1 (c [0],..., c [K 1] and θ M (c [0],..., c [K 1] exist, and furthermore θ M 1 (c [0],..., c [K 1] = θ 1. By uniqueness of MOM estimators in Lemma A.19, µ M 1,[t], µm,[t] are well defined for each t 1. Also, c M [t] is also well defined for each t 1, given that we have shown the optimal strategy in the model F ( ; µ M 1,[t], µm,[t] is a cutoff strategy. To prove monotonicity, first suppose that c [1] c [0]. I have argued that we must have θ1,[] M = θm 1,[1] = θ 1, so now I rule out θ,[] M > θm,[1]. Note that m (H( θ 1, θ ; c [0], H( θ 1, θ ; c [1] = w 0 w 0 + w 1 m (H( θ 1, θ ; c [0] + w 1 w 0 + w 1 m (H( θ 1, θ ; c [1] where w 0 = P F [X 1 c [0] ] > 0 and w 1 = P F [X 1 c [1] ] 0. The moment-matching condition for generation 1 implies m (H( θ 1, θ M,[1] ; c [0] = E F [X ]. For any θ M,[] > θm,[1], we have m (H( θ 1, θ M,[]; c [0] > E F [X ] from Assumption A.(b. If c [1] = inf(i 1, we have found a contradiction since the weight w 1 is 0. When c [1] > inf(i 1, we get m (H( θ 1, θ M,[]; c [1] m (H( θ 1, θ M,[]; c [0] > E F [X ] by combining Assumption A.(c with the fact that c [1] c [0]. Both w 0 and w 1 are strictly positive, and they are multiplied to terms both strictly larger than E F [X ]. This shows again contradicting the moment condition. m (H( θ 1, θ M,[]; c [0], H( θ 1, θ M,[]; c [1] > E F [X ], Hence we must have θ M,[] θ M,[1], and thus cm [] c M [1] by monotonicity of the cutoff threshold in belief as discuss before. Similar argument establishes (µ M 1,[t] t 1, (µ M,[t] t 1, and (c M [t] t 1 are decreasing sequences. The case of c [1] > c [0] is symmetric. OA 1.8 Proof of Proposition A.6 Let J 1 = (µ 1 µ 1 σ 13

14 and for i, let J i be c1 ci 1 (x 1,...,x i... i 1 j=1 [ (µ φ(x j ; µ j, σ i µ i + i 1 j=1 γ i,j (x j µ j ] dx σ i 1...dx 1. The expression in square brackets is the KL divergence from the agent s subjective model for X i (X 1 = x 1,..., X i 1 = x i 1 to the true distribution of X i, under fundamentals µ 1,..., µ i. So, the integral J i is a weighted average of this divergence, taken across different realizations of previous draws (x 1,..., x i i with weights given by the true likelihood of observing such a sequence of draws in periods 1 through i 1 under the stopping strategy S c. Note that for each i, J i (and I i depends on µ 1,..., µ i. of J i. I first develop an alternative expression of D KL (H(Ψ ; S c H(Ψ(µ; γ; S c as the sum Lemma OA.1. L i=1 I i = L i=1 J i. Proof. Let Ĩi be a slightly modified version of I i, where the inner-most integral over x i has the range (,, so Ĩi is ci 1 (x 1,...,x i ( i... φ(x k ; µ k, σ Π i ln k=1φ(x k ; µ k, σ k=1 Π i k=1 φ(x k; µ k k 1 dx j=1 γ k,j (x j µ j, σ i...dx 1. c1 Observe that ĨL = I L. Inductively I will show ĨL + L 1 i=1 I i = L i=1 J i for every 1 L L. When L = 1, this just says Ĩ1 = J 1, which is true by definition. Now suppose the statement holds for some L = S L 1. I show it also holds when L = S + 1. We have ( S S 1 Ĩ S+1 + I i = ĨS+1 + (I S ĨS S + Ĩ S + I i = ĨS+1 + (I S ĨS + J i i=1 i=1 i=1 where the last equality comes from the inductive hypothesis. Since I S and ĨS simply differ in terms of the bounds of the inner-most integral, I S ĨS is c1 ( cs (x 1,...,x S 1 S... φ(x k ; µ k, σ Π S ln k=1φ(x k ; µ k, σ k=1 Π S k=1 φ(x k; µ k k 1 dx j=1 γ k,j (x j µ j, σ S...dx 1. Now, decompose the ln the sum ( Π S+1 k=1 φ(x k;µ k,σ Π S+1 k=1 φ(x k 1 k;µ k j=1 γ k,j (x j µ j,σ term in the integrand of ĨS+1 into ( Π S ln k=1φ(x k ; µ k, σ ( φ(x S+1 ; µ S+1,, σ Π S k=1 φ(x k; µ k k 1 +ln j=1 γ k,j (x j µ j, σ φ(x S+1 ; µ S+1 S. j=1 γ S+1,j (x j µ j, σ 14

15 We know that c1 cs (.. ( S+1... φ(x k ; µ k, σ Π S k=1 ln φ(x k; µ k, σ k=1 Π S k=1 φ(x k; µ k k 1 j=1 γ dx S+1...dx 1 k,j (x j µ j, σ c1 cs (.. S ( =... φ(x k ; µ k, σ φ(x S+1 ; µ S+1, σ Π S k=1 ln φ(x k; µ k, σ k=1 Π S k=1 φ(x k; µ k k 1 j=1 γ dx S+1...dx 1 k,j (x j µ j, σ c1 ( cs (.. S =... φ(x k ; µ k, σ Π S k=1 ln φ(x k; µ k, σ Π S k=1 φ(x k; µ k k 1 j=1 γ dx S...dx 1 k,j (x j µ j, σ = (I S ĨS k=1 where c S (.. abbreviates the bound of integration c S (x 1,..., x S 1. At the same time, c1... c1 =... c1 =... cs (.. cs (.. k=1 cs (.. c1 cs (.. =... =J S+1 ( S+1 φ(x k ; µ k, σ φ(x S+1 ; µ S+1 ln, σ k=1 φ(x S+1 ; µ S+1 S j=1 γ dx S+1...dx 1 S+1,j (x j µ j, σ S ( φ(x k ; µ k, σ φ(x S+1 ; µ S+1, σ φ(x S+1 ; µ S+1 ln, σ φ(x S+1 ; µ S+1 S j=1 γ dx S+1...dx 1 S+1,j (x j µ j, σ S S φ(x k ; µ k, σ D KL [N (µ S+1, σ, N (µ S+1 γ S+1,j (x j µ j, σ ]dx S...dx 1 k=1 S k=1 φ(x k ; µ k, σ (µ S+1 µ S+1 + S j=1 γ S+1,j (x j µ j σ dx S...dx 1 j=1 where we used the closed-form expression of the KL divergence between two Gaussian distributions, S D KL N (µ S+1, σ N (µ S+1 γ S+1,j (x j µ j, σ = (µ S+1 µ S+1 + S j=1 γ S+1,j (x j µ j. j=1 σ So by induction, ĨL + L 1 i=1 I i = L i=1 J i. As ĨL = I L, we are done. Using Lemma OA.1, I can now give the proof of Proposition A.6. Proof. Abbreviate D KL (H(Ψ ; S c H(Ψ(µ; γ; S c as ξ(µ 1,..., µ L. By Lemma OA.1, ξ(µ 1,..., µ L = Li=1 J i (µ 1,..., µ i. We show that the recursively defined parameters are the only ones satisfying the first-order condition, ξ µ i (ˆµ 1,..., ˆµ L = 0 for each i. In the integrand for J i, each µ j where 1 j i appears once in the term (µ i µ i+ i 1 j=1 γ i,j (x j µ j σ. For any (x 1,..., x i 1, the partial derivative of this term with respect to µ j for j < i is γ i,j times its partial derivative with respect to µ i. That is, at any values of ˆµ 1,..., ˆµ i, we get J i µ j (ˆµ 1,..., ˆµ i = γ i,j J i µ i (ˆµ 1,..., ˆµ i 15

16 for each 1 j < i. At any (µ 1,..., µ L satisfying the first-order condition for µ L, we must have ξ µ L (µ 1,..., µ L = J L µ L (µ 1,..., µ L = 0. By above, this also implies for each 1 j < L, either J L µ j (µ 1,..., µ L = 0, or γ L,k = 0 (in which case J L is not actually a function of µ j and J L µ j = 0 everywhere. Either way, this shows for the case of j = L 1, ξ µ L 1 (µ 1,..., µ L = J L µ L 1 (µ 1,..., µ L + J L 1 µ L 1 (µ 1,..., µ L 1 = J L 1 µ L 1 (µ 1,..., µ L 1. If (µ 1,..., µ L also satisfies the first-order condition for µ L 1, then J L 1 µ L 1 (µ 1,..., µ L 1 = 0. Continuing this telescoping argument, we conclude if (µ 1,..., µ L satisfies the first-order condition for all µ i, 1 i L, then J i µ i (µ 1,..., µ i = 0 for every 1 i L. Given the form of J 1, it is clear that J 1 µ 1 (µ 1 = 0 implies µ 1 = µ 1. Also, J i (µ µ 1,..., µ c1 ci 1 (x 1,...,x i i =... i i 1 j=1 [ (µ φ(x j ; µ j, σ i µ i + γ i,j (x j µ ] j dx i 1...dx 1. Using the fact that J i µ i (µ 1,..., µ i = 0, we multiply the integrand by the constant σ c1 ci 1 (x 1,...,x i (... i 1 j=1 σ φ(x j ; µ j, σ dx i 1...dx 1 1 and get E Ψ µ i µ i 1 i + γ i,j (X j µ j (X k k=1 i 1 R i 1 = 0. j=1 Rearranging, we have µ i = µ i i 1 j=1 γ i,j (µ j E Ψ [X j (X k i 1 k=1 R i 1] as desired. This means the only (µ 1,..., µ L satisfying the first-order condition for minimizing KL divergence is the one iteratively given in this proposition. OA 1.9 Proof of Proposition A.7 Proof. This clearly holds for i = 1. By induction assume this holds for all i K for some K L 1. I show that this also holds for i = K

17 From Proposition A.6, i 1 µ i = µ i j=1 γ i,j (µ j E Ψ [X j (X k i 1 k=1 R i 1]. The continuation region R i 1 is the rectangle (, c 1... (, c i 1 R i 1. As (X 1,..., X i 1 are objectively independent, the events {X k c k } for k j are independent of X j, so the expression simplifies to i 1 µ i = µ i γ i,j (µ j E Ψ [X j X j c j ]. Expanding each µ j for 1 j i 1 using the inductive hypothesis, j=1 K µ K+1 =µ K+1 γ K+1,j (µ j E Ψ [X j X j c j ] j=1 K j 1 + γ K+1,j W (p (µ k E Ψ [X k X k c k ] j=1 k=1 p P [j k] K K =µ K+1 + ( γ K+1,j + γ K+1,k W (p (µ j E Ψ [X j X j c j ]. j=1 k=j+1 p P [k j] Paths in P [K + 1 j] come in two types. The first type is the direct path consisting of just one edge (K + 1, j, with weight γ K+1,j. The second type consists of the indirect paths p = ((K + 1, k, p where p P [k j]. We have W (p = γ K+1,k W (p. We therefore see that the expression K j=1 [ ( γk+1,j + K k=j+1 γ K+1,k ( p P [k j] W (p ] in fact gives the sum of weights for all paths in P [K + 1 j]. So, we have shown that the claim holds also for i = K + 1. By induction it holds for all 1 i L. OA 1.10 Proof of Corollary A. Proof. First suppose δ > α. By Proposition A.7, since µ j E[X j X j c j ] > 0 for any c j R, I only need to show that p P [i j] W (p < 0 for every i > j pair. Due to the stationarity of γ under the γ i,j = α δ i j 1 functional form, it suffices to prove p P [i 1] W (p < 0 for every i L. When i =, P [ 1] consists of a single path with weight α < 0. By induction suppose p P [i 1] W (p < 0 for all i S for S L 1. We can exhaustively enumerate p P [S + 1 1] by relating each path in P [S 1] to a pair of paths in P [S + 1 1]. Relate p = ((S, i 1,..., (i M 1, 1 P [S 1] to the pair p = ((S + 1, i 1,..., (i M 1, 1 and p = ((S + 1, S, (S, i 1,..., (i M 1, 1. That is, p modifies the first edge in p from (S, i 1 to 17

18 (S + 1, i 1, while p simply concatenates the extra edge (S + 1, S in front of p. We have W (p = δ W (p, because the weight of (S, i 1 is αδ S i1 1 while the weight of (S + 1, i 1 is αδ S i 1, and the two paths are otherwise identical. We have W (p = α W (p, since the newly concatenated edge has weight α. This argument shows p P [S+1 1] W (p = (δ α p P [S 1] W (p. Since δ α > 0 and p P [S 1] W (p < 0 by the inductive hypothesis, we also have p P [S+1 1] W (p < 0. By induction, we have shown that p P [i 1] W (p < 0 for every i L. Next, suppose δ < α. By Proposition A.7, µ 3 = µ 3 + ( αδ + α (µ 1 E[X 1 X 1 c 1 ] + ( α (µ E[X X c ]. The coefficient in front of µ 1 E[X 1 X 1 c 1 ] comes from the fact that there are two paths from 3 to 1, with weights γ 3,1 = αδ and ( γ 3, ( γ,1 = ( α ( α = α. We have ( αδ + α = α(α δ > 0 since α > 0 and δ < α. So, fixing c, as c 1 we get µ 1 E[X 1 X 1 c 1 ] and therefore µ 3. OA 1.11 Proof of Proposition A.9 Proof. Write φ(x; a, b for the Gaussian density with mean a, variance b, evaluated at x. Without loss, suppose h,n for all 1 n N 1, and h,n = for all n > N 1. I show that the posterior density over (µ 1, µ after the dataset (h n N n=1 only depends on N 1, 1 N Nn=1 h 1,n, and 1 N 1 N1 n=1(h,n + γh 1,n. Indeed, g(µ 1, µ (h n N N 1 n=1 g(µ 1, µ φ(h 1,n ; µ 1, σ φ(h,n ; µ γ(h 1,n µ 1, σ =g(µ 1, µ = [ N n=1 [ N φ(h 1,n ; µ 1, σ n=1 φ(h 1,n ; µ 1, σ n=1 ] N n=n 1 +1 ] N 1 φ(h,n ; µ γ(h 1,n µ 1, σ n=1 N 1 φ(h,n + γh 1,n ; µ + γµ 1, σ. n=1 It is well-known that under the Gaussian likelihood, (h 1,n N n=1 N n=1 φ(h 1,n ; µ 1, σ is a function of 1 Nn=1 h N 1,n, and for the same reason (h,n +γh 1,n N 1 n=1 N 1 n=1 φ(h,n +γh 1,n ; µ + γµ 1, σ 1 is a function of N1 N 1 n=1(h,n + γh 1,n. Since the posterior belief g( (h n N n=1 only depends on N 1 and the two statistics Λ (N 1 ((h n N n=1, Λ (N ((h n N n=1 R, the optimal cutoff rule may be expressed as a function of these two statistics, N 1, and c of the predecessors. φ(h 1,n ; µ 1, σ 18

19 OA 1.1 Proof of Proposition A.10 Proof. To see the first two equations, let c R, µ 1, µ R, and write Ψ = Ψ(µ 1, µ ; γ. We have E h H(Ψ;c [h + γh 1 h ] =E h H(Ψ;c [h h ] + γe h H(Ψ;c [h 1 h ] =E Ψ [X X 1 c] + γe Ψ [X 1 X 1 c] =E Ψ [µ γ(x 1 µ 1 X 1 c] + γe Ψ [X 1 X 1 c] =E Ψ [µ + γµ 1 X 1 c] =µ + γµ 1. Since this holds for any c, so we must get that on the mixed history distribution, Λ (H(Ψ(µ 1, µ, γ; c 1,..., c L µ + γµ 1 as well. It is easy to see that we must have Λ 1 (H(Ψ(µ 1, µ, γ; c 1,..., c L = µ 1. To obtain the final equation, first note that we can re-write the second statistic under the true distribution of histories H (c 1,...c L as a weighted average, L E h H (c 1,...c L [h + γh 1 h ] = w l E h H (c l [h + γh 1 h ] l=1 This is because the event of h happens only when h 1 falls below the censoring threshold, so the posterior probability of h being generated from the sub-distribution H (c l given that h depends on the relative likelihoods of X 1 falling under the L different censoring thresholds. For each l, E h H (c l [h + γh 1 h 1 c] = µ + γe[x 1 X 1 c l ] where the conditional expectation of h h 1 c l is simply µ by independence of X and X 1 under Ψ. Putting this into the weighted average expression, L E h H (c 1,...c L [h + γh 1 h ] =µ + γ w l E[X 1 X 1 c l ]. l=1 In order to match the statistics, s 1 = µ 1 and s = µ + γ L l=1 w l E[X 1 X 1 c l ] produced by Λ(H (c 1,..., c L, we must therefore have µ 1 = µ 1, and L µ + γ w l E[X 1 X 1 c l ] = µ + γµ 1, l=1 19

20 which rearranges to µ = µ γ L w l (µ 1 E[X 1 X 1 c l ] = µ (c 1,..., c L. l=1 OA Proof of Theorem 1 In this section I prove the almost-sure convergence of beliefs and behavior when biased agents act one at a time and entertain uncertainty over both µ 1 and µ. For µ 1 < µ 1, µ < µ, let ([µ 1, µ 1 ], [µ, µ ] refer to the parallelogram in R with the vertices: (µ 1, µ + γ ( µ 1 µ 1 (µ 1, µ + γ ( µ 1 µ 1 ( µ 1, µ γ ( µ 1 µ 1 ( µ 1, µ γ ( µ 1 µ 1 In other words, ([µ 1, µ 1 ], [µ, µ ] is the parallelogram constructed by starting with the rectangle [µ 1, µ 1 ] [µ, µ ], then replacing the top and bottom edges with lines with slope γ (and adjusting the left and right edges accordingly to connect with the new top and bottom edges. Consider a sequence of short-lived agents playing the stage game in rounds t = 1,, 3,... They are uncertain about both µ 1 and µ, with a prior density g(µ 1, µ supported on feasible fundamentals M = ([µ 1, µ 1 ], [µ, µ ] as in Remark 1(b. I abbreviate this support as when no confusion arises. Each agent t choose the the optimal cutoff C t maximizing expected payoff based on posterior belief formed from all past histories. I show the almost sure convergence of stochastic processes ( C t and ( G t to the unique steady state under the hypotheses of Theorem 1. OA.1 Preliminary Results First, I consider how the predicted second-period payoff after X 1 = x 1 depends on the parameters of the subjective model Ψ(µ 1, µ ; γ. 0

21 Lemma OA.. For every µ 1, µ, x 1 R, the conditional distribution X X 1 = x 1 is the same under Ψ(µ 1, µ + γ(µ 1 µ 1; γ and Ψ(µ 1, µ ; γ. So in particular, C(µ 1, µ ; γ = C(µ 1, µ + γ(µ 1 µ 1; γ. Proof. Under the subjective model Ψ(µ 1, µ +γ(µ 1 µ 1; γ, the conditional distribution of X given X 1 = x 1 is N (µ +γ(µ 1 µ 1 γ(x 1 µ 1, σ, which simplifies to N (µ γ(x 1 µ 1 ; σ. It is easy to see that this is also the expression for the same conditional distribution under Ψ(µ 1, µ ; γ. Suppose C(µ 1, µ ; γ = c. This implies the indifference condition, u 1 (c = E Ψ(µ1,µ ;γ[u (c, X X 1 = c]. But by the equivalence of conditional distribution given above, u 1 (c = E Ψ(µ 1,µ +γ(µ 1 µ 1 ;γ [u (c, X X 1 = c]. This means c is also the indifference threshold for the model Ψ(µ 1, µ + γ(µ 1 µ 1; γ. As a corollary, this lemma shows the restriction to cutoff strategies is without loss, and that C t is well defined. That is, for any belief given by a density on M, there exists a cutoff strategy that is weakly optimal among the class of all stopping strategies, and further this cutoff strategy is strictly optimal among the class of cutoff strategies. This is because for any x 1 R and any density g on M, M E Ψ(µ1,µ ;γ[u (x 1, X X 1 = x 1 ] g(µ 1, µ d(µ 1, µ = µ µ E Ψ(µ 1,µ ;γ[u (x 1, X X 1 = x 1 ] g V (µ dµ where µ := max{µ : (µ 1, µ } and µ := min{µ : (µ 1, µ }, and g V (µ is the integral of g(µ 1, µ over the line in with slope γ that passes through (µ 1, µ. This equality holds because by Lemma OA., all fundamentals on that line imply the same continuation payoff after X 1 = x 1 as the fundamentals (µ 1, µ. The proof of Lemma A.5 shows that x 1 u 1 (x 1 µ µ E Ψ(µ 1,µ ;γ[u (x 1, X X 1 = x 1 ] g V (µ dµ is a strictly increasing, continuous function that crosses 0. Now, the key step is to separate the two-dimensional inference problem into a pair of one-dimensional problems. 1

22 OA. Learning µ 1 I define the stochastic process of data log-likelihood (for a given fundamental. For each µ 1, µ supp(g, let l t (µ 1, µ (ω be the log likelihood that the fundamentals are (µ 1, µ and histories ( H s s t (ω are generated by the end of round t. It is given by t l t (µ 1, µ (ω := ln(g(µ 1, µ + ln(lik( H s (ω; µ 1, µ s=1 where lik(x 1, ; µ 1, µ := φ(x 1 ; µ 1, σ and lik(x 1, x ; µ 1, µ := φ(x 1 ; µ 1, σ φ(x ; µ γ(x 1 µ 1 ; σ. By simple algebra, we may expand t [ l t (µ 1, µ (ω = ln(g(µ 1, µ + ln((πσ 1/ ] (X 1,s(ω µ 1 s=1 σ ( t + 1{X 1,s (ω C s (ω} ln((πσ 1/ (X,s(ω µ + γ(x 1,s (ω µ 1 s=1 σ I first establish that, without knowing anything about the process (C t, we can conclude agents learn µ 1 arbitrarily well. Lemma OA.3. For every ɛ > 0, almost surely lim t Gt ( ([µ 1 ɛ, µ 1 + ɛ] R = 1. Proof. I first calculate the directional derivative where v = 1/ 1 + γ γ/ 1 + γ (l t /t (µ 1, µ = 1 D 1 g(µ 1, µ µ 1 t g(µ 1, µ + 1 ( 1 σ t v 1 t l t(µ 1, µ, is the unit vector with slope γ. We have t X 1,s µ 1 s=1 + γ σ 1 t t 1{X 1,s C s } (X,s µ +γ(x 1,s µ 1 s=1 (l t /t (µ 1, µ = 1 D g(µ 1, µ µ t g(µ 1, µ σ t t 1{X 1,s C s } (X,s µ + γ(x 1,s µ 1, s=1 where D 1 g and D g are the two partial derivatives of g. At every ω and every (µ 1, µ, note the last summand in (lt/t µ 1 is γ times the last summand in (lt/t µ. Therefore, v 1 t l t(µ 1, µ = ( 1 1 σ 1 + γ t t 1 X 1,s µ 1 + s=1 t 1 D 1 g(µ 1, µ 1 + γ t g(µ 1, µ γ t D g(µ 1, µ 1 + γ g(µ 1, µ.

23 Since g, D 1 g, D g are continuous on the compact set, there exists some 0 < B < so that D 1g(µ 1,µ g(µ 1,µ < B and D g(µ 1,µ g(µ 1,µ < B for all (µ 1, µ. This means for every ω, ( 1 inf v (µ 1,µ l t l t(µ 1, µ ( 1 1 σ 1 + γ t t X 1,s (µ 1 ɛ + 1 s=1 t (1 + γ 1 + γ B, where L := ([µ 1, µ 1 ɛ] R is the sub-parallelogram to the left of µ 1 ɛ. By law of large numbers applied to the i.i.d. sequence (X 1,s, almost surely 1 t ts=1 X 1,s µ 1, therefore almost surely lim inf t ( 1 inf v (µ 1,µ L t l t(µ 1, µ We may divide L further divide into two halves: L,1 := ([µ 1, µ 1 + d/] R L, := ([µ 1 + d/, µ 1 ɛ] R ɛ σ 1 + γ. where d := µ 1 ɛ µ 1. I will show that lim t Gt ( L,1 = 0 almost surely. The idea is we can map every point in L,1 to another point in L, in the direction of v. For every point, its image under the map will have much higher posterior probability, since we have a uniform, strictly positive lowerbound on the directional derivative of log-likelihood l t in the direction of v. Almost surely, G t ( L,1 = g t (µ 1, µ dµ L,1 = g t (µ 1, µ g t(µ 1 d, µ γd dµ L, g t (µ 1, µ = g t (µ 1, µ exp(l t (µ 1 d, µ γd l t (µ 1, µ dµ L, d = g t (µ 1, µ exp( v l t (µ 1 d + z, µ γd + γzdzdµ L, 0 lim inf t so almost surely inf ( vl t (µ 1 d + z, µ γd + γz (µ 1,µ L,,z [0,d] tɛ σ 1 + γ, lim sup t dtɛ G t ( L,1 lim sup g t (µ 1, µ exp( t L, σ 1 + γ dµ. 3

24 But for every ω and t, the RHS is bounded by exp( dtɛ σ 1+γ, which tends to 0 as t. So in fact G t ( L,1 0 almost surely. Now by dividing L, into two equal halves and iterating this argument, we eventually show lim t Gt ( ([µ 1 ɛ, R = 1. A symmetric argument also shows lim t Gt ( ((, µ 1 + ɛ] R = 1. OA.3 Decomposing Partial Derivative of Log-Likelihood With Respect to µ I record a decomposition of l µ (µ 1, µ, the partial derivative of the log-likelihood process with respect to its second argument. Define two stochastic processes: ϕ s (µ 1, µ := σ (X,s µ + γ(x 1,s µ 1 1{X 1,s C s } ϕ s (µ 1, µ := σ P[X 1 C s ] (µ µ γ(µ 1 E[X 1 X 1 C s ], with a slight abuse of notation, P[X 1 x] means the probability that each first-period draw falls below x, and E[X 1 X 1 x] the conditional expectation of the first draw given that it falls below x. Note that ϕ s (µ 1, µ is measurable with respect to F s 1, since ( C t is a predictable process. Write ξ s (µ 1, µ := ϕ s (µ 1, µ ϕ s (µ 1, µ and y t (µ 1, µ := t s=1 ξ s (µ 1, µ. Write z t (µ 1, µ := t s=1 ϕ s (µ 1, µ. Lemma OA.4. lt µ (µ 1, µ = D g(µ 1,µ g(µ 1,µ + y t (µ 1, µ + z t (µ 1, µ Proof. This comes from expanding l t (µ 1, µ and taking its derivative as in the proof of Lemma OA.3. Now I derive two results about the ξ t (µ 1, µ processes for different pairs (µ 1, µ. Lemma OA.5. There exists κ ξ < so that for every (µ 1, µ and for every t 1, ω Ω, E[ξ t (µ 1, µ F t 1 ](ω κ ξ. Proof. Note that ϕ t (µ 1, µ is measurable with respect to F t 1. Also, ϕ t (µ 1, µ F t 1 = ϕ t (µ C t, because by independence of X t from (X s t 1 s=1, the only information that F t 1 contains about ϕ t (µ 1, µ is in determining the cutoff threshold C t. At a sample path ω so that C t (ω = c R, E[ϕ s (µ 1, µ F t 1 ](ω = E[σ (X µ + γ(x 1 µ 1 1{X 1 c}] = σ (E[(X µ 1{X 1 c}] + E[γ(X 1 µ 1 1{X 1 c}] = σ P[X 1 c] (µ µ γ(µ 1 E[X 1 X 1 c], 4

25 where we used the fact that X 1,t and X,t are independent. This shows that E[ϕ s (µ 1, µ F t 1 ](ω = ϕ s (µ 1, µ (ω. Since this holds regardless of c, we get that E[ϕ s (µ 1, µ F t 1 ] = ϕ t (µ 1, µ for all ω, that is to say E[ξt (µ 1, µ F t 1 ] = Var[ϕ t (µ 1, µ F t 1 ] E[ϕ t (µ 1, µ F t 1 ] = σ 4 E[(X,s µ + γ(x 1,s µ 1 1{X 1,s C s }] σ 4 E[(X µ + γ(x 1 µ 1 ] The RHS of the final line is independent of ω and t, while (µ 1, µ E[(X µ +γ(x 1 µ 1 ] is a continuous function on. Therefore it is bounded uniformly by some κ ξ <, which also provides a bound for E[ξt (µ 1, µ F t 1 ](ω for every t, ω and (µ 1, µ. OA.4 Heidhues, Koszegi, and Strack (018 s Law of Large Numbers I use a statistical result from Heidhues, Koszegi, and Strack (018 to show that the y t /t term in the decomposition of 1 l t t µ almost surely converges to 0 in the long run, and furthermore this convergence is uniform on. This lets me focus on terms of the form ϕ s (µ 1, µ, which can be interpreted as the expected contribution to the log likelihood derivative from round s data. This lends tractability to the problem as ϕ s (µ 1, µ only depends on C s, but not on X 1,s or X,s. Lemma OA.6. For every (µ 1, µ, lim t yt(µ 1,µ = 0 almost surely. t Proof. Heidhues, Koszegi, and Strack (018 s Proposition 10 shows that if (y t is a martingale such that there exists some constant v 0 satisfying [y] t vt almost surely, where [y] t y is the quadratic variation of (y t, then almost surely lim t t = 0. t Consider the process y t (µ 1, µ for a fixed (µ 1, µ. By definition y t = t s=1 ϕ s (µ 1, µ ϕ s (µ 1, µ. As established in the proof of Lemma OA.5, for every s, ϕ s (µ 1, µ = E[ϕ s (µ 1, µ F s 1 ]. 5

26 So for t < t, E[y t (µ 1, µ F t ] = = = t s=1 t s=1 t s=1 ϕ s (µ 1, µ ϕ s (µ 1, µ + E ϕ s (µ 1, µ ϕ s (µ 1, µ + ϕ s (µ 1, µ ϕ s (µ 1, µ + 0 = y t (µ 1, µ. This shows (y t (µ 1, µ t is a martingale. Also, t 1 t s=t +1 t s=t +1 ϕ s (µ 1, µ ϕ s (µ 1, µ F t [y(µ 1, µ ] t = E[(y s (µ 1, µ y s 1 (µ 1, µ F s 1 ] = s=1 t 1 s=1 κ ξ t E[ξ s(µ 1, µ F s 1 ] E[E[ϕ s (µ 1, µ ϕ s (µ 1, µ F s 1 ] F t ] by Lemma OA.5. Therefore Heidhues, Koszegi, and Strack (018 Proposition 10 applies. Lemma OA.7. lim t sup (µ1,µ yt(µ 1,µ = 0 almost surely. t Proof. This argument is similar to Lemma 11 in Heidhues, Koszegi, and Strack (018. I apply Lemma of Andrews (199, which says to prove this result I just need to check conditions BD, P-SSLN, and S-LIP from Andrews (199. BD holds because is a bounded subset of R. P-SLLN holds because by Lemma OA.6, which shows for all (µ 1, µ, lim t yt(µ 1,µ = 0 almost surely. t Condition S-LIP is essentially a Lipschitz continuity condition. It requires finding sequence of random variables B t such that ξ t (µ 1, µ ξ t (µ 1, µ B t ( µ 1 µ 1 + µ µ almost surely, such that these random variables satisfy 1 sup ts=1 1 t 1 E[B t s ] <, and lim ts=1 t (B t s E[B s ] = 0 almost surely. But for every ω, ξ t (µ 1, µ ξ t (µ 1, µ 1{X 1,s C s } σ ( µ µ + γ µ 1 µ 1 + σ P[X 1 C s ] (( µ µ + γ µ 1 µ 1 σ ( µ µ + γ µ 1 µ 1. Setting B s as the constant σ (1 + γ for every s satisfies S-LIP. 6

27 OA.5 Bounds on Asymptotic Beliefs and Asymptotic Cutoffs Recall that Lemma OA. implies that if we draw the line with slope γ through the point (µ 1, µ, all pairs of fundamentals on this line have the same optimal cutoff threshold. Then against any feasible model Ψ(µ 1, µ ; γ with (µ 1, µ, the best cutoff strategy is between c := C(µ 1, µ ; γ and c := C(µ 1, µ ; γ. For µ l µ h in the interval [µ, µ ], let [µ l,µ h ] be constructed from by translating its top and bottom edges towards the center, so that they pass through (µ 1, µ l and (µ 1, µ h respectively. Lemma OA.8. For c c, if lim inf t Ct c almost surely, then lim t Gt ( [µ,µ (c = 0 almost surely. Also, for c c, if lim sup t Ct c almost surely, then lim t Gt ( (µ ( c, µ ] = 0 almost surely. Proof. I first show that for all ɛ > 0, there exists δ > 0 such that almost surely, lim inf t From Lemma OA.4, we may rewrite LHS as lim inf t inf (µ 1,µ [µ,µ (c ɛ]] 1 l t inf (µ 1, µ δ. (µ 1,µ [µ,µ (c ɛ]] t µ [ 1 D g(µ 1, µ + y t(µ 1, µ + z ] t(µ 1, µ, t g(µ 1, µ t t which is no smaller than taking the inf separately across the three terms in the bracket, lim inf t + lim inf t 1 inf (µ 1,µ [µ,µ (c ɛ]] t D g(µ 1, µ g(µ 1, µ z t (µ 1, µ inf. (µ 1,µ [µ,µ (c ɛ]] t + lim inf t y t (µ 1, µ inf (µ 1,µ [µ,µ (c ɛ]] t Since D g/g is bounded on as D g is continuous and g is continuous and strictly positive on the compact set, the first term is 0 for every ω. To deal with the second term, lim inf t y t (µ 1, µ inf (µ 1,µ [µ,µ (c ɛ]] t lim inf t inf y t(µ 1, µ (µ 1,µ t = lim inf t { 1 sup (µ 1,µ Lemma OA.7 gives lim t sup (µ1,µ yt(µ 1,µ = 0 almost surely. Hence, we conclude t almost surely. lim inf t y t (µ 1, µ inf (µ 1,µ [µ,µ (c ɛ]] t 7 0 y } t(µ 1, µ. t

28 It suffices then to find δ > 0 and show lim inf t inf (µ1,µ zt(µ 1,µ [µ δ almost,µ (c ɛ]] t surely. Since z t is the sum of ϕ s terms that are decreasing functions of µ + γµ 1, the inner inf is always achieved at (µ 1, µ = (µ 1, µ (c ɛ. So we get lim inf t z t (µ 1, µ inf (µ 1,µ [µ,µ (c ɛ]] t = lim inf t = lim inf t z t (µ 1, µ (c ɛ t [ 1 t ] ϕ s (µ t 1, µ (c ɛ. s=1 The definition of µ (c is such that, µ µ (c γ(µ 1 E[X 1 X 1 c] = 0. So for any c c, since γ > 0, So at any c c, µ µ (c γ(µ 1 E[X 1 X 1 c] 0 µ (µ (c ɛ γ(µ 1 E[X 1 X 1 c] ɛ. σ P[X 1 c] (µ (µ (c ɛ γ(µ 1 E[X 1 X 1 c] σ P[X 1 c] ɛ Along any ω where lim inf t Ct c, we therefore have lim inf t ϕ s(µ 1, µ (c ɛ σ P[X 1 c] ɛ. Put δ = σ P[X 1 c] ɛ. This shows almost surely, lim inf t [ 1 t ] ϕ s (µ t 1, µ (c ɛ δ. s=1 From here, the same argument as in the proof of Lemma OA.3 showslim t Gt ( [µ,µ (c ɛ] = 0 almost surely. Since the choice of ɛ > 0 is arbitrary, this establishes the first part of the lemma. The proof of the second part of the statement is exactly symmetric. To sketch the argument, we need to show that for all ɛ > 0, there exists δ > 0 such that almost surely, lim sup t 1 l t sup (µ 1, µ δ. (µ 1,µ [µ t µ ( c+ɛ, µ ] 8

29 This essentially reduces to analyzing For any c c, since γ > 0, lim sup t [ 1 t ] ϕ s (µ t 1, µ ( c + ɛ. s=1 µ µ ( c γ(µ 1 E[X 1 X 1 c] 0 µ (µ ( c + ɛ γ(µ 1 E[X 1 X 1 c] ɛ. For every t and along every ω, Ct (ω c, as cutoffs below this value cannot be myopically optimal given any belief about second-period fundamental supported on. So along any ω such that lim sup t Ct (ω c, we have lim sup t ϕ s (µ 1, µ ( c+ɛ σ P[X 1 c ] ( ɛ. [ Setting δ := σ P[X 1 c 1 ts=1 ] (ɛ, we get lim sup t ϕ t s (µ 1, µ ( c + ɛ ] δ almost surely. Now, I use a bound on agents asymptotic beliefs about µ to deduce asymptotic restrictions on their cutoffs. Lemma OA.9. Suppose that there are µ µl < µ h µ such that lim t Gt ( [µ l,µ h ] = 1 almost surely. Then lim inf t Ct C(µ 1, µ l ; γ and lim sup t Ct C(µ 1, µ h ; γ almost surely. Proof. I show lim inf t Ct C(µ 1, µ l ; γ almost surely. The argument establishing lim sup t Ct C(µ 1, µ h ; γ is symmetric. Let c l = C(µ 1, µ l ; γ, and recall before we defined c := C(µ 1, µ ; γ and c := C(µ 1, µ ; γ. For (µ 1, µ, let L(µ be the line segment in supp(g with slope γ that contains the point (µ 1, µ. By Lemma OA., C(µ 1, µ ; γ = C(µ 1, µ ; γ for all (µ 1, µ L(µ. Since c U(c; µ 1, µ is single peaked for every (µ 1, µ, and since c l C(µ 1, µ ; γ for all µ [µ l, µ h ], we also get c l C(µ 1, µ ; γ for every (µ 1, µ [µ l,µ h ], since [µ l,µ h ] is the union of the line segments, [µ l,µ h ] = µ [µ l,µh ] L(µ. Fix some ɛ > 0. We get U(c l ; µ 1, µ U(c l ɛ; µ 1, µ > 0 for every (µ 1, µ [µ l,µ h ]. As (µ 1, µ ( U(c l ; µ 1, µ U(c l ɛ; µ 1, µ is continuous, there exists some κ > 0 so that U(c l ; µ 1, µ U(c l ɛ; µ 1, µ > κ for all (µ 1, µ [µ l,µ h ]. In particular, if ν ( [µ l,µ h ] is a belief about fundamentals, then U(c l ; µ 1, µ U(c l ɛ; µ 1, µ > dν(µ > κ. Now, let κ := κ := sup sup c [c, c ] (µ 1,µ inf c [c, c ] U(c; µ 1, µ, inf U(c; µ 1, µ. (µ 1,µ 9

30 Find p (0, 1 so that pκ (1 p( κ κ = 0. At any belief ˆν ( that assigns more than probability p to the sub-parallelogram [µ l,µ h ], the optimal cutoff is larger than c l ɛ. To see this, take any ĉ c l ɛ and I will show ĉ is suboptimal. If ĉ < c, then it is suboptimal after any belief on. If c ĉ c l ɛ, I show that U(c l ; µ 1, µ U(ĉ; µ 1, µ dˆν(µ > 0. To see this, we may decompose ˆν as the mixture of a probability measure ν on [µ l,µ h ] and another probability measure ν c on \ [µ l,µ h ]. Let ˆp > p be the probability that ν assigns to [µ l,µ h ]. The above integral is equal to: ˆp U(c l ; µ 1, µ U(ĉ; µ 1, µ dν(µ + (1 ˆp [µ l,µ h ] U(c l ; µ 1, µ U(ĉ; µ 1, µ dν c (µ \ [µ l,µ h ] Since c l is to the left of the optimal cutoff for all (µ 1, µ [µ l,µ h ] and ĉ c l ɛ, then U(ĉ; µ 1, µ U(c l ɛ; µ 1, µ for all (µ 1, µ [µ l,µ h ]. The first summand is no less than ˆp U(c l ; µ 1, µ U(c l ɛ; µ 1, µ dν(µ ˆpκ. [µ l,µ h ] Also, the integrand in the second summand is no smaller than ( κ κ, therefore U(c l ; µ 1, µ U(ĉ; µ 1, µ dˆν(µ ˆpκ (1 ˆp( κ κ. Since ˆp > p, we get ˆpκ (1 ˆp( κ κ > 0 as desired. Along any sample path ω where lim t Gt ( [µ l,µ h ] (ω = 1, eventually G t ( [µ l,µ h ] (ω > p for all large enough t, meaning lim inf t Ct (ω c l ɛ. Since lim t Gt ( [µ l,µ h ] = 1 almost surely, this shows lim inf t Ct C(µ 1, µ l ; γ ɛ almost surely. Since the choice of ɛ > 0 was arbitrary, we in fact conclude lim inf t Ct C(µ 1, µ l ; γ almost surely. OA.6 The Contraction Map I now combine the results established so far to prove the convergence statement in Theorem 1. Proof. Let µ l,[1] := µ, µh,[1] := µ. For k =, 3,..., iteratively define µ l,[k] := I(µl,[k 1] ; γ and µ h,[k] := I(µh,[k 1] ; γ. From Lemma OA.9, if lim t Gt ( [µ l,[k],µh,[k] ] = 1 almost surely, then lim inf t Ct C(µ 1, µ l,[k] ; γ and lim sup C t t C(µ 1, µ h,[k] ; γ almost surely. But using these conclusions 30

Asymptotic results discrete time martingales and stochastic algorithms

Asymptotic results discrete time martingales and stochastic algorithms Bernard Bercu Bordeaux University, France IFCAM Summer School Bangalore, India, July 2015 Bernard Bercu Asymptotic results for discrete