CEC login. Student Details Name SOLUTIONS

Student Details Name SOLUTIONS CEC login Instructions You have roughly 1 minute per point, so schedule your time accordingly. There is only one correct answer per question. Good luck!

Question 1. Searching Circle the correct answer for each question (there is exactly one): a) [2] What is total number of nodes that iterative deepening visits? (if a node is visited multiple times, count it multiple times). Assume the tree has a branch factor of "b", depth "d", and the goal is at depth "g". a. O(bd) b. O(b d ) c. O(bg) d. O(b g+1 ) O(b g+1 ) === This answer could be O(b g ) but in class we derived it loosely as O(b g+1 ). By definition of O- notation, O(b g ) is O(b g+1 ) ( but not the other way around) b) [2] Which of the following statements are true about Breadth First graph search? Assume the tree has a branch factor of "b", depth "d", and the goal is at depth "g". a. Breadth First graph search is complete on problems with finite search graphs. b. Breadth First graph search uses O(bg) space. c. Both (a) and (b) are true. d. Both (a) and (b) are false. BFS searches the complete state space, so on finite graphs, using graph search to not re- explore nodes, it will explore all nodes and therefore find a solution c) [4] Consider the A* search. g is the cumulative path cost of a node t, h is a lower bound on the shortest path to a goal state, and n' is the parent of n. Assume all costs are positive. Note: Enqueuing == putting a node onto the fringe, dequeuing == removing then expanding a node from the fringe Which of the following search algorithms are guaranteed to be optimal? i. A*, but apply the goal test before enqueuing nodes rather than after dequeuing. ii. A*, but prioritize n by g(n) only. iii. A*, but prioritize n by h(n) only. iv. A*, but prioritize n by g(n) + h(n') v. A*, but prioritize n by g(n') + h(n) a. All but i. b. ii. and v. c. iv. and ii. d. iv. and iii.

e. Only iv. f. Only ii. ii is equivalent to A* with the heuristic h = 0 for all nodes. v is equivalent to A* with heuristic h (n) = h(n) (step(n, n)). So if h is admissible, so is h. iv. is wrong because it uses the true cost to node n + the heuristic from the parent of n. This could overestimate the total path length through node n, and therefore never expand a node on the optimal path. [IS THIS TRUE?] We apply a variety of queue- based graph- search algorithms to the state- graph on the right. Initially the fringe contains the start state A. When there are multiple options, nodes are enqueued (put onto the fringe) in alphabetical order. d) [2] In the BFS algorithm, we perform the goal test when we enqueue a node. How many nodes have been dequeued when we find the solution? a. 2 b. 3 c. 4 d. 5 - - - in this case, all nodes ABCDE will be dequeued. e. 6 e) [2] In the DFS algorithm, we perform the goal test when we enqueue a node. What is the sequence of dequeued nodes? a. A,B,E,G b. A,B,C,D,E c. A,B,E d. A,D,E - - - dequeue A, add BCD. Dequeue D, and ACE. Dequeue E, add BCDG. As we add G, we perform the goal test. So dequeued nodes are ADE. e. None of the above f) [2] In the UCS algorithm, we perform the goal test when we dequeue (expand) a node. How many nodes have been dequeued when we find the solution? (Do not count the dequeuing of the goal state itself) a. 3 b. 4 (explanation below).

c. 5 d. 6 e. None of the above Expand A, to have fringer: (B,1), (C,3), (D,7). Expand B to get fringe: (E,2), (C,3), (D,7) Expand E to get fringe: (C,3),(G,4), (D,7) Expand C to get fringe: (G,4), (D,6) Dequeue G and it passes goal test. A B C D E G H1 2 2 2 2 0 0 H2 3 2 2 2 1 0 H3 5 4 3 2 1 0 g) [2] The above table shows 3 heuristics, H1, H2, H3 and their values at each node. (For example, H1(A) = 2, H1(B) = 2, ) Which of these heuristics are admissible? (The graph is copied again below for your convenience) a. H1 and H2. H3 not admissible because H3(B) == 4 but distance is 3 from goal. b. H2 only c. H3 only d. All are admissible e. None are admissible. h) [2] Which of these heuristics are consistent? a. H1 and H2 b. H2 only - - - H1,H3 not consistent: H1(B) H1(E) = 2, H3(B)- H3(E) = 3 c. H3 only d. All are consistent e. None are consistent

Question 2. CSPs Circle the correct answer for each question (there is exactly one): a) [2] Which of the following statements are true about the runtime for CSPs? a. Tree- structured CSPs may be solved in time that is linear in the number of variables. b. Arc- consistency may be checked in time that is polynomial in the number of variables. c. Both (a) and (b) are true. i. Tree structured CSPs can be solved in time O(nd 2 ), arc- consistency is O(n 2 d 2 ), although in class we showed an algorithm that is O(n 2 d 3 ) d. Both (a) and (b) are false. b) [2] When solving a CSP by backtracking, which of the following are good heuristics? a. Pick the value that is least likely to fail. b. Pick the variable that is least constrained. c. Both (a) and (b) are good heuristics. d. Both (a) and (b) are bad heuristics. c) [2] Suppose you have a highly efficient solver for tree- structured CSPs. Given a CSP with the following binary constraints, for which re- formulations will you find a fast and correct solution? A D H C G B E F I a. Set the value of E, solve the remaining CSP, and try another value for E if no solution is found. b. Replace variables D and E with variable DE {(d,e) d D and e E}, then solve. c. Ignore variable either variable D or E, solve, then pick a consistent value. d. Both (a) and (b). i. These both create a tree- like structure that can be solved quickly. e. Both (a) and (c).

d) [2] Which of the following statements are true? a. Additional constraints always make CSPs easier to solve. b. CSP solvers incorporating randomness are always a bad idea. c. Both (a) and (b) are true. d. Both (a) and (b) are false. i. [WE NEVER DISCUSSED RANDOMNESS IN CLASS, SO EVERYONE GOT CREDIT FOR ALL ANSWERS TO THIS QUESTION] e) [2] Which of the following are true about CSPs a. If a CSP is arc- consistent, it can be solved without backtracking b. A CSP with only binary constraints can be solved in time polynomial in n (the number of variables) and d (the number of options per variable). c. Both (a) and (b) are true. d. Both (a) and (b) are false. i. (a) not true, could have every arc be consistent, but no possible solution. (b) not true, general CSP with binary constraints is NP- hard.

Question 3. Adversarial Search Circle the correct answer for each question (there is exactly one): a) [2] Which statement is true about reflex agents: a. Reflex agents can be learned with Q- Learning. b. You can design reflex agents that play optimally. c. Both a) and b) are true. i. Q- learning defines optimal moves in every state, and therefore defines table for reflex agent to follow. Q- learning is one way to define reflex agents that play optimally. d. Both a) and b) are false. b) [2] Which statement is true about multi- player games? a. Each multi- player game is also a search problem. b. Multi- player games are easier (in complexity) than general search problems. c. Both a) and b) are true. i. Easier because things like alpha- beta pruning let you explore less of the search tree. d. Both a) and b) are false. c) [2] When doing Alpha- Beta pruning on a game tree visited left to right, a. the right- most branch will always be pruned. b. the left- most branch will always be pruned. c. Both a) and b) are true. d. Both a) and b) are false. i. The left- most branch is NEVER pruned, but otherwise there are no guarantees. d) [2] When applying alpha- beta pruning to minimax game trees. a. Pruning nodes does not change the value of the root to the max player. b. Alpha- beta pruning can prune different numbers of nodes if the children of the root node are reordered. c. Both a) and b) are true. d. Both a) and b) are false.

e) [2] Normally, alpha- beta- pruning is not used with expectimax. Which one of the following conditions allows you to perform pruning with expectimax: a. All values are positive. b. Children of expectation nodes have values within a finite pre- specified range. i. The finite range allows lower and upper bounds to be computed and used as in alpha/beta pruning. c. All transition probabilities are within finite a pre- specified range. d. The probabilities sum- up to one, and you only ever prune the last child. f) [2] You have game that you are play sometimes against a talented opponent, and sometimes against a random opponent, so you have implemented both Minimax and Expectimax. You discover that your evaluation function had a bug, and instead of returning the actual value of a terminal state, it returns the square- root of the value of the terminal state. All terminal states have positive values. Which of the following statements is true: a. The resulting policy might be sub- optimal for Minimax. b. The resulting policy might be sub- optimal for Expectimax. i. While minimax just looks at the sorted order, which doesn t change if you take the square root of positive values, expectimax needs to compute the average, which is affected by the square root. c. Both a) and b) are true. d. Both a) and b) are false.

Consider the following game tree (which is evaluated left- to- right): max This is LEFT This is RIGHT min min max max max max 4 0 5 8 0 2 1 8 g) [2] what is the minimax value at the root node of this tree? 4 h) [4] How many leaf nodes would alpha- beta pruning prune? 3 i) [4] Suppose player #2 (formerly min) switches to a new strategy and picks the left action with probability (1/4) and right with probability (3/4). What is the maximum expected utility of player #1? 6 Left side. Max chooses options 4,8 so expected value is 6 Right side, Max chooses 2,8, so expected value is 5 Max chooses better of these to get 6

Question 4. MDP + RL Circle the correct answer for each question (there is exactly one): a) [1] A rational agent (who uses dollar amounts as utility) prefers to be given an envelope containing $X rather than one containing either $0 or $10 (with equal probability). What is the smallest $X such that the agent may be acting in accordance with the principle of maximum expected utility? a. There is no such minimum $X. b. $0. c. $5. - - - money is not always a good model for utility of choices for people, but this problem explicitly states that dollar amounts are the utility function. d. $10. b) [2] Which of the following are true about Markov Decision Processes: a. If the only difference between two MDPs is the value of the discount factor then they must have the same optimal policy. b. Rational policies can be learned before values converge. i. This is why we use policy iteration, because the policy may stay constant while the value is still changing. (a) is false - - - if the discount factor is small, the MDP may avoid long paths with the (larger) payoff only at the end [CHECK THIS ANSWER] c. Both (a) and (b) are true d. Neither (a) nor (b) are true c) [2] Which of the following are true about Q- learning? a. Q- learning will only learn the optimal Q- values if actions are eventually selected according to the optimal policy. b. In a deterministic MDP (i.e. one in which each state / action leads to a single deterministic next state), the Q- learning update with a learning rate of α = 1 will correctly learn the optimal Q- values. i. : True. The learning rate is only there because we are trying to approximate a summation with a single sample. In a deterministic MDP where s is always that we always get to after applying action a in state s, then the update rule: 1. Q(s,a) = R(s,a,s ) + max_a Q(s,a ), which is exactly the update we make. ii. (a) is false because any strategy that visits all states will eventually allow Q- learning to converge to the optimal Q- values. c. Both (a) and (b) are true d. Neither (a) nor (b) are true

d) The above MDP has states 1,2,3,4,G, and M. Where G, M are terminal states. The reward for transitioning from any state to G is 1 (you scored a goal!). The reward for transitioning from any state to M is 0. All other rewards are zero. There is no discount rate (so γ = 1). The transition distributions are: From state i, if you shoot, you have a probability i/4 of scoring, so: T(i, S, G) = i/4, and otherwise you miss: T(i, S, M) = 1- i/4 If you dribble from state i, you have a ¾ probability to get to state i+1, and a ¼ probability of losing the ball and going to state M (unless you are in state 4, when the goalie stops you every time), so: T(i,D,i+1) = ¾, and T(I,D,M) = ¼ for i = 1,2,3, and T(4,D,M) = 1 a. [3] Let π be the policy that always shoots. What is V π (1)? 1/4 V(1) = T(1,S,G) * R(1,S,G) + T(1,S,M) * R(1,S,M) = 1/4 * 1 + 3/4 * 0 = 1/4 b. [3] Define Q* to be Q- values under the optimal policy; what is Q*(3,D)? 3/4 Q(3,D) is dribbling from step 3, so actions will be dribble then shoot. Rewards for missing are zero, so those terms can be dropped as soon as we see them. Thus: Plugging in values: T(3,D,4) = ¾. T(4,S,G) = 1, R(4,S,G) = 1 so Q*(3,D) = 3/4 c. [3] If you use value iteration to compute the values V* for each node, what is the sequence of values after the first three iterations for for V*(1)? (Your answer should be a set of three values, such as 1/12, 1/3, 1/2, and you may have to compute the value iteration values for all states to compute these. 1/4, 6/16 OR 3/8, 27/64 Box below is just for your work. The only thing that will be graded is what you put here ^^^^.

Iteration# V*(1) V*(2) V*(3) V*(4) 0 (initialization) 0 0 0 0 1 ¼ 2/4 3/4 4/4 2 6/16 9/16 3/4 4/4 3 27/64 9/16 3/4 4/4 Question 5. Probabilities We continue the same soccer problem, but now imagine that sometimes there is a defender D between the agent A and the goal. A has no way of detecting whether D is present, but does know statistics of the environment: D is present 2/3 of the time. D does not affect shots at all, only dribbling. When D is absent, the chance of dribbling forward successfully is 3/4 (as it was in the problem above), When D is present, the chance of dribbling forward is 1/4. In either case, if dribbling forward fails, the game goes to the M (missed) state. a. [2] If the defender is present, what is the optimal action from state 1? S b. [4] Suppose that A dribbles twice successfully from state 1 to state 3, then shoots and scores. Given this observation, what is the probability that the defender D was present? 2/11 We can use Bayes rule, where d is a random variable denoting the presence of the defender, and e is the evidence that A dribbled twice and then scored: Want to compute P(d e). By Bayes rule that can be expressed as: P(d e) = P(e d) P(d) / P(e) Building up these pieces,we have: P(e) = P(e d) * P(d) + P(e ~d) P(~d) P(e d) = probability of our actions given defender = ¼ * ¼ * ¾ = 3/64 P(e ~d) = probability of our actions without defender = ¾ * ¾ * ¾ = 27/64 P(e) = 3/64 * 2/3 + 27/64 * 1/3 = 2/64 + 9/64 = 11/64 P(d e) = (3/64) * (2/3) / (11/64) = 3*(2/3) / 11 = 2/11