Machine Learning in Computer Vision Markov Random Fields Part II

Size: px

Start display at page:

Download "Machine Learning in Computer Vision Markov Random Fields Part II"

Barbra Dalton
5 years ago
Views:

1 Machine Learning in Computer Vision Markov Random Fields Part II Oren Freifeld Computer Science, Ben-Gurion University March 22, 2018 Mar 22, / 40

2 1 Some MRF Computations 2 Mar 22, / 40

3 Few Things Plan for what is coming up next (not just today): Computations on MRF (i.e., inference): Where we can compute things exactly using Dynamic Programming Markov-Chain Monte Carlo (MCMC) and Gibbs Sampling Briefly discuss additional methods for inference Several specific MRF models (e.g., the Ising model) and their applications in computer vision HW #2 will focus on implementing exact sampling from the Ising model prior (but DP is also applicable for the posterior) HW #3 will focus on implementing MCMC sampling (particularly Gibbs sampling) from the Ising model prior and posterior (given observations corrupted by Gaussian noise). Mar 22, / 40

4 Some MRF Computations Computing on General Graphs Things we may want to compute Most likely state (possibly conditioned on some observations) Example The most probable binary segmentation given a noisy image. Marginals, normalizing constants, and expectations. Example p(x) = 1 Z exp ( c C E c(x c ) ). Compute Z = x exp ( c C E c(x c ) ) Sampling from p Example Sampling an image from a distribution over images. (Ulf Grennander s Pattern Theory: model synthesis = model analysis ) In all cases, exploiting the conditional independences captured by G is key. Mar 22, / 40

5 Some MRF Computations Many Approaches For Computing on MRFs If R (range of x s ) is finite and G is special (e.g.: linear; a tree) there are efficient methods with theoretical guarantees; in fact, these methods are often used even if there is no special structure (hence no guarantees) In Gaussian MRFs often things can be done in either a closed form or very efficiently (exploiting sparse linear algebra and sparsity of Q = Σ 1 ) If R is finite and G is either small or big-but-not-too-complicated, can do things exactly using Dynamic Programming. If clique functions have special structure, this can often be exploited If R is finite, graph-theoretic methods can often efficiently (locally-) maximize p Markov-Chain Monte-Carlo (MCMC) Methods that approximate G and/or p (e.g., using a simpler graph) We will touch upon only some of these. Mar 22, / 40

6 Dynamic Programming Not directly applicable for continuous RVs Impractical if the associated complexity of the MRF is too high When applicable, computations are exact. Consider, as a prototype problem, working with p(x A y B ) (more generally, working with p(x), an MRF w.r.t. some graph G) Let us start with finding most-likely (argmax) configurations. Mar 22, / 40

7 Dynamic Programming for Argmax Example S = {s, t, u, v, w} x = (x s, x t, x u, x v, x w ) p(x) = H uvw (x u, x v, x w )H u (x u )H ut (x u, x t )H vt (x v, x t )H t (x t )H us (x u, x s ) C = {{u, v, w}, {u}, {u, t}, {v, t}, {t}, {u, s}} x v x t dependency graph for p(x) x w x u x s Mar 22, / 40

8 Dynamic Programming for Argmax Example Step 1: Absorb sub-cliques (optional, facilitates book keeping), and redefine C (set of relevant cliques) accordingly. E.g., with F uvw = H uvw F ut = H u H ut F vt = H vt H t F us = H us we convert p(x) = H uvw (x u, x v, x w )H u (x u )H ut (x u, x t )H vt (x v, x t )H t (x t )H us (x u, x s ) C = {{u, v, w}, {u}, {u, t}, {v, t}, {t}, {u, s}} to p(x) = F uvw (x u, x v, x w )F ut (x u, x t )F vt (x v, x t )F us (x u, x s ) C = {{u, v, w}, {u, t}, {v, t}, {u, s}} Mar 22, / 40

9 Dynamic Programming for Argmax Example S = {s, t, u, v, w} x = (x s, x t, x u, x v, x w ) p(x) = F uvw (x u, x v, x w )F ut (x u, x t )F vt (x v, x t )F us (x u, x s ) C = {{u, v, w}, {u, t}, {v, t}, {u, s}} x v x t dependency graph for p(x) x w x u x s Remark: could have done better with C = {{u, v, w}, {u, v, t}, {u, s}} but that would have hidden some of the structure of the factorization. Mar 22, / 40

10 Dynamic Programming for Argmax Example Step 2: Pick a site-visitation schedule, i.e., an ordering (and rename sites accordingly) S = {1, 2, 3, 4, 5} x = (x 1, x 2, x 3, x 4, x 5 ) p(x) = F 123 (x 1, x 2, x 3 )F 24 (x 2, x 4 )F 34 (x 3, x 4 )F 35 (x 3, x 5 ) C = {{1, 2, 3}, {2, 4}, {3, 4}, {3, 5}} x 2 x 4 dependency graph for p(x) x 1 x 3 x 5 Mar 22, / 40

11 Dynamic Programming for Argmax Example During the blood and gore of the next steps, keep in mind that the only thing that is really happening here is this: max x p(x) = max x 5 = max x 5 = max x 5 max x 4 max x 4 max x 4 max x 3 max x 3 max x 3 max x 2 max p(x) x 1 max max F 123 F 24 F 34 F 35 x 2 x 1 F 24 max F 123 x 1 }{{} C 1 (x 2,x 3 ) F 34 F 35 max x 2 } {{ } } C 2 (x 3,x 4 ) {{ } } C 3 (x 4,x 5 ) {{ } } C 4 (x 5 ) {{ } C 5 Mar 22, / 40

12 Dynamic Programming for Argmax Example Step 3: Define boundaries and innovation cliques, k {1, 2,..., S }. Boundaries: b k = boundary of 1, 2,..., k in G ={l S : l > k & l c for some c C s.t. c {1, 2,..., k} } e.g., for C = {{1, 2, 3}, {2, 4}, {3, 4}, {3, 5}}: b 1 = {2, 3} = {l S : l > 1 & l c for some c C s.t. c {1} } b 2 = {3, 4} = {l S : l > 2 & l c for some c C s.t. c {1, 2} } b 3 = {4, 5} = {l S : l > 3 & l c for some c C s.t. c {1, 2, 3} } b 4 = {5} = {l S : l > 4 & l c for some c C s.t. c {1, 2, 3, 4} } b 5 = = {l S : l > 5 & l c for some c C s.t. c {1, 2, 3, 4, 5} } Mar 22, / 40

13 Dynamic Programming for Argmax Example Step 3: Define boundaries and innovation cliques, k {1, 2,..., S }. Innovation cliques: C k = relevant new cliques at step k = {c C : k c & l / c l < k} e.g., for C = {{1, 2, 3}, {2, 4}, {3, 4}, {3, 5}}: C 1 = {{1, 2, 3}} = {c C : 1 c & l / c l < 1} C 2 = {{2, 4}} = {c C : 2 c & l / c l < 2} C 3 = {{3, 4}, {3, 5}} = {c C : 3 c & l / c l < 3} C 4 = = {c C : 4 c & l / c l < 4} C 5 = = {c C : 5 c & l / c l < 5} Mar 22, / 40

14 Dynamic Programming for Argmax Example Step 4: Loop through sites, computing conditional optima. Initialize: R 1 (x b1 ) = arg max x 1 C 1 (x b1 ) = max x 1 F c (x c ) c C 1 F c (x c ) = F c (x c ) c C 1 c C 1 E.g., for b 1 = {2, 3} and C 1 = {{1, 2, 3}} R 1 (x 2, x 3 ) = arg max x 1 F 123 (x 1, x 2, x 3 ) C 1 (x 2, x 3 ) = F 123 (x 1, x 2, x 3 ) x1 =R 1 (x b1 ) x1 =R 1 (x b1 ) Mar 22, / 40

15 Dynamic Programming for Argmax Example Step 4: Iterate: R k (x bk ) = arg max C k 1 (x bk 1 ) F c (x c ) x k c C k C k (x bk ) = max C k 1 (x bk 1 ) F c (x c ) x k c C k = C k 1 (x bk 1 ) F c (x c ) c C k xk =R k (x bk ) k = 2, 3,..., S (break ties arbitrarily) Mar 22, / 40

16 Dynamic Programming for Argmax Example E.g.: R 2 (x 3, x 4 ) = arg max x 2 C 1 (x 2, x 3 )F 24 (x 2, x 4 ) C 2 (x 3, x 4 ) = C 1 (R 2 (x 3, x 4 ), x 3 )F 24 (R 2 (x 3, x 4 ), x 4 ) R 3 (x 4, x 5 ) = arg max x 3 C 2 (x 3, x 4 )F 34 (x 3, x 4 )F 35 (x 3, x 5 ) C 3 (x 4, x 5 ) = C 2 (R 3 (x 4, x 5 ), x 4 )F 34 (R 3 (x 4, x 5 ), x 4 )F 35 (R 3 (x 4, x 5 ), x 5 ) R 4 (x 5 ) = arg max x 4 C 3 (x 4, x 5 ) C 4 (x 5 ) = C 3 (R 4 (x 5 ), x 5 ) R 5 = arg max x 5 C 4 (x 5 ) C 5 = C 4 (R 5 ) Mar 22, / 40

17 Dynamic Programming for Argmax Example Step 4: Compute global optimum, backwards E.g.: x S = R S x k 1 = R k 1 ( x bk 1 ), k = S,..., 2 x 5 = R 5 x 4 = R 4 ( x 5 ) x 3 = R 3 ( x 4, x 5 ) x 2 = R 2 ( x 3, x 4 ) x 1 = R 1 ( x 2, x 3 ) ( x 1, x 2, x 3, x 4, x 5 ) maximizes p (i.e., it is the most likely state) Mar 22, / 40

18 Dynamic Programming for Argmax Example Computational Cost: Assume a common state space, R (x k R k). Step k involves O( R b k +1 ) operations. Hence the cost of the entire procedure is no more than O( S R B+1 ) where B = max k b k. E.g., B = 2 in the example. If the visitation schedule were re-ordered, exchanging 3 and 4, B would still be 2, but the third step would require O( R 2 ) instead of O( R 3 ) operations. In either case, cost is about 5 R 3. Finding the ordering that minimizes B is NP hard. Recall that in HMM, p(x A y B ) is a Markov Chain. Using left-to-right visitation schedule, B = 1, and the number of computations is O(n R 2 ). Brute forces is O(n R n ). Mar 22, / 40

19 Dynamic Programming for Normalization-constant Example S = {s, t, u, v, w} x = (x s, x t, x u, x v, x w ) p(x) = 1 Z H uvw(x u, x v, x w )H u (x u )H ut (x u, x t )H vt (x v, x t )H t (x t )H us (x u, x s ) H uvw (x u, x v, x w )H u (x u )H ut (x u, x t )H vt (x v, x t )H t (x t )H us (x u, x s ) C = {{u, v, w}, {u}, {u, t}, {v, t}, {t}, {u, s}} Same example, but compute the normalizer, Z = x c C H c(x c ): x v x t dependency graph for p(x) x w x u x s Mar 22, / 40

20 Dynamic Programming for Normalization-constant Example Similar to the argmax case Step 1: Absorb sub-cliques (optional, facilitates book keeping), and redefine C (set of relevant cliques) accordingly. E.g., with F uvw = H uvw F ut = H u H ut F vt = H vt H t F us = H us we convert p(x) H uvw (x u, x v, x w )H u (x u )H ut (x u, x t )H vt (x v, x t )H t (x t )H us (x u, x s ) C = {{u, v, w}, {u}, {u, t}, {v, t}, {t}, {u, s}} to p(x) F uvw (x u, x v, x w )F ut (x u, x t )F vt (x v, x t )F us (x u, x s ) C = {{u, v, w}, {u, t}, {v, t}, {u, s}} Mar 22, / 40

21 Dynamic Programming for Normalization-constant Example Similar to the argmax case S = {s, t, u, v, w} x = (x s, x t, x u, x v, x w ) p(x) F uvw (x u, x v, x w )F ut (x u, x t )F vt (x v, x t )F us (x u, x s ) C = {{u, v, w}, {u, t}, {v, t}, {u, s}} x v x t dependency graph for p(x) x w x u x s Remark: could have done better with C = {{u, v, w}, {u, v, t}, {u, s}} but that would have hidden some of the structure of the factorization. Mar 22, / 40

22 Dynamic Programming for Normalization-constant Example Similar to the argmax case Step 2: Pick a site-visitation schedule, i.e., an ordering (and rename sites accordingly) S = {1, 2, 3, 4, 5} x = (x 1, x 2, x 3, x 4, x 5 ) p(x) F 123 (x 1, x 2, x 3 )F 24 (x 2, x 4 )F 34 (x 3, x 4 )F 35 (x 3, x 5 ) C = {{1, 2, 3}, {2, 4}, {3, 4}, {3, 5}} x 2 x 4 dependency graph for p(x) x 1 x 3 x 5 Mar 22, / 40

23 Dynamic Programming for Normalization-constant Example Similar to the argmax case During the blood and gore of the next steps, keep in mind that the only thing that is really happening here is this: F c (c) = F c (c) x c C x 5 x 4 x 3 x 2 x 1 c C = F 123 F 24 F 34 F 35 x 5 x 4 x 3 x 2 x 1 = F 34 F 35 F 24 F 123 x 5 x 4 x 3 x 2 x 1 }{{} } T 1 (x 2,x 3 ) {{} } T 2 (x 3,x 4 ) {{ } } T 3 (x 4,x 5 ) {{ } } T 4 (x 5 ) {{ } T 5 Mar 22, / 40

24 Dynamic Programming for Normalization-constant Example Identical to the argmax case Step 3: Define boundaries and innovation cliques, k {1, 2,..., S }. Boundaries: b k = boundary of 1, 2,..., k in G ={l S : l > k & l c for some c C s.t. c {1, 2,..., k} } e.g., for C = {{1, 2, 3}, {2, 4}, {3, 4}, {3, 5}}: b 1 = {2, 3} = {l S : l > 1 & l c for some c C s.t. c {1} } b 2 = {3, 4} = {l S : l > 2 & l c for some c C s.t. c {1, 2} } b 3 = {4, 5} = {l S : l > 3 & l c for some c C s.t. c {1, 2, 3} } b 4 = {5} = {l S : l > 4 & l c for some c C s.t. c {1, 2, 3, 4} } b 5 = = {l S : l > 5 & l c for some c C s.t. c {1, 2, 3, 4, 5} } Mar 22, / 40

25 Dynamic Programming for Normalization-constant Example Identical to the argmax case Step 3: Define boundaries and innovation cliques, k {1, 2,..., S }. Innovation cliques: C k = relevant new cliques at step k = {c C : k c & l / c l < k} e.g., for C = {{1, 2, 3}, {2, 4}, {3, 4}, {3, 5}}: C 1 = {{1, 2, 3}} = {c C : 1 c & l / c l < 1} C 2 = {{2, 4}} = {c C : 2 c & l / c l < 2} C 3 = {{3, 4}, {3, 5}} = {c C : 3 c & l / c l < 3} C 4 = = {c C : 4 c & l / c l < 4} C 5 = = {c C : 5 c & l / c l < 5} Mar 22, / 40

26 Dynamic Programming for Normalization-constant Example Slightly simpler than the argmax case (less bookkeeping) Step 4: Loop through sites, computing partial sums. Initialize: T 1 (x b1 ) = F c (x c ) x 1 c C 1 E.g., for b 1 = {2, 3} and C 1 = {{1, 2, 3}} T 1 (x 2, x 3 ) = F 123 (x 1, x 2, x 3 ) x 1 Mar 22, / 40

27 Dynamic Programming for Normalization-constant Example Slightly simpler than the argmax case (less bookkeeping) Step 4: Iterate: k = 2, 3,..., S Then T S = Z. T k (x bk ) = x k T k 1 (x bk 1 ) c C k F c (x c ) Mar 22, / 40

28 Dynamic Programming for Normalization-constant Example E.g.: T 2 (x 3, x 4 ) = x 2 T 1 (x 2, x 3 )F 24 (x 2, x 4 ) T 3 (x 4, x 5 ) = x 3 T 2 (x 3, x 4 )F 34 (x 3, x 4 )F 35 (x 3, x 5 ) T 4 (x 5 ) = x 4 T 3 (x 4, x 5 ) T 5 = x 5 T 4 (x 5 ) = Z Mar 22, / 40

29 Dynamic Programming for Normalization-constant Example Computational complexity analysis is similar to the argmax case. Mar 22, / 40

30 Dynamic Programming for Expectation If p(x) = c C F c(x c ) and J(x) = J(x A ), for some function J and some A S, then computing E(J(X A )) = x J(x A ) c C F c (x c ) is identical to the normalization-constant case, except that C is augmented with the clique A. Computational complexity analysis is similar to the argmax case. Mar 22, / 40

31 Dynamic Programming for Marginals Let n = S and p = c C F c(x c ). By induction, T k (x bk ) = F c (x c ) k = 1, 2,..., n x 1 :x k c C:c {1,2,...,k} = p(x (k+1):n ) = 1 p(x) = 1 F c (x c ) Z Z x 1 :x k x 1 :x k c C = 1 F c (x c ) Z c C:c {1,2,...,k}= x 1 :x k = 1 F c (x c ) T k (x bk ) Z c C:c {1,2,...,k}= c C:c {1,2,...,k} = F c (x c ) There are ways to organize the computations so that any set of marginals can be computed while avoiding repeating sub-computations. Mar 22, / 40

32 Dynamic Programming for Marginals Example x 2 x 4 dependency graph for p(x 2, x 3, x 4, x 5 ) = x 1 p(x) x 1 x 3 x 5 η 1 = {2, 3} was already fully connected. Mar 22, / 40

33 Dynamic Programming for Marginals Example x 2 x 4 x 1 x 3 x 5 dependency graph for p(x 3, x 4, x 5 ) = p(x) x 2 x 1 η 2 = {3, 4} was already fully connected. Mar 22, / 40

34 Dynamic Programming for Normalization-constant Example x 2 x 4 x 1 x 3 x 5 dependency graph for p(x 4, x 5 ) = p(x) x 3 x 2 x 1 η 3 = {4, 5} was not fully connected so we added an edge. Mar 22, / 40

35 Dynamic Programming for Marginals Example x 2 x 4 x 1 x 3 x 5 dependency graph for p(x 5 ) = p(x) x 4 x 3 x 2 x 1 η 4 = {5} is (trivially) already fully connected. Mar 22, / 40

36 Sampling from p Assume we know how to sample u U(0, 1) (a uniform continuous RV) Suppose we want to sample from some pmf p over a finite set of states. If we can enumerate all the states according to some ordering, there is a simple way to do it using u U(0, 1) and the inverse of the CDF. But this is impractical if there are too many states (e.g., ) If p is an MRF, we can again resort to Dynamic Programming, with the same caveats as we had for argmax, normalization constants, marginals, etc. Mar 22, / 40

37 Dynamic Programming for Sampling Note p(x) = p(x n )p(x n 1 x n )p(x n 2 x (n 1):n )p(x n 3 x (n 2):n ) p(x 1 x 2:n ) If p = 1 Z c C F c(x c ), then, using what we saw for marginals, p(x k x (k+1):n ) = = p(x k:n) p(x (k+1):n ) = c C:c {1,2,...,k 1}= F c(x c ) c C:c {1,2,...,k}= F c(x c ) T k 1 (x bk 1 ) T k (x bk ) = F c (x c ) T k 1(x bk 1 ) k = 2, 3,..., n T k (x bk ) c C k ( ) ( ) p(x n ) = 1 F c (x c ) T n 1 (x bn 1 ) = 1 F c (x c ) T n 1 (x n ) Z Z c C n c C n Mar 22, / 40

38 Dynamic Programming for Sampling Since we have explicit representations for p(x n ) and p(x k x (k+1):n ) k = 1, 2,..., n 1, we can sample from p x n p(x n ) x k p(x k x (k+1):n ) k = n 1, n 2,..., 1 If all the variables are scalars, then these sampling operations are particularly easy Note that p(x k x (k+1):n ) depends only on x k and x bk. Thus p(x k x (k+1):n ) = p(x k x bk ) Mar 22, / 40

39 Dynamic Programming excellent bookkeeping methods for reusing calculations. Example (HMM) p(x) = n k=2 n n F k,k 1 (x k, x k 1 ) G k (x k, y k ) F k,k 1 (x k, x k 1 ) k=1 k=2 For a fixed y, and visit order, 1, 2,..., n: T k (x k+1 ) = x 1:k k+1 i=1 F i 1,i(x i 1, x i ) forward T k (x k+1 ) = x (k:1):n k i=2 F i 1,i(x i 1, x i ) backward p(x i y) = T i(x i ) T i+1 (x i ) T n i n 1, n 2,..., 1 Mar 22, / 40

40 Sampling When Dynamic Programming is Inapplicable Examples: If the RV s are continuous (easy for Gaussians, but usually not in other cases) Discrete RVs with (infinite) countable state space Discrete RVs but where the graph is too large and complicated The good news: Can resort to MCMC procedures that, asymptotically, produce a sample which behaves like a sample from the p we want. This is, in fact, true in general for a general complicated p, regardless whether it is an MRF or not. In the MRF case we can do it while exploiting the local structure. This is what a method called Gibbs sampling (and its variants) is based on. Mar 22, / 40

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)