RESEARCH ARTICLE. The Penalized Biclustering Model And Related Algorithms Supplemental Online Material

Journal of Applied Statistics Vol. 00, No. 00, Month 00x, 8 RESEARCH ARTICLE The Penalized Biclustering Model And Related Algorithms Supplemental Online Material Thierry Cheouo and Alejandro Murua Département de mathématiques et de statistique, Université de Montréal, CP 68, succ. centre-ville, Montréal, Québec H3C 3J7 Canada Received 00 Month 00x; in final form 00 Month 00x) A. Introduction In these sections we provide further details on the EM and Bayesian algorithms described in the main body of the paper. Section B provides the EM updating equations for the plaid model of Section. Section C displays the full conditional distributions of the labels and parameters associated with the penalized plaid model of Section 4. The procedure to initialize the parameters and labels for the Marov chain Monte Carlo implementation of the penalized plaid model is described in Section D. Section E gives the URL addresses of the biclustering pacages used in this study, including our own pacage implementing the penalized plaid model. B. The EM updating equations Note that the bicluster and combination bicluster probabilities pρ i κ j ) are constants depending only on the combination bicluster. We will denote them as π, = 0,,..., K. Observe that pρ, κ Y, θ) = pρ i, κ j y ij, θ). It is straightforward to verify that pρ i, κ j y ij, θ) = pρ i κ j y ij, θ) = ρ i κ j π cb) σρ φ i κ j) y ij µρ i κ j, θ))/σρ i κ j ) ) π cb ) σρ i κ j) φ y ij µρ i κ j, θ))/σρ ), i κ j ) where cb) and cb ) denote the combination biclusters associated to ρ i, κ j ) and ρ i, κ j ), respectively. The maximizer of Qθ θ) for the plaid model is obtained by Corresponding author: A. Murua. E-mail: murua@dms.umontreal.ca ISSN: 066-4763 print/issn 360-053 online c 00x Taylor & Francis DOI: 0.080/066476YYxxxxxxxx http://www.tandfonline.com

T. Cheouo & A. Murua taing the derivatives with respect to θ. These yield: µ = E θρ i κ j ) E θρ i κ j )y ij E θρ i κ j ρ i κ j )µ + α i + β j ) α i = j E θρ i κ j ) E θρ i κ j )y ij E θρ i κ j ρ i κ j )µ + α i + β j ) µ j β j = i E θρ i κ j ) E θρ i κ j )y ij E θρ i κ j ρ i κ j )µ + α i + β j ) µ i π cb) pρ i κ j = ρ cb) κ cb) y ij, θ), where ρ cb) κ cb) denotes the corresponding -th combination bicluster, and σ = qp = qp E θ y ij ) ρ i κ j µ + α i + β j ) yij E θρ i κ j )y ij µ + α i + β j ) + E θρ i κ j ρ i κ j )µ + α i + β j )µ + α i + β j )., Note that the updating equations are recursive. The parameters can be estimated using a Gauss-Seidel relaxation scheme over =,..., K. For example, let the superscript t + ) denote the coefficients recently updated, and the superscript t), the coefficients not yet updated. Then in order to solve the system, say for α i s, we iterate for within the EM iterations α t+) i = j E θ ρ iκ j ) E θρ i κ j )y ij j < E θρ i κ j ρ i κ j )µ t+) + β t+) j ) E θρ i κ j ρ i κ j )µ t) + α t) i + β t) j ) > } µ t+). + α t+) i Also note that the expectation is intractable if the number of biclusters K is large. For example, one needs to compute E θρ i κ j ) = ρ i κ j pρ i κ j y ij, θ), which is ρ iκ j= a sum involving K terms. In the non-overlapping bicluster model, the sum reduces to one term E θρ i κ j ) = pρ i κ j =, ρ i κ j ) = y ij, θ). In this latter

case, the updating equations simplify to Journal of Applied Statistics 3 µ = E θ ρ iκ j ) E θρ i κ j )y ij α i = j E θ ρ iκ j ) E θρ i κ j )y ij µ β j = i E θ ρ iκ j ) E θρ i κ j )y ij µ σ = E θ ρ iκ j ) j i E θρ i κ j ) y ij µ α i β j ) π cb) pρ i κ j = y ij, θ). C. The penalized plaid model : the full conditionals The labels Note that the lielihood may be written as exp yij γ ij ) ρ } iκ j µ + α i + β j ) ) + log σρ i, κ j ) σρ i, κ j ) γ ij y )} ij µ 0 ) + log σ0. σ 0 Let the bicluster, be fixed. Define the variables z ij = y ij ρ i κ j µ α i β j ), α = α i ) i I R r, β = β j ) j J R c. To find the full conditional of the labels, say ρ i, we use the fact that K y ij ρ i κ j µ + α i + β j ) = z ij ρ i κ j µ + α i + β j ) = = ρ i κ j z ij µ α i β j ) + ρ i )z ij + ρ i κ j )z ij. Note that γ ij = K = ρ iκ j ). For a given, we will write γ ij = K =

4 T. Cheouo & A. Murua ρ i κ j ). Then ρ i κ j γ ij ) = ρ i κ j γ ij ρ i κ j )ρ i κ j = ρ i κ j. We have j γ ij ) z ij ρ i κ j µ + α i + β j )) σ ρ i κ j ) z ij µ α i β = ρ j) i σ ρ i κ j ) j J + ρ i ) j z ij µ α i β = ρ j) i σ ρ i κ j ) j J zij γ ij ) σ ρ i κ j ) + ρ zij i γ ij ) σ ρ i κ j ) j / J + ρ i ) zij γ ij ) σ ρ i κ j ) + zij γ ij ) σ ρ i κ j ). j J j / J As before, let θ denote the set of parameters of the model. Define A i = exp } z ij µ α i β j) σ ρ i κ j ) j J B i = exp σ0 C i = exp D i,ρi γ ij y ij µ 0 ) j J γ ij ) j J = exp γ ij ) j / J } σ 0 ) / σ ρ i κ j ), j J ) j J γ ij/, z ij σ ρ i κ j ) + log σ ρ i κ j ) )}, zij )} σ ρ i κ j ) + log σ ρ i κ j ). Also let ρ i) denote the set of all row labels except ρ i. From the above equation it is straightforward to verify that the full conditionals of ρ i satisfy pρ i y ij }, ρ i), κ), θ) A ρi i B ρi i C ρi i D i,ρi πρ i ), where πρ i ) = exp λ j K = ρ i κ j + γ ij + γ ij )κ j ρ i ) }. In particular, the ratio pρ i = y ij }, ρ i), κ), θ)/pρ i = 0 y ij }, ρ i), κ), θ) is given by A i B i C i D i,d i,0 exp λ j γ ij )κ j }. The term D i,ρi may be ignored for models whose variances do not depend on i, j). In particular, for the plaid model, the logarithm of this ratio is σ j J z ij µ α i β j ) + γ ij )zij +γ ijy ij µ 0 ) } λ j γ ij)κ j. The full conditional for κ j s are found in a similar way by symmetry.

The row and column effects Journal of Applied Statistics 5 Define the matrices R = diag j J σ ρ i, κ j )), and C = diag i I σ ρ i, κ j )). Let m denote the vector of all s in R m. Since the variance of α is given by σαv = σαi r r r r ), we may write α = V a for a random vector a N0, σαi r ). It is easy to verify that the full conditional of a is a multivariate normal with mean µ a, and variance Σ a, given by µ a, = V R V + σ α I r ) V z α,, Σ a, = V R V + σ α I r ), where z α, = j J z ij µ β j )/σ ρ i, κ j )) i I. Similarly, let U = I c c c c ). We may write β = U b for a random vector b N0, σ β I c ). It is easy to verify that the full conditional of b is a multivariate normal with mean µ b, and variance Σ b, given by µ b, = U C U + σ β I c ) U z β,, Σ b, = U C U + σ β I c ), where z β, = i I z ij µ α i )/σ ρ i, κ j )) j J. For the plaid model σρ i, κ j ) = σ, and for the model of Cheng and Church [], σρ i, κ j ) = σ. In both cases the variance is constant on each bicluster. Therefore, for these models, R = σ c I r and C = σ r I c. Hence the conditional means and variances for a and b become µ a, = Σ a, = µ b, = Σ b, = c σ c σ r σ r σ + σ α + σ α + σ β ) c σ z i z ) i I, ) I r + c σα r σ r r ), ) r z j z ) j J, σ + ) σα I c + r σβ c σ c c ), where z denotes the mean of the values of z ij in the bicluster, and z i = j J z ij /c, and z j = i I z ij /r. It can be easily shown that the full conditionals of the means µ, = 0,,..., K are also normal distributions with means and variances given by µ µ, = σ µ Σ µ, = σ µ + ) B + ) B ) z ij α i β j σ ρ i, κ j ) σ, ρ i, κ j ) ) B ). σ ρ i, κ j ) Again, for the plaid and Cheng and Church models, the means and variances

6 T. Cheouo & A. Murua simplify to µ µ, = σ µ + n ) n σ σ z, Σ µ, = σµ + n ) σ. Note that when σµ, σα, and σβ estimators. tend to infinity we obtain the hard-em or ICM) The variances Let z ij = y ij ρ i κ j µ + α i + β j ). The full conditionals of the variances are proportional to σ exp zij + ) νs σ + νs ) B σ } n logσ ) ν + ) log σ. If we suppose that there is no overlapping among the biclusters, we obtain the Cheng and Church model []. The corresponding full conditional of σ is an inverseχ distribution with scale νs + ) B zij )/ν + n ), and ν + n degrees of freedom. If instead we suppose that σρ i, κ j ) = σ independently of the cell i, j) i.e., σ = σ for all = 0,,..., K), then we obtain the full conditional of σ for the plaid model. This is also an inverse-χ distribution, but this time with scale νs + z ij )/ν + pq), and ν + pq degrees of freedom. D. The penalized plaid model: initial values Finding the initial membership labels ρ, κ) is a difficult tas. Several procedures have been suggested in the literature. We have adopted a technique similar to that of Turner et al. [3]. We run two independent -means algorithms [4] with = : once for the rows and once for the columns. Using the Cartesian product of the resulting -means row and column labels, we divide the data matrix into four groups. A single initial bicluster is chosen among these four groups according to a variance criterion explained a few lines below. The procedure is repeated as many times as the number of initial biclusters needed. A single initial bicluster is chosen after each application of the independent row and column -means algorithms. The elements of the biclusters already chosen are mased by replacing their original values y ij by random values. This is done so that in the next iteration a different group may be chosen. The masing procedure is not new. It has been used before by Sheng, Moreau and De Moor [] to determinate multiple biclusters. The criterion to choose an initial bicluster among the four groups yielded by each iteration is the following. Suppose that the cells of each group follow a random effect additive ANOVA model. That is, on each group g, y ij = µ g + α ig + β jg + ɛ ij, g =,, 3, 4. The standard moment estimates of the variances are ˆσ gα = MSS gα) MSS g e), ˆσ gβ c = MSS gβ) MSS g e), ˆσ ge = MSS g e), ) g r g where r g is the number of rows in the g-th group, c g is the number of columns, and MSS g e), MSS g α) and MSS g β) are the corresponding mean sum of squares for error, rows and columns, respectively. We select as an initial bicluster the group g

REFERENCES 7 that maximizes ˆσ gα + ˆσ gβ )/ˆσ ge. For each initial bicluster, the parameters µ g, α ig and β jg are initialized as ȳ.., ȳ i. ȳ.., ȳ.j ȳ.., respectively, where y.., ȳ i., ȳ.j stand for the overall bicluster mean, the bicluster i-th row mean, and the bicluster j-th column mean, respectively. The parameter µ 0 is estimated as the arithmetic mean of the zero-bicluster. The variance σ is initialized as the mean sum of squares of all the residuals. E. Biclustering algorithm pacages The following table gives the URLs associated with the biclustering pacages used in the comparison study presented in Section 5 of the main body of this paper. The pacage names are given in parenthesis under the associated algorithm names. Algorithm FABIA fabia) Description and Address R pacage: Factor Analysis for Bicluster Acquisition www.bioconductor.org/pacages/release/bioc/html/fabia.html SAMBA EXPANDER) EXpression Analyzer and DisplayER: java-based tool for analysis of gene expression and NGS data acgt.cs.tau.ac.il/expander/overview.html TUR biclust) R pacage: the plaid model implementation of Turner and al. cran.r-project.org/web/pacages/biclust/ CC biclust) R pacage: original method suggested by Cheng and Church to fit a mixture model cran.r-project.org/web/pacages/biclust/ Spectral biclust) R pacage: uses a singular value decomposition cran.r-project.org/web/pacages/biclust/ Penalized plaid penalizedplaid) java pacage: The models are: ) the non-overlapping bicluster model; ) the plaid model; and 3) the penalized plaid model. The estimation methods are: a) the Gibbs sampler and b) the Metropolis-Hastings algorithm www.dms.umontreal.ca/~murua/ References [] Y. Cheng and G. Church, Biclustering of expression data, Int. Conf. Intelligent Systems for Molecular Biology 000), pp. 6 86. Next Generation Sequencing

8 REFERENCES [] Q. Sheng, Y. Moreau, and B. De Moor, Biclustering microarray data by Gibbs sampling, Bioinformatics 9 003), pp. ii96 ii05. [3] H. Turner, T. Bailey, and W. Krzanowsi, Improved biclustering of microarray data demonstrated through systematic performance tests, Computational Statistics & Data Analysis 48 005), pp. 35 54. [4] J.H. Ward, Hierarchical groupings to optimize an objective function, J. American Statistical Association 58 963), pp. 34 44.