Biclustering of gene expression data by non-smooth non-negative matrix factorization

Thumbnail Image
Full text at PDC
Publication Date
Advisors (or tutors)
Journal Title
Journal ISSN
Volume Title
Biomed Central LTD
Google Scholar
Research Projects
Organizational Units
Journal Issue
Background: The extended use of microarray technologies has enabled the generation and accumulation of gene expression datasets that contain expression levels of thousands of genes across tens or hundreds of different experimental conditions. One of the major challenges in the analysis of such datasets is to discover local structures composed by sets of genes that show coherent expression patterns across subsets of experimental conditions. These patterns may provide clues about the main biological processes associated to different physiological states. Results: In this work we present a methodology able to cluster genes and conditions highly related in sub-portions of the data. Our approach is based on a new data mining technique, Non-smooth Non-Negative Matrix Factorization (nsNMF), able to identify localized patterns in large datasets. We assessed the potential of this methodology analyzing several synthetic datasets as well as two large and heterogeneous sets of gene expression profiles. In all cases the method was able to identify localized features related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The uncovered structures showed a clear biological meaning in terms of relationships among functional annotations of genes and the phenotypes or physiological states of the associated conditions. Conclusion: The proposed approach can be a useful tool to analyze large and heterogeneous gene expression datasets. The method is able to identify complex relationships among genes and conditions that are difficult to identify by standard clustering algorithms.
© 2006 Carmona-Saez et al; licensee BioMed Central Ltd. This work has been supported by the Spanish grants GR/SAL/0653/2004, CICYT BFU2004-00217/BMC, GEN2003-20235-c05-05, TIN2005-5619, PR27/05-13964-BSCH and a collaborative grant between the Spanish Research Council and the National Research Council of Canada (CSIC050402040003). The authors also thank the KEY Foundation for Brain-Mind Research in Zurich for partial economical support of this work. P.C.S. is the recipient of a fellowship from Comunidad de Madrid (CAM). A.P.M. acknowledges the support of the Spanish Ramón y Cajal program.
Unesco subjects
1. Stoughton RB: Applications of DNA Microarrays in Biology. Annu Rev Biochem 2004. 2. Hsiao LL, Dangond F, Yoshida T, Hong R, Jensen RV, Misra J, Dillon W, Lee KF, Clark KE, Haverty P, Weng Z, Mutter GL, Frosch MP, Macdonald ME, Milford EL, Crum CP, Bueno R, Pratt RE, Mahadevappa M, Warrington JA, Stephanopoulos G, Gullans SR: A compendium of gene expression in normal human tissues. Physiol Genomics 2001, 7:97-104. 3. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional discovery via a compendium of expression profiles. Cell 2000, 102:109-126. 4. Shyamsundar R, Kim YH, Higgins JP, Montgomery K, Jorden M, Sethuraman A, van de Rijn M, Botstein D, Brown PO, Pollack JR: A DNA microarray survey of gene expression in normal human tissues. Genome Biol 2005, 6:R22. 5. Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG,Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM,Schultz PG, Hogenesch JB: Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA 2002, 99:4465-4470. 6. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 2004, 101:6062-6067. 7. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22:281-285. 8. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999, 96:2907-2912. 9. Toronen P, Kolehmainen M, Wong G, Castren E: Analysis of gene expression data using self-organizing maps. FEBS Lett 1999, 451:142-146. 10. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95:14863-14868. 11. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403:503-511. 12. Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM, Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA 2004, 101:811-816. 13. Shaffer AL, Rosenwald A, Hurt EM, Giltnane JM, Lam LT, Pickeral OK, Staudt LM: Signatures of the immune response. Immunity 2001, 15:375-385. 14. Wang J, Delabie J, Aasheim H, Smeland E, Myklebost O: Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study. BMC Bioinformatics 2002, 3:36. 15. Gasch AP, Eisen MB: Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biol 2002, 3:RESEARCH0059. 16. Getz G, Levine E, Domany E: Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci USA 2000, 97:12079-12084. 17. Sheng Q, Moreau Y, De Moor B: Biclustering microarray data by Gibbs sampling. Bioinformatics 2003, 19(Suppl 2):II196-II205. 18. Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002, 18(Suppl 1):S136-144. 19. Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2004, 1:24-45. 20. Tanay A, Sharan R, Shamir R: Biclustering Algorithms: A Survey. In Handbook of Computational Molecular Biology Edited by: Aluru S. Chapman & Hall/CRC Computer and Information Science Series; 2005. 21. Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 2004, 101:4164-4169. 22. Kim PM, Tidor B: Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res 2003, 13:1706-1718. 23. Pascual-Montano A, Carazo JM, Kochi K, Lehmann D, Pascual-Marqui RD: Non-smooth Non-Negative Matrix Factorization (nsNMF). IEEE Transactions on Pattern Analysis and Machine Intelligence 2006, 28:403-415. 24. Nielsen TO, West RB, Linn SC, Alter O, Knowling MA, O'Connell JX, Zhu S, Fero M, Sherlock G, Pollack JR, Brown PO, Botstein D, van de Rijn M: Molecular characterisation of soft tissue tumours: a gene expression study. Lancet 2002, 359:1301-1307. 25. Associated web site [] 26. Bonnycastle LL, Yu CE, Hunt CR, Trask BJ, Clancy KP, Weber JL, Patterson D, Schellenberg GD: Cloning, sequencing, and mapping of the human chromosome 14 heat shock protein gene (HSPA2). Genomics 1994, 23:85-93. 27. Magre J, Delepine M, Khallouf E, Gedde-Dahl T Jr, Van Maldergem L, Sobel E, Papp J, Meier M, Megarbane A, Bachy A, Verloes A, d'Abronzo FH, Seemanova E, Assan R, Baudic N, Bourut C, Czernichow P, Huet F, Grigorescu F, de Kerdanet M, Lacombe D, Labrune P, Lanza M, Loret H, Matsuda F, Navarro J, Nivelon-Chevalier A, Polak M, Robert JJ, Tric P, Tubiana-Rufi N, Vigouroux C, Weissenbach J, Savasta S, Maassen JA, Trygstad O, Bogalho P, Freitas P, Medina JL, Bonnicci F, Joffe BI, Loyson G, Panz VR, Raal FJ, O'Rahilly S, Stephenson T, Kahn CR, Lathrop M, Capeau J: Identification of the gene altered in Berardinelli-Seip congenital lipodystrophy on chromosome 11q13. Nat Genet 2001, 28:365-370. 28. Nagayama S, Katagiri T, Tsunoda T, Hosaka T, Nakashima Y, Araki N, Kusuzaki K, Nakayama T, Tsuboyama T, Nakamura T, Imamura M, Nakamura Y, Toguchida J: Genome-wide analysis of gene expression in synovial sarcomas using a cDNA microarray. Cancer Res 2002, 62:5859-5866. 29. Nielsen TO, Hsu FD, O'Connell JX, Gilks CB, Sorensen PH, Linn S, West RB, Liu CL, Botstein D, Brown PO, van de Rijn M: Tissue microarray validation of epidermal growth factor receptor and SALL2 in synovial sarcoma with comparison to tumors of similar histology. Am J Pathol 2003, 163:1449-1456. 30. West RB, Corless CL, Chen X, Rubin BP, Subramanian S, Montgomery K, Zhu S, Ball CA, Nielsen TO, Patel R, Goldblum JR, Brown PO, Heinrich MC, van de Rijn M: The novel marker, DOG1, is expressed ubiquitously in gastrointestinal stromal tumors irrespective of KIT or PDGFRA mutation status. Am J Pathol 2004, 165:107-113. 31. Blay P, Astudillo A, Buesa JM, Campo E, Abad M, Garcia-Garcia J, Miquel R, Marco V, Sierra M, Losa R, Lacave A, Brana A, Balbin M, Freije JM: Protein kinase C theta is highly expressed in gastrointestinal stromal tumors but not in other mesenchymal neoplasias. Clin Cancer Res 2004, 10:4089-4095. 32. Duensing A, Joseph NE, Medeiros F, Smith F, Hornick JL, Heinrich MC, Corless CL, Demetri GD, Fletcher CD, Fletcher JA: Protein Kinase C theta (PKCtheta) expression and constitutive activation in gastrointestinal stromal tumors (GISTs). Cancer Res 2004, 64:5127-5131. 33. Kluger Y, Basri R, Chang JT, Gerstein M: Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 2003, 13:703-716. 34. Dueck D, Morris QD, Frey BJ: Multi-way clustering of microarray data using probabilistic sparse matrix factorization. Bioinformatics 2005, 21(Suppl 1):il44-il51. 35. Donoho D, Stodden V: When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? In Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems (NIPS 2003); 8–13 December Vancouver and Whistler, British Columbia, Canada; 2003. 36. Heger A, Holm L: Sensitive pattern discovery with 'fuzzy' alignments of distantly related proteins. Bioinformatics 2003, 19(Suppl 1):il30-137. 37. Pehkonen P, Wong G, Toronen P: Theme discovery from gene lists for identification and viewing of multiple functional groups. BMC Bioinformatics 2005, 6:162. 38. Chagoyen M, Carmona-Saez P, Shatkay H, Carazo JM, Pascual-Montano A: Discovering semantic features in the literature: a foundation for building functional associations. BMC Bioinformatics 2006, 7:41. 39. Lee DD, Seung HS: Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401:788-791. 40. Mel BW: Computational neuroscience. Think positive to find parts. Nature 1999, 401:759-760. 41. Hoyer PO: Non-negative sparse coding. In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing; 4–6 September Martigny, Switzerland; 2002:557-565. 42. Hoyer PO: Non-negative Matrix Factorization with Sparseness Constraints. Journal of Machine Learning Research 2004, 5:1457-1469. 43. Liu W, Zheng N, Lu X: Non-negative Matrix Factorization for visual coding. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'2003); 6–10 April; Hong Kong 2003:293-296. 44. Feng T, Li SZ, Shum H, Zhang HJ: Local Non-Negative Matrix Factorization as a Visual Representation. In Proceedings of the Second International Conference on Development and Learning Washington DC; 2002:178-183. 45. Crescenzi M, Giuliani A: The main biological determinants of tumor line taxonomy elucidated by a principal component analysis of microarray data. FEBS Lett 2001, 507:114-118. 46. Gene Expression Omnibus repository [] 47. Soft-tissue tumor dataset [] 48. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17:520-525. 49. Khatri P, Draghici S, Ostermeier GC, Krawetz SA: Profiling gene expression using onto-express. Genomics 2002, 79:266-270.