• G. Povoleri, E. Nova-Lamperti, C. Scottà, G. Fanelli, Y. Chen, P. Becker, D. Boardman, B. Costantini, M. Romano, P. Pavlidis, R. McGregor, E. Pantazi, D. Chauss, H. Sun, H. Shih, D. Cousins, N. Cooper, N. Powell, C. Kemper, M. Pirooznia, A. Laurence, S. Kordasti, M. Kazemian, G. Lombardi, B. Afzali, “Retinoic acid-regulated CD161+ Tregs support wound repair in intestinal mucosa”, Nature Immunology 2018.

Repair of tissue damaged during inflammatory processes is key to the return of local homeostasis and restoration of epithelial integrity. Here we describe CD161+ regulatory T (Treg) cells as a distinct, highly suppressive population of Treg cells that mediate wound healing. These Treg cells were enriched in intestinal lamina propria, particularly in Crohn’s disease. CD161+ Treg cells had an all-trans retinoic acid (ATRA)-regulated gene signature, and CD161 expression on Treg cells was induced by ATRA, which directly regulated the CD161 gene. CD161 was co-stimulatory, and ligation with the T cell antigen receptor induced cytokines that accelerated the wound healing of intestinal epithelial cells. We identified a transcription-factor network, including BACH2, RORγt, FOSL2, AP-1 and RUNX1, that controlled expression of the wound-healing program, and found a CD161+ Treg cell signature in Crohn’s disease mucosa associated with reduced inflammation. These findings identify CD161+ Treg cells as a population involved in controlling the balance between inflammation and epithelial barrier healing in the gut.

  • J. Lin, N. Du, P. Li, M. Kazemian, T. Gebregiorgis, R. Spolski, WJ Leonard, “Critical roles for STAT5 tetramers in the maturation and survival of natural killer cells”, Nature Communication, 8 (1), 1320, (2017).

Interleukin-15 (IL-15) is essential for the development and maintenance of natural killer (NK) cells. IL-15 activates STAT5 proteins, which can form dimers or tetramers. We previously found that NK cell numbers are decreased in Stat5aStat5b tetramer-deficient double knockin (DKI) mice, but the mechanism was not investigated. Here we show that STAT5 dimers are sufficient for NK cell development, whereas STAT5 tetramers mediate NK cell maturation and the expression of maturation-associated genes. Unlike the defective proliferation of Stat5DKI CD8+ T cells, Stat5 DKI NK cells have normal proliferation to IL-15 but are susceptible to death upon cytokine withdrawal, with lower Bcl2and increased active caspases. These findings underscore the importance of STAT5 tetramers in maintaining NK cell homoeostasis. Moreover, defective STAT5 tetramer formation could represent a cause of NK cell immunodeficiency, and interrupting STAT5 tetramer formation might serve to control NK leukaemia.

  • B. Afzali, J. Grönholm, J. Vandrovcova, C. O'Brien, H. Sun, I. Vanderleyden, FP Davis, A. Khoder, Y. Zhang, AN Hegazy, AV Villarino, IW Palmer, J. Kaufman, NR Watts, M. Kazemian, O. Kamenyeva, J. Keith, A. Sayed, D. Kasperaviciute, M. Mueller, JD Hughes, IJ Fuss, M F Sadiyah, K Montgomery-Recht, J McElwee, NP Restifo, W. Strober, MA Linterman, PT Wingfield, HH Uhlig, R. Roychoudhuri, TJ Aitman, P. Kelleher, MJ Lenardo, JJ O'Shea, N. Cooper, ADJ Laurence, “BACH2 immunodeficiency illustrates an association between super-enhancers and haploinsufficiency”, Nature Immunology 2017 Jul;18(7):813-823. doi: 10.1038/ni.3753 (2017).

The transcriptional programs that guide lymphocyte differentiation depend on the precise expression and timing of transcription factors (TFs). The TF BACH2 is essential for T and B lymphocytes and is associated with an archetypal super-enhancer (SE). Single-nucleotide variants in the BACH2 locus are associated with several autoimmune diseases, but BACH2 mutations that cause Mendelian monogenic primary immunodeficiency have not previously been identified. Here we describe a syndrome of BACH2-related immunodeficiency and autoimmunity (BRIDA) that results from BACH2 haploinsufficiency. Affected subjects had lymphocyte-maturation defects that caused immunoglobulin deficiency and intestinal inflammation. The mutations disrupted protein stability by interfering with homodimerization or by causing aggregation. We observed analogous lymphocyte defects in Bach2-heterozygous mice. More generally, we observed that genes that cause monogenic haploinsufficient diseases were substantially enriched for TFs and SE architecture. These findings reveal a previously unrecognized feature of SE architecture in Mendelian diseases of immunity: heterozygous mutations in SE-regulated genes identified by whole-exome/genome sequencing may have greater significance than previously recognized.

  • EE West, R. Spolski, M. Kazemian, C. Kemper, W. J. Leonard, “TSLP acts on neutrophils to drive complement-mediated killing of methicillin-resistant Staphylococcus aureus”, Science Immunology, 18 Nov 2016:Vol. 1, Issue 5, eaaf8471, DOI: 10.1126/sciimmunol.aaf8471 (2016).

Community-acquired Staphylococcus aureus infections often present as serious skin infections in otherwise healthy individuals and have become a worldwide epidemic problem fueled by the emergence of strains with antibiotic resistance, such as methicillin-resistant S. aureus (MRSA). The cytokine thymic stromal lymphopoietin (TSLP) is highly expressed in the skin and in other barrier surfaces and plays a deleterious role by promoting T helper cell type 2 (TH2) responses during allergic diseases; however, its role in host defense against bacterial infections has not been well elucidated. We describe a previously unrecognized non-TH2 role for TSLP in enhancing neutrophil killing of MRSA during an in vivo skin infection. Specifically, we demonstrate that TSLP acts directly on both mouse and human neutrophils to augment control of MRSA. Additionally, we show that TSLP also enhances killing of Streptococcus pyogenes, another clinically important cause of human skin infections. Unexpectedly, TSLP mechanistically mediates its antibacterial effect by directly engaging the complement C5 system to modulate production of reactive oxygen species by neutrophils. Thus, TSLP increases MRSA killing in a neutrophil- and complement-dependent manner, revealing a key connection between TSLP and the innate complement system, with potentially important therapeutic implications for control of MRSA infection.

  • Y. Sun, H. Zhang, M. Kazemian, JM. Troy, C. Seward, X. Lu, L Stubbs, “ZSCAN5B and primate-specific paralogs bind RNA polymerase III genes and extra-TFIIIC (ETC) sites to modulate mitotic progression”, Oncotarget. 7(45): 72571–72592 (2016).

Mammalian genomes contain hundreds of genes transcribed by RNA Polymerase III (Pol III), encoding noncoding RNAs and especially the tRNAs specialized to carry specific amino acids to the ribosome for protein synthesis. In addition to this well-known function, tRNAs and their genes (tDNAs) serve a variety of other critical cellular functions. For example, tRNAs and other Pol III transcripts can be cleaved to yield small RNAs with potent regulatory activities. Furthermore, from yeast to mammals, active tDNAs and related “extra-TFIIIC” (ETC) loci provide the DNA scaffolds for the most ancient known mechanism of three-dimensional chromatin architecture. Here we identify the ZSCAN5 TF family - including mammalian ZSCAN5B and its primate-specific paralogs - as proteins that occupy mammalian Pol III promoters and ETC sites. We show that ZSCAN5B binds with high specificity to a conserved subset of Pol III genes in human and mouse. Furthermore, primate-specific ZSCAN5A and ZSCAN5D also bind Pol III genes, although ZSCAN5D preferentially localizes to MIR SINE- and LINE2-associated ETC sites. ZSCAN5 genes are expressed in proliferating cell populations and are cell-cycle regulated, and siRNA knockdown experiments suggested a cooperative role in regulation of mitotic progression. Consistent with this prediction, ZSCAN5A knockdown led to increasing numbers of cells in mitosis and the appearance of cells. Together, these data implicate the role of ZSCAN5 genes in regulation of Pol III genes and nearby Pol II loci, ultimately influencing cell cycle progression and differentiation in a variety of tissues.

  • M. Kazemian, M. Ren, J. Lin, W. Liao, R. Spolski, W. J. Leonard,  Comprehensive assembly of novel transcripts from unmapped human RNA-Sequencing data and their association with cancer”, Molecular Systems Biology 11 (8), 826, (2015).

Crucial parts of the genome including genes encoding microRNAs and noncoding RNAs went unnoticed for years, and even now, despite extensive annotation and assembly of the human genome, RNA‐sequencing continues to yield millions of unmappable and thus uncharacterized reads. Here, we examined > 300 billion reads from 536 normal donors and 1,873 patients encompassing 21 cancer types, identified ~300 million such uncharacterized reads, and using a distinctive approach de novo assembled 2,550 novel human transcripts, which mainly represent long noncoding RNAs. Of these, 230 exhibited relatively specific expression or non‐expression in certain cancer types, making them potential markers for those cancers, whereas 183 exhibited tissue specificity. Moreover, we used lentiviral‐mediated expression of three selected transcripts that had higher expression in normal than in cancer patients and found that each inhibited the growth of HepG2 cells. Our analysis provides a comprehensive and unbiased resource of unmapped human transcripts and reveals their associations with specific cancers, providing potentially important new genes for therapeutic targeting.

  • C. Wan, A.B. Andraski,  R. Spolski, P. Li, M. KazemianJ. Oh, L. Samsel,  P.A. Swanson, D.B. McGavern, E.P. Sampaio, A.F. Freeman, J.D. Milner, S.M. Holland, W.J. Leonard,

    Opposing Roles of STAT1 and STAT3 in IL-21 Function in CD4+ T cells

    , PNAS, accepted  (2015).

IL-21 is a type I cytokine important for immune cell differentiation and function. We found that transcription factors STAT1 and STAT3 play partially opposing roles in IL-21 function in CD4+ T cells. Both STAT1 and STAT3 control IL-21-mediated gene regulation, with some genes including Ifng, Tbx21, and Il21 reciprocally regulated by these STATs. IFN-g production was also differentially regulated by these STATs in vitro during CD4+ T cell differentiation and in vivo during chronic lymphocytic choriomeningitis infection. Importantly, IL-21-induced IFNG and TBX21 expression was higher in CD4+T cells from patients with autosomal dominant hyper-IgE syndrome or with STAT1 gain-of-function mutations, suggesting that dys-regulated IL-21-STAT signaling partially explains the clinical manifestations of these patients.

  • M. Kazemian, M. Ren, JX Lin, W. Liao, R. Spolski, W.J. Leonard, “Possible HPV38 contamination of endometrial cancer RNA-Seq samples in The Cancer Genome Atlas database”, J. Virology, doi: 10.1128/JVI.00822-15  (2015).
Viruses are causally associated with a number of human malignancies. In this study, we sought to identify new viral-cancer associations by searching RNA-Sequencing datasets from >2000 patients, encompassing 21 cancers from The Cancer Genome Atlas (TCGA), for the presence of viral sequences. In agreement with previous studies, we found human papillomavirus type 16 (HPV16) and HPV18 in oropharyngeal cancer and hepatitis B and C viruses in liver cancer. Unexpectedly, however, we found HPV38, a cutaneous form of HPV associated with skin cancer, in 32 of 168 samples with endometrial cancer. In 12 of the HPV38+samples, we observed at least one paired read that mapped to both human and HPV38 genomes, indicative of viral integration into host DNA, something not previously demonstrated for HPV38. The expression levels of HPV38 transcripts were relatively low, and all 32 HPV38+ samples belonged to the same experimental batch of 40 samples, whereas none of the other 128 endometrial carcinoma samples were HPV38+, raising doubts about the significance of the HPV38 association. Moreover, the HPV38+ samples contained the same 10 novel single nucleotide variations (SNVs), leading us to hypothesize that one patient was infected with this new isolate of HPV38, which was integrated into his/her genome and may have cross-contaminated other TCGA samples within batch #228. Based on our analysis, we propose guidelines to examine batch effect, virus expression level, and SNVs as part of NGS data analysis for evaluating the significance of viral/pathogen sequences in clinical samples.

Breakthrough paper ...

  • C Blatti, M. Kazemian, S. Wolfe, M. Brodsky, S. Sinha, Integrating motif, DNA accessibility, and gene expression data to build regulatory maps in an organism. Nucleic Acids Research. (2015), 43: 3998-4012. Selected as breakthrough paper.
Characterization of cell type specific regulatory networks and elements is a major challenge in genomics, and emerging strategies frequently employ high-throughput genome-wide assays of transcription factor (TF) to DNA binding, histone modifications or chromatin state. However, these experiments remain too difficult/expensive for many laboratories to apply comprehensively to their system of interest. Here, we explore the potential of elucidating regulatory systems in varied cell types using computational techniques that rely on only data of gene expression, low-resolution chromatin accessibility, and TF–DNA binding specificities (‘motifs’). We show that static computational motif scans overlaid with chromatin accessibility data reasonably approximate experimentally measured TF–DNA binding. We demonstrate that predicted binding profiles and expression patterns of hundreds of TFs are sufficient to identify major regulators of ∼200 spatiotemporal expression domains in the Drosophila embryo. We are then able to learn reliable statistical models of enhancer activity for over 70 expression domains and apply those models to annotate domain specific enhancers genome-wide. Throughout this work, we apply our motif and accessibility based approach to comprehensively characterize the regulatory network of fruitfly embryonic development and show that the accuracy of our computational method compares favorably to approaches that rely on data from many experimental assays.
  •  M. Kazemian, K. Suryamohan, J. Chen, Y. Zhang, Md. A. H. Samee, M. S. Halfon, S. Sinha, Evidence for deep regulatory similarities in early developmental programs across highly diverged insects. Genome Biology and Evolution. 6 (9): 2301-2320. doi: 10.1093/gbe/evu184.

Many genes familiar from Drosophila development, such as the so-called gap, pair-rule, and segment polarity genes, play important roles in the development of other insects and in many cases appear to be deployed in a similar fashion, despite the fact that Drosophila-like “long germband” development is highly derived and confined to a subset of insect families. Whether or not these similarities extend to the regulatory level is unknown. Identification of regulatory regions beyond the well-studiedDrosophila has been challenging as even within the Diptera (flies, including mosquitoes) regulatory sequences have diverged past the point of recognition by standard alignment methods. Here, we demonstrate that methods we previously developed for computational cis-regulatory module (CRM) discovery in Drosophila can be used effectively in highly diverged (250–350 Myr) insect species including Anopheles gambiae,Tribolium castaneumApis mellifera, and Nasonia vitripennis. InDrosophila, we have successfully used small sets of known CRMs as “training data” to guide the search for other CRMs with related function. We show here that although species-specific CRM training data do not exist, training sets from Drosophila can facilitate CRM discovery in diverged insects. We validate in vivo over a dozen new CRMs, roughly doubling the number of known CRMs in the four non-Drosophila species. Given the growing wealth of Drosophila CRM annotation, these results suggest that extensive regulatory sequence annotation will be possible in newly sequenced insects without recourse to costly and labor-intensive genome-scale experiments. We develop a new method, Regulus, which computes a probabilistic score of similarity based on binding site composition (despite the absence of nucleotide-level sequence alignment), and demonstrate similarity between functionally related CRMs from orthologous loci. Our work represents an important step toward being able to trace the evolutionary history of gene regulatory networks and defining the mechanisms underlying insect evolution.

  • T. Duque, Md. A. H. Samee, M. Kazemian, H. N. Pham, M. H. Brodsky, S. Sinha, “Simulations of enhancer evolution provide mechanistic insights into gene regulation”, Mol Biol Evol, doi:10.1093/molbev/mst170, (2013)

There is growing interest in models of regulatory sequence evolution. However, existing models specifically designed for regulatory sequences consider the independent evolution of individual transcription factor (TF) binding sites, ignoring that the function and evolution of a binding site depends on its context, typically the cis-regulatory module (CRM) in which the site is located. Moreover, existing models do not account for the gene-specific roles of TF-binding sites, primarily because their roles often are not well-understood. We introduce two models of regulatory sequence evolution that address some of the shortcomings of existing models and implement simulation frameworks based on them. One model simulates the evolution of an individual binding site in the context of a CRM, while the other evolves an entire CRM. Both models use a state-of-the art sequence-to-expression model to predict the effects of mutations on the regulatory output of the CRM and determine the strength of selection. We use the new framework to simulate the evolution of TF-binding sites in 37 well-studied CRMs belonging to the anterior-posterior patterning system in Drosophila embryos. We show that these simulations provide accurate fits to evolutionary data from 12 Drosophila genomes, which includes statistics of binding site conservation on relatively short evolutionary scales and site loss across larger divergence times. The new framework allows us, for the first time, to test hypotheses regarding the underlying cis-regulatory code by directly comparing the evolutionary implications of the hypothesis to observed evolutionary dynamics of binding sites. Using this capability, we find that explicitly modeling self-cooperative DNA-binding by the TFCaudal (CAD) provides significantly better fits than an otherwise identical evolutionary simulation that lacks this mechanistic aspect. This hypothesis is further supported by a statistical analysis of the distribution of inter-site spacing between adjacent CAD sites. Experimental tests confirm direct homodimeric interaction between CAD molecules as well as self-cooperative DNA-binding by CAD. We note that computational modeling of the D. melanogaster CRMs alone did not yield significant evidence to support CAD self-cooperativity. We thus demonstrate how specific mechanistic details encoded in CRMs can be revealed by modeling their evolution and fitting such models to multi-species data.

 Server ... 

  • M. Kazemian, Hannah Pham, M. Brodsky, S. Sinha, “Widespread and distinct sequence signatures of combinatorial transcriptional regulation”, Nucleic Acids Research, doi: 10.1093/nar/gkt598, (2013)

Regulation of eukaryotic gene transcription is often combinatorial in nature, with multiple transcription factors (TFs) regulating common target genes, often through direct or indirect mutual interactions. Many individual examples of cooperative binding by directly interacting TFs have been identified, but it remains unclear how pervasive this mechanism is during animal development. Cooperative TF binding should be manifest in genomic sequences as biased arrangements of TF-binding sites. Here, we explore the extent and diversity of such arrangements related to gene regulation during Drosophila embryogenesis. We used the DNA-binding specificities of 322 TFs along with chromatin accessibility information to identify enriched spacing and orientation patterns of TF-binding site pairs. We developed a new statistical approach for this task, specifically designed to accurately assess inter-site spacing biases while accounting for the phenomenon of homotypic site clustering commonly observed in developmental regulatory regions. We observed a large number of short-range distance preferences between TF-binding site pairs, including examples where the preference depends on the relative orientation of the binding sites. To test whether these binding site patterns reflect physical interactions between the corresponding TFs, we analyzed 27 TF pairs whose binding sites exhibited short distance preferences. In vitro protein-protein binding experiments revealed that >65% of these TF pairs can directly interact with each other. For five pairs, we further demonstrate that they bind cooperatively to DNA if both sites are present with the preferred spacing. This study demonstrates how DNA-binding motifs can be used to produce a comprehensive map of sequence signatures for different mechanisms of combinatorial TF action.

Available source code for iTF

iTFs_v1.0_Mac.tar.gz iTFs_v1.0_Mac.tar.gz
Size : 608.243 Kb
Type : gz
iTFs_v1.0_Linux.tar.gz iTFs_v1.0_Linux.tar.gz
Size : 614.78 Kb
Type : gz
  • Q. Cheng, M. Kazemian, H. Pham, C. Blatti, S. E. Celniker, S. A. Wolfe, M. H. Brodsky, S. Sinha, “Computational identification of diverse mechanisms underlying transcription factor-DNA occupancy”, PLoS Genetics. Aug;9(8):e1003571, (2013)

ChIP-based genome-wide assays of transcription factor (TF) occupancy have emerged as a powerful, high-throughput method to understand transcriptional regulation, especially on a global scale. This has led to great interest in the underlying biochemical mechanisms that direct TF-DNA binding, with the ultimate goal of computationally predicting a TF's occupancy profile in any cellular condition. In this study, we examined the influence of various potential determinants of TF-DNA binding on a much larger scale than previously undertaken. We used a thermodynamics-based model of TF-DNA binding, called “STAP,” to analyze 45 TF-ChIP data sets from Drosophila embryonic development. We built a cross-validation framework that compares a baseline model, based on the ChIP'ed (“primary”) TF's motif, to more complex models where binding by secondary TFs is hypothesized to influence the primary TF's occupancy. Candidates interacting TFs were chosen based on RNA-SEQ expression data from the time point of the ChIP experiment. We found widespread evidence of both cooperative and antagonistic effects by secondary TFs, and explicitly quantified these effects. We were able to identify multiple classes of interactions, including (1) long-range interactions between primary and secondary motifs (separated by ≤150 bp), suggestive of indirect effects such as chromatin remodeling, (2) short-range interactions with specific inter-site spacing biases, suggestive of direct physical interactions, and (3) overlapping binding sites suggesting competitive binding. Furthermore, by factoring out the previously reported strong correlation between TF occupancy and DNA accessibility, we were able to categorize the effects into those that are likely to be mediated by the secondary TF's effect on local accessibility and those that utilize accessibility-independent mechanisms. Finally, we conducted in vitro pull-down assays to test model-based predictions of short-range cooperative interactions, and found that seven of the eight TF pairs tested physically interact and that some of these interactions mediate cooperative binding to DNA.

  • M. S. Enuameh, Y. Asriyan, A. Richards, R. G. Christensen, V. L. Hall, M. Kazemian, C. Zhu, H. Pham, Q. Cheng, C. Blatti, J. A. Brasefield, M. D. Basciotta, J. Ou, J. C. McNulty, L. J. Zhu, S. E. Celniker, S. Sinha, G. D. Stormo, M. H. Brodsky, and S. Wolfe, “Global analysis of Drosophila Cys2-His2 zinc finger proteins reveals a multitude of novel recognition motifs and binding determinants”, Genome Research, doi:10.1101/gr.151472.112, (2013)

Cys2-His2 zinc finger proteins (ZFPs) are the largest group of transcription factors in higher metazoans. A complete characterization of these ZFPs and their associated target sequences is pivotal to fully annotate transcriptional regulatory networks in metazoan genomes. As a first step in this process, we have characterized the DNA-binding specificities of 129 zinc finger sets from Drosophila using a bacterial one-hybrid system. This data set contains the DNA-binding specificities for at least one encoded ZFP from 70 unique genes and 23 alternate splice isoforms representing the largest set of characterized ZFPs from any organism described to date. These recognition motifs can be used to predict genomic binding sites for these factors within the fruit fly genome. Subsets of fingers from these ZFPs were characterized to define their orientation and register on their recognition sequences, thereby allowing us to define the recognition diversity within this finger set. We find that the characterized fingers can specify 47 of the 64 possible DNA triplets. To confirm the utility of our finger recognition models, we employed subsets of Drosophila fingers in combination with an existing archive of artificial zinc finger modules to create ZFPs with novel DNA-binding specificity. These hybrids of natural and artificial fingers can be used to create functional zinc finger nucleases for editing vertebrate genomes.

  • S. Shahinfar, H. Mehrabani-Yeganeh, C. Lucas, A. Kalhor, M. Kazemian, and K. A. Weigel, “Prediction of Breeding Values for Dairy Cattle Using Artificial Neural Networks and Neuro-Fuzzy Systems”, Computational and Mathematical Methods in Medicine, Volume 2012, Article ID 127130, 9 pages, doi:10.1155/2012/127130,  (2012)

Developing machine learning and soft computing techniques has provided many opportunities for researchers to establish new analytical methods in different areas of science. The objective of this study is to investigate the potential of two types of intelligent learning methods, artificial neural networks and neuro-fuzzy systems, in order to estimate breeding values (EBV) of Iranian dairy cattle. Initially, the breeding values of lactating Holstein cows for milk and fat yield were estimated using conventional best linear unbiased prediction (BLUP) with an animal model. Once that was established, a multilayer perceptron was used to build ANN to predict breeding values from the performance data of selection candidates. Subsequently, fuzzy logic was used to form an NFS, a hybrid intelligent system that was implemented via a local linear model tree algorithm. For milk yield the correlations between EBV and EBV predicted by the ANN and NFS were 0.92 and 0.93, respectively. Corresponding correlations for fat yield were 0.93 and 0.93, respectively. Correlations between multitrait predictions of EBVs for milk and fat yield when predicted simultaneously by ANN were 0.93 and 0.93, respectively, whereas corresponding correlations with reference EBV for multitrait NFS were 0.94 and 0.95, respectively, for milk and fat production.

  •  M. Kazemian, Q. Zhu, M. S. Halfon, S. Sinha, “Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison”, Nucl. Acids Res. (2011) first published online August 5, doi:10.1093/nar/gkr621, (2011)

Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, 'enhancers'), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for 'motif-blind' CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to 'supervise' the search. We propose a new statistical method, based on 'Interpolated Markov Models', for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers. 

Available source codes for enhancer prediction methods and Loci Length-aware Hypergeometric Test

LLHT.tar.tar.gz LLHT.tar.tar.gz
Size : 36.912 Kb
Type : gz
HexMCD.zip HexMCD.zip
Size : 64.066 Kb
Type : zip
IMM.zip IMM.zip
Size : 285.467 Kb
Type : zip
PAC-rc.zip PAC-rc.zip
Size : 876.802 Kb
Type : zip
  • M. Kazemian, M. H. Brodsky, S. Sinha, “Genome surveyor 2.0: cis-regulatory analysis in Drosophila”, Nucl. Acids Res. (2011) first published online May 18, doi:10.1093/nar/gkr291, (2011)
Genome Surveyor 2.0 is a web-based tool for discovery and analysis of cis-regulatory elements in Drosophila, built on top of the GBrowse genome browser for convenient visualization. Genome Surveyor was developed as a tool for predicting transcription factor (TF) binding targets and cis-regulatory modules (CRMs/enhancers), based on motifs representing experimentally determined DNA binding specificities. Since its first publication, we have added substantial new functionality (e.g. phylogenetic averaging of motif scores from multiple species, and a novel CRM discovery technique), increased the number of supported motifs about 4-fold (from ∼100 to ∼400), added provisions for evolutionary comparison across many more Drosophila species (from 2 to 12), and improved the user-interface. The server is free and open to all users, and there is no login requirement. Address: http://veda.cs.uiuc.edu/gs.
  • L. J. Zhu, R. G. Christensen, M. Kazemian, C. J. Hull, M. S. Enuameh, M. D. Basciotta, J. A. Brasefield, C. Zhu, Y. Asriyan, D. S. Lapointe, S. Sinha, S. A. Wolfe, and M. H. Brodsky, “FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system”. Nucl. Acids Res. (2010) first published online November 19, doi:10.1093/nar/gkq858, (2010) 

FlyFactorSurvey (http://pgfe.umassmed.edu/TFDBS/) is a database of DNA binding specificities for Drosophila transcription factors (TFs) primarily determined using the bacterial one-hybrid system. The database provides community access to over 400 recognition motifs and position weight matrices for over 200 TFs, including many unpublished motifs. Search tools and flat file downloads are provided to retrieve binding site information (as sequences, matrices and sequence logos) for individual TFs, groups of TFs or for all TFs with characterized binding specificities. Linked analysis tools allow users to identify motifs within our database that share similarity to a query matrix or to view the distribution of occurrences of an individual motif throughout the Drosophila genome. Together, this database and its associated tools provide computational and experimental biologists with resources to predict interactions between Drosophila TFs and target cis-regulatory sequences.

  • M. Kazemian, C. Blatti, A. Richards, M. McCutchan, N. Wakabayashi-Ito, A. Hammonds, S. Celniker, S. Kumar, S. Wolfe, M. Brodsky, and S. Sinha. “Quantitative analysis of the Drosophila segmentation regulatory network using pattern generating potentials”. PLoS Biology. 8(8): e1000456. doi:10.1371/journal.pbio.1000456, (2010)
             † co-first authors

Cis-regulatory modules that drive precise spatial-temporal patterns of gene expression are central to the process of metazoan development. We describe a new computational strategy to annotate genomic sequences based on their "pattern generating potential" and to produce quantitative descriptions of transcriptional regulatory networks at the level of individual protein-module interactions. We use this approach to convert the qualitative understanding of interactions that regulate Drosophila segmentation into a network model in which a confidence value is associated with each transcription factor-module interaction. Sequence information from multiple Drosophila species is integrated with transcription factor binding specificities to determine conserved binding site frequencies across the genome. These binding site profiles are combined with transcription factor expression information to create a model to predict module activity patterns. This model is used to scan genomic sequences for the potential to generate all or part of the expression pattern of a nearby gene, obtained from available gene expression databases. Interactions between individual transcription factors and modules are inferred by a statistical method to quantify a factor's contribution to the module's pattern generating potential. We use these pattern generating potentials to systematically describe the location and function of known and novel cis-regulatory modules in the segmentation network, identifying many examples of modules predicted to have overlapping expression activities. Surprisingly, conserved transcription factor binding site frequencies were as effective as experimental measurements of occupancy in predicting module expression patterns or factor-module interactions. Thus, unlike previous module prediction methods, this method predicts not only the location of modules but also their spatial activity pattern and the factors that directly determine this pattern. As databases of transcription factor specificities and in vivo gene expression patterns grow, analysis of pattern generating potentials provides a general method to decode transcriptional regulatory sequences and networks.

  • M. Kazemian, B. Moshiri, C. Lucas, H. Nikbakht, V. Palade, "Using classifier fusion techniques for protein secondary structure prediction", Int. J. Comput. Intelligence in Bioinformatics and Systems Biology, Vol. 1, No. 4, pp. 418-434 (2010)

Classifier fusion techniques are gaining more popularity for their capability of improving the accuracy achieved by individual classifiers. A common approach is to combine the classifiers' outcome using simple methods, such as majority voting. In this paper, we build a meta-classifier by fusing some already well-known classifiers for protein structure prediction. Each individual classifier outputs a unique structure for every input residue. We have used the confusion matrix of each protein secondary structure classifier, which is representative of classifiers' expertness, as a general reusable pattern for converting its simple class-label assignment to class-preference score. The results obtained using several classifier fusion operators have been compared, on some standard datasets from the EVA server, with simple majority voting and with the results provided by the individual classifiers. The comparative analysis showed that the Choquet fuzzy integral operator had the highest improvement with respect to accuracy, multi-class sensitivity and specificity criteria over both the best performing individual classifier and the other fusion operators, while all of the classifier fusion techniques yielded some improvements too.

 In the news ... 

  • M. R. Kantorovitz, M. Kazemian, S. Kinston, D. Miranda-Saavedra, Q. Zhu, G. E. Robinson, B. Göttgens, M. S. Halfon, S. Sinha, “Motif-Blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse”, Developmental Cell, Volume 17, Issue 4, 568-579, 20 October, (2009)
             † co-first authors

We present new approaches to cis-regulatory module (CRM) discovery in the common scenario where relevant transcription factors and/or motifs are unknown. Beginning with a small list of CRMs mediating a common gene expression pattern, we search genome-wide for CRMs with similar functionality, using new statistical scores and without requiring known motifs or accurate motif discovery. We cross-validate our predictions on 31 regulatory networks in Drosophila and through correlations with gene expression data. Five predicted modules tested using an in vivo reporter gene assay all show tissue-specific regulatory activity. We also demonstrate our methods' ability to predict mammalian tissue-specific enhancers. Finally, we predict human CRMs that regulate early blood and cardiovascular development. In vivo transgenic mouse analysis of two predicted CRMs demonstrates that both have appropriate enhancer activity. Overall, 7/7 predictions were validated successfully in vivo, demonstrating the effectiveness of our approach for insect and mammalian genomes.
  • A.H. Keyhanipoor, B. Moshiri, M. Kazemian, C. Lucas, “Aggregation of web search engines based on users' preferences in WebFusion”, Knowledge-based Systems, 20(4): 321-328, (2007)
The required information of users is distributed in the databases of various search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. Meta-search engines could provide a unified access for their users. In this paper, a novel meta-search engine, named as WebFusion, is introduced. WebFusion learns the expertness of the underlying search engines in a certain category based on the users’ preferences. It also uses the “click-through data concept” to give a content-oriented ranking score to each result page. Click-through data concept is the implicit feedback of the users’ preferences, which is also used as a reinforcement signal in the learning process, to predict the users’ preferences and reduces the seeking time in the returned results list. The decision lists of underling search engines have been fused using ordered weighted averaging (OWA) approach and the application of optimistic operator as weightening function has been investigated. Moreover, the results of this approach have been compared with those achieve by some popular meta-search engines such as ProFusion and MetaCrawler. Experimental results demonstrate a significant improvement on average click rate, and the variance of clicks as well as average relevancy criterion.
  • M. Kazemian, B. Moshiri, H. Nikbakht, C. Lucas. “A new expertness index for assessment of secondary structure prediction engines”, Journal of Computational Biology and Chemistry 31(1): 44-47, (2007)

Improvement of prediction accuracy of the protein secondary structure is essential for further developments of the whole field of protein research. In this paper, the expertness of protein secondary structure prediction engines has been studied in three levels and a new criterion has been introduced in the third level. This criterion could be considered as an extension of the previous ones based on amino acid index. Using this new criterion, the expertness of some high score secondary structure prediction engines has been reanalyzed and some hidden facts have been discovered. The results of this new assessment demonstrated that a noticeable harmony has been existed among each amino acid prediction behavior in all engines. This harmony has also been seen between single global propensity and prediction accuracy of amino acid types in each secondary structure class. Moreover, it is shown that Proline and Glycine amino acids have been predicted with less accuracy in alpha helices and beta strands. In addition, regardless of different approaches used in prediction engines, beta strands have been predicted with less accuracy.

  • M. Kazemian, B. Moshiri, H. Nikbakht, C. Lucas. “Architecture for biological database integration”, Special Issue on AI & Specific Applications, ICGST International Journal on Artificial Intelligence and Machine Learning, AIML, Volume 6, pp.15-19, (2006)

The work in laboratory involves integration of various data sources to solve biological problems. Our philosophy is that different types of data sources will give us more information than a single one. By combining data sources intelligently, we are able to obtain a more complete picture of the problem. Here we introduced a general architecture for Bio Meta Search Engines based on Decision Fusion concept. This architecture has seven stages. In addition, it has three databases for keeping the underlying engines statistics and biological insights and users’ preferences which are evolved through system using.

  • M. Kazemian, Y. Ramezani, C. Lucas, B. Moshiri, Swarm clustering based on flowers pollination by artificial bees”, Studies in computational intelligence, Swarm Intelligence and Data Mining, Springer, chapt.8, pp. 191-203, (2006)

This chapter presents a new swarm data clustering method based on flowers pollination by artificial bees we named it FPAB. FPAB does not require any parameter settings and any initial information such as the number of classes and the number of partitions on input data. Initially, in FPAB, bees move the pollens and pollinate them. Each pollen will grow in proportion to its garden flowers. Better growing will occur in better conditions. After some iteration natural selection reduces the pollens and flowers to form gardens of same type of flowers. The prototypes of each gardens are taken as the initial cluster centers for Fuzzy C Means algorithm which is used to reduce obvious misclassification errors. In the next stage the prototypes of gardens are assumed as a single flower and FPAB is applied to them again. Results from three small data sets show that the partitions produced by FPAB are competitive with those obtained from FCM or AntClass. 

  • M. Kazemian, B. Moshiri, H. Nikbakht, C. Lucas. “Protein secondary structure classifiers fusion using OWA”, Lecture Notes in Computer Science. Springer-Verlag Berlin Heidelberg 3745 -pp. 338 -345, (2005)

The combination of classifiers has been proposed as a method to improve the accuracy achieved by a single classifier. In this study, the performances of optimistic and pessimistic ordered weighted averaging operators for protein secondary structure classifiers fusion have been investigated. Each secondary structure classifier outputs a unique structure for each input residue. We used confusion matrix of each secondary structure classifier as a general reusable pattern for converting this unique label to measurement level. The results of optimistic and pessimistic OWA operators have been compared with majority voting and five common classifiers used in the fusion process. Using a benchmark set from the EVA server, the results showed a significant improvement in the average Q3 prediction accuracy up to 1.69% toward the best classifier results.