- M. Kazemian, M. Ren, J. Lin, W. Liao, R. Spolski, W. J. Leonard
Crucial parts of the genome including genes encoding microRNAs and noncoding RNAs went unnoticed for years, and even now, despite extensive annotation and assembly of the human genome, RNA‐sequencing continues to yield millions of unmappable and thus uncharacterized reads. Here, we examined > 300 billion reads from 536 normal donors and 1,873 patients encompassing 21 cancer types, identified ~300 million such uncharacterized reads, and using a distinctive approach de novo assembled 2,550 novel human transcripts, which mainly represent long noncoding RNAs. Of these, 230 exhibited relatively specific expression or non‐expression in certain cancer types, making them potential markers for those cancers, whereas 183 exhibited tissue specificity. Moreover, we used lentiviral‐mediated expression of three selected transcripts that had higher expression in normal than in cancer patients and found that each inhibited the growth of HepG2 cells. Our analysis provides a comprehensive and unbiased resource of unmapped human transcripts and reveals their associations with specific cancers, providing potentially important new genes for therapeutic targeting.
- C. Wan, A.B. Andraski, R. Spolski, P. Li, M. Kazemian, J. Oh, L. Samsel, P.A. Swanson, D.B. McGavern, E.P. Sampaio, A.F. Freeman, J.D. Milner, S.M. Holland, W.J. Leonard, “
Opposing Roles of STAT1 and STAT3 in IL-21 Function in CD4+ T cells”, PNAS, accepted (2015).
IL-21 is a type I cytokine important for immune cell differentiation and function. We found that transcription factors STAT1 and STAT3 play partially opposing roles in IL-21 function in CD4+ T cells. Both STAT1 and STAT3 control IL-21-mediated gene regulation, with some genes including Ifng, Tbx21, and Il21 reciprocally regulated by these STATs. IFN-g production was also differentially regulated by these STATs in vitro during CD4+ T cell differentiation and in vivo during chronic lymphocytic choriomeningitis infection. Importantly, IL-21-induced IFNG and TBX21 expression was higher in CD4+T cells from patients with autosomal dominant hyper-IgE syndrome or with STAT1 gain-of-function mutations, suggesting that dys-regulated IL-21-STAT signaling partially explains the clinical manifestations of these patients.
- M. Kazemian, M. Ren, JX Lin, W. Liao, R. Spolski, W.J. Leonard, “Possible HPV38 contamination of endometrial cancer RNA-Seq samples in The Cancer Genome Atlas database”, J. Virology, doi: 10.1128/JVI.00822-15 (2015).
- C Blatti, M. Kazemian, S. Wolfe, M. Brodsky, S. Sinha, Integrating motif, DNA accessibility, and gene expression data to build regulatory maps in an organism. Nucleic Acids Research. (2015), 43: 3998-4012. Selected as breakthrough paper.
- M. Kazemian, K. Suryamohan, J. Chen, Y. Zhang, Md. A. H. Samee, M. S. Halfon, S. Sinha, Evidence for deep regulatory similarities in early developmental programs across highly diverged insects. Genome Biology and Evolution. 6 (9): 2301-2320. doi: 10.1093/gbe/evu184.
Many genes familiar from development, such as the so-called gap, pair-rule, and segment polarity genes, play important roles in the development of other insects and in many cases appear to be deployed in a similar fashion, despite the fact that -like “long germband” development is highly derived and confined to a subset of insect families. Whether or not these similarities extend to the regulatory level is unknown. Identification of regulatory regions beyond the well-studied has been challenging as even within the Diptera (flies, including mosquitoes) regulatory sequences have diverged past the point of recognition by standard alignment methods. Here, we demonstrate that methods we previously developed for computational -regulatory module (CRM) discovery in can be used effectively in highly diverged (250–350 Myr) insect species including , , , and . In , we have successfully used small sets of known CRMs as “training data” to guide the search for other CRMs with related function. We show here that although species-specific CRM training data do not exist, training sets from can facilitate CRM discovery in diverged insects. We validate in vivo over a dozen new CRMs, roughly doubling the number of known CRMs in the four non- species. Given the growing wealth of CRM annotation, these results suggest that extensive regulatory sequence annotation will be possible in newly sequenced insects without recourse to costly and labor-intensive genome-scale experiments. We develop a new method, Regulus, which computes a probabilistic score of similarity based on binding site composition (despite the absence of nucleotide-level sequence alignment), and demonstrate similarity between functionally related CRMs from orthologous loci. Our work represents an important step toward being able to trace the evolutionary history of gene regulatory networks and defining the mechanisms underlying insect evolution.
- T. Duque, Md. A. H. Samee, M. Kazemian, H. N. Pham, M. H. Brodsky, S. Sinha, “Simulations of enhancer evolution provide mechanistic insights into gene regulation”, Mol Biol Evol, doi:10.1093/molbev/mst170, (2013)
There is growing interest in models of regulatory sequence evolution. However, existing models specifically designed for regulatory sequences consider the independent evolution of individual transcription factor (TF) binding sites, ignoring that the function and evolution of a binding site depends on its context, typically the cis-regulatory module (CRM) in which the site is located. Moreover, existing models do not account for the gene-specific roles of TF-binding sites, primarily because their roles often are not well-understood. We introduce two models of regulatory sequence evolution that address some of the shortcomings of existing models and implement simulation frameworks based on them. One model simulates the evolution of an individual binding site in the context of a CRM, while the other evolves an entire CRM. Both models use a state-of-the art sequence-to-expression model to predict the effects of mutations on the regulatory output of the CRM and determine the strength of selection. We use the new framework to simulate the evolution of TF-binding sites in 37 well-studied CRMs belonging to the anterior-posterior patterning system in Drosophila embryos. We show that these simulations provide accurate fits to evolutionary data from 12 Drosophila genomes, which includes statistics of binding site conservation on relatively short evolutionary scales and site loss across larger divergence times. The new framework allows us, for the first time, to test hypotheses regarding the underlying cis-regulatory code by directly comparing the evolutionary implications of the hypothesis to observed evolutionary dynamics of binding sites. Using this capability, we find that explicitly modeling self-cooperative DNA-binding by the TFCaudal (CAD) provides significantly better fits than an otherwise identical evolutionary simulation that lacks this mechanistic aspect. This hypothesis is further supported by a statistical analysis of the distribution of inter-site spacing between adjacent CAD sites. Experimental tests confirm direct homodimeric interaction between CAD molecules as well as self-cooperative DNA-binding by CAD. We note that computational modeling of the D. melanogaster CRMs alone did not yield significant evidence to support CAD self-cooperativity. We thus demonstrate how specific mechanistic details encoded in CRMs can be revealed by modeling their evolution and fitting such models to multi-species data.
- M. Kazemian, Hannah Pham, M. Brodsky, S. Sinha, “Widespread and distinct sequence signatures of combinatorial transcriptional regulation”, Nucleic Acids Research, doi: 10.1093/nar/gkt598, (2013)
Regulation of eukaryotic gene transcription is often combinatorial in nature, with multiple transcription factors (TFs) regulating common target genes, often through direct or indirect mutual interactions. Many individual examples of cooperative binding by directly interacting TFs have been identified, but it remains unclear how pervasive this mechanism is during animal development. Cooperative TF binding should be manifest in genomic sequences as biased arrangements of TF-binding sites. Here, we explore the extent and diversity of such arrangements related to gene regulation during Drosophila embryogenesis. We used the DNA-binding specificities of 322 TFs along with chromatin accessibility information to identify enriched spacing and orientation patterns of TF-binding site pairs. We developed a new statistical approach for this task, specifically designed to accurately assess inter-site spacing biases while accounting for the phenomenon of homotypic site clustering commonly observed in developmental regulatory regions. We observed a large number of short-range distance preferences between TF-binding site pairs, including examples where the preference depends on the relative orientation of the binding sites. To test whether these binding site patterns reflect physical interactions between the corresponding TFs, we analyzed 27 TF pairs whose binding sites exhibited short distance preferences. In vitro protein-protein binding experiments revealed that >65% of these TF pairs can directly interact with each other. For five pairs, we further demonstrate that they bind cooperatively to DNA if both sites are present with the preferred spacing. This study demonstrates how DNA-binding motifs can be used to produce a comprehensive map of sequence signatures for different mechanisms of combinatorial TF action.
Available source code for iTF
- Q. Cheng, M. Kazemian, H. Pham, C. Blatti, S. E. Celniker, S. A. Wolfe, M. H. Brodsky, S. Sinha, “Computational identification of diverse mechanisms underlying transcription factor-DNA occupancy”, PLoS Genetics. Aug;9(8):e1003571, (2013)
ChIP-based genome-wide assays of transcription factor (TF) occupancy have emerged as a powerful, high-throughput method to understand transcriptional regulation, especially on a global scale. This has led to great interest in the underlying biochemical mechanisms that direct TF-DNA binding, with the ultimate goal of computationally predicting a TF's occupancy profile in any cellular condition. In this study, we examined the influence of various potential determinants of TF-DNA binding on a much larger scale than previously undertaken. We used a thermodynamics-based model of TF-DNA binding, called “STAP,” to analyze 45 TF-ChIP data sets from Drosophila embryonic development. We built a cross-validation framework that compares a baseline model, based on the ChIP'ed (“primary”) TF's motif, to more complex models where binding by secondary TFs is hypothesized to influence the primary TF's occupancy. Candidates interacting TFs were chosen based on RNA-SEQ expression data from the time point of the ChIP experiment. We found widespread evidence of both cooperative and antagonistic effects by secondary TFs, and explicitly quantified these effects. We were able to identify multiple classes of interactions, including (1) long-range interactions between primary and secondary motifs (separated by ≤150 bp), suggestive of indirect effects such as chromatin remodeling, (2) short-range interactions with specific inter-site spacing biases, suggestive of direct physical interactions, and (3) overlapping binding sites suggesting competitive binding. Furthermore, by factoring out the previously reported strong correlation between TF occupancy and DNA accessibility, we were able to categorize the effects into those that are likely to be mediated by the secondary TF's effect on local accessibility and those that utilize accessibility-independent mechanisms. Finally, we conducted in vitro pull-down assays to test model-based predictions of short-range cooperative interactions, and found that seven of the eight TF pairs tested physically interact and that some of these interactions mediate cooperative binding to DNA.
- M. S. Enuameh, Y. Asriyan, A. Richards, R. G. Christensen, V. L. Hall, M. Kazemian, C. Zhu, H. Pham, Q. Cheng, C. Blatti, J. A. Brasefield, M. D. Basciotta, J. Ou, J. C. McNulty, L. J. Zhu, S. E. Celniker, S. Sinha, G. D. Stormo, M. H. Brodsky, and S. Wolfe, “Global analysis of Drosophila Cys2-His2 zinc finger proteins reveals a multitude of novel recognition motifs and binding determinants”, Genome Research, doi:10.1101/gr.151472.112, (2013)
Cys2-His2 zinc finger proteins (ZFPs) are the largest group of transcription factors in higher metazoans. A complete characterization of these ZFPs and their associated target sequences is pivotal to fully annotate transcriptional regulatory networks in metazoan genomes. As a first step in this process, we have characterized the DNA-binding specificities of 129 zinc finger sets from Drosophila using a bacterial one-hybrid system. This data set contains the DNA-binding specificities for at least one encoded ZFP from 70 unique genes and 23 alternate splice isoforms representing the largest set of characterized ZFPs from any organism described to date. These recognition motifs can be used to predict genomic binding sites for these factors within the fruit fly genome. Subsets of fingers from these ZFPs were characterized to define their orientation and register on their recognition sequences, thereby allowing us to define the recognition diversity within this finger set. We find that the characterized fingers can specify 47 of the 64 possible DNA triplets. To confirm the utility of our finger recognition models, we employed subsets of Drosophila fingers in combination with an existing archive of artificial zinc finger modules to create ZFPs with novel DNA-binding specificity. These hybrids of natural and artificial fingers can be used to create functional zinc finger nucleases for editing vertebrate genomes.
- S. Shahinfar, H. Mehrabani-Yeganeh, C. Lucas, A. Kalhor, M. Kazemian, and K. A. Weigel, “Prediction of Breeding Values for Dairy Cattle Using Artificial Neural Networks and Neuro-Fuzzy Systems”, Computational and Mathematical Methods in Medicine, Volume 2012, Article ID 127130, 9 pages, doi:10.1155/2012/127130, (2012)
Developing machine learning and soft computing techniques has provided many opportunities for researchers to establish new analytical methods in different areas of science. The objective of this study is to investigate the potential of two types of intelligent learning methods, artificial neural networks and neuro-fuzzy systems, in order to estimate breeding values (EBV) of Iranian dairy cattle. Initially, the breeding values of lactating Holstein cows for milk and fat yield were estimated using conventional best linear unbiased prediction (BLUP) with an animal model. Once that was established, a multilayer perceptron was used to build ANN to predict breeding values from the performance data of selection candidates. Subsequently, fuzzy logic was used to form an NFS, a hybrid intelligent system that was implemented via a local linear model tree algorithm. For milk yield the correlations between EBV and EBV predicted by the ANN and NFS were 0.92 and 0.93, respectively. Corresponding correlations for fat yield were 0.93 and 0.93, respectively. Correlations between multitrait predictions of EBVs for milk and fat yield when predicted simultaneously by ANN were 0.93 and 0.93, respectively, whereas corresponding correlations with reference EBV for multitrait NFS were 0.94 and 0.95, respectively, for milk and fat production.
- M. Kazemian, Q. Zhu, M. S. Halfon, S. Sinha, “Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison”, Nucl. Acids Res. (2011) first published online August 5, doi:10.1093/nar/gkr621, (2011)
Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, 'enhancers'), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for 'motif-blind' CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to 'supervise' the search. We propose a new statistical method, based on 'Interpolated Markov Models', for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers.
Available source codes for enhancer prediction methods and Loci Length-aware Hypergeometric Test
- M. Kazemian, M. H. Brodsky, S. Sinha, “Genome surveyor 2.0: cis-regulatory analysis in Drosophila”, Nucl. Acids Res. (2011) first published online May 18, doi:10.1093/nar/gkr291, (2011)
- L. J. Zhu, R. G. Christensen, M. Kazemian, C. J. Hull, M. S. Enuameh, M. D. Basciotta, J. A. Brasefield, C. Zhu, Y. Asriyan, D. S. Lapointe, S. Sinha, S. A. Wolfe, and M. H. Brodsky, “FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system”. Nucl. Acids Res. (2010) first published online November 19, doi:10.1093/nar/gkq858, (2010)
FlyFactorSurvey (http://pgfe.umassmed.edu/TFDBS/) is a database of DNA binding specificities for Drosophila transcription factors (TFs) primarily determined using the bacterial one-hybrid system. The database provides community access to over 400 recognition motifs and position weight matrices for over 200 TFs, including many unpublished motifs. Search tools and flat file downloads are provided to retrieve binding site information (as sequences, matrices and sequence logos) for individual TFs, groups of TFs or for all TFs with characterized binding specificities. Linked analysis tools allow users to identify motifs within our database that share similarity to a query matrix or to view the distribution of occurrences of an individual motif throughout the Drosophila genome. Together, this database and its associated tools provide computational and experimental biologists with resources to predict interactions between Drosophila TFs and target cis-regulatory sequences.
- M. Kazemian†, C. Blatti†, A. Richards, M. McCutchan, N. Wakabayashi-Ito, A. Hammonds, S. Celniker, S. Kumar, S. Wolfe, M. Brodsky, and S. Sinha. “Quantitative analysis of the Drosophila segmentation regulatory network using pattern generating potentials”. PLoS Biology. 8(8): e1000456. doi:10.1371/journal.pbio.1000456, (2010)
Cis-regulatory modules that drive precise spatial-temporal patterns of gene expression are central to the process of metazoan development. We describe a new computational strategy to annotate genomic sequences based on their "pattern generating potential" and to produce quantitative descriptions of transcriptional regulatory networks at the level of individual protein-module interactions. We use this approach to convert the qualitative understanding of interactions that regulate Drosophila segmentation into a network model in which a confidence value is associated with each transcription factor-module interaction. Sequence information from multiple Drosophila species is integrated with transcription factor binding specificities to determine conserved binding site frequencies across the genome. These binding site profiles are combined with transcription factor expression information to create a model to predict module activity patterns. This model is used to scan genomic sequences for the potential to generate all or part of the expression pattern of a nearby gene, obtained from available gene expression databases. Interactions between individual transcription factors and modules are inferred by a statistical method to quantify a factor's contribution to the module's pattern generating potential. We use these pattern generating potentials to systematically describe the location and function of known and novel cis-regulatory modules in the segmentation network, identifying many examples of modules predicted to have overlapping expression activities. Surprisingly, conserved transcription factor binding site frequencies were as effective as experimental measurements of occupancy in predicting module expression patterns or factor-module interactions. Thus, unlike previous module prediction methods, this method predicts not only the location of modules but also their spatial activity pattern and the factors that directly determine this pattern. As databases of transcription factor specificities and in vivo gene expression patterns grow, analysis of pattern generating potentials provides a general method to decode transcriptional regulatory sequences and networks.
- M. Kazemian, B. Moshiri, C. Lucas, H. Nikbakht, V. Palade, "Using classifier fusion techniques for protein secondary structure prediction", Int. J. Comput. Intelligence in Bioinformatics and Systems Biology, Vol. 1, No. 4, pp. 418-434 (2010)
Classifier fusion techniques are gaining more popularity for their capability of improving the accuracy achieved by individual classifiers. A common approach is to combine the classifiers' outcome using simple methods, such as majority voting. In this paper, we build a meta-classifier by fusing some already well-known classifiers for protein structure prediction. Each individual classifier outputs a unique structure for every input residue. We have used the confusion matrix of each protein secondary structure classifier, which is representative of classifiers' expertness, as a general reusable pattern for converting its simple class-label assignment to class-preference score. The results obtained using several classifier fusion operators have been compared, on some standard datasets from the EVA server, with simple majority voting and with the results provided by the individual classifiers. The comparative analysis showed that the Choquet fuzzy integral operator had the highest improvement with respect to accuracy, multi-class sensitivity and specificity criteria over both the best performing individual classifier and the other fusion operators, while all of the classifier fusion techniques yielded some improvements too.
- M. R. Kantorovitz†, M. Kazemian†, S. Kinston, D. Miranda-Saavedra, Q. Zhu, G. E. Robinson, B. Göttgens, M. S. Halfon, S. Sinha, “Motif-Blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse”, Developmental Cell, Volume 17, Issue 4, 568-579, 20 October, (2009)
- A.H. Keyhanipoor, B. Moshiri, M. Kazemian, C. Lucas, “Aggregation of web search engines based on users' preferences in WebFusion”, Knowledge-based Systems, 20(4): 321-328, (2007)
- M. Kazemian, B. Moshiri, H. Nikbakht, C. Lucas. “A new expertness index for assessment of secondary structure prediction engines”, Journal of Computational Biology and Chemistry 31(1): 44-47, (2007)
Improvement of prediction accuracy of the protein secondary structure is essential for further developments of the whole field of protein research. In this paper, the expertness of protein secondary structure prediction engines has been studied in three levels and a new criterion has been introduced in the third level. This criterion could be considered as an extension of the previous ones based on amino acid index. Using this new criterion, the expertness of some high score secondary structure prediction engines has been reanalyzed and some hidden facts have been discovered. The results of this new assessment demonstrated that a noticeable harmony has been existed among each amino acid prediction behavior in all engines. This harmony has also been seen between single global propensity and prediction accuracy of amino acid types in each secondary structure class. Moreover, it is shown that Proline and Glycine amino acids have been predicted with less accuracy in alpha helices and beta strands. In addition, regardless of different approaches used in prediction engines, beta strands have been predicted with less accuracy.
- M. Kazemian, B. Moshiri, H. Nikbakht, C. Lucas. “Architecture for biological database integration”, Special Issue on AI & Specific Applications, ICGST International Journal on Artificial Intelligence and Machine Learning, AIML, Volume 6, pp.15-19, (2006)
The work in laboratory involves integration of various data sources to solve biological problems. Our philosophy is that different types of data sources will give us more information than a single one. By combining data sources intelligently, we are able to obtain a more complete picture of the problem. Here we introduced a general architecture for Bio Meta Search Engines based on Decision Fusion concept. This architecture has seven stages. In addition, it has three databases for keeping the underlying engines statistics and biological insights and users’ preferences which are evolved through system using.
- M. Kazemian, Y. Ramezani, C. Lucas, B. Moshiri, “Swarm clustering based on flowers pollination by artificial bees”, Studies in computational intelligence, Swarm Intelligence and Data Mining, Springer, chapt.8, pp. 191-203, (2006)
This chapter presents a new swarm data clustering method based on flowers pollination by artificial bees we named it FPAB. FPAB does not require any parameter settings and any initial information such as the number of classes and the number of partitions on input data. Initially, in FPAB, bees move the pollens and pollinate them. Each pollen will grow in proportion to its garden flowers. Better growing will occur in better conditions. After some iteration natural selection reduces the pollens and flowers to form gardens of same type of flowers. The prototypes of each gardens are taken as the initial cluster centers for Fuzzy C Means algorithm which is used to reduce obvious misclassification errors. In the next stage the prototypes of gardens are assumed as a single flower and FPAB is applied to them again. Results from three small data sets show that the partitions produced by FPAB are competitive with those obtained from FCM or AntClass.
- M. Kazemian, B. Moshiri, H. Nikbakht, C. Lucas. “Protein secondary structure classifiers fusion using OWA”, Lecture Notes in Computer Science. Springer-Verlag Berlin Heidelberg 3745 -pp. 338 -345, (2005)