• Next-Gen Sequencing and disease 
Next generation sequencing has shifted our understanding of molecular biology underlying both complex and Mendelian diseases. It provides a time and cost effective approach to monitor expression pattern, transcriptional regulatory footprints, epigenome, and genomic variations. Moreover, computational tools provide a bridge to connect this heterogenous data and elucidate causal mechanisms. We have recently performed RNA-sequencing on the lymph node from several Finnish patients diagnosed with Angio-immunoblastic T-cell lymphoma (AITL). We are analyzing the mRNA level expression to determine the key genes/pathways involved in AITL. We are now looking for genomic variations that may contribute to the significant number of AITL cases. In the process, we have developed several new computational tools for teasing out relevant variants that may apply to other type of complex disorders. In another case with a child experiencing a severe immune disease, we have performed RNA and Exome sequencing on both healthy parents and the child. We are using the healthy parents' data to guide the search for relevant genes/pathways/variations influential to the child's disease. 

  • long non-coding RNAs and the role in cell differentiation 
With the completion of human genome project, it became clear that the coding fraction (compromising less than 2%) of the genome can not explain the complexity of the human physiology. Shortly after, researchers observed that in fact a large fraction (>75%) of genome is transcribed without coding for proteins. This observation raised many questions, along with skepticism, on the role of these reproducible, non-coding transcripts. It worth noting that several critical components of cellular machinery are long known to be non-coding. Recent studies demonstrate a high cell type specificity of these transcripts. Here, we analyze cell type specific RNA-seq libraries for the presence of such transcripts. Using statistical approaches, we investigate the potential mechanism(s) of action by overlaying various types of information.

  • Predicting novel cis-regulatory modules with prior knowledge of related CRMs
A major challenge in understanding metazoan genomes is to find and annotate the regions that control the precise spatial and temporal expression of the genes. Cis-regulatory modules (CRMs), main players of this regulatory process, are typically short (<1kb) sequences that are embedded in non-coding regions of the genome. They harbor cis-elements (binding sites) for one or more related transcription factors (TFs) and mediate a discrete aspect of the expression pattern of their nearby gene. Although decades of research in biology have provided scientists with few hundreds of such sequences, we are far from completing the search and understanding the underlying mechanisms of these regulatory regions.

Most of the current CRM discovery approaches rely on the knowledge of related motifs. They normally search the genomic regions for clusters of such motifs. Thus, their accuracy is bounded by how well, if at all, the motifs are characterized. Other methods simultaneously search for motifs and CRMs, but the “unsolved” nature of the motif discovery problem cast doubts upon the scalability of such methods.


Supervise CRM prediction pipeline 

Experimentally validated fruit fly CRMs 

Thanks to the extensive work of biologists over the last few decades, who have tested many sequence fragments for regulatory activity in a reporter gene assay, we have now an invaluable collection of known enhancers in variety of species and tissues.

Our goal here is to use a small set of known CRMs participating in a transcriptional network as “training data” to guide the search for other CRMs with similar functionality in the network. We call this task “supervised CRM prediction”. To this end, we constantly develop novel statistical/probabilistic models to capture the similarity between any given sequence and the training data set. We employ our models to locate the high scoring regions of the genome that are potential candidate CRMs of the same network. Experimentally validated candidates in fly and mouse confirms the power of these techniques.

Experimentally validated human CRMs in mouse 

  • Modeling gene/CRM expression pattern

Finding when, where and how intensely genes are expressed is a challenging problem. Thermodynamic models address some aspects of the problem in fruit fly anterior-posterior (A/P) segmentation network by modeling the expression pattern in nearly 50 of its known CRMs (as training data) during the embryonic stages. These models can be invoked to predict the expression profile of any sequence, given the necessary context. However, such models are too slow to be used in a genome-wide search for CRMs that recapitulate their nearby gene’s expression.

Here, we develop a model that maps the binding sites and the concentration information of relevant transcription factors to the expression pattern driven by several known CRMs involved in anterior-posterior (A/P) patterning. We deliberately sacrifice a negligible amount of accuracy for simplicity and efficiency by replacing the thermodynamic model with a logistic regression that has fewer parameters.

Modeling the expression pattern 

Cis regulatory module discovery 



Predicted regulatory network 

Our model enables us to not only search genome-wide for other CRMs of the network, but also provides a simple mechanism to statistically infer the effect of each TF on each CRM. The video on the left shows how the model is used to scan the region around a gene (e.g. hkb) for segments that could drive expression similar to that of the gene. The middle panel plots the expected and predicted expression pattern for each window, and the bottom panel plots the similarity between expected and predicted expression patterns for that window. 


  • Transcription factor interactome (iTFs)
Tissue-specific gene expression in eukaryotes is often mediated by interaction between multiple transcription factors that are recruited by regulatory regions. Although protein-protein interaction networks have provided invaluable information about the interaction between pairs of factors, they do not elaborate on the mechanisms of interactions such as preferred distance or orientation via their DNA binding sites. Moreover, the sensitivity and specificity of such networks have much room for improvement. Recently, intense efforts have characterized many TF binding specificities which have been used to computationally search for their genomic targets. The iTFs program statistically searches the shared genomic targets of each pair of factors for signatures of interactions and infer mechanisms of such interactions.

TF-TF interaction signature 

 Genome Surveyor scheme

  • Annotating genome (Genome Surveyor)
Genome Surveyor is a collection of web-based computational tools for discovery and analysis of cis-regulatory elements in fruit fly, built on top of the generic genome browser for convenient visualization. It predicts transcription factor (TF) binding targets and cis-regulatory modules (CRMs/enhancers), based on motifs representing experimentally determined DNA binding specificities.