A Hybrid Data Mining Framework for Efficient Characterization of Insertion Preferences of Retrotransposons: a Comprehensive Study on Alu Elements
Principal Investigator: Kun Zhang, Ph.D. Assistant Professor, Department of Computer Sciences
Mentor: Dr. Prescott Deininger, Tulane Cancer Center
Alu elements are primate-specific short interspersed elements (SINEs). Each Alu is approximately 300 bp in length and derives its name from a single recognition site for the restriction enzyme AluI. Comprising roughly 11% of the human genome, Alu elements have amplified to more than one million copies in primate genomes over the last 65 million years, and a series of Alu subfamilies of different ages has been generated. Although Alu elements have no known biological function, the propagation of Alus has contributed a great deal to the evolution, structure, and dynamics of the human genome. A significant proportion of human genetic diseases, such as cancer, have been ascribed to the disruptive, random Alu insertions and mutations.
Almost all current recognition of Alu insertion mechanism comes from the traditional costly, time-consuming biological laboratory studies and preliminary small-scale multiple sequence alignment of restricted regions. The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA sequences in public databases. The ever-increasing availability of large-scale genomic sequence data of various organisms, from the computational aspect, facilitates the further acknowledgement of the means of Alu insertions, as well as the causality between Alu elements and cancer. Through the close collaboration with Dr. Prescott Deininger’s group, we propose that a hybrid data mining
framework as described below can be used to automatically determine the larger-scale features of the DNA sequence on the order of hundreds or thousands base pairs that facilitate or adversely affect Alu insertions. This knowledge, if acquired, would in turn assist not only the identification of which genes might be especially susceptible to the Alu insertion mutations, but also the treatment and prevention of different cancers or other related diseases. Although our approach is to study Alu insertions, Alus are inserted using the activities of the L1 ORF2 protein. Thus, our observations should be useful for all currently active human mobile elements. In addition, the key step that determines the point of insertion of an Alu (or L1) element is the cleavage site for the endonuclease activity of the L1 ORF2. Data from the Deininger laboratory suggest that this endonuclease activity may be a major source of double-strand breaks in DNA. Because tumors generally overexpress L1, they would be expected to be subject to this endonucleolytic activity that would be expected to contribute to DNA instability and tumor progression. This work will help us to predict regions of the genome that would be most susceptible to damage by the L1 endonuclease and therefore most likely to vary during tumor progression.
The proposed study will assist not only in understanding Alu elements themselves and their effects in human genetic diseases, but also in integrating and developing a series of advanced data mining techniques for biological sequence analysis. Significant progress towards this goal will contribute to the overall recognition of Alu biology and genetic basis of human diseases.
In particular, our specific aims are:
Aim 1. To develop an effective, scalable and unbiased discriminative data mining framework for Alu insertion site prediction. Three inevitable and distinct components that will be investigated are defined as follows:
- Specialized feature generation methods for Alu sequence data;
- A divide-and-conquer based feature selection and refinement mechanism which is driven by frequent-itemset mining; and
- A scalable and unbiased discriminative model augmented by probabilistic tree ensembles.
Aim 2. To design a biological testing framework with increased focus on thoroughly validating proposed Alu insertion site prediction model through phylogenetic footprinting.
1. K. Zhang , W. Fan, P. Deininger , A. Edwards, Z. Xu and D. Zhu, “Breaking the computational barrier: a divide-conquer and aggregate based approach for Alu insertion site characterization”, International Journal of Computational Biology and Drug Design, 2009;2(4):302-322.
doi : 10.1504/IJCBDD.2009.030763. Epub 2009 Jan 4. PMID: 20090173
2. W. Zhang, A. Edwards, W. Fan, D. Zhu and K. Zhang , "Identification of conserved and diverged co-expression modules in different biological categories", BMC Bioinformatics 2010, 11:338 doi:10.1186/1471-2105-11-338, http://www.biomedcentral.com/1471-2105/11/338
3. W. Fan, E. Zhong , J. Peng , O. Verscheure , K. Zhang , J. Ren , R. Yan and Q. Yang, "Generalized and Heuristic-Free Feature Construction", Proceedings of Tenth SIAM International Conference on Data Mining, 2010. Peer-review, Acceptance rate < 20%
1. “Enhancement of microRNA Research through Bioinformatics Tool Development”, LBRN NIH INBRE Program
PI; Awarded funds: $635,000; 7/2010 - 4/2015
2. “New Informatics Paradigm for Reconstructing Signaling Pathways in Human Disease”, NIH 1R21LM010137-01, Administrative Supplements,
Co-PI ; Awarded funds: $76,412; 7/2010 – 6/2011; PI, Dr. Dongxiao Zhu, UNO
1. “A Divide-Conquer and Aggregate Based Approach for Alu Insertion Site Characterization”, Jan. 2010, Louisiana Biomedical Research Network 8th Annual meeting
2. “Generalized and Heuristic-Free Feature Construction”, SIAM International Conference on Data Mining 2010, April, Ohio
3. Panel discussion at the 4th IEEE International Conference on Bioinformatics and Biomedical Engineering, June, 2010, Chengdu, China