Journal Details
 Research Article Open Access 
Mining Unique-m Substrings from Genomes
Kai Ye1*, Zhenyu Jia2, Yipeng Wang2,3, Paul Flicek4 and Rolf Apweiler5
1Molecular Epidemiology section, Medical Statistics and Bioinformatics, Leiden University Medical Center, The etherlands
2Department of Pathology & Laboratory Medicine, University of California, Irvine, CA 92697, USA
3Vaccine Research Institute of San Diego, San Diego, CA 92121, USA
4Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
5EMBL Outstation, European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
*Corresponding author: Kai Ye, Ph.D.,
Molecular Epidemiology section Medical Statistics and Bioinformatics Leiden University Medical Center The Netherlands,
E-mail : K.Ye@lumc.n
Citation: Ye K, Jia Z, Wang Y, Flicek P, Apweiler R (2010) Mining Unique-m Substrings from Genomes. J Proteomics Bioinform 3: 099-100. doi: 10.4172/ jpb.1000127
Copyright: © 2010 Ye K, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Unique substrings in genomes may indicate high level of specificity which is crucial and fundamental to many genetics studies, such as PCR, microarray hybridization, Southern and Northern blotting, RNA interference (RNAi), and genome (re)sequencing. However, being unique sequence in the genome alone is not adequate to guaranty high specificity. For example, nucleotides mismatches within a certain tolerance may impair specificity even if an interested substring occur only once in the genome. In this study we propose the concept of unique-m substrings of genomes for controlling specificity in genome-wide assays. A unique-m substring is defined if it only has a single perfect match on one strand of the entire genome while all other approximate matches must have more than m mismatches. We developed a pattern growth approach to systematically mine such unique-m substrings from a given genome. Our algorithm does not need a pre-processing step to extract sequential information which is required by most of other rival methods. The search for unique-m substrings from genomes is performed as a single task of regular data mining so that the similarities among queries are utilized to achieve tremendous speedup. The runtime of our algorithm is linear to the sizes of input genomes and the length of unique-m substrings. In addition, the unique-m mining algorithm has been parallelized to facilitate genome-wide computation on a cluster or a single machine of multiple CPUs with shared memory.
This Article
» Full Text (PDF)
» Full Text (HTML)
» Similar articles in google scholar
» Similar articles in Pub Med
Google Scholar
» Kai Ye
» Zhenyu Jia
» Yipeng Wang
Pub Med
» Kai Ye
» Zhenyu Jia
» Yipeng Wang
OMICS Publishing Group is the member of/publishing partner of/source content provider to