by
Veli Mäkinen(Department of Computer Science, UNIVERSITY OF HELSINKI)
→
Europe/Stockholm
RB35
RB35
Description
A repetitive sequence collection is one where portions of a base
sequence of length n are repeated many times with small variations,
forming a collection of total length N. Examples of such collections are
version control data and genome sequences of individuals, where
the differences can be expressed by lists of basic edit operations.
Flexible and efficient data analysis on a such typically huge collection
is plausible using suffix trees. However, suffix tree occupies O(N log
N) bits, which very soon inhibits in-memory analyses. Recent advances in
full-text self-indexing reduce the space of suffix tree to O(N log c)
bits, where c is the alphabet size. In practice, the space reduction is
more than 10-fold, for example on suffix tree of Human Genome. However,
this reduction factor remains constant when more sequences are
added to the collection.
We develop a new family of self-indexes suited for the repetitive
sequence collection setting. Their expected space requirement depends
mostly on the length n of the base sequence and the number s of
variations in its repeated copies. That is, the space reduction factor
is no longer constant, but depends on N/n.
The structures developed in this work are aimed at providing a
fundamental basis for storage and retrieval of individual genomes as
they become available due to rapid progress in the sequencing technologies.
Joint work with G. Navarro, J. Sirén, and N. Välimäki.