Minimal absent words (MAW) of a genomic sequence are subsequences that are absent themselves but the subwords of which are all present in the sequence. The characteristic distribution of genomic MAWs as a function of their length has been observed to be qualitatively similar for all living organisms, the bulk being rather short, and only relatively few being long. It has been an open issue whether the reason behind this phenomenon is statistical or reflects a biological mechanism, and what biological information is contained in absent words.
In this work we demonstrate that the bulk can be described by a probabilistic model of sampling words from random sequences, while the tail of long MAWs is of biological origin. We show that the long MAWs are caused by evolutionary fine-tuned regions of the genome important for genetic regulation and that properties of tail of the distribution are be related to high-level phenotypic features.
The talk is based on:
Erik Aurell, Nicolas Innocenti, Hai-Jun-Zhou, The Bulk and The Tail of Minimal Absent Words in Genome Sequences [arXiv:1509.05188].