Speaker
Roberta Sinatra
(University of Catania)
Description
There are many examples in biology, in linguistics and in
the theory of dynamical systems, where information resides
and has to be extracted from corpora of raw data consisting
in sequences of symbols. For instance, a written text in
English or in another language is a collection of sentences,
each sentence being a sequence of the letters from a given
alphabet. Not all sequences of letters are possible, since
the sentences are organized on a lexicon of a certain number
of words. In addition to this, different words are used
together in a structured and conventional way. Similarly, in
biology, DNA nucleotides or aminoacidic sequence data can be
seen as corpora of strings. Many results have shown proteins
are far from being a random assembly of peptides and DNA
sequences show non-trivial statistical properties. All this
gives meaning to the metaphor of DNA and protein sequences
regarded as texts written in a still unknown language.
Sequences of symbols can also be found in time series
generated by dynamical systems. In fact, a trajectory in the
phase space can be transformed into sequence of symbols, by
the so-called “symbolic dynamic” approach. In all the
examples mentioned above, the main challenge is to decipher
the message contained in the corpora of data sequences, and
to infer the underlying rules that govern their production.
We propose a general method to construct networks out of any
symbolic sequential data. The method is based on two
different steps: first it extracts in a “natural” way
motifs, i.e. those recurrent short strings which play the
same role words do in language; then it represents
correlations of motifs within sequences as a network.
Important information from the original data are embedded in
such a network and can be easily retrieved as we will show
through diverse applications to social dialogs, biological
examples and dynamical systems. With the respect to
previous linguistic methods, our approach does not need the
a priori knowledge of a given dictionary. All this, makes
the method very general and opens up a wide range of
applications from the study of written text, to the analysis
of different trajectories in dynamical systems.