Direct Coupling Analysis (DCA) has emerged as a powerful tool to find pair-wise dependencies in large biological data sets, with particularly striking results on predicting contacts in protein structures in silico. DCA amounts to matching coefficients in an Ising model or a Potts model to data, and then using the largest such inferred coefficients as predictors for the dependencies of interest.
In an earlier contribution (Skwark et al, PLoS Genetics 2017) we showed that DCA can be used on whole-genome bacterial data to predict links between genes involved in antibiotic resistance. The main computational bottle-neck is then the inference step.
Recently we have looked at if DCA can be speeded up by first filtering the data on correlations, an approach we call Correlation-Compressed Direct Coupling Analysis (CC-DCA). The computational bottle-neck then moves from DCA to the more standard task of finding a subset of most strongly correlated vectors in large data sets. I will describe results obtained so far, and outline what it would take to do CC-DCA on whole-genome data in higher organisms.
This is joint work with Chen-Yi Gao and Hai-Jun Zhou, available as arXiv:1710.04819.