Speakers
Description
One common issue in vastly different fields of research and industry is the ever-increasing need for more data storage. With experiments taking more complex data at higher rates, the data recorded is quickly outgrowing the storage capabilities. This issue is very prominent in LHC experiments such as ATLAS where in five years the resources needed are expected to be many times larger than the storage available (assuming a flat budget model and current technology trends) [1]. Since the data formats used are already highly compressed, storage constraints could require more drastic measures such as lossy compression, where some data accuracy is lost during the compression process.
In our work, following from a number of undergraduate projects [2,3,4,5,6,7], we have developed an interdisciplinary open-source tool for machine learning-based lossy compression. The tool utilizes an autoencoder neural network, which is trained to compress and decompress data based on correlations between the different variables in the dataset. The process is lossy, meaning that the original data values and distributions cannot be reconstructed precisely. However, for certain variables and observables where the precision loss is tolerable, the high compression ratio allows for more data to be stored yielding greater statistical power.
The tool we have developed is called Baler and is available as an open source project [8][9].
[1] - https://cerncourier.com/a/time-to-adapt-for-big-data/
[2] - http://lup.lub.lu.se/student-papers/record/9049610
[3] - http://lup.lub.lu.se/student-papers/record/9012882
[4] - http://lup.lub.lu.se/student-papers/record/9004751
[5] - http://lup.lub.lu.se/student-papers/record/9075881
[6] - https://zenodo.org/record/5482611#.Y3Yysy2l3Jz
[7] - https://zenodo.org/record/4012511#.Y3Yyny2l3Jz
[8] - https://zenodo.org/record/7817467#.ZED-65FBzmE
[9] - https://github.com/baler-collaboration/baler