High Performance Compressive Algorithms for Big Genomics Data

High Performance Compressive Algorithms for Big Genomics Data

Sequencing of genomes for numerous species including humans has become increasingly affordable due to next generation high-throughput genome sequencing (NGS) technologies. This opens up perspectives for diagnosis and treatment of genetic diseases and is increasingly effective in conducting system biology studies. However, there remain many computational challenges that needs to be addressed before these sequencing technologies find their way into every day health and human care. The first daunting challenge is the flood of sequencing data that one gets from these technologies (several tera-bytes for few runs of experiments) which can reach peta-byte level for comprehensive system-biology studies.

One direction that can be taken to deal with such Big Data is the compression of genomic information and the ability to analyze the compressed form of the data. Significant genomic data compression is needed to reduce the storage size, to increase the transmission of speed to cloud infrastructure, and to reduce the cost of I/O bandwidth required for connecting databases to these processing facilities. It is, therefore, desirable to develop high performance computing solutions for efficient and fast compression of Big NGS Data sets. Further, a framework which allows analysis of compressed-form of the NGS data will be of immense advantage in terms of time and resources. Since I/O cost for computing data sets on large supercomputers is a bottleneck for scalable solutions; compressed and encoded data is highly desirable that can exploit the data-redundancy is NGS data sets. To this end, we have investigated this problem on two fronts: 1) HPC algorithms for compression of NGS data sets of memory-distributed supercomputers 2) Encoding of data for better transmission over networks for better I/O utilization.

HPC algorithms for compression of NGS data sets of memory-distributed supercomputers

We have presented phyNGSC, a hybrid strategy between MPI and OpenMP to accelerate the compression of big NGS datasets by combining the best features of distributed and shared memory architectures to balance the load of work among processes, to alleviate memory latency by exploiting locality, and to accelerate I/O by reducing excessive read/write operations and inter-node message exchange. Our algorithm introduces a novel timestamp based approach which allows concurrent writing of compressed data in a non-deterministic order and thereby allows us to exploit a high amount of parallelism.

One of the problems for parallel applications is concurrent reading and writing. The effects of concurrent memory writing have been extensively studied. However, the specific issues encountered when performing I/O operations for big data are relatively new in the field. A very common way of writing to a file in parallel is to employ dedicated processes to read and write and use the rest of the processes for computations in a master/slave approach. In these parallel compression schemes, the use of a shared memory buffer is required in order for the writer to fetch the data. This guarantees a writing order since only one process is accessing the output file and the data to be written is stored in the shared memory buffer in an identifiable order. However, using this read/write scheme limits the scalability of the parallel genomic compression algorithms because the process in charge of writing has to wait for all of the workers to finish processing the data before it can write to the output file. 

In order for a parallel compression algorithm to be more scalable, the compressed form of the data should be writable as soon as its processing is finished. In order to overcome the non-determinism and guarantee an order in the way of writing, we integrated the concept of time-stamping which we call Timestamp-based Ordering for Concurrent Writing or TOC-W. Our method uses a global clock and gets a snapshot of the time of completion for the writing operation to the output file. Our parallel computing datastructures and methods have allowed extremely scalable parallel compression algorithms that are shown to be scalable on large XSEDE SDSC Gordon supercomputer while giving similar compression ratio’s as compared to other systems. All of the code is available as open-source on our lab web portal.

We have also made progress in developing efficient encoding techniques that allows transmission and data movement of Big NGS Data over networks. This include an efficient sampling based encoding and other deep-learning algorithms for encoding of genomic data sets. We plan to extend this line of research for efficient data movement on large supercomputers. Such use of efficient and compressive encoding of big NGS data will alleviate I/O bottleneck problems when trying to analyze Big NGS Data on peta-scale supercomputers. All of the implementation for our parallel algorithms are available on the lab portal as open-source code and is free for use for non-commercial purposes. 

Recent Publications

  1. Sandino Vargas-Pérez and Fahad Saeed*, “A Hybrid MPI-OpenMP Strategy to Speedup the Compression of Big Next-Generation Sequencing Datasets“, IEEE Transactions on Parallel and Distributed Systems, March 2017 Tech Report | IEEE Xplore
  2. Mohammed Aledhari, Marianne Di Pierro, Mohamed Hefeida, and Fahad Saeed*, “A Deep Learning-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets“, IEEE Transactions on Big Data, Feb 2018 IEEE Xplore
  3. Mohammed Aledhari, Marianne Di Pierro, and Fahad Saeed*, “A Fourier-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets”, Proceedings of IEEE Big Data Congress, San Francisco CA USA, July 2-7, 2018
  4. Sandino Vargas-P’erez and Fahad Saeed*, “Scalable Data Structure to Compress Next-Generation Sequencing Files and its Application to Compressive Genomics“, Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM), Kansas City, MO, USA, Nov 13-16, 2017 Tech Report | IEEE Xplore
  5. Sandino N. V. Perez and Fahad Saeed*,  “A Parallel Algorithm for Compression of Big Next-Generation Sequencing (NGS) Datasets“, Proceedings of Parallel and Distributed Processing with Applications (IEEE ISPA-15), Vol.3. pp. 196-201 Helsinki Finland, Aug 2015 Tech Report | IEEE Xplore