High Performance Computing Algorithms for Big Data Proteomics, Proteogenomics, and Meta-Proteomics
Proteogenomics is an emerging systems-biology research field at the intersection of proteomics and genomics. Two high-throughput technologies, Mass Spectrometry (MS) for proteomics and Next Generation Sequencing (NGS) for genomics, are required to conduct proteogenomics studies. Independently, both MS and NGS technologies have issues with data deluge which creates problems of storage, transfer, analysis, and visualization. Integrating these big data sets (NGS+MS) for proteogenomics studies compounds all of the associated problems. Existing sequential algorithms for these proteogenomics analysis are inadequate for big data and high performance computing (HPC) solutions are almost non-existent. The computational challenges are more pronounced for proteogenomics studies of non-model organisms with previously unsequenced or partially sequenced genomes. Note that in-order to achieve the following goals we also contribute towards building efficient and HPC proteomics algorithms as well as more fundamental HPC strategies (e.g. sorting, indexing etc.).
The goal of this our proposed research is to design and develop algorithmic and high-performance computing infrastructure for big proteogenomics data. Essential to achieving high performance for such big volume data applications is the ability of the proposed solutions to quickly narrow down the solution space, operate on reduced data points, and exploit massive data parallelism without sacrificing accuracy.
This research is towards designing and building infrastructure which would be useful for the broadest biological and ecological community. The project will accomplish the following:
- Design and development of novel algorithms for analysis of big proteogenomics data sets. This includes reductive algorithms that allow computations in sub-linear time, compression algorithms that operate in sub-linear space, and algorithms that operate on lossy compressed-form of the data.
- Design and development of high performance algorithms for a variety of proteogenomics big data problems. Novel sketching, sampling and dimensionality reduction strategies will be established to analyze proteogenomics datasets in a reasonable time.
- Design and development of interfaces of the proposed HPC algorithms with familiar bioinformatics frameworks will be developed. This will make the proposed tools accessible to a large community of experimental and systems-biologists.
- Performance evaluation of the proposed algorithms using simulated as well as real-experimental data sets. Traditional HPC metrics such as scalability, memory and communication costs will also be studied for peta-scale problems.
The proposed research brings together tools from multiple disciplines such as systems biology, proteomics, genomics, algorithms, and high performance computing. The proposed research activity will create novel high-performance software infrastructure which will reveal previously un-observable phenomenon of genome-proteome interaction at a much larger scale than is currently possible. Such useful proteogenomics data analysis may help us study: disease identification; plant adaptation to climate change; venom; pathogens; and megaviruses. All of the implementations will be available to the academic community via open-source codes and web-services.
- Haseeb, Muhammad, and Fahad Saeed. “High performance computing framework for tera-scale database search of mass spectrometry data.” Nature Computational Science 1, no. 8 (2021): 550-561 Nature
- Fahad Saeed, and Muhammad Haseeb. “Methods and systems for compressing data.” U.S. Patent 10,810,180, issued October 20, 2020. Google Patents
- Haseeb, Muhammad, and Fahad Saeed. “Efficient Shared Peak Counting in Database Peptide Search Using Compact Data Structure for Fragment-Ion Index.” In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 275-278. IEEE, 2019. Xplore
- Haseeb, Muhammad*, Fatima Afzali, and Fahad Saeed. “LBE: A Computational Load Balancing Algorithm for Speeding up Parallel Peptide Search in Mass-Spectrometry based Proteomics.” In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 191-198. IEEE, 2019. Xplore
- Muaaz Awan, and Fahad Saeed*, “MaSS‐Simulator: A Highly Configurable Simulator for Generating MS/MS Datasets for Benchmarking of Proteomics Algorithms“, Wiley Proteomics, Oct 2018 Wiley | PubMed
- Muaaz Awan, Taban Eslami, and Fahad Saeed*, “GPU-DAEMON: GPU Algorithm Design, Data Management & Optimization template for array based big omics data”, Elsevier Computers in Biology and Medicine, Aug 2018 Elsevier | PubMed
- Muaaz Gul Awan and Fahad Saeed*, “An Out-of-Core GPU based dimensionality reduction algorithm for Big Mass Spectrometry Data and its application in bottom-up Proteomics“, Proceedings of ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB), Boston MA, August 2017 Tech Report | ACM | PubMed
- Muaaz Awan and Fahad Saeed*, “MS-REDUCE: An ultrafast technique for reduction of Big Mass Spectrometry Data for high-throughput processing”, accepted in Oxford Bioinformatics, Jan 2016 Tech Report | PubMed | Oxford
- Fahad Saeed*, “Big Data Proteogenomics and High Performance Computing: Challenges and Opportunities“, Symposium on Signal and Information Processing for Software-Defined Ecosystems, and Green Computing, Proceedings of IEEE Global Conference on Signal and Information Processing (IEEE GlobalSIP), Orlando Florida, Dec 2015 Tech Report | IEEE Xplore
- Fahad Saeed*, Jason Hoffert and Mark Knepper, “CAMS-RS: Clustering Algorithm for Large-Scale Mass Spectrometry Data using Restricted Search Space and Intelligent Random-Sampling“, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol.11, No.1, pp.128,141, Jan. 2014 Tech Report | PubMed | IEEE Xplore