Named Data Networking for Genomics Data Management and Integrated Workflows

2021
Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading towards the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA central DNA sequence repository contains over 42 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in various repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA's GeneLab. In this work, we first systematically point out the data-management challenges of the genomics community. We then show that Named Data Networking (NDN), a novel but well-researched Internet architecture, is capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths) and eliminates the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Named based operations also streamlines deployment and integration of workflows with various cloud platforms. Our contribution in this work is twofold (a) we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate, and (b) we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. We also discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP).
    • Correction
    • Source
    • Cite
    • Save
    27
    References
    0
    Citations
    NaN
    KQI
    []
    Baidu
    map