Exploiting Common Patterns for Tree-Structured Data

2017 
Tree-structured data formats, such as JSON and Protocol Buffers, are capable of expressing sophisticated data types, including nested, repeated, and missing values. While such expressing power contributes to their popularity in real-world applications, it presents a significant challenge for systems supporting tree-structured data. Existing systems have focused on general-purpose solutions either extending RDBMSs or designing native systems. However, the general-purpose approach often results in sophisticated data structures and algorithms, which may not reflect and optimize for the actual structure patterns in the real world. In this paper, we aim to better understand tree-structured data types in real uses and optimize for the common patterns. We present an in-depth study of five types of real-world use cases of tree-structured data. We find that a majority of the root-to-leaf paths in the tree structures are simple, containing up to one repeated node. Given this insight, we design and implement Steed, a native analytical database system for tree-structured data. Steed implements the baseline general-purpose support for storing and querying data in both row and column layouts. Then we enhance the baseline design with a set of optimizations to simplify and improve the processing of simple paths. Experimental evaluation shows that our optimization improves the baseline by a factor of up to 1.74x. Compared to three representative state-of-the-art systems (i.e. PostgreSQL, MongoDB, and Hive+Parquet), Steed achieves orders of magnitude better performance in both cold cache and hot cache scenarios.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    9
    Citations
    NaN
    KQI
    []
    Baidu
    map