Improving the usability and scalability of FINN, a DNN compiler for FPGAs

2021
FINN is a framework developed by Xilinx Research Labs that compiles Deep Neural Network software descriptions into fast and scalable dataflow architectures for inference acceleration on FPGAs. The dataflow architectures are network dependent, sized according to the user-defined throughput requirements, and constrained by available resources on the user-specified FPGA board. Synthesising large neural network designs with a high degree of configurability leads to large build times, spanning from hours to days, to build an entire network. Thus, the first objective of this thesis is to explore and propose a modified FINN accelerator construction methodology that can substantially reduce the build times. The main idea behind our proposal is to reduce the granularity of the architecture to reduce the size of synthesis jobs and to enable logic reuse within and across neural network layers. Using this method, up to 12× speedup in High-Level Synthesis times and up to 2× speedup in end-to-end build times of accelerator networks are achieved. The second limitation that this work addresses relates to the performance scalability of FINN generated architectures. There are two modes of parallelism in FINN that currently provide performance scaling in convolution operations. The first factor, which modifies the number of Processing Elements (PEs), parallelises along the input channels of a convolutional layer and the second factor, that modifies the number of Single Instruction Multiple Data (SIMD) lanes present in each PE, parallelises along the number of output channels of the convolution. Computations are currently not parallelisable across the non-depth dimensions of images, i.e., the side containing pixels of images that faces the viewer. This limitation can restrict the achievable performance for networks that contain layers with large image dimensions and shallow depth dimension. The second part of this work leverages the fine-grained construction methodology to augment FINN performance scaling. The proposed approach introduces a generic FINN modification that enables pixel-level parallelism, i.e., multiple output pixels of a convolutional layer can be processed simultaneously by performing Multiple Matrix Vector (MMV) multiplications at the same time. Using this generic method, MMV number of pixels can be processed simultaneously, an MMV times throughput increase can be obtained at the cost of less than MMV × additional resources.
    • Correction
    • Source
    • Cite
    • Save
    22
    References
    0
    Citations
    NaN
    KQI
    []
    Baidu
    map