Neural Network Accelerator

Custom FPGA-based hardware accelerator for deep learning inference

Project Overview

This project presents a custom hardware accelerator designed specifically for efficient deep neural network inference on FPGAs. The accelerator achieves high throughput and energy efficiency through specialized compute units, optimized memory hierarchies, and intelligent dataflow architectures.

Architecture

System Overview

Key Components

  1. Processing Elements (PEs)
    • Systolic array architecture
    • 256 MAC units per PE
    • INT8/INT16 precision support
    • Peak performance: 2 TOPS
  2. Memory Hierarchy
    • On-chip SRAM: 4 MB
    • DDR4 interface: 25.6 GB/s
    • Weight compression: 4:1 ratio
    • Activation sparsity exploitation
  3. Control Unit
    • RISC-V based controller
    • DMA engines for data movement
    • Hardware scheduling optimization

Supported Operations

Core Layers

  • Convolution (including depthwise/pointwise)
  • Fully connected layers
  • Pooling (max, average, global)
  • Batch normalization (fused)
  • Activation functions (ReLU, Sigmoid, Tanh)

Advanced Features

  • Skip connections
  • Element-wise operations
  • Dynamic shape support
  • Multi-branch networks

Implementation Details

Convolution Engine

module conv_engine #(
    parameter PE_ROWS = 16,
    parameter PE_COLS = 16,
    parameter DATA_WIDTH = 8
)(
    input clk,
    input rst_n,
    input [DATA_WIDTH-1:0] input_data,
    input [DATA_WIDTH-1:0] weight_data,
    output [DATA_WIDTH*2-1:0] output_data
);

    // Systolic array for matrix multiplication
    genvar i, j;
    generate
        for (i = 0; i < PE_ROWS; i = i + 1) begin : row
            for (j = 0; j < PE_COLS; j = j + 1) begin : col
                processing_element #(
                    .DATA_WIDTH(DATA_WIDTH)
                ) pe (
                    .clk(clk),
                    .rst_n(rst_n),
                    .a_in(row_data[i]),
                    .b_in(col_data[j]),
                    .c_in(partial_sum[i][j]),
                    .c_out(partial_sum[i][j+1])
                );
            end
        end
    endgenerate
endmodule

Dataflow Optimization

# Compiler optimization for layer fusion
def optimize_graph(model):
    """Fuse operations to minimize memory transfers"""
    optimized = []

    for i, layer in enumerate(model.layers):
        if can_fuse(layer, model.layers[i+1]):
            fused = fuse_layers(layer, model.layers[i+1])
            optimized.append(fused)
            i += 1  # Skip next layer
        else:
            optimized.append(layer)

    return optimized

Performance Results

Benchmark Networks

Network FPS Latency Power Efficiency
ResNet-50 312 3.2 ms 8.5 W 235 GOPS/W
MobileNet-V2 1840 0.54 ms 4.2 W 380 GOPS/W
YOLO-V3 Tiny 125 8.0 ms 12.3 W 162 GOPS/W

Comparison with Other Platforms

Software Stack

Compiler Toolchain

# Model compilation workflow
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   ONNX Model    │ --> │  Quantization   │ --> │  Graph Optimizer│
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                          │
                                                          v
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  FPGA Bitstream │ <-- │  Place & Route  │ <-- │  HLS Generation │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Runtime API

// C++ API for inference
class NNAccelerator {
public:
    // Load compiled model
    void load_model(const std::string& model_path);

    // Synchronous inference
    Tensor infer(const Tensor& input);

    // Asynchronous inference with callback
    void infer_async(const Tensor& input,
                     std::function<void(Tensor)> callback);

    // Batch inference for higher throughput
    std::vector<Tensor> infer_batch(const std::vector<Tensor>& inputs);

    // Performance profiling
    ProfileData get_profile_data();
};

Applications

Computer Vision

  • Real-time object detection
  • Image segmentation
  • Face recognition
  • Video analytics

Edge AI

  • Autonomous vehicles
  • Drone navigation
  • Smart cameras
  • Industrial inspection

Healthcare

  • Medical image analysis
  • Real-time patient monitoring
  • Diagnostic assistance

Resource Utilization

Target Device: Xilinx ZCU104 (Zynq UltraScale+ MPSoC)

Resource Used Available Utilization
LUTs 168,432 230,400 73%
FFs 201,984 460,800 44%
BRAM 280 312 90%
DSP48 1,248 1,728 72%

Future Enhancements

  1. Sparse Computing: Hardware support for sparse matrices
  2. Mixed Precision: FP16 and INT4 support
  3. Transformer Support: Specialized attention mechanisms
  4. Multi-FPGA Scaling: Distributed inference

Publications

  1. “Energy-Efficient Neural Network Accelerator with Systolic Array Architecture” - Submitted to FPGA 2024
  2. “Optimizing Memory Access Patterns in FPGA-based DNN Accelerators” - Workshop Paper, ISCA 2023

Open Source Release

The project will be open-sourced after publication. Stay tuned!

  • HDL sources
  • Compiler toolchain
  • Pre-trained models
  • Benchmark suite

Acknowledgments

This work was inspired by various academic and industry accelerators including Google’s TPU, NVIDIA’s DLA, and Microsoft’s Brainwave.