Neural Network Accelerator
Custom FPGA-based hardware accelerator for deep learning inference
Project Overview
This project presents a custom hardware accelerator designed specifically for efficient deep neural network inference on FPGAs. The accelerator achieves high throughput and energy efficiency through specialized compute units, optimized memory hierarchies, and intelligent dataflow architectures.
Architecture
System Overview
Key Components
- Processing Elements (PEs)
- Systolic array architecture
- 256 MAC units per PE
- INT8/INT16 precision support
- Peak performance: 2 TOPS
- Memory Hierarchy
- On-chip SRAM: 4 MB
- DDR4 interface: 25.6 GB/s
- Weight compression: 4:1 ratio
- Activation sparsity exploitation
- Control Unit
- RISC-V based controller
- DMA engines for data movement
- Hardware scheduling optimization
Supported Operations
Core Layers
- Convolution (including depthwise/pointwise)
- Fully connected layers
- Pooling (max, average, global)
- Batch normalization (fused)
- Activation functions (ReLU, Sigmoid, Tanh)
Advanced Features
- Skip connections
- Element-wise operations
- Dynamic shape support
- Multi-branch networks
Implementation Details
Convolution Engine
module conv_engine #(
parameter PE_ROWS = 16,
parameter PE_COLS = 16,
parameter DATA_WIDTH = 8
)(
input clk,
input rst_n,
input [DATA_WIDTH-1:0] input_data,
input [DATA_WIDTH-1:0] weight_data,
output [DATA_WIDTH*2-1:0] output_data
);
// Systolic array for matrix multiplication
genvar i, j;
generate
for (i = 0; i < PE_ROWS; i = i + 1) begin : row
for (j = 0; j < PE_COLS; j = j + 1) begin : col
processing_element #(
.DATA_WIDTH(DATA_WIDTH)
) pe (
.clk(clk),
.rst_n(rst_n),
.a_in(row_data[i]),
.b_in(col_data[j]),
.c_in(partial_sum[i][j]),
.c_out(partial_sum[i][j+1])
);
end
end
endgenerate
endmodule
Dataflow Optimization
# Compiler optimization for layer fusion
def optimize_graph(model):
"""Fuse operations to minimize memory transfers"""
optimized = []
for i, layer in enumerate(model.layers):
if can_fuse(layer, model.layers[i+1]):
fused = fuse_layers(layer, model.layers[i+1])
optimized.append(fused)
i += 1 # Skip next layer
else:
optimized.append(layer)
return optimized
Performance Results
Benchmark Networks
| Network | FPS | Latency | Power | Efficiency |
|---|---|---|---|---|
| ResNet-50 | 312 | 3.2 ms | 8.5 W | 235 GOPS/W |
| MobileNet-V2 | 1840 | 0.54 ms | 4.2 W | 380 GOPS/W |
| YOLO-V3 Tiny | 125 | 8.0 ms | 12.3 W | 162 GOPS/W |
Comparison with Other Platforms
Software Stack
Compiler Toolchain
# Model compilation workflow
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ ONNX Model │ --> │ Quantization │ --> │ Graph Optimizer│
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
v
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ FPGA Bitstream │ <-- │ Place & Route │ <-- │ HLS Generation │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Runtime API
// C++ API for inference
class NNAccelerator {
public:
// Load compiled model
void load_model(const std::string& model_path);
// Synchronous inference
Tensor infer(const Tensor& input);
// Asynchronous inference with callback
void infer_async(const Tensor& input,
std::function<void(Tensor)> callback);
// Batch inference for higher throughput
std::vector<Tensor> infer_batch(const std::vector<Tensor>& inputs);
// Performance profiling
ProfileData get_profile_data();
};
Applications
Computer Vision
- Real-time object detection
- Image segmentation
- Face recognition
- Video analytics
Edge AI
- Autonomous vehicles
- Drone navigation
- Smart cameras
- Industrial inspection
Healthcare
- Medical image analysis
- Real-time patient monitoring
- Diagnostic assistance
Resource Utilization
Target Device: Xilinx ZCU104 (Zynq UltraScale+ MPSoC)
| Resource | Used | Available | Utilization |
|---|---|---|---|
| LUTs | 168,432 | 230,400 | 73% |
| FFs | 201,984 | 460,800 | 44% |
| BRAM | 280 | 312 | 90% |
| DSP48 | 1,248 | 1,728 | 72% |
Future Enhancements
- Sparse Computing: Hardware support for sparse matrices
- Mixed Precision: FP16 and INT4 support
- Transformer Support: Specialized attention mechanisms
- Multi-FPGA Scaling: Distributed inference
Publications
- “Energy-Efficient Neural Network Accelerator with Systolic Array Architecture” - Submitted to FPGA 2024
- “Optimizing Memory Access Patterns in FPGA-based DNN Accelerators” - Workshop Paper, ISCA 2023
Open Source Release
The project will be open-sourced after publication. Stay tuned!
- HDL sources
- Compiler toolchain
- Pre-trained models
- Benchmark suite
Acknowledgments
This work was inspired by various academic and industry accelerators including Google’s TPU, NVIDIA’s DLA, and Microsoft’s Brainwave.