Xilinx INT8 optimization delivers the highest performance and energy efficiency for deep learning inference. The Xilinx integrated DSP architecture provides a 1.75x improvement in solution-level performance on INT8 deep learning operations compared to other FPGA-based DSP architectures. This white paper explores the implementation of INT8 deep learning on Xilinx’s DSP48E2 slice and compares it with other FPGAs. With similar resource usage, Xilinx’s DSP architecture achieves 1.75x higher peak performance in terms of operations per second (OPS) for INT8 deep learning tasks. Since deep learning inference can leverage lower bit precision without compromising accuracy, an efficient INT8 implementation is essential. Xilinx’s DSP architecture and libraries are specifically optimized for INT8 deep learning inference. This document explains how to utilize the DSP48E2 in Xilinx UltraScale and UltraScale+ FPGAs to perform two parallel INT8 multiply-accumulate (MACC) operations using the same core weights. It also discusses why Xilinx technology requires a minimum input bit width of 24 bits. Furthermore, this paper uses INT8 optimization as a case study to illustrate how this technology aligns with fundamental neural network operations. INT8 for Deep Learning Deep Neural Networks (DNNs) have transformed machine learning by introducing human-level AI capabilities into various applications. As more accurate deep learning models are developed, their increasing complexity brings challenges in terms of high computational demand and memory bandwidth requirements. Energy efficiency has become a key driver in pushing deep learning inference toward new innovations that reduce computational power and memory bandwidth, though sometimes at the cost of accuracy or throughput. Reducing these overheads ultimately enhances energy efficiency and lowers overall power consumption. In addition to saving power during computation, lower bit-width calculations also reduce memory bandwidth power consumption, as fewer bits need to be transferred during each memory transaction. Research has shown that maintaining the same level of accuracy does not require full-precision floating-point calculations [Reference 1][Reference 2][Reference 3]. Many applications, such as image classification, only need INT8 or lower fixed-point precision to achieve acceptable inference accuracy [Reference 2][Reference 3]. Table 1 lists fine-tuned networks along with dynamic fixed-point parameters and outputs for convolutional and fully connected layers. The numbers in parentheses represent unadjusted accuracy. Table 1: CNN Model with Fixed Point Accuracy [Image: DSP48E2 slice optimization INT8 deep learning operation analysis] INT8 Deep Learning on Xilinx DSP Slice The Xilinx DSP48E2 is designed to efficiently execute multiply-accumulate algorithms in a single clock cycle, supporting up to 18x27-bit multiplication and 48-bit accumulation, as illustrated in Figure 1. In addition to linking multiple DSP slices, MACC operations can be executed efficiently using Xilinx devices. [Image: Figure 1: DSP slice using MACC mode] Figure 1: DSP Slice Using MACC Mode The wider 27-bit input is particularly beneficial for INT8 calculations. In traditional applications, pre-adders are often used to implement (A + B) × C efficiently, but such operations are rare in deep learning. Instead, the result of (A + B) × C is split into A × C and B × C, which are then accumulated in separate data streams to match typical deep learning requirements. For INT8 deep learning operations, an 18x27-bit multiplier is crucial. At least one of the multiplier inputs must be at least 24 bits, and the carry accumulator needs to be 32 bits wide to support two INT8 MACC operations simultaneously on a single DSP slice. Combining the 27-bit input with a 48-bit accumulator increases the depth learning solution performance by 1.75 times (a ratio of 1.75:1 between the DSP multiplier and the INT8 MACC). Other FPGA vendors typically offer only 18x19-bit multipliers in a single DSP block, with a 1:1 ratio between the DSP multiplier and the INT8 MACC. Scalable INT8 Optimization The goal is to find an efficient way to encode inputs a, b, and c so that the multiplication results between them can be easily divided into axc and bxc. In lower-precision calculations like INT8, the upper 10 or 19 bits of the inputs are often padded with zeros or ones and carry only one bit of information. The same applies to the top 29 bits of the final 45-bit product. Therefore, another calculation can be performed using the upper 19 bits without affecting the lower 8 or 16 bits of the input. Generally, there are two rules to follow when using unused high bits for additional calculations: 1. The high bits should not interfere with the low-bit calculations. 2. Any impact from the low-bit calculations on the high bits must be detectable and potentially recoverable.

20V DC Power Supply

The 20V series DC Power Supply could be widely used for charging 12V battery system. It`s constant power and wide range of voltage and current output with accurate voltage and current measurement capability.

Our apm dc power supply inherits the functional design and maintains the high power density characteristic and 1U height appearance.Standard communication interface includes RS232/RS485/USB/LAN,GPIB is optional.Users can choose the variable dc power supply that fits their testing requirements perfectly.

Some features of the dc power supply as below:


  • Ultrafast respond time and high efficiency.
  • Accurate voltage and current measurement capability
  • Equips with LIST waveform editing function
  • Compliant with SCPI communication protocol
  • Master/Slave parallel and series operation mode for up to 10 units
  • Built-in standard automobile electrical testing curves
  • Full protection: OVP/OCP/OPP/OTP/SCP
  • Voltage drop compensation by remote sense line.
  • Have obtained CE,UL,CSA,FCC.ROHS


20V DC Power Supply,High Voltage DC Power Supply,Variable Voltage DC Power Supply,Power Supply

APM Technologies Ltd , https://www.apmpowersupply.com