At the 2020 Symposia on VLSI Technology and Circuits this week, Intel is presenting a body of research on the computing transformation being driven by data that is increasingly distributed across the edge, the core, and endpoints. Several of the studies Intel will present explore techniques for higher-level intelligence and energy-efficient performance, both at the edge and across network and cloud systems.
Digital binary AI accelerator
In power and resource-constrained edge devices where low-precision outputs are acceptable for some applications, analog binary neural networks (BNNs) are coming into use as an alternative to higher-precision, more computationally demanding and memory-intensive AI algorithms. However, analog BNNs tend to have lower prediction accuracy, as they’re less tolerant of variability and noise. An Intel-authored paper describes a potential solution in a digital BNN chip, a 10-nanometer chip that implements 1 billion activation functions and weights. (Activation functions define the outputs of nodes — the building blocks of machine learning algorithms — given an input or a set of inputs. Weights are the variables that transform inputs within an algorithm’s layers of nodes.) The researchers claim their all-digital approach enables the chip to achieve 617 TOPS/W, where a TOP (tera operations per second) is a measure of the maximum achievable throughput. Compared with previous digital implementations, this approach achieves a claimed 2.8 to 135 times higher compute density and 2.7 times higher energy efficiency.
The paper’s coauthors assert the chip has additional advantages over analog approaches, as it doesn’t increase transistor count or introduce capacitors, which would decrease the compute area efficiency. Instead, it packs 161KB of memory and memory execution units (MEUs) that support things like output activation and comparison operations. A centralized controller orchestrates the flow of data from four memory banks that store inputs, outputs, and weights. It’s connected to eight MEU arrays (which have 16 MEUs each) to complete the overall design.
Ray casting accelerator
In simultaneous localization and mapping, or SLAM, accelerator chips are often used for both odometry (the use of data from motion sensors to estimate change in position over time) and path planning. But these chips typically struggle with “dense” SLAM tasks, like accurate surface estimation and 3D scene reconstruction, that entail processing enormous volumes of real-time visual data. Ray casting is a popular technique for executing dense SLAM on low-power edge processing system-on-chips, where a ray is cast for each pixel of the current frame and the 3D surrounding map is polled at every step until the first point of intersection onto any solid object. Accelerating ray casting requires at least eight surrounding voxels (objects representing values on a grid in 3D space) for each sampling point of the ray. But fortunately rays in close spatial proximity intercept the overlapping voxel region, creating opportunities to optimize a chip’s memory access and data movements.
Taking advantage of this, an Intel paper proposes what the coauthors call a “ray casting accelerator,” a 10-nanometer complementary metal-oxide-semiconductor (CMOS) that casts multiple rays in spatial proximity to exploit the locality of voxels. The researchers report that it demonstrates 320 x 240-pixel ray casting, with an average latency of 23.2 milliseconds per frame, while achieving maximum energy efficiency of 115.3 giga ray-steps (1 billion rays cast) per watt.
Event-driven visual data processor
Real-time AI-based visual analytics require not only quick object detection from multiple video streams, but high compute cycles and hardware memory bandwidth. Exacerbating the challenge, frames in the cameras capturing the data are typically downsampled to minimize that load, which has the effect of degrading image accuracy. Intel researchers propose an event-driven visual data processing unit (EPU) to address this issue. In conjunction with novel algorithms, it can instruct AI accelerator chips to only process visual inputs using motion-based “regions of interest.” The EPU pipeline supports Full HD video at 70 frames per second and workloads like event detection and event clustering, improving the end-to-end energy efficiency of AI-based vision hardware by 5 times while boosting throughput by 4.3 times.
Every EPU clock cycle, an algorithm called an Eventifier running on the EPU compares the intensities of batches of 16 pixels from current and previous frames, and a separate module clusters events correlated in space and time. A Convolver algorithm skips regions of the frame that don’t contain motion, analyzing indoor, outdoor, daytime, and nighttime footage. “The sheer volume of data flowing across distributed edge, network, and cloud infrastructure demands energy-efficient, powerful processing to happen close to where the data is generated but is often limited by bandwidth, memory, and power resources,” Intel fellow and director of circuit technology researcher Vivek K. De said in a statement. “The research Intel is showcasing at the VLSI Symposia highlights several novel approaches to more efficient computation that show promise for a range of applications — from robotics and augmented reality to machine vision and video analytics. This body of research is focused on addressing barriers to the movement and computation of data, which represent the biggest data challenges of the future.”