Workshop 5

* If you encounter menus do not work upon clicking, delete your browser's cache.

Architectural Benchmarking of Compute-in-Memory Systems

Organizer : Pritish Narayanan (Architectural Benchmarking of Compute-in-Memory Systems)

Deep Neural Networks (DNNs) have demonstrated unparalleled capabilities in recent years for several applications, such as image processing, natural language understanding, and content generation. As DNNs have evolved over time ? from convolutional neural networks, recurrent neural networks, to transformers and beyond ? precision requirements, performance bottlenecks, and hardware design considerations have changed with the DNN characteristics. While Compute-in Memory (CIM) is a promising approach for accelerating the workhorse Multiply-Accumulate operations of DNNs, architecting future DNN systems goes well beyond CIM tile design, as macro-level efficiency may not necessarily translate into system level efficiency. Amdahl’s law cannot be ignored, causing auxiliary operations such as attention, layerNorm, etc. to become more important and nullifying tile efficiency gains. Von Neumann’s bottleneck could make this worse, as increasing DNN model sizes may preclude full stationarity and force weight movement. In this workshop, we focus on application benchmarking for CIM systems, translating from application requirements to circuit, device, architecture, and manufacturing requirements. Topics of interest include pipeline design and scheduling for CIM, data-transport topologies, architectural tools, and 3D approaches to address weight capacity requirements


About Organizer

Dr. Pritish Narayanan is Principal Research Scientist at IBM Research, working on Hardware Acceleration of AI using Compute-in-memory. He has been the design lead for many accelerator prototype tapeouts, and has an extensive publication record in top journals and conferences including Nature, VLSI, IEDM and ICML. He has given several keynote, invited and tutorial talks at venues such as CICC, COOLCHIPS and IMW, is an IEEE Senior Member, and until recently was an Associate Editor for TED.

1. GainSight: Fine-Grained Memory Access Profiling for GCRAM-Based AI Accelerators, Thierry Tambe, Stanford University
Abstract

The exponentially increasing memory capacity and bandwidth requirements for data-intensive compute workloads, such as transformer-based generative AI models, call for increasing amounts of low-latency, low-cost, and high-density on-chip storage for AI/ML accelerators. Given the challenges in scaling SRAM cells for on-chip memory, we explore alternative on-chip memory devices that provide better scalability and higher density at similar latencies, with a particular focus on gain-cell memories. The major potential drawback of gain-cell RAM (GCRAM), however, is their shorter data retention times and higher refresh costs. To address this, we are on a mission to co-design accelerator hardware and software such that cached data lifetimes align with gain-cell RAM retention times to minimize refreshes. As a starting point in our design space exploration process, we are developing a simulation-based profiler, GainSight, to capture and analyze fine-grained memory access patterns and data lifetimes of various AI benchmarking workloads on cycle-accurate accelerator hardware models, such as GPGPUs and systolic arrays. GainSight offers insights beyond those of traditional coarse-grained software profilers, including application-specific cache lifetime statistics, GCRAM retention requirements, optimal GCRAM topology choices, and more. Ultimately, our work guides the development of GCRAM-based AI accelerator architectures that strategically exploit transitory data to achieve significant improvements in area and energy efficiency when compared to conventional SRAM-based systems.

2. CIM-based processing of DNNs, Xiaoyu Sun, TSMC
Abstract

The Compute-In-Memory (CIM) concept has been extensively studied over the past decade as an energy- and area-efficient solution for matrix multiplications involved in DNN processing. In this presentation, we will begin with a brief overview of CIM, including the evolution of TSMC’s CIM macros at advanced nodes since 2020. We will then focus on CIM in the accelerator context, discuss workload-level latency and energy estimation, layer- and application-specific challenges, as well as their potential solutions through dataflow, architecture, and technology optimizations.

3. Recent Development of NeuroSim Benchmark Framework towards Angstrom Nodes and Heterogeneous 3D Integrated System, Shimeng Yu, Georgia Institute of Technology
Abstract

This presentation will discuss the recent progresses in system-technology co-design (STCO) enabling tool “NeuroSim” for the memory-centric compute system including the following topics:
1) digital compute-in-memory (DCIM)’s scaling trends towards 5 Angstrom node for on-chip AI/ML acceleration;
2) “Active” backside power delivery where the on-chip voltage converter that is integrated at the point-of-load for the 3D stacked memory/logic system;
3) large language model (LLM) acceleration that takes the advantages of 3D stackable DRAM on top of logic compute dies with co-packaged optical interconnect.
Besides the power/performance/area (PPA) metrics, additional measures such as heat dissipation and power density are included in the benchmark framework.

4. Tile Efficiency is not System Efficiency ? CIM architecture studies of LLMs and other large DNNs, Pritish Narayanan, IBM Research Almaden
Abstract

To achieve system-level benefits, compute-in-memory tiles need to be integrated into heterogeneous architectures alongside general and application-specific digital compute cores, together with a high-bandwidth and reconfigurable on-chip routing fabric that can deliver the right vectors to the right locations for just-in-time DNN compute. In the first part of my talk, I will review some of IBM’s work in developing weight-stationary analog compute cores with a focus on the design choices and optimizations for high tile efficiency. I will then provide a brief introduction to heterogeneous architectures for CIM systems followed by architectural studies of DNNs identifying auxiliary operations that bottleneck the performance. Finally, I will highlight the issue of achieving true weight-stationarity in large models such as Mixture-of-Expert (MoE) Transformer models, and the system-level benefits that such an architecture can achieve.

5. HISIM: Efficient Design Space Exploration of 2.5D/3D Heterogeneous Integration for AI Computing, Yu Cao, University of Minnesota
Abstract

Monolithic designs face significant challenges in fabrication cost and data movement, especially when executing bigger and more complex DNN models. While recent advancements, such as near-memory and in-memory computing (IMC), aim to address these issues, the scaling trend of monolithic design still lags behind the ever-increasing demand of AI algorithms and other data-intensive applications. In this context, technological innovations, particularly 2.5D/3D integration through advanced packaging techniques, are critical to enabling heterogeneous integration (HI) and unlocking significant performance, energy and cost benefits beyond conventional chip design approaches. Such a paradigm shift requires a tight collaboration between packaging and chiplet design throughout the entire design cycle. In this talk, we will introduce HISIM, a new system performance benchmark tool for efficient design space exploration of 2.5D/3D heterogenous systems for energy-efficient AI computing. HISIM incorporates a suite of analytical performance models for various computing units (e.g., IMCs and systolic arrays), network-on-chip, 2.5D/3D interconnections, and thermal simulations, achieving 106× faster than state-of-the-art AI benchmark tools. We will demonstrate HISIM on various DNN models, helping shed light on the potential and research needs of future chiplet-package co-design.