Symposium on VLSI Technology and Circuits

* If you encounter menus do not work upon clicking, delete your browser's cache.

Architectural Benchmarking of Compute-in-Memory Systems

Organizer : Pritish Narayanan and Sidney Tsai, IBM Research – Almaden

Deep Neural Networks (DNNs) have demonstrated unparalleled capabilities in recent years for several applications, such as image processing, natural language understanding, and content generation. As DNNs have evolved over time – from convolutional neural networks, recurrent neural networks, to transformers and beyond – precision requirements, performance bottlenecks, and hardware design considerations have changed with the DNN characteristics. While Compute-in Memory (CIM) is a promising approach for accelerating the workhorse Multiply-Accumulate operations of DNNs, architecting future DNN systems goes well beyond CIM tile design, as macro-level efficiency may not necessarily translate into system level efficiency. Amdahl’s law cannot be ignored, causing auxiliary operations such as attention, layerNorm, etc. to become more important and nullifying tile efficiency gains. Von Neumann’s bottleneck could make this worse, as increasing DNN model sizes may preclude full stationarity and force weight movement. In this workshop, we focus on application benchmarking for CIM systems, translating from application requirements to circuit, device, architecture, and manufacturing requirements. Topics of interest include pipeline design and scheduling for CIM, data-transport topologies, architectural tools, and 3D approaches to address weight capacity requirements.

About Organizer

Dr. Pritish Narayanan is Principal Research Scientist at IBM Research, working on Hardware Acceleration of AI using Compute-in-memory. He has been the design lead for many analog accelerator prototypes and has an extensive publication record in top journals and conferences including Nature, VLSI, IEDM and ICML. He has given several keynote, invited and tutorial talks at venues such as CICC, COOLCHIPS and IMW, and until recently was an Associate Editor for TED.

Sidney Tsai received her Ph.D. from the Electrical Engineering and Computer Science department at MIT in 2011. After graduation, Sidney joined IBM and worked on Advanced Lithography for logic and memory. She currently works at the Almaden Research Center in San Jose as the manager of the Analog AI team. The team has demonstrated software equivalent accuracies for both training and inference of Deep Neural Networks in Analog memory and built inference chips with Phase Change Memory devices fabricated on 14nm CMOS.

1. CIM-based processing of DNNs, Xiaoyu Sun, TSMC: Abstract

The Compute-In-Memory (CIM) concept has been extensively studied over the past decade as an energy- and area-efficient solution for matrix multiplications involved in DNN processing. In this presentation, we will begin with a brief overview of CIM, including the evolution of TSMC’s CIM macros at advanced nodes since 2020. We will then focus on CIM in the accelerator context, discuss workload-level latency and energy estimation, layer- and application-specific challenges, as well as their potential solutions through dataflow, architecture, and technology optimizations.

Biography
Xiaoyu Sun received the Ph.D. degree in electrical engineering from Georgia Institute of Technology in 2020. He is currently a Technical Manager at TSMC Corporate Research, San Jose. His current research interests include AI accelerator design, performance modeling, and 3D integration technologies.
2. GainSight: Fine-Grained Memory Access Profiling for GCRAM-Based AI Accelerators, Thierry Tambe, Assistant Professor of Electrical Engineering and, by courtesy, of Computer Science at Stanford University: Abstract
The exponentially increasing memory capacity and bandwidth requirements for data-intensive compute workloads, such as transformer-based generative AI models, call for increasing amounts of low-latency, low-cost, and high-density on-chip storage for AI/ML accelerators. Given the challenges in scaling SRAM cells for on-chip memory, we explore alternative on-chip memory devices that provide better scalability and higher density at similar latencies, with a particular focus on gain-cell memories. The major potential drawback of gain-cell RAM (GCRAM), however, is their shorter data retention times and higher refresh costs. To address this, we are on a mission to co-design accelerator hardware and software such that cached data lifetimes align with gain-cell RAM retention times to minimize refreshes. As a starting point in our design space exploration process, we are developing a simulation-based profiler, GainSight, to capture and analyze fine-grained memory access patterns and data lifetimes of various AI benchmarking workloads on cycle-accurate accelerator hardware models, such as GPGPUs and systolic arrays. GainSight offers insights beyond those of traditional coarse-grained software profilers, including application-specific cache lifetime statistics, GCRAM retention requirements, optimal GCRAM topology choices, and more. Ultimately, our work guides the development of GCRAM-based AI accelerator architectures that strategically exploit transitory data to achieve significant improvements in area and energy efficiency when compared to conventional SRAM-based systems.

Biography
Thierry Tambe is an Assistant Professor of Electrical Engineering and, by courtesy, of Computer Science at Stanford University. His research explores the intersection of AI/ML and hardware systems. Specifically, his research group develops algorithms, hardware architectures, chips, and tools to make accelerated AI computing more portable, scalable, efficient, and easier to design. Previously, Thierry was a visiting research scientist at NVIDIA and an engineer at Intel. He received a B.S., and M.Eng. from Texas A&M University, and a Ph.D. from Harvard University, all in Electrical Engineering.
3. Tile Efficiency is not System Efficiency – CIM architecture studies of LLMs and other large DNNs, Pritish Narayanan, IBM Research – Almaden: Abstract
To achieve system-level benefits, compute-in-memory tiles need to be integrated into heterogeneous architectures alongside general and application-specific digital compute cores, together with a high-bandwidth and reconfigurable on-chip routing fabric that can deliver the right vectors to the right locations for just-in-time DNN compute. In the first part of my talk, I will review some of IBM’s work in developing weight-stationary analog compute cores with a focus on the design choices and optimizations for high tile efficiency. I will then provide a brief introduction to heterogeneous architectures for CIM systems followed by architectural studies of DNNs identifying auxiliary operations that bottleneck the performance. Finally, I will highlight the issue of achieving true weight-stationarity in large models such as Mixture-of-Expert (MoE) Transformer models, and the system-level benefits that such an architecture can achieve.
4. HISIM: Efficient Design Space Exploration of 2.5D/3D Heterogeneous Integration for AI Computing, Yu (Kevin) Cao, Professor of Electrical and Computer Engineering, University of Minnesota: Abstract
Monolithic designs face significant challenges in fabrication cost and data movement, especially when executing bigger and more complex DNN models. While recent advancements, such as near-memory and in-memory computing (IMC), aim to address these issues, the scaling trend of monolithic design still lags behind the ever-increasing demand of AI algorithms and other data-intensive applications. In this context, technological innovations, particularly 2.5D/3D integration through advanced packaging techniques, are critical to enabling heterogeneous integration (HI) and unlocking significant performance, energy and cost benefits beyond conventional chip design approaches. Such a paradigm shift requires a tight collaboration between packaging and chiplet design throughout the entire design cycle. In this talk, we will introduce HISIM, a new system performance benchmark tool for efficient design space exploration of 2.5D/3D heterogenous systems for energy-efficient AI computing. HISIM incorporates a suite of analytical performance models for various computing units (e.g., IMCs and systolic arrays), network-on-chip, 2.5D/3D interconnections, and thermal simulations, achieving 106× faster than state-of-the-art AI benchmark tools. We will demonstrate HISIM on various DNN models, helping shed light on the potential and research needs of future chiplet-package co-design.

Biography
Yu Cao is the Louis John Schnell Professor in the Department of Electrical and Computer Engineering at the University of Minnesota. He received the Ph.D. degree in electrical engineering from University of California, Berkeley, in 2002. He has published numerous articles and three books on nano-CMOS modeling and physical design. His research interests include neural-inspired computing, hardware design for on-chip learning, and reliable integration of nanoelectronics.

Dr. Cao is a Distinguished Lecturer of the IEEE Circuits and Systems Society. He was a recipient of the 2020 Intel Outstanding Researcher Award, the 2009 ACM SIGDA Outstanding New Faculty Award, the 2006 NSF CAREER Award, the 2006 and 2007 IBM Faculty Award, and five Best Paper Awards. He is a Fellow of the IEEE.
5. Recent Development of NeuroSim Benchmark Framework towards Anstrom Nodes and Heterogeneous 3D Integrated System, Shimeng Yu, Electrical and Computer Engineering, Georgia Institute of Technology: Abstract
This presentation will discuss the recent progresses in system-technology co-design (STCO) enabling tool “NeuroSim” for the memory-centric compute system including the following topics: 1) digital compute-in-memory (DCIM)’s scaling trends towards 5 Angstrom node for on-chip AI/ML acceleration; 2) “Active” backside power delivery where the on-chip voltage converter that is integrated at the point-of-load for the 3D stacked memory/logic system; 3) large language model (LLM) acceleration that takes the advantages of 3D stackable DRAM on top of logic compute dies with co-packaged optical interconnect. Besides the power/performance/area (PPA) metrics, additional measures such as heat dissipation and power density are included in the benchmark framework.

Biography
Shimeng Yu is the endowed Dean’s Professor of Electrical and Computer Engineering at the Georgia Institute of Technology. He received the PhD degree from Stanford University in 2013. He is elevated for the IEEE Fellow for contributions to non-volatile memories and in-memory computing. His 400+ publications received 32,000+ citations (Google Scholar) with H-index 83. He serves flagship conferences in the field as technical program committee, including IEEE International Electron Devices Meeting (IEDM), IEEE Symposium on VLSI Technology and Circuits, etc. He is also an editor for IEEE Electron Device Letters (EDL). Among Prof. Yu’s honors, he was a recipient of National Science Foundation (NSF) CAREER Award in 2016, IEEE Electron Devices Society (EDS) Early Career Award in 2017, ACM Special Interests Group on Design Automation (SIGDA) Outstanding New Faculty Award in 2018, Semiconductor Research Corporation (SRC) Inaugural Young Faculty Award in 2019, IEEE Circuits and Systems Society (CASS) Distinguished Lecturer in 2021, IEEE Electron Devices Society (EDS) Distinguished Lecturer in 2022, and Intel Outstanding Researcher Award in 2023, etc.

2025 Symposium on
VLSI Technology and Circuits

Workshop 5

Architectural Benchmarking of Compute-in-Memory Systems

About Organizer