A 28-nm 64-kb 31.6-TFLOPS/W Digital-Domain Floating-Point-Computing-Unit and Double-Bit 6T-SRAM Computing-in-Memory Macro for Floating-Point CNNs,IEEE Journal of Solid-State Circuits

当前位置： X-MOL 学术 › IEEE J. Solid-State Circuits › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A 28-nm 64-kb 31.6-TFLOPS/W Digital-Domain Floating-Point-Computing-Unit and Double-Bit 6T-SRAM Computing-in-Memory Macro for Floating-Point CNNs
IEEE Journal of Solid-State Circuits ( IF 4.6 ) Pub Date : 2024-03-25 , DOI: 10.1109/jssc.2024.3375359
An Guo ₁ , Xi Chen ₁ , Fangyuan Dong ₁ , Xingyu Pu ₁ , Dongqi Li ₁ , Jingmin Zhang ₁ , Xueshan Dong ₁ , Hui Gao ₁ , Yiran Zhang ₁ , Bo Wang ₁ , Jun Yang ₁ , Xin Si ₁

Affiliation

With the rapid advancement of artificial intelligence (AI), computing-in-memory (CIM) structure is proposed to improve energy efficiency (EF). However, previous CIMs often rely on INT8 data types, which pose challenges when addressing more complex networks, larger datasets, and increasingly intricate tasks. This work presents a double-bit 6T static random-access memory (SRAM)-based floating-point CIM macro using: 1) a cell array with double-bitcells (DBcells) and floating-point computing units (FCUs) to improve throughput without the sacrifice of inference accuracy; 2) an FCU with high-bit full-precision multiply cell (HFMC) and low-bit approximate-calculation multiply cell (LAMC) to reduce internal bandwidth and area cost; 3) a CIM macro architecture with FP processing circuits to support both floating-point MAC (FP-MAC) and integer (INT)-multiplication and accumulation (MAC); 4) a new ShareFloatv2 data type to map floating point in CIM array; and 5) a lookup table (LUT)-based Tensorflow training method to improve inference accuracy. A fabricated 28-nm 64-kb digital-domain SRAM-CIM macro achieved the best EF (31.6 TFLOPS/W) and the highest area efficiency (2.05 TFLOPS/mm 2)^{2}) for FP-MAC with Brain Float16 (BF16) IN/W/OUT on three AI tasks: classification@CIFAR100, detection@COCO, and segmentation@VOC2012.

中文翻译：

用于浮点 CNN 的 28 nm 64 kb 31.6 TFLOPS/W 数字域浮点计算单元和双位 6T-SRAM 内存计算宏

随着人工智能（AI）的快速发展，内存计算（CIM）结构被提出来提高能源效率（EF）。然而，以前的 CIM 通常依赖于 INT8 数据类型，这在处理更复杂的网络、更大的数据集和日益复杂的任务时带来了挑战。这项工作提出了一种基于双位 6T 静态随机存取存储器 (SRAM) 的浮点 CIM 宏，使用：1) 具有双位单元 (DBcell) 和浮点计算单元 (FCU) 的单元阵列，以提高吞吐量，而无需牺牲推理准确性； 2）具有高位全精度乘法单元（HFMC）和低位近似计算乘法单元（LAMC）的FCU，以减少内部带宽和面积成本； 3）具有FP处理电路的CIM宏架构，支持浮点MAC（FP-MAC）和整数（INT）乘法累加（MAC）； 4) 新的 ShareFloatv2 数据类型用于映射 CIM 数组中的浮点； 5）基于查找表（LUT）的Tensorflow训练方法来提高推理精度。制造的 28 nm 64 kb 数字域 SRAM-CIM 宏对于具有 Brain Float16 的 FP-MAC 实现了最佳 EF (31.6 TFLOPS/W) 和最高面积效率 (2.05 TFLOPS/mm 2)^{2}) ( BF16) 在三个 AI 任务上的 IN/W/OUT：分类@CIFAR100、检测@COCO 和分段@VOC2012。

更新日期：2024-03-25

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南