A 28-nm Floating-Point Computing-in-Memory Processor Using Intensive-CIM Sparse-Digital Architecture,IEEE Journal of Solid-State Circuits

当前位置： X-MOL 学术 › IEEE J. Solid-State Circuits › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A 28-nm Floating-Point Computing-in-Memory Processor Using Intensive-CIM Sparse-Digital Architecture
IEEE Journal of Solid-State Circuits ( IF 4.6 ) Pub Date : 2024-02-16 , DOI: 10.1109/jssc.2024.3363871
Shengzhe Yan ₁ , Jinshan Yue ₁ , Chaojie He ₁ , Zi Wang ₁ , Zhaori Cong ₁ , Yifan He ₂ , Mufeng Zhou ₂ , Wenyu Sun ₂ , Xueqing Li ₂ , Chunmeng Dou ₁ , Feng Zhang ₁ , Huazhong Yang ₂ , Yongpan Liu ₂ , Ming Liu ₁

Affiliation

Computing-in-memory (CIM) chips have demonstrated promising high energy efficiency on multiply–accumulate (MAC) operations for artificial intelligence (AI) applications. Though integral (INT) CIM chips are emerging, the floating-point (FP) CIM chip has not been well explored. The high-accuracy demand of larger models and complex tasks requires FP computation. Besides, most of the neural network (NN) training tasks still rely on FP computation. This work presents an energy-efficient FP CIM processor. It is observed that most of the exponent values of FP data are concentrated in a small region. Therefore, the FP computations are divided into intensive and sparse parts and then executed on an intensive-CIM sparse-digital architecture. First, an FP-to-INT CIM workflow for the intensive FP operations is designed to reduce the CIM execution cycles. Second, a flexible sparse-digital core is proposed for the remaining sparse FP operations. Utilizing both the intensive-CIM and sparse-digital cores, this work can achieve both high energy efficiency and identical accuracy to the FP algorithm baseline. Considering the FP CIM execution flow, a CIM-friendly low-bit FP training method is proposed to further reduce the execution cycles. Besides, a low-MAC-value (MACV) CIM macro is designed to utilize the more random sparsity brought by FP alignment. The 28-nm fabricated chip shows 275–1615-TOPS/W@INT4 and 17.2–91.3-TOPS/W@FP16 macro energy efficiency from dense to the average sparsity on the tested models.

中文翻译：

采用密集 CIM 稀疏数字架构的 28 nm 浮点内存计算处理器

内存计算 (CIM) 芯片在人工智能 (AI) 应用的乘法累加 (MAC) 运算中展现出了令人鼓舞的高能效。尽管积分（INT）CIM芯片正在兴起，但浮点（FP）CIM芯片尚未得到很好的开发。较大模型和复杂任务的高精度需求需要FP计算。此外，大多数神经网络（NN）训练任务仍然依赖于FP计算。这项工作提出了一种节能的 FP CIM 处理器。可以看出，FP 数据的大部分指数值都集中在一个小区域内。因此，FP 计算被分为密集部分和稀疏部分，然后在密集 CIM 稀疏数字架构上执行。首先，针对密集 FP 操作的 FP 到 INT CIM 工作流程旨在减少 CIM 执行周期。其次，针对剩余的稀疏 FP 操作提出了灵活的稀疏数字核心。利用密集 CIM 和稀疏数字核心，这项工作可以实现高能效和与 FP 算法基线相同的精度。考虑到FP CIM执行流程，提出了一种CIM友好的低位FP训练方法，以进一步减少执行周期。此外，低MAC值（MACV）CIM宏被设计为利用FP对齐带来的更随机的稀疏性。在测试模型上，28 nm 制造的芯片从密集到平均稀疏的宏观能效分别为 275–1615-TOPS/W@INT4 和 17.2–91.3-TOPS/W@FP16。

更新日期：2024-02-16

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南