当前位置: X-MOL 学术IEEE J. Solid-State Circuits › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A 28-nm Floating-Point Computing-in-Memory Processor Using Intensive-CIM Sparse-Digital Architecture
IEEE Journal of Solid-State Circuits ( IF 4.6 ) Pub Date : 2024-02-16 , DOI: 10.1109/jssc.2024.3363871
Shengzhe Yan 1 , Jinshan Yue 1 , Chaojie He 1 , Zi Wang 1 , Zhaori Cong 1 , Yifan He 2 , Mufeng Zhou 2 , Wenyu Sun 2 , Xueqing Li 2 , Chunmeng Dou 1 , Feng Zhang 1 , Huazhong Yang 2 , Yongpan Liu 2 , Ming Liu 1
Affiliation  

Computing-in-memory (CIM) chips have demonstrated promising high energy efficiency on multiply–accumulate (MAC) operations for artificial intelligence (AI) applications. Though integral (INT) CIM chips are emerging, the floating-point (FP) CIM chip has not been well explored. The high-accuracy demand of larger models and complex tasks requires FP computation. Besides, most of the neural network (NN) training tasks still rely on FP computation. This work presents an energy-efficient FP CIM processor. It is observed that most of the exponent values of FP data are concentrated in a small region. Therefore, the FP computations are divided into intensive and sparse parts and then executed on an intensive-CIM sparse-digital architecture. First, an FP-to-INT CIM workflow for the intensive FP operations is designed to reduce the CIM execution cycles. Second, a flexible sparse-digital core is proposed for the remaining sparse FP operations. Utilizing both the intensive-CIM and sparse-digital cores, this work can achieve both high energy efficiency and identical accuracy to the FP algorithm baseline. Considering the FP CIM execution flow, a CIM-friendly low-bit FP training method is proposed to further reduce the execution cycles. Besides, a low-MAC-value (MACV) CIM macro is designed to utilize the more random sparsity brought by FP alignment. The 28-nm fabricated chip shows 275–1615-TOPS/W@INT4 and 17.2–91.3-TOPS/W@FP16 macro energy efficiency from dense to the average sparsity on the tested models.

中文翻译:


采用密集 CIM 稀疏数字架构的 28 nm 浮点内存计算处理器



内存计算 (CIM) 芯片在人工智能 (AI) 应用的乘法累加 (MAC) 运算中展现出了令人鼓舞的高能效。尽管积分(INT)CIM芯片正在兴起,但浮点(FP)CIM芯片尚未得到很好的开发。较大模型和复杂任务的高精度需求需要FP计算。此外,大多数神经网络(NN)训练任务仍然依赖于FP计算。这项工作提出了一种节能的 FP CIM 处理器。可以看出,FP 数据的大部分指数值都集中在一个小区域内。因此,FP 计算被分为密集部分和稀疏部分,然后在密集 CIM 稀疏数字架构上执行。首先,针对密集 FP 操作的 FP 到 INT CIM 工作流程旨在减少 CIM 执行周期。其次,针对剩余的稀疏 FP 操作提出了灵活的稀疏数字核心。利用密集 CIM 和稀疏数字核心,这项工作可以实现高能效和与 FP 算法基线相同的精度。考虑到FP CIM执行流程,提出了一种CIM友好的低位FP训练方法,以进一步减少执行周期。此外,低MAC值(MACV)CIM宏被设计为利用FP对齐带来的更随机的稀疏性。在测试模型上,28 nm 制造的芯片从密集到平均稀疏的宏观能效分别为 275–1615-TOPS/W@INT4 和 17.2–91.3-TOPS/W@FP16。
更新日期:2024-02-16
down
wechat
bug