SP-PIM: A Super-Pipelined Processing-In-Memory Accelerator With Local Error Prediction for Area/Energy-Efficient On-Device Learning,IEEE Journal of Solid-State Circuits

当前位置： X-MOL 学术 › IEEE J. Solid-State Circuits › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SP-PIM: A Super-Pipelined Processing-In-Memory Accelerator With Local Error Prediction for Area/Energy-Efficient On-Device Learning
IEEE Journal of Solid-State Circuits ( IF 4.6 ) Pub Date : 2024-03-12 , DOI: 10.1109/jssc.2024.3369326
Jaehoon Heo ₁ , Jung-Hoon Kim ₁ , Wontak Han ₁ , Jaeuk Kim ₁ , Joo-Young Kim ₁

Affiliation

Over the past few years, on-device learning (ODL) has become an integral aspect of the success of edge devices that embrace machine learning (ML) since it plays a crucial role in restoring ML model accuracy when the edge environment changes. However, implementing ODL on battery-limited edge devices poses significant challenges due to the generation of large-size intermediate data during ML training and the frequent data movement between the processor and memory, resulting in substantial power consumption. To address this limitation, certain ML accelerators in edge devices have adopted a processing-in-memory (PIM) paradigm, integrating computing logic into memory. Nevertheless, these accelerators still face hurdles such as long latency caused by the lack of a pipelined approach in the training process, notable power and area overheads related to floating-point arithmetic, and incomplete handling of data sparsity during training. This article presents a high-throughput super-pipelined PIM accelerator, named SP-PIM, designed to overcome the limitations of existing PIM-based ODL accelerators. To this end, SP-PIM implements a holistic multi-level pipelining scheme based on local error prediction (EP), enhancing training speed by 7.31 $\times $ . In addition, SP-PIM introduces a local EP unit (LEPU), a lightweight circuit that performs accurate EP leveraging power-of-two (PoT) random weights. This strategy significantly reduces power-hungry external memory access (EMA) by 59.09%. Moreover, SP-PIM fully exploits sparsities in both activation and error data during training, facilitated by a highly optimized PIM macro design. Finally, the SP-PIM chip, fabricated using 28-nm CMOS technology, achieves a training speed of 8.81 epochs/s. It occupies a die area of 5.76 mm2 and consumes between 6.91 and 433.25 mW at operating frequencies of 20–450 MHz with a supply voltage of 0.56–1.05 V. We demonstrate that it can successfully execute end-to-end ODL for the CIFAR10 and CIFAR100 datasets. Consequently, it achieves state-of-the-art area efficiency (560.6 GFLOPS/mm2) and competitive power efficiency (22.4 TFLOPS/W), marking a 3.95 $\times $ higher figure-of-merit (area efficiency $\times $ power efficiency $\times $ capacity) than previous work. Furthermore, we implemented a cycle-level simulator using Python to investigate and validate the scalability of SP-PIM. By doing architectural experiments in various hardware configurations, we successfully verified that the core computing unit within SP-PIM possesses both scale-up and scale-out capabilities.

中文翻译：

SP-PIM：具有本地错误预测功能的超级流水线内存处理加速器，用于区域/节能设备上学习

在过去几年中，设备端学习 (ODL) 已成为采用机器学习 (ML) 的边缘设备成功不可或缺的一个方面，因为它在边缘环境发生变化时恢复 ML 模型准确性方面发挥着至关重要的作用。然而，由于 ML 训练期间会生成大量中间数据以及处理器和内存之间频繁的数据移动，从而导致大量功耗，因此在电池有限的边缘设备上实现 ODL 会带来重大挑战。为了解决这一限制，边缘设备中的某些机器学习加速器采用了内存处理 (PIM) 范例，将计算逻辑集成到内存中。然而，这些加速器仍然面临一些障碍，例如训练过程中缺乏流水线方法导致的长时间延迟、与浮点运算相关的显着功率和面积开销以及训练期间数据稀疏性处理不完整。本文提出了一种高吞吐量超级流水线 PIM 加速器，名为 SP-PIM，旨在克服现有基于 PIM 的 ODL 加速器的局限性。为此，SP-PIM实现了基于局部错误预测（EP）的整体多级流水线方案，将训练速度提高了7.31 $\times $。此外，SP-PIM 引入了本地 EP 单元 (LEPU)，这是一种利用二次幂 (PoT) 随机权重执行精确 EP 的轻量级电路。该策略将耗电的外部存储器访问 (EMA) 显着减少了 59.09%。此外，在高度优化的 PIM 宏设计的推动下，SP-PIM 在训练过程中充分利用了激活数据和错误数据的稀疏性。最后，采用28 nm CMOS技术制造的SP-PIM芯片实现了8.81 epochs/s的训练速度。它占用的芯片面积为 5。76 mm2，工作频率为 20-450 MHz、电源电压为 0.56-1.05 V 时功耗为 6.91 至 433.25 mW。我们证明它可以成功地对 CIFAR10 和 CIFAR100 数据集执行端到端 ODL。因此，它实现了最先进的面积效率 (560.6 GFLOPS/mm2) 和具有竞争力的功率效率 (22.4 TFLOPS/W)，标志着品质因数提高了 3.95 $\times $（面积效率 $\times $与以前的工作相比，电源效率 $\times $ 容量）。此外，我们使用 Python 实现了一个循环级模拟器来研究和验证 SP-PIM 的可扩展性。通过在各种硬件配置下进行架构实验，我们成功验证了SP-PIM中的核心计算单元具有纵向扩展和横向扩展的能力。

更新日期：2024-03-12

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南