Journal of Systems Architecture ( IF 3.7 ) Pub Date : 2019-08-27 , DOI: 10.1016/j.sysarc.2019.101637 Marwa Naveed Sheikh , Wei-Ming Lin
Simultaneous Multithreading improves performance of superscalar CPUs by allowing execution of multiple threads with a shared path. An improved instruction throughput is attained by better utilizing shared resources from exploiting the newly available thread-level parallelism in addition to the intrinsic instruction-level parallelism. Physical Register file is one of the most critical shared resources in SMT systems due to the limited number of rename registers available for renaming. Registers held by long-latency instructions of some threads will block the progress of other faster threads resulting in inefficient resource utilization and performance degradation. In this paper, we present an algorithm with which each thread is allotted a portion of rename registers (i.e. a cap) in real time according to their run-time behaviors, namely the utilization ratio of its allotted quota and the pace of its deallocation. The proposed method differs from the state-of-the-art capping techniques in allowing each thread to adjust its own individual cap value in real time. To preclude over-adjustment, a global lower limit on the cap values is further established also in real time to accommodate potentially drastic variations from different mixes of on-going threads. The proposed method shows a very significant improvement in IPC up to 53.8% in a 4-threaded system, 43.8% and 41.6% in a 6-threaded and an 8-threaded system respectively.
中文翻译:
动态限制SMT处理器的重命名寄存器
同步多线程允许通过共享路径执行多个线程,从而提高了超标量CPU的性能。除了固有的指令级并行性之外,还可以通过开发新可用的线程级并行性来更好地利用共享资源,从而提高指令吞吐量。物理寄存器文件是SMT系统中最关键的共享资源之一,因为可用于重命名的重命名寄存器数量有限。由某些线程的长等待时间指令持有的寄存器将阻止其他更快的线程的进程,从而导致资源利用效率低下和性能下降。在本文中,我们提出了一种算法,根据该算法,每个线程根据其运行时行为实时分配一部分重命名寄存器(即上限),即分配配额的利用率和释放的速度。所提出的方法与最新的封顶技术不同,它允许每个线程实时调整其自己的封顶值。为了防止过度调整,还实时建立了上限值的全局下限,以适应来自不同持续螺纹的潜在潜在剧烈变化。所提出的方法在IPC上显示出非常显着的提高,在4线程系统中达到了53.8%,在6线程和8线程系统中分别达到了43.8%和41.6%。还实时建立了上限值的全球下限,以适应来自不同持续螺纹混合的潜在剧烈变化。所提出的方法在IPC上显示出非常显着的提高,在4线程系统中达到了53.8%,在6线程和8线程系统中分别达到了43.8%和41.6%。还实时建立了上限值的全球下限,以适应来自不同持续螺纹混合的潜在剧烈变化。所提出的方法在IPC上显示出非常显着的提高,在4线程系统中达到了53.8%,在6线程和8线程系统中分别达到了43.8%和41.6%。