본문 바로가기
Studies/Circuit Design

High-Bandwidth Memory (HBM3) Specifications Study (1)

by veritedemoi 2023. 7. 26.
JEDEC (Joint Electron Device Engineering Council) standards of 3rd generation HBM has been published in January 2022. The latest revised version of HBM3 standards will be addressed in this post.

Due to the copyright issues, some figures or information may be reinterpreted by myself referring to the JEDEC standards.

 

With respect to the chip design verification, it is necessary to comprehend the specifications of the chip before writing the test vector codes (or scenarios). The reliability of the digital logic inside the chip is mostly verified with SystemVerilog and UVM (Universal Verification Methodology) testbenches. In order to create and apply these testbenches, all of the interactions between signals for channels and architectural efficiency should be considered.

 

In this post, I will review the JEDEC standards focusing on the architecture of HBM3, the characteristics of main signals, and additional circuitry techniques used in HBM designs. To start with, it is highly recommended to look through the basic architecture of DRAM before getting into multi-level stacked DRAM.

 

 

 

HBM Architecture


 

Bus Transmission & DRAM Architecture (CC: B. Jacob, ISCA '02)

As you can see from the figure above, DRAM architecture is quite similar to most of the main memory systems. It have a column/row decoder, sense amplifiers, a memory array consisting of bitcells, and data I/O circuitry. DRAM has the advantage of large storage capacity thanks to the 1T1C characteristics. However, other than that, the basic principle of reading and writing data onto the memory array is identical to other memory systems.

HBM 3D Stack For Maximum Data Throughput (CC: Rambus)

The point that we should focus on is how HBM manages basic DRAM architecture to have a larger bandwidth. Since the constraints of conventional interconnection technology, it made a breakthrough with 3D-stack methodology. This trick also cannot avoid the variation incurred within interconnection, however, it still dominates other conventional DRAMs in terms of data bandwidth.

 

Additional to the bandwidth advantages, the customers (AMD, NVIDIA, Microsoft, etc.) also require a large capacity of data storage. Therefore, HBM implemented multiple channels and higher stack levels for the data storage requirements. 

Example of DRAM Die Stack with Channels (CC: Intel)

 

According to JEDEC standards, the division of channels for each stack is left to the vendor (Samsung, Hynix, Micron, etc.). The vendors may manufacture products that can flexibly support 1, 2, 4 or 8 channels ─ enabling 16-channel configurations with stacks of 4 to 16 dies. Furthermore, each channel is independent while operating in the whole HBM system.

 

More specifically, the architectural features of HBM3 are addressed at the beginning of JEDEC standards. Before reviewing the features, the terminologies used in the explanation should be reviewed first.


Prefetch

CC: Lecture Notes of CSE 502 Stony Brrok Univ. (Spring '15)
Internal Structure of Virtual Channel (CC: B. Jacob, ISCA '02)

It literally means fetching data ahead of demand. Obviously, it was designed for higher bandwidth constrained by the performance gap between the core (CPU) and memory systems. This operation is usually controlled by controller software or instruction frequencies.

 

Prefetching takes place while the data transfers from the row decoder to the channel. In this manner, prefetching benefits the core and DRAM at the same time with respect to the increase of bandwidth and higher yields at a lower speed [Link].

BL (Burst Length)

The DRAM burst length refers to the number of consecutive memory locations that can be accessed with a single command. A higher burst length allows for more efficient use of the memory bus and can improve performance by reducing the number of memory accesses required to fetch a given amount of data.

 

For instance, a system with a DRAM burst length of 8 can access 8 consecutive memory locations in a single command, while a system with a burst length of 4 can only access 4. This means that a system with a burst length of 8 can access the same amount of data in fewer memory accesses, which can lead to improved performance. Additionally, this also means that a system with a burst length of 8 can access the same amount of data in less time than a system with a burst length of 4, which can also lead to improved performance.

Prefetch vs. Burst Length

Prefetch → ratio of DRAM core frequency to IO frequency.
Burst Length the cache line size of CPU.

Both prefetch and burst length are the techniques used in DRAM to optimize memory access and data transfer. Prefetching is closely related to the burst length because the DRAM controller often uses prefetching techniques to determine the number of data elements to transfer in a burst.

PC (Pseudo Channel)

HBM2 Pseudo Channel (CC: SK Hynix Seminar @ NVIDIA)

Pseudo Channel is the technique that distributes a single physical channel into a number of identical-sized virtual channels. For instance, two pseudo channels in a single DRAM die channel utilize the same AWORD (address buffer), however, pseudo channel mode enables each bank group in the pseudo channel to access data independently to the physical channel I/O.

 

By having multiple pseudo channels, the memory controller can perform simultaneous memory operations, thereby increasing the effective memory bandwidth.

Bank

Bank refers to a subsection of the memory array within a DRAM chip.

 

The entire DRAM chip is divided into multiple banks, and each bank is capable of independently accessing and storing data. When the processor or memory controller requests data from the DRAM, the memory controller activates a specific row in a selected bank, and then it accesses the desired column to read or write the data. This process is known as a DRAM access cycle. 

 

The concept of banks in DRAM is a fundamental part of the memory organization that contributes to improving memory access speed, parallelism, and overall system performance.

MRS (Mode Register Set)

MRS is a command used to configure various operating modes and settings for the HBM devices. It's part of the JEDEC standard for DRAM devices, which includes HBM.

 

The MRS command allows the memory controller to write specific data patterns to special mode registers within the HBM device. These mode registers control various parameters and settings, such as timing, data bus width, refresh rate, and other operational characteristics of the memory.

 

By configuring the mode registers through the MRS command, the memory controller can optimize the performance and operation of the HBM for the specific requirements of the system or application.


Based on the aforementioned terminologies, there are a few architectural features added to the previous generations of HBM.

 

First of all, 256 bit prefetch per memory read/write access is supported. Since the configurations of storage capacity, number of channels are different from previous HBM devices, this novel prefetch access is also changed. (⇒ 64 DQ width +ECC/SEV pins support /channel, 32 DQ width for PC mode)

 

In addition to this, another architectural features provide the capacity of HBM3. Channel density of 2Gb to 32Gb16/32/48/64 banks per channel, and 1KB page size per pseudo channel.


 

 

 

Single Channel Signals


According to the given architectural specifications of HBM3, each channel consists of an independent command and data interface. Therefore, the signals required for each channel is identical to each other, except the global signals per HBM3 device.

 

DQ[63:0] C[7:0] R[9:0] 1 DBI / 8 DQs
2 ECC / 32 DQs 2 SEV / 32 DQs 1 PAR / 32 DQs (DPAR) 1 PAR / AWORD (APAR)
1 DERR / 32 DQs 1 RDQS_t/c, WDQS_t/c /
32 DQs
CK_t/c AERR / AWORD
RD[3:0] Redundant row/column 1 RFU / AWORD  

To put it simply, the vector expressions inside the bracket [] imply the number of microbumps. DQ is for data signal, and C and R is for column and row address for command. The rest of the signals are representations of specific operations, such as checking parity, controlling strobe signals, etc.

 

Specific operations will be addressed in the next post.


 

댓글