Memory innovations must keep pace with compute to achieve best performance

Media often focuses on the compute and TOPS (TeraOPs Per Second) when writing about AI chips or hardware accelerator. Memory is mentioned in the article but is not part of the headlines. In reality, memory is just as important part of the overall acceleration and deserves more attention. As the compute performance is increasing, it has become increasingly important to have right data available at right time to ensure the performance is optimal. Not having the data available means the compute is idle and less utilization of resources.

There are many moving parts when it comes to memory within data center. There’s the RAM. RAM could be within chip in the form of Cache or on-board in the form of accelerator RAM or within server as system RAM. Then there’s the connectivity between compute and memory. The speed, bus width and the bandwidth impacts how fast the data moves from memory to compute. Then there’s also an aspect of bringing data from system storage (primarily Flash for AI applications) to the server via network interface. This might involve multiple hops for data by the time it gets to compute.

The are hardware aspect of RAM involves size, location of RAM, connectivity protocol. The software aspect involves managing the overall process of getting data from wherever it may be to compute unit as smoothly as possible. Table below shows different types of memories within typical server environment in data center and how they connect to compute block.

Memory	Connectivity to compute	Innovations
Register Files	Custom	None
Cache	Custom	via Chiplets
On-chip/board RAM	GDDR6, HBM	Chiplets, 2.5D architectures, new protocols
System RAM	PCIe	New protocols, better CPU- accelerator connectivity, software
Off-system memory	Ethernet, Infiniband	System architecture and software

Table 1: Memory within a typical data center environment

If you compare the highest compute available in H100(3958 INT8TOPS) to V100 (130 TensorTFLOPS), the increase in two generations is 28X. The accelerator RAM between the same two generations has increased from 32 GB to 80 GB (2.5X). The memory bandwidth has increased from 1.1 TB/sec to 2 TB/sec, an increase of 1.8X. The system connectivity has increased from 32 GB to 128 GB (4X). Needless to say, it is obvious that the memory parameter increase for size and connectivity are far smaller than compute. Of course, this isn’t the best apples-to-apples comparison and there are different issues surrounding memory than compute but it gives and indication of the relative increase of computer vs memory parameters. (Note: The data is collected from V100 and H100 data sheets).

Parameter	H100	V100	Increase	Comments
Max compute	3958 TOPS	130 TOPS	28X	H100 INT8 and V100 Tensor TOPS considered.
GPU RAM	80 GB	32 GB	2.5X	From data sheets
GPU RAM Bandwidth	1.1 TB/s	2.0 TB/s	1.8X	From data sheets
System Connectivity	32 GB/s	128GB/s	4X	PCIe speed improvement

Table 2: Increase in compute vs memory parameters in V100 vs H100

The memory architecture, protocols, manufacturing technology, system architecture – all possibly needs revamping to bring memory on par with compute improvement. Memcon, upcoming conference at Santa Clara, focuses on issues pertaining to memory in data center and has a good collection of sessions to discuss such open issues. We need more such gatherings to ensure that memory architecture stays in sync with compute architecture and maximum performance is achieved. In time, I hope that somewhat ‘out-of-box ideas will lead to tighter memory-compute integration and eventually best possible accelerator performance.