101 Jefferson Dr, 1st Floor
Menlo Park CA 94025

(408) 623 9165

info at jpdata dot co
sales at jpdata dot co

Memory innovations must keep pace with compute to achieve best performance

Media often focuses on the compute and TOPS (TeraOPs Per Second) when writing about AI chips or hardware accelerator. Memory is mentioned in the article but is not part of the headlines. In reality, memory is just as important part of the overall acceleration and deserves more attention. As the compute performance is increasing, it has become increasingly important to have right data available at right time to ensure the performance is optimal. Not having the data available means the compute is idle and less utilization of resources.

There are many moving parts when it comes to memory within data center. There’s the RAM. RAM could be within chip in the form of Cache or on-board in the form of accelerator RAM or within server as system RAM. Then there’s the connectivity between compute and memory. The speed, bus width and the bandwidth impacts how fast the data moves from memory to compute. Then there’s also an aspect of bringing data from system storage (primarily Flash for AI applications) to the server via network interface. This might involve multiple hops for data by the time it gets to compute.

The are hardware aspect of RAM involves size, location of RAM, connectivity protocol. The software aspect involves managing the overall process of getting data from wherever it may be to compute unit as smoothly as possible. Table below shows different types of memories within typical server environment in data center and how they connect to compute block.

MemoryConnectivity to computeInnovations
Register FilesCustomNone
CacheCustomvia Chiplets
On-chip/board RAMGDDR6, HBMChiplets, 2.5D architectures, new protocols
System RAMPCIeNew protocols, better CPU- accelerator connectivity, software
Off-system memoryEthernet, InfinibandSystem architecture and software
Table 1: Memory within a typical data center environment

If you compare the highest compute available in H100(3958 INT8TOPS) to V100 (130 TensorTFLOPS), the increase in two generations is 28X. The accelerator RAM between the same two generations has increased from 32 GB to 80 GB (2.5X). The memory bandwidth has increased from 1.1 TB/sec to 2 TB/sec, an increase of 1.8X. The system connectivity has increased from 32 GB to 128 GB (4X). Needless to say, it is obvious that the memory parameter increase for size and connectivity are far smaller than compute. Of course, this isn’t the best apples-to-apples comparison and there are different issues surrounding memory than compute but it gives and indication of the relative increase of computer vs memory parameters. (Note: The data is collected from V100 and H100 data sheets).

Max compute3958 TOPS130 TOPS28XH100 INT8 and V100 Tensor TOPS considered.
GPU RAM80 GB32 GB2.5XFrom data sheets
GPU RAM Bandwidth1.1 TB/s2.0 TB/s1.8XFrom data sheets
System Connectivity32 GB/s128GB/s4XPCIe speed improvement
Table 2: Increase in compute vs memory parameters in V100 vs H100

The memory architecture, protocols, manufacturing technology, system architecture – all possibly needs revamping to bring memory on par with compute improvement. Memcon, upcoming conference at Santa Clara, focuses on issues pertaining to memory in data center and has a good collection of sessions to discuss such open issues. We need more such gatherings to ensure that memory architecture stays in sync with compute architecture and maximum performance is achieved. In time, I hope that somewhat ‘out-of-box ideas will lead to tighter memory-compute integration and eventually best possible accelerator performance.