JP Data LLC
101 Jefferson Dr, 1st Floor
Menlo Park CA 94025
Phone
(408) 623 9165
Email
info at jpdata dot co
sales at jpdata dot co
JP Data LLC
101 Jefferson Dr, 1st Floor
Menlo Park CA 94025
Phone
(408) 623 9165
Email
info at jpdata dot co
sales at jpdata dot co
Media often focuses on the compute and TOPS (TeraOPs Per Second) when writing about AI chips or hardware accelerator. Memory is mentioned in the article but is not part of the headlines. In reality, memory is just as important part of the overall acceleration and deserves more attention. As the compute performance is increasing, it has become increasingly important to have right data available at right time to ensure the performance is optimal. Not having the data available means the compute is idle and less utilization of resources.
There are many moving parts when it comes to memory within data center. There’s the RAM. RAM could be within chip in the form of Cache or on-board in the form of accelerator RAM or within server as system RAM. Then there’s the connectivity between compute and memory. The speed, bus width and the bandwidth impacts how fast the data moves from memory to compute. Then there’s also an aspect of bringing data from system storage (primarily Flash for AI applications) to the server via network interface. This might involve multiple hops for data by the time it gets to compute.
The are hardware aspect of RAM involves size, location of RAM, connectivity protocol. The software aspect involves managing the overall process of getting data from wherever it may be to compute unit as smoothly as possible. Table below shows different types of memories within typical server environment in data center and how they connect to compute block.
Memory | Connectivity to compute | Innovations |
Register Files | Custom | None |
Cache | Custom | via Chiplets |
On-chip/board RAM | GDDR6, HBM | Chiplets, 2.5D architectures, new protocols |
System RAM | PCIe | New protocols, better CPU- accelerator connectivity, software |
Off-system memory | Ethernet, Infiniband | System architecture and software |
If you compare the highest compute available in H100(3958 INT8TOPS) to V100 (130 TensorTFLOPS), the increase in two generations is 28X. The accelerator RAM between the same two generations has increased from 32 GB to 80 GB (2.5X). The memory bandwidth has increased from 1.1 TB/sec to 2 TB/sec, an increase of 1.8X. The system connectivity has increased from 32 GB to 128 GB (4X). Needless to say, it is obvious that the memory parameter increase for size and connectivity are far smaller than compute. Of course, this isn’t the best apples-to-apples comparison and there are different issues surrounding memory than compute but it gives and indication of the relative increase of computer vs memory parameters. (Note: The data is collected from V100 and H100 data sheets).
Parameter | H100 | V100 | Increase | Comments |
Max compute | 3958 TOPS | 130 TOPS | 28X | H100 INT8 and V100 Tensor TOPS considered. |
GPU RAM | 80 GB | 32 GB | 2.5X | From data sheets |
GPU RAM Bandwidth | 1.1 TB/s | 2.0 TB/s | 1.8X | From data sheets |
System Connectivity | 32 GB/s | 128GB/s | 4X | PCIe speed improvement |
The memory architecture, protocols, manufacturing technology, system architecture – all possibly needs revamping to bring memory on par with compute improvement. Memcon, upcoming conference at Santa Clara, focuses on issues pertaining to memory in data center and has a good collection of sessions to discuss such open issues. We need more such gatherings to ensure that memory architecture stays in sync with compute architecture and maximum performance is achieved. In time, I hope that somewhat ‘out-of-box ideas will lead to tighter memory-compute integration and eventually best possible accelerator performance.