• April 22, 2024

AWS charges a high fee for its own Arm chips

Amazon Increases Cloud Service Prices.

For decades, Moore's Law improvements in server CPU performance and cost-effectiveness have led us all to assume that, one way or another, we would always see a reduction in the cost per unit of performance with each new generation of processors. But this does not always happen, especially in the late 2020s when the scaling down of transistor sizes and the reduction of clock speeds come to an end.

The market price of Amazon Web Services' designed Graviton 4 processor does not seem to have followed this trend, with the initial R8g instances now widely available. Eventually, more Graviton 4-based instances with varying memory, local storage, and I/O capacities will be launched on AWS, but for now, the basic R8g instances are only available in four regions.

The Graviton series of Arm-based CPUs, designed by the cloud giant's Annapurna Labs division, has gradually expanded in scale, and with the launch of the Graviton 4, it is capable of taking on larger tasks. The chip features faster cores, better cores, more cores, and for the first time, supports dual-socket NUMA memory clusters, resulting in 192 cores running at 2.8 GHz, supported by 1.5 TB of main memory. Compared to the currently rentable Graviton 4, the original Graviton 1 chip launched in November 2018 looks like a toy.

AWS launched the Graviton 4 in November last year without revealing many details about the chip. Ali Saidi, Senior Principal Engineer at Annapurna Labs, filled in several blanks in our feature list. Saidi explained that the Graviton 4 chip runs at 2.8 GHz, very close to our guessed 2.7 GHz. Due to the doubling of L2 cache per core to 2 MB, the AWS team was able to reduce the amount of L3 cache on the processor, leaving more room for a 50% increase in the number of cores per chip to 96. In fact, the L3 cache per core has been limited to 384 KB, which is 2.7 times smaller than the 2 MB L2 cache per core. However, among those 96 cores, the L3 cache collectively shares 36 MB and provides a larger shared memory space than the 2 MB L2 cache per core.

"So, each L2 has become larger, that is, 2 MB instead of 1 MB," Saidi told The Next Platform. "The reason is simple. It takes 10 cycles to reach the L2 cache, and it takes 10 cycles when the capacity is doubled. It takes 80 to 90 cycles to reach the last level of cache. We want to put as much memory as possible as close as possible, and we set it close to 8 times."As we previously reported, Graviton 4 is based on Arm Ltd's "Demeter" V2 core, the same core used by Nvidia in its 72-core "Grace" CPU, and a core that many other chip manufacturers have also chosen to use. In addition to many other features, the V2 core also has four 128-bit SVE-2 vector engines, which are very useful for many HPC and AI workloads. We still do not know the process node that AWS has chosen for Graviton 4, the number of transistors on this product, the number of PCI-Express 5.0 channels it has, or its thermal design point.

AWS has deployed more than 2 million Graviton processors across 33 regions and over 100 availability zones. It is an important differentiator for the AWS cloud and a significant resource for the Amazon Group, which owns various media, entertainment, retail, electronics, and cloud businesses. In fact, assuming that the Graviton 4 instances offer a cost-performance ratio that is about 30% to 40% higher than that of roughly equivalent X86 processors from Intel and AMD (we believe it might be 20% to 25% higher this time, but we need to look at some cross-architecture benchmarks to make a better assessment), the pricing of the initial memory-optimized R8g instances indicates a high demand for Graviton 4, so high that customers who purchase it may help the parent company Amazon to obtain its own Graviton 4 capacity at a much lower price than other methods.

Here are the supply and speed of Graviton 4 instances, as well as their on-demand and reserved instance pricing:

The single-slot memory of the R8g instances ranges from 1 to 96 cores, from 8 GB to 768 GB. Network bandwidth can be scaled proportionally, with a maximum of 40 Gb/s per instance, and Elastic Block Store (EBS) can also be scaled up to a maximum of 30 Gb/s per slot. We believe that dual-slot Graviton 4 instances are a special case because the network bandwidth of dual-slot machines is only 50 Gb/s, and EBS bandwidth is only 40 Gb/s. In addition, there is no instance size between 96 and 192 cores, which you would expect if all the physical machines that Amazon is building are based on dual-slot boxes.

Again, this may just be the way AWS allocates machines. As far as we know, all Graviton 4 machines may be dual-slot systems. It is clear that AWS (and its customers) value NUMA memory sharing across processors, as this node can run quite large workloads, such as the SAP HANA in-memory database, which will be certified on the R8g instances, with 192 cores and 1.5 TB of memory.

Rahul Kulkarni, Director of Product Management for AWS Compute and AI/ML Portfolio, said that overall, customers can expect at least a 30% performance improvement from upgrading from Graviton 3 to Graviton 4, but in many cases, the performance will be improved by 40% or more. This depends on the nature of the workload and whether the software uses integer or vector features.

The premium that AWS charges for Graviton 4 is quite substantial. Let's take a look by comparing the Graviton 4 R8g instances with the previous Graviton 2 and Graviton 3 instances:

Our estimated ECUs (short for EC2 Compute Units, an ancient relative performance metric that AWS used in the early days) raise the performance of the Graviton 4 series to the minimum 30% performance improvement that Saidi and Kulkarni said you should expect. For these instances shown above, we assume that the workload is not limited by memory and apply the same relative performance to each CPU type, regardless of memory. In the real world, we realize that more memory sometimes means you are closer to the theoretical performance of the computing engine. If we had more data, we would estimate the performance impact of less memory on some smaller instance types. But we don't have more data.To achieve a relative cost-effectiveness, we estimated the cost of running each instance for one year according to AWS's current pricing. We also estimated the cost of R8gd instances, which will have dedicated local flash storage like other "gd" instances, as always, this is displayed in bold red italics.

The results are as follows: When comparing the top-tier 64-core R7g with the top-tier 96-core R8g instance, the R8g instance's performance has increased by 30%, but the cost has increased by 65%, reducing the cost-effectiveness by 26.9%.

This has happened in the past with CPU releases. In 1990, IBM's ES/9000 mainframe. In 2001, Sun Microsystems' UltraSparc-III system. In 2017, Intel's "Skylake" Xeon SP v1 processor. All of these processors had a higher cost per unit of performance than their predecessors, and it was during a particularly tough time when competition was about to become fierce. We suspect that for AWS, this is more about pricing according to market affordability.

*Statement: This article is the original creation of the author. The content of the article is the author's personal opinion. Our reposting is only for sharing and discussion, and does not represent our approval or agreement. If there are any objections, please contact the backstage.

Comment