AMD Epyc Faces Off With Intel Skylake-SP Xeon in Massive Server Battle

Last month, AMD launched its new server architecture, codenamed Epyc, in single-chip configurations of up to 32 cores. We already knew back then that Intel was prepping a massive Xeon refresh, starting with the Core X-Series (working on that, for the record) and following it up with a new lineup of Xeon parts with up to 28 cores, 56 threads, and a new L2 cache structure that quadrupled the amount of L2 while slashing the amount of L3 allocated per-core.

On Monday, Intel launched roughly 50 SKUs in total, with top-end 28-core prices reaching $ 10K to $ 11K per physical CPU. Intel’s new Xeon “Purley” Skylake-SP CPUs supports AVX-512, Intel’s own mesh topology, and the aforementioned larger L2 cache, so the chips are rather significantly different (with both gains and losses) relative to previous Xeon products.

Over at Anandtech, the indelible Johan De Gelas (once of Aces Hardware for you longtime tech readers) has joined up with Ian Cutress to provide preliminary data on how AMD’s Epyc and these new Skylake-SP Xeons compare with one another, with previous Xeon chips thrown in for good measure.

A few points before we dive in. Anandtech acknowledges having had just one week with its AMD testbed and two weeks with the Intel system. Server testing is far more complicated than desktop tests, the benchmarks themselves are often more arcane and trickier to fine tune, and performance can be very dependent on the presence or absence of such optimizations. Anandtech makes prominent note of the fact that they had only a very limited window in which to test and apply or discover such optimizations, particularly in AMD’s case.

Second, the performance scenarios and relative rankings of Epyc versus Xeon are themselves highly dependent on the tests in question. This is the first time in at least six years that AMD has had a server part that could take the fight to Intel in any context, but testing has shown some distinct strong and weak points to AMD’s architecture. While we’ll provide an overview of the findings, there’s no substitute for close reading of the original article if you want to completely understand the subtleties.

Inherent strengths, weaknesses, and differences

In comparison to Intel’s new chips, AMD’s Epyc uses its own CCX and Infinity Fabric, doesn’t implement AVX-512, and has the same cache structure as Ryzen. This proves critical to understanding how Intel and AMD compare in a number of benchmarks (more on that in a moment).

AMD-Epyc

AMD has a significant advantage in base price; the top-end Epyc 7601 (180W TDP) is a 32-core chip with a 2.2GHz base / 3.2GHz max clock speed and a $ 4,200 price tag. Intel’s Xeon 8180 is a 28-core chip with a 2.5 – 3.8GHz max clock and a $ 10,009 price tag (the same chip in a 165W TDP with support for 1.5TB of DRAM per socket, and a 2.1GHz base clock retails for $ 11,722). Anandtech tested the Xeon 8176 — 28 cores, 2.1GHz base, with a maximum of 768GB of RAM per socket and a price tag of $ 8,719. Intel’s new Platinum/Gold/Silver/Bronze format looks nothing short of nightmarishly complicated, with vastly different specs swept into the same “families” in some cases. Other designations contain a number of exceptions to the rules that are supposed to govern which chips are placed in which brackets.

See? The only difference between 51xx and 61xx is the number of QPI links, AVX-512 FMA units per core, core counts, RAM support, and scalability. They’re practically identical!

We noted when news of the rebrand hit that it wasn’t clear how this structure would clarify Intel’s product lines, and a muddle is precisely what’s emerged from these results.

Cache changes

I want to take a moment to talk about cache architecture differences between the new Skylake-SP Xeons and previous CPUs, as well as between Intel and AMD. Skylake-S processors have a 256KB L2 cache that’s 4-way set associative (see our L1 vs. L2 cache explainer for details on what this means) with an 11-cycle latency. Previous Xeons used a large inclusive L3 cache with ~2.5MB of L3 cache allocated per core, up to 16-way set associativity, and a 44ns cycle time.

Skylake-SP, on the other hand, has a 1MB L2 cache that’s 16-way associative, but has higher (13 cycle) latency. Less L3 cache is integrated per core (1.375MB), the cache is 11-way set associative instead of 16-way, it has a 77 cycle latency (up from 44), and it’s a non-inclusive cache.

An inclusive cache is a cache that is guaranteed to contain data found within higher level caches. The advantage of inclusive caches is that you can search the highest level of cache (L3 in Intel’s case) and determine whether data is located in L1. If you can’t find it in L3, you know it’s not in L1, which means you know you need to load it. This reduces the miss latency penalty (searching main memory is still much slower than searching L3). The disadvantage to an inclusive cache is that they offer less real space for storing data, since each cache must contain all the information in the cache level above it. Intel’s use of very large L3 caches in previous Broadwell and Skylake-S chips mitigated this issue by providing a large absolute amount of cache space.

Skylake-SP transforms the L3 cache into what is often called a victim cache, because data lines present in L2 aren’t copied to L3 until they are moved or evicted. Data can be read back from L3 into L2 but also remain in the L3. Anandtech doesn’t believe Skylake-SP can prefetch into L3, which means it serves as a home for “evicted” information. It’s not used as much as the inclusive Broadwell and earlier Xeon L3 cache, which is why Intel can relax its latency and performance.

Meanwhile, AMD uses its own distinct CPU Complex (CCX) design, which combines four CPU cores and an 8MB L3 cache. Two CCX’s make up one Zeppelin die, and AMD’s own Epyc diagrams show up to four dies per CPU package. The L3 is mostly exclusive victim cache, but AMD’s reliance on the CCX architecture for cross-communication between cores means there are some tangible penalties and impacts. Local data movement within the same CCX is quite quick, but there’s a significant latency penalty for moving data across CCX complexes. AMD states that a Naples CPU (4 Zeppelin dies) has 64MB of L3, but that’s not really accurate. What Epyc has is better described as 8x8MB L3s, in much the same way that a pair of GPUs in SLI mode with 4GB of RAM each are better described as 2x4GB GPUs as opposed to an 8GB GPU.

These cache structure difference account for a substantial part of why Epyc, pre-Skylake-SP Xeons, and the new Purley Xeons perform differently than one another. But they’re scarcely the only factor in play. The chart below shows how complex the comparisons between Epyc, Broadwell-EP, and Skylake-SP can get in just memory bandwidth depending on test conditions.

MemHierarchy

There’s no “wrong” test result here and all these test types are used by shipping software to varying degrees.

AMD’s Epyc 7601 has 0.42x of Skylake-SP’s bandwidth in some tests, but 2.26x more bandwidth than others, depending on how threads are pinned across the CPUs. Raw bandwidth for Broadwell-SP is higher than Skylake-SP in almost every case except when eight threads are running, which is where Skylake-SP finally pulls ahead. Relative memory latencies are also different between AMD and Intel, with AMD competing extremely well at or below 4MB of L3 and poorly once above that point. Accessing more than 8MB of L3 is a worst-case scenario for Epyc; its latency is worse than Intel’s DRAM access latency.

latencyepyc_xeonv5_tinymembench

Ouch — but not as determinative of overall performance as you might think, given the scope of the gap.

Performance Overview

AT runs through SPEC2006 (single-thread, SMT, multi-core), database and transactional performance, Java, big data number crunching, and floating point performance. AMD’s FPU performance is surprisingly excellent compared with Intel. There are several reasons for this, but a number of them come down to various aspects of AVX and its impact on turbo clocks. For the last few product cycles, Intel has publicly stated that its Turbo Mode frequency figures depend on whether AVX is active, with non-AVX clocks being substantially lower. Intel’s Xeon 8176 has a non-AVX 28-core maximum turbo frequency of 2.8GHz, an AVX 2.0 28-core maximum turbo frequency of 2.4GHz, and an AVX-512 28-core maximum turbo frequency of just 1.9GHz.

NAMD MolDyn

Intel talks up its use of 256-bit and 512-bit FMACs compared with AMD’s 128-bit implementation of AVX. But AMD may have taken the wiser route here (it wins all the FPU benchmarks AT ran). Intel takes a 20 percent clock penalty compared with 256-bit AVX when running AVX-512. While higher efficiency should theoretically be able to still show significant AVX-512 performance improvements, they’re only going to happen with substantial performance tuning. Not all software vendors or buyers can afford that kind of work, but it’ll be critical for AVX-512 to be a success.

FPU performance is, surprisingly, AMD’s best total showing. It’s a mediocre database server, beats Intel in Java performance (but not by the same margins as in FPU code), and is extremely competitive in Big Data tests given price and clock differentials. Power consumption varies substantially by workload; the Xeon 8176 has extremely high idle power consumption, but vastly better MySQL perf/watt than Broadwell and modestly better perf/watt in this test than the Epyc 7601. In POV-RAY testing, AMD flips the tables on Intel, with higher performance at a huge power differential (327W for Epyc versus 453W for Skylake-SP).

Conclusions

The bottom line is this: AMD’s Epyc isn’t the better choice in every situation or environment. But a combination of lower prices, competitive performance, and some solid test wins show AMD can hang with Intel again, even at the top of the market. For hardware cost-conscious companies, or vendors that can afford to optimize heavily for Ryzen (cloud providers like MS, for example), Epyc is a very strong brand. But Skylake-SP shows some formidable performance gains of its own, has a better scaling mesh topology, and the stronger overall level of performance. If your TCO is dominated more by software costs than hardware pricing, Intel and its proven track record may still be the better option here.

Finally, I’d like to echo some comments Johan makes. After years of watching Intel’s only competition being its own previous generation of products, it’s really nice to see some genuine performance back-and-forth. One of the grand ironies of reviewing is that people regularly accuse reviewers of using various tricks or indulging biases to tilt reviews deliberately towards AMD or Intel when, in reality, we’re probably the people that most want to see exciting performance matches. Articles like this (or, of course, AT’s vastly larger review) don’t write themselves; they take considerable time and effort. It’s boring to watch the same company win over and over. Nobody likes a slugfest better than a reviewer, and this review is worth a read.

ExtremeTechExtreme – ExtremeTech