Intel's Best x86 Server CPU

The launch of the Nehalem-EX a year ago was pretty spectacular. For the first time in Intel's history, the high-end Xeon did not have any real weakness. Before the Nehalem-EX, the best Xeons trailed behind the best RISC chips in either RAS, memory bandwidh, or raw processing power. The Nehalem-EX chip was well received in the market. In 2010, Intel's datacenter group reportedly brought in $8.57 billion, an increase of 35% over 2009.

The RISC server vendors have lost a lot of ground to the x86 world. According to IDC's Server Tracker (Q4 2010), the RISC/mainframe market share has halved since 2002, while Intel x86 chips now command almost 60% of the market. Interestingly, AMD grew from a negligble 0.7% to a decent 5.5%.

 

Only one year later, Intel is upgrading the top Xeon by introducing Westmere-EX. Shrinking Intel's largest Xeon to 32nm allows it to be clocked slightly higher, get two extra cores, and add 6MB L3 cache. At the same time the chip is quite a bit smaller, which makes it cheaper to produce. Unfortunately, the customer does not really benefit from that fact, as the top Xeon became more expensive. Anyway, the Nehalem-EX was a popular chip, so it is no surprise that the improved version has persuaded 19 vendors to produce 60 different designs, ranging from two up to 256 sockets.

Of course, this isn't surprising as even mediocre chips like Intel Xeon 7100 series got a lot of system vendor support, a result of Intel's dominant position in the server market. With their latest chip, Intel promises up to 40% better performance at slightly lower power consumption. Considering that the Westmere-EX is the most expensive x86 CPU, it needs to deliver on these promises, on top of providing rich RAS features.

We were able to test Intel's newest QSSC-S4R server, with both "normal" and new "low power" Samsung DIMMs.

Some impressive numbers

The new Xeon can boast some impressive numbers. Thanks to its massive 30MB L3 cache it has even more transistors than the Intel "Tukwilla" Itanium: 2.6 billion versus 2 billion transistors. Not that such items really matter without the performance and architecture to back it up, but the numbers ably demonstrate the complexity of these server CPUs.

Processor Size and Technology Comparison
CPU transistors count (million) Process

Die Size (mm²)

Cores
Intel Westmere-EX 2600 32 nm 513 10
Intel Nehalem-EX 2300 45 nm 684 8
Intel Dunnington 1900 45 nm 503 6
Intel Nehalem 731 45 nm 265 4
IBM Power 7 1200 45 nm 567 8
AMD Magny-cours 1808 (2x 904) 45 nm 692 (2x 346) 12
AMD Shanghai 705 45 nm 263 4

 

Test Servers and Benchmark Setup
Comments Locked

62 Comments

View All Comments

  • Fallen Kell - Thursday, May 19, 2011 - link

    As the subject says. Would love to see how these deal with something like Linpack or similar.
  • erple2 - Thursday, May 19, 2011 - link

    I'd be more interested at seeing how they perform in slightly more "generic" and non-GPU optimizeable workloads. If I'm running Linpack or other FPU operations, particularly those that parallelize exceptionally well, I'd rather invest time and money into developing algorithms that run on a GPU than a fast CPU. The returns for that work are generally astounding.

    Now, that's not to say that General Purpose problems work well on a GPU (and I understand that). However, I'm not sure that measuring the "speed" of a single processor (or even a massively parallelized load) would tell you much, other than "it's pretty fast, but if you can massively parallelize a computational workload, figure out how to do it on a commodity GPU, and blow through it at orders of magnitude faster than any CPU can do it".

    However, I can't see running any virtualization work on a GPU anytime soon!
  • stephenbrooks - Thursday, May 19, 2011 - link

    Yeah, well, in an ideal world...

    But sometimes (actually, every single time in my experience) the "expensive software" that's been bought to run on these servers lacks a GPU option. I'm thinking of electromagnetic or finite element analysis code.

    Finite element engines are the sort of thing that companies make a lot of money selling. They are complicated. The commercial ones probably have >10 programmer-years of work in them, and even if they weren't fiercely-protected closed source, porting and re-optimising for a GPU would be additional years work requiring programmers again at a high level and with a lot of mathematical expertise.

    (There might be some decent open-source alternatives around, but they lack the front ends and GUI that most engineers are comfortable using.)

    If you think fixing the above issues are "easy", go ahead. You'll make millions.
  • L. - Thursday, May 19, 2011 - link

    lol

    if you code .. i don't want to read your code
  • carnachion - Friday, May 20, 2011 - link

    I agree with you. In my experience GPU computing for scientific applications are still in it's infancy, and in some cases the performance gains are not so high.

    There's still a big performance penalty by using double precision for the calculations. In my lab we are porting some programs to GPU, we started using a matrix multiplication library that uses GPU in a GTX590. Using one of the 590's GPU it was 2x faster than a Phenon X6 1100T, and using both GPUs it was 3.5x faster. So not that huge gain, using a Magny-Cours processor we could reach the performance of a single GPU, but of course at a higher price.

    Usually scientific applications can use hundreds of cores, and they are tunned to get a good scaling. But I don't know how GPU calculations scales with the number of GPUs, from 1 to 2 GPUs we got this 75% boost, but how it will perform using inter-node communication, even with a Infiniband connection I don't know if there'll be a bottleneck for real world applications. So that's why people still invest in thousands of cores computers, GPU still need a lot of work to be a real competitor.
  • DanNeely - Saturday, May 21, 2011 - link

    single vs double precision isn't the only limiting factor for GPU computing. The amount of data you can have in cache per thread is far smaller than on a traditional CPU. If your working set is too big to fit into the tiny amount of cache available performance is going to nose dive. This is farther aggravated by the fact that GPU memory systems are heavily optimized for streaming access and that random IO (like cache misses) suffers in performance.

    The result is that some applications which can be written to fit the GPU model very well will see enormous performance increases vs CPU equivalents. Others will get essentially nothing.

    Einstein @ Home's gravitational wave search app is an example of the latter. The calculations are inherently very random in memory access (to the extent that it benefits by about 10% from triple channel memory on intel quads; Intel's said that for quads there shouldn't be any real world app benefit from the 3rd channel). A few years ago when they launched cuda, nVidia worked with several large projects on the BOINC platform to try and port their apps to CUDA. The E@H cuda app ended up no faster than the CPU app and didn't scale at all with more cuda cores since all they did was to increase the number of threads stalled on memory IO.
  • Marburg U - Thursday, May 19, 2011 - link

    Finally something juicy,
  • JarredWalton - Thursday, May 19, 2011 - link

    So, just curious: is this spam (but no links to a separate site), or some commentary that didn't really say anything? All I've got is this, "On the nature of things":

    http://en.wikipedia.org/wiki/De_rerum_natura

    Maybe I missed what he's getting at, or maybe he's just saying "Westmere-EX rocks!"
  • bobbozzo - Monday, May 23, 2011 - link

    Jarred, my guess is that it is spam, and that there was a link or some HTML posted which was filtered out by the comments system.

    Bob
  • lol123 - Thursday, May 19, 2011 - link

    Why is there a 2 socket only line of E7 (E7-28xx), but at least as far as I can tell, not any 2-socket motherboards or servers? Are those simply not available yet?

Log in

Don't have an account? Sign up now