HANAlgorithmics – Efficiency by Design with SAP HANA – Part 1

John Appleby

Posted by John Appleby on April 18, 2013

Global Head Of DDM/HANA COEs

More by this author

I’ve always been fascinated by the performance of computer designs. I put it down to the late, great Roger Needham, from whom I had the pleasure of learning algorithms. I can still remember his hand-scribbled slides. And lately I’ve been wondering why IT professionals often find programming in SAP HANA hard so I thought that perhaps, I could do justice to some of what I learnt.

In this post I’ll explain why you have to think differently when programming in SAP HANA, using some original research and material collated from various SAP HANA resources. In future posts I’ll explain how to change some programming constructs so they perform on SAP HANA, and you may be amazed by what you can do.

1) Moore’s Law

You are almost certainly familiar with Moore’s Law, coined after Gordon E. Moore, co-founder of Intel, in a 1965 paper. The detail of Moore’s law is important though – he postulated that the number of components in a circuit board would double every two years. It is often thought to mean that computing power doubles every two years, and that is only partially true.

First, here’s transistors in Intel’s top-end CPU, by the year:

Screen Shot 2013-04-17 at 3.15.24 PM.png

It tracks Moore’s Law perfectly! If you’re interested in what the weird blip between 1999 and 2000 is, it’s when Intel moved from Slot CPUs (what were they thinking…) that had L3 cache on a separate chip, to Socket CPUs that had L3 cache on-die. Technically there was also a blip in 1997 when Intel had L3 cache on-chip in the Pentium Pro.


Take a look at this next graph. It shows total computing power from 1995 to 2012. I’ve taken the SPECInt Rate benchmark, which shows CPU throughput, and normalized the best result for each year through the 3 different benchmarks that existed. I only used commodity Intel hardware and avoided expensive RISC servers, because these often scale as big as you can afford.

Screen Shot 2013-04-17 at 11.45.19 AM.png

The graph is logarithmic and we use the CPU in 1995 as a baseline of 1. Moore’s law is obviously exactly linear and overall CPU throughput has improved a little better than that. Note the first interesting thing: Since 2002, the number of cores in the systems has increased exponentially.

Let’s move onto the next graph: Moore’s law as relates to single threads. If we use the SPECInt benchmark, rather than the Rate benchmark, then we get a very different graph:

Screen Shot 2013-04-18 at 6.40.54 AM.png

So from 1995-2003, Moore’s law held, and then something happened – CPUs stopped getting exponentially faster. Instead they only get linearly faster and only got 4x faster in 10 years. Note that the number of cores has increased on this benchmark, but not as fast as on the throughput, because CPUs with fewer cores can be cranked up to higher frequencies. Note that by 2012, CPUs are slower by nearly a whole order of magnitude, which is made up by having many cores.


In Intel’s new E7 series CPU, Ivy Bridge, this will be even more pronounced, because Ivy Bridge will have 16 cores per CPU, which will be only incrementally faster (25%) than the current generation per core.

2) System bandwidth

To continue to understand the changes in technology in the last 20 years, we need to look at how relative bandwidth has changed. This next graph shows how memory has grown as compared to memory bandwidth. In the first 10 years, it roughly tracked, and CPU bandwidth to the memory bus was growing in line. Since then, the amount of RAM that can be purchased is far in excess of the ability of the memory bus to fill the CPU.

Screen Shot 2013-04-17 at 4.05.11 PM.png

Again, we are out by an order of magnitude in terms of how much RAM we can read into CPUs. This will get better with DDR4 memory in 2013/2014 which will be 4x faster, but more importantly get rid of the memory channel concept which means that chips share bandwidth, instead mating chips with controllers.


Despite this, getting data into the CPU from RAM will remain a bottleneck and the major reason why it is very difficult to keep all the cores in a big system busy.

3) QuickPath Interconnect

To try to help with the problem above, Intel produced a technology called QuickPath Interconnect. QPI connects CPUs to RAM directly rather than via a Front-Side Bus (FSB). It looks rather like this:

/wp-content/uploads/2013/04/image2.png

The interesting thing about QPI is that it performs much better for the local memory attached to the CPU than it does to remote memory attached to another CPU. Take this table into account as a rough order of magnitude for the Intel E7 platform:

Memory Location Time (ns)
L1 Cache 0.5ns
L2 Cache 7ns
L3 Cache (line unshared) 20ns
L3 Cache (line shared in another core) 30ns
Local DRAM 60ns
Remote DRAM (via QPI bus) 100ns

The important thing to note here is that we get a get a 25% hit to memory bandwidth (on average) if the software we write doesn’t consider what RAM it loads into what CPU socket.

4) Single Instruction Multiple Data

Early CPUs were very simple – one instruction at a time. As the need for performance increased, CPUs started to play every clever tricks to keep information moving through the CPU, including longer pipelines, execution paths etc. By the time we get to Intel’s E7 CPU, we have 10 cores, each of which has a complex execution pipeline, 2 threads with the ability to halt state on one whilst it is waiting for a cache miss.

In addition to that, the E7 can take a 128-bit register with 4 32-bit integer or floating point values in it, and execute 4 additions (or other things) in parallel, using just one instruction:

/wp-content/uploads/2013/04/simd_02.png

This is called Single Instruction, Multiple Data, or SIMD and has actually been available since the Pentium MMX and extended through various forms. In its current form, SSE 4.2, there are a number of extra CPU instructions like CRC32 and packed string comparisons that have alternative uses (creating sums of database columns, or VARCHAR comparisons, for example).


So in an 8-CPU, 10 core system, it is possible to run 320 additions simultaneously per clock cycle (2.4Ghz). That’s 768 billion additions per second, if you can get the data out of RAM fast enough.

Conclusions

So all in all, we have 4 dimensions where CPUs have fundamentally changed since I learnt computing algorithms:

1) Modern CPUs are powerful only if we can multi-thread

2) Modern CPUs are starved of memory bandwidth

3) QPI means we get a 25% penalty if we aren’t able to keep our memory access local

4) SSE 4.2 SIMD means we get a 4x benefit if we can multi-thread with parallel execution of various instructions

Now let’s add to that the obvious things that people always talk about: RAM is cheap so we can fit all our data in RAM, and disk is 1000x slower than memory. If you add all this up, I hope you can see why:

1) There is a need for in-memory software designed in a very different way

2) To take full advantage of this software, you will need to write algorithms in a very different way

Next in this series will be how HANA takes advantage of all of (1-4) and what it means to the programmer.

VN:F [1.9.22_1171]
Average User Rating
Rating: 5.0/5 (1 vote cast)
HANAlgorithmics - Efficiency by Design with SAP HANA - Part 1, 5.0 out of 5 based on 1 rating

11278 Views