10 days before SAP TechEd, Steve Lucas called and asked if we could replicate the Wikipedia page views demo that Oracle produced during their Annual OpenWorld in early October. For those who haven’t seen it, it is a download of the 250bn rows of Page view statistics for Wikimedia projects, which are stored in hourly files since 2007. People always ask how we got the same dataset – it’s publicly available at the link above.
There were two real challenges – first, my math calculated that we needed roughly 6TB of DRAM just to store the data, and second that we had to download, provision, and load 30TB of flat files in just 10 days, and replicate the app.
To solve the first problem, the folks at SGI offered to build a SGI UV300H scale-up appliance with 32 sockets, 480 cores and 12TB of DRAM. This configuration would normally come with 24TB of DRAM but we didn’t need that much for this demo, so we downsized slightly. Once I knew we had this appliance secured, any concerns about performance evaporated, because I knew this appliance could do 1.4 trillion scans/sec, which is around 5x what Oracle demonstrated on their SuperCluster.
For the second problem, the folks at Verizon FIOS came out the next day and upgraded equipment so we could get internet fast enough to get the data down onto USB2 had disks, which we promptly shipped to SGI’s office in Illinois. Thanks Verizon!
This would probably be a great time for you to go ahead and watch the video, so go right ahead to see what we built in 3 days!
Response time at the end-user
As Steve rightly points out, there are some super-smart people at Oracle, but the first thing that got me about their demo was that the response time on the web interface seemed to be quite slow: 4-5 seconds. Despite this they claim sub-second response times, so I assume they are measuring response time at the database and not at the end user.
For the HANA example, we look at performance in Google Chrome Developer Tools because that’s what users experience – the time from button click, to graph. And because HANA is an integrated platform, we see performance that – to my eye – crushes Oracle’s $4m SuperCluster with a system a fraction of the cost and complexity.
In my testing, we regularly saw 3-400ms response time but we sought to mimic how customers use systems in the real world, so we ran the SGI system in their lab and connected from the laptop in the keynote theatre over the internet – that’s over 1750 miles away. That distance is around 50ms round trip at the speed of light, so raw physics has an impact on our demo performance!
Simplicity and Agility
HANA has a number of features that make a demo like this possible in a short period of time, and those features are just as useful to developers in the real world.
First, almost no model optimization is required. The model design was completed in a few minutes. This is very significant – some databases are very sensitive to model design, but it was just necessary to follow simple best practices on HANA.
Second, HANA self-optimizes in several critical ways. For a start it automatically defines a table sort order and sorts the columns according to this. It will also define (and re-define) the best compression algorithm for the exact data in each column. When the database is quiet, you will often see little background jobs that optimize tables – and table sizes will decrease.
Third, HANA information views allow you to graphically define models, predicative algorithms and other sophisticated data access mechanisms. These allow you to control how end-users access the information.
If you contrast this with Oracle 12c in-Memory, Oracle is a real pain in the butt. You have to define compression and in-memory setting for every table and column, and you have to ensure the row store is sorted, because the column store can’t sort itself (column store caches are built on start-up as a copy of the row store). It is a maintenance headache.
HANA as an integrated platform
The most significant benefit that HANA brings for these scenarios is that it collapses all the layers of the app into one in-memory appliance. The database, modeling, integration and web layers all sit within the HANA software stack and are one piece of software that comes pre-configured out the box. That’s one of the reasons why we can build a demo like this in just a few days, but it’s also the reason why it is so screamingly fast.
This is a pretty big dataset, so we see 4-500ms response times, but for smaller datasets we often get 10-30ms response times for Web Services at the browser, and that provides what I would call an amazing user experience.
HANA’s web server includes a variant of the OpenUI5 SDK and we used this to build the apps. It provides a consumer-grade user experience and cuts the build time of complex apps.
Building a demo like this in 10 days was a logistical feat by any standards, but I don’t think we could have done it on a database other than HANA. The agility, simplicity and performance of HANA, made this possible at all. The integrated platform aspects of HANA meant that it was possible not only to show HANA providing a differentiating user experience, but also possible to extend the demo with predictive algorithms in the short time available.
Since we’re passionate about openness, we’ve allowed you to reproduce the demo on your HANA Cloud Build your own Wikipedia Keynote Part 1 – Build and load data. In addition, we’ll be opening the kimono on the technical details of this demo in coming weeks.
VN:F [1.9.22_1171]Unleashing Lightening with SAP HANA - the quarter trillion row showcase,