I was asked to comment on a blog posted by Teradata that included a number of inaccurate and misleading statements. Here, I won’t try to address each point, but I’ll hit the first few in the first blog and leave to you question the accuracy of the rest.
One important note: It is silly to suggest that in-memory databases have no advantage because they have to persist data. For a write HANA does what all OLTP databases do to achieve performance it writes a log record and then commits. It is sillier still because Teradata is a data warehouse database which generally means “write once and read-many”. The advantage to in-memory is on the read when you run a query.
More positively: we agree with the Teradata author that there is a gap growing between processor capabilities and storage capabilities that affect the ability to fully utilize processors. Programs have to wait for disk I/O.
But the author suggests that the Teradata shared-nothing MPP approach solves the problem and “saturates” the CPU. This is a mistake. If you run a single query, no matter how complex, you cannot saturate a Teradata cluster due to the exact issue the author carefully introduces. The CPU has to wait for I/O. No amount of shared-nothingness solves this problem. MPP does not help.
For Teradata saturation occurs when enough queries are running simultaneously and the operating system is swapping queries in and out to find a query that is ready to use CPU while the other queries are waiting for I/O to complete. The definition of “ready to use CPU” is that the I/O request has completed and the required data is in-memory. In short, the problem rightly raised by Teradata is in no way solved by their architecture or product. HANA really solves the problem by keeping all of the data in-memory all of the time: HANA is always ready to use CPU. In fact, each query can use 100% of the CPU. The problem described by the Teradata author is solved by HANA.
The implications of this solution are important. The difference between the time required reading data from RAM or from disk is nearly two orders of magnitude, 100X (see here for the numbers); and it gets better. The time to read data from the processor cache (all data moves from RAM to cache as it is processed) versus the time from RAM is another 6 orders of magnitude, 100,000X. Because of its in-memory design HANA keeps data flowing into the cache. Rather than go overboard and suggest a 10,000,000X improvement for HANA. Let’s modestly say that HANA gets some efficiency from the use of cache and that the benefit of the in-memory architecture yields a 1000X advantage over a disk-based system, 100X by avoiding disk-reads and 10X from cache efficiencies.
This is not marketing hype, it is architecture “physics” as the Teradata author might say. In general, HANA will out-perform Teradata by a 1000X on any single query based on this architectural advantage.
We also agree with Teradata that a shared-nothing architecture is the proper approach to scale.
HANA is built on the same shared-nothing approach as Teradata. It scales across nodes. We have published benchmarks on 100TB databases (here) and are building a 1PB cluster now.
Consider the implications of this: Teradata has not solved the CPU vs. disk problem they themselves raised rather they propagate the problem from node to node and try to finally assemble enough inefficiency to solve the problem. HANA scales a series of efficient systems with the CPU vs. disk problem solved.
I’m running long for a blog so let me wrap up with a hypothetical that is not about Teradata, it is about the economics of HANA.
Imagine an efficient in-memory DBMS that gets only a 100X performance boost over a comparable disk-based system based solely on the speed of disk vs. memory. Memory must be paid for so let’s imagine that each in-memory node costs 10X what a disk-based node with the same CPU power (this is an huge overstatement and the cost is probably 2X more for in-memory but you’ll get the point). Now let’s deploy two systems with equal performance. You can see what happens: Each in-memory node costs more but for the same performance you need to buy 1/10th the number of nodes. This is the result of the hardware economics described at the Quick Five Minute Rule Update.
Teradata is a fantastic product. But it is a product that was architected for weak single-core nodes with little memory. The first Teradata systems I deployed ran on x286 processors. There are other suggestions of this aging architecture here. Teradata’s engineering team is fantastic as well. They have found creative ways, for example the VAMP, to extend their 1980’s architecture as processor architectures advanced.
HANA is new and it is architected for the processor technology out today with an eye on the technologies emerging. The result produces efficiencies that cannot be easily replicated by engineering creativity. The efficiencies are architectural.
There is no doubt that someday Teradata will require a technology refresh – it’s been 30+ years. HANA, on the other hand, is a refreshing new technology.
VN:F [1.9.22_1171]HANA vs. Teradata – Part 1,