SAP recently announced a partnership with a very cool, Big Data Startup called Databricks, out of Berkeley, California. The first fruits of the partnership saw the delivery of an SAP Distribution of Apache Spark. Spark is a very popular, open source, execution engine on top of Hadoop (and other data persistence stores). If you are deep into the Hadoop world, likely you already know about Spark, but if you have been in the enterprise analytics space, you may not have heard about it. See the Databricks Press Release and a great blog from Arsalan Tavaloli-Shiraji of Databricks.
Databricks joined the SAP Startup Focus program a few months back because very quickly, both teams saw benefit for their respective communities around making insights from “Big Data” faster and easier. To provide some context from my perspective, my job as part of SAP’s Startup Focus Development Accelerator, is to help startups that can benefit from HANA’s extreme speed and compute power adopt the SAP HANA Platform quickly and successfully. A key goal of the program is to get more beneficial applications out in the world that leverage HANA to drive innovation in next generation applications.
Here’s the thing, many innovative startups have invested in all manner of open source platforms, many with various Hadoop-related lineages. They are interested in SAP HANA to tackle more real-time, predictive, insights. The power of SAP HANA is very attractive, along with its enterprise-class management and global support, and finally its growing core of enterprise data from the worlds best companies. So when we stand in front of Startups and explain SAP HANA, with its incredible power for working with data in place, in-memory, and its simplified architecture, we get a lot of positive nods, but then we get some questions… “How do we get to SAP HANA from where we are?” Or, “Do we need to start over and build everything in SAP HANA to get the benefit?”
I see Spark as extremely complementary to SAP HANA to provide access to data that is in place already in Hadoop and to access some great capabilities that already exist on Spark or will be developed by the Spark community in the future. I think we will see many companies that want to leverage SAP HANA as their core application platform, but maintain a seamless architecture where some data sources are virtual tables in SAP HANA but actually are maintained on HDFS. You can do this with SAP HANA today without Spark, but the way in which Spark exposes Hadoop data is very well aligned with HANA in my opinion. Spark allows HDFS integration independent of Hadoop distribution and provides a rich execution framework to work on the data in place to streamline what comes across to SAP HANA. This data can be joined to data in HANA to enrich any HANA application. It is also bi-directional. The JDBC Server in Spark means that all of the HANA views, models, procedures and services are easy to consume on the Spark side as well. Future work between Databricks and SAP will bolster Spark-centric scenarios to use the predictive, geospatial, advanced business functions, and text analytic engines in SAP HANA by the Spark development community. We already have organic interest from customers and developers from the Spark community to leverage SAP HANA. Other colleagues have provided some good background for how SAP HANA leverages Spark to access Hadoop (HDFS) in a high performance manner using SQL. Check out Balaji Krisha’s blog to dive a little deeper into the details. Also take a look at John Schitka’s blog to understand that even though both HANA and Spark are “In-Memory” they are not competitive or incompatible.
Before I forget, here is the link to the SAP Distribution of Spark that you can use for SAP HANA development and integration to Hadoop (HDFS) today. I also recommend you take a look at the SAP presentation at Spark Summit 2014 from July 1st.
I would love to hear about your blended solutions that include SAP HANA and Hadoop.
David Sonnenschein – email@example.com