In many of our customers at Bluefin, we have a need for what I call “True HA and DR” – where the business continuity requirements require that data is not lost due to hardware failure or disaster. SAP HANA has a set of fantastic functionality in this area which we believe makes it best of class.
Most commonly we implement these for scenarios like the Business Suite on SAP HANA and SAP S/4 HANA, because there is a cost which requires justification. In this article I will show you how HANA supports ZERO RPO and 90 second RTO for HA, and low RPO and low RTO for DR.
What are the business requirements for HA and DR?
There are two scenarios that the business wants to protect against. First, they want to protect against the failure of a specific machine, or network. This is what we call High Availability or HA, and with in-memory databases it’s super important – more–so than with a traditional disk-based RDBMS. The characteristics of HA are not to lose any data, and to be up and running quickly and automatically after a machine failure.
The second requirement is to protect against a total failure of an entire data center, due to power problems, fire, flood or other disaster scenario. Most businesses have a Business Continuity plan for this, in which we include a HANA Disaster Recovery Plan, or DR for short. This is almost always a manual process, because you don’t want to trigger this unintentionally. Normally, it’s OK to lose a little data that hasn’t transferred yet, and more of then than not the business wants to be up and running in a few hours to a few days. This is because there are many considerations with a DR plan, including relocating employees. Databases aren’t top on the to-do list in DR scenarios.
What are RPO and RTO?
These are two key metrics in HA and DR.
The Recovery Point Objective or RPO, is the amount of data which is lost in the event of a failure since it is not transferred yet. The Recovery Time Objective or RTO, is the time it takes to be up and running. They are always different for HA and DR.
For HA, we look to ideally have a ZERO RPO in most cases (no data loss), and RTO should be as low as possible. If a server fails, you want your system running super-fast.
For DR, we look to have a low RPO which is quite low (many businesses set a Service Level or SLA of minutes, to hours), and a RTO of a few hours to a few days. There are many considerations in DR scenarios and you have to make a conscious decision to fail over to the HR environment.
How does SAP HANA support High Availability?
SAP HANA HA is awesome and super-easy to configure. For true HA scenarios it works best with a single-node HANA system because you can have two systems, both with all data loaded into DRAM all the time.
Once you have two identical SAP HANA systems, you can just use SAP HANA Synchronous System Replication and with a single command, it replicates the data between the two nodes and keeps it up to date. There is a dedicated 10 Gigabit network between the HA nodes and transfer rates for initial sync are around 3 gigabytes per second. It doesn’t take long to sync even a big system.
Transactions are committed to both nodes simultaneously and so you have a ZERO RTO.
Now, you also need to automate this, and this is possible with SUSE Linux HA Extensions, which are an extension to the SUSE Linux License. This is a set of extensions which configures the two HANA systems into a cluster, which has a virtual IP address, which is what users connect to (not to the node directly!). SUSE uses a technology called STONITH (an awesome acronym called Shoot The Other Node In The Head!) which works at the hardware level and the implementation varies by vendor.
We recently implemented this for a 3TB Suite on HANA customer with Lenovo system X hardware, which uses the Integrated Management Module (IMM) for STONITH. The cluster nodes talk to each other via a cluster controller, and they decide which node which is active. In the event of a failure, the secondary node STONITHs the primary node and takes over the Virtual IP. Very cool.
SUSE HA takes a short while to fail over, because it double checks the node is really down before it fires the STONITH bullet. This provides for a guaranteed RTO of around 90 seconds. Because data is loaded into memory, we don’t have to load anything and the system is 100% operational.
This is MUCH better than most other RDBMS because they start with an empty cache and can be very slow after takeover for a few minutes.
Note that there are other HA options for HANA including scale-out systems, but for mission critical, I recommend a scale-up. You do need two identical HANA systems and they do have to be dedicated.
How does SAP HANA Support DR?
The Disaster Recovery system should always be in another data center, and it should have a good Wide Area Network or WAN link (we recommend 1GBit, this is an in-memory database. It’s possible to use less than this depending on your data volumes, we have used as low as 100Mbit, but talk to your HANA experts.
This time we use SAP HANA Asynchronous System Replication, which is one more single command to add the third (DR) node into the system. It sends the updates to the DR database over the wire, where they are committed, and there is a buffer to allow for varying network conditions. This allows for a low RPO of a few seconds, depending on the network conditions.
In the event of a primary site failure, we make a decision to fail over to the DR node manually, and we update the DNS to repoint the users to the DR node. It’s possible to automate this, but we never do, because activating a DR or Business Continuity strategy almost always requires a human decision.
As a result or RTO depends on our DR process, and it can be as low as 15 minutes for mission-critical environments.
Also note that in many cases we don’t run the DR node entirely in-memory, and commit to disk instead. This way we can use the DR system as a Test (QA) system as well, by adding an additional HANA instance. Need to do DR? Shut down the QA system and reconfigure DR to use all the memory, and it will automatically load tables on demand. This saves you from buying an extra HANA system, which is nice, if you don’t need a low RTO for DR. Note that this does not increase the RPO.
Running SAP Business Suite HA
It’s also worth briefly noting that you can use the regular mechanisms for making the SAP NetWeaver stack HA – using the SAP ASCS technology in a cluster, with multiple additional application servers. Customers have been running this for many years and it works precisely the same with SAP HANA as any other database, so I won’t cover it off in any more detail here.
I hope this introduction to making your SAP HANA Database Mission Critical. The SAP HANA System Replication and SUSE Linux HA Extension technologies make it extremely easy to setup and configure and manage. In my opinion, the ease of configuration and the ease of getting ZERO RPO and low RTO make HANA class-leading in its HA and DR capabilities.
Thanks to a conversation with the HANA leaders at SAP for the prompt of writing this blog, and to Lloyd Palfrey of Bluefin and George Toon at SAP for validating the technical details. Also thanks to Keith Frisby of Lenovo, who is the HANA HA installation expert at my customer.
VN:F [1.9.22_1171]Running Mission Critical SAP HANA – True HA and DR,