Over the past couple of years, businesses across every industry have faced unexpected challenges in keeping their enterprise IT systems safe, secure, and available to users. Many have experienced sudden spikes or drops in demand for their products and services and most are now operating in a hybrid work environment. In such changing conditions, with business requirements and expectations constantly evolving, it is a best practice to periodically revisit your IT system service-level objectives (SLOs) and agreements (SLAs) and ensure they are still aligned with your business needs.
Adapting to these new requirements can be especially complex for companies that run their SAP enterprise applications in on-premises environments. These organizations are often already struggling with running business-critical SAP instances as they can be complex and costly to maintain. They know how much their users depend on these systems and how disruptive dealing with unplanned outages can be, so they see the on-premises setup—backed up with major investments in high availability (HA) systems and infrastructure—as the best way to ensure the security and availability of these essential applications. IT organizations charged with running on-premises SAP landscapes, in many cases, must also manage a growing number of other business-critical applications—all while under pressure to do more with less.
For many organizations, this is an unsustainable approach. In fact, according to a SIOS survey looking at trends in HA solutions, companies at the time were already struggling to hold the line with on-premises application availability:
- 95% of the companies surveyed reported at least occasional failures in the HA services that support their applications.
- 98% reported regular or occasional application performance issues, and 71% reported them once or more per month
- When HA application issues occurred, companies surveyed spent 3–5 hours, on average, to identify and fix the problem.
Things aren’t getting easier for these companies. Today’s IT landscape is dominated by risk, uncertainty, and the prospect of belt-tightening down the road. At the same time, it’s especially important now to keep your SAP applications—the software at the heart of your company—secure, productive, and available for the business.
At Google Cloud, we’ve put a lot of thought into solving the challenges around high availability for SAP environments. We recognize this as a potential make-or-break issue for customers and we prioritize giving them a solution: a reliable, scalable, and cost-effective SAP environment, built on a cloud platform designed to deliver high availability and performance.
When you use Google Cloud, you get many services that are designed to be fault tolerant or highly available. The concepts are similar, but understanding the difference can save you time and effort when designing your architecture.
We consider fault tolerant components as fully redundant mechanisms, where any failure of these components is designed to be seamless to the system availability. It includes components like storage (Google Cloud Storage, Persistent Disks) and network (Google Network, Cloud DNS, Cloud Load Balancer).
Highly available services, however, will have an automated recovery mechanism of all the relevant architectural components, also known as single points of failure, which minimizes the recovery time objective (RTO) and recovery point objective (RPO). It usually involves replicating components and automating the failover process between them.
Four levels of SAP high availability on Google Cloud
Understanding how to give SAP customers the right availability solution starts with recognizing that each customer will have different target availability SLAs and those targets will vary depending on their business needs, budgets, SAP application use cases, and other factors. Let’s look at the SAP high availability landscape infrastructure, operating system and application availability components, and what you would need to consider for your SAP system’s overall availability strategy.
Level 1: Infrastructure
Many customers find that simply moving their SAP system from on-premises to Google Cloud can increase their system’s uptime, because they are able to leverage the platform’s embedded security, networking, compute and storage features which are highly available by default.
For compute services, Google Cloud Compute Engine has three built-in capabilities that are especially important and can reduce or even eliminate disruptions to applications due to hardware failures:
Live Migration: When a customer’s VM instances are running on a host system that needs scheduled maintenance, Live Migration moves the VM instance from one host to another, without triggering a restart or disrupting the application. This is a built-in feature that every Google Cloud user gets at no additional cost. It works seamlessly and automatically, no matter how large or complex a user’s workloads happen to be. Google Cloud conducts hardware maintenance and applies hypervisor security patches and updates globally and seamlessly without ever having to inform a single customer to restart their VM as our maintenance does not impact your running applications, thanks to the power of Live Migration.
Memory Poisoning Recovery (MPR): Even the highest-quality hardware infrastructure could break at some point and memory errors are the most common type of hardware malfunction (see Google Cloud’s study on memory reliability). Modern CPU architectures have native features like Error Correction Code (ECC), which enable hosts to recover from correctable errors. However, uncorrectable errors will crash and restart all VMs in the host, resulting in unexpected downtime.
If you have HANA databases, you also have to account for the time it takes to load the data into memory. In that case, a host crash can cause hours of business critical service downtime, depending on the database size.
Google Cloud developed a solution which integrates the CPU native error handling capabilities, SAP HANA and Google Cloud capabilities to reduce disruptions and downtime due to memory errors. With MPR, the uncorrectable memory error is detected and isolated until the VMs can be live migrated off of the affected host.
If the uncorrectable error is found on a VM hosting SAP HANA, Google Cloud MPR will send a signal to SAP HANA, with Fast Restart enabled, to reload only the affected memory from disk, thus resolving the issue without downtime in most situations. Subsequently, all VMs on the affected host will be live migrated to a healthy host to prevent any downtime or disruption to customer’s applications running on those VMs.
Automatic Restart: In the rare case when an unplanned shutdown cannot be prevented, this feature swings into action and automatically restarts the VM instance on a different host. When necessary, it calls up a user-defined startup script to ensure that the application running on top of the VM restarts at the same time. The goal is to ensure the fastest possible recovery from an unplanned shutdown, while keeping the process as simple and reliable as possible for users.
These services aim to increase the uptime of the single node, but highly critical workloads need resilience against compute related failures, including a complete zone outage. To cover this, Google Cloud Compute Engine offers a monthly uptime percentage SLA of 99.99% for instances distributed across multiple zones.
Network File System storage (NFS)
Another important component of highly available SAP infrastructure is the Network File System storage (NFS), which is used for SAP shared files, such as the interfaces directory and transport management. Google Cloud offers several file sharing solutions, like its first party Filestore Enterprise and third party solutions, such as NetApp CVS-Performance, both offering a 99.99% availability SLA. (if you need more information comparing NFS solutions on Google Cloud, please check the documentation available).
Level 2: Operating System
A critical part of the failover mechanism is clustering compute components at operating system level. It allows for fast component failure detection and triggers the failover procedures, minimizing the application downtime.
Clustering at the OS level on Google Cloud, is very similar to the on-prem approach to clustering, with a couple improved features.
Both SUSE Enterprise Linux (SLES) and Red Hat Enterprise Linux (RHEL) implement Pacemaker as a clustering resource manager and provide cluster agents designed for Google Cloud, which allows it to seamlessly manage functions and features like STONITH fencing, VIP routes and storage actions. When deploying OS clusters on Google Cloud, customers can avail themselves of the HA/DR provider hooks that allow SAP HANA to send out notifications to ensure a successful failover without data loss. For more information, see the detailed documentation for configuring HA clusters on RHEL and on SLES in our SAP high availability deployment guides.
Windows-based workloads use Microsoft failover clustering technology and have special features on Google Cloud to enable and configure the cluster. Here you can find detailed documentation.
Level 3: Database
Every SAP environment depends on a central database system to store and manage business-critical data. Any SAP high availability solution must consider how to maintain the availability and integrity of this database layer. In addition, SAP systems support a variety of database systems—many of which employ different mechanisms to achieve high availability performance. By supporting and documenting the use of HA architectures for SAP HANA, MaxDB, SAP ASE, IBM Db2, Microsoft SQL Server and Oracle workloads (using our Bare Metal Solution, you can use HA certified hardware and even install Oracle RAC solution). Google Cloud gives customers the freedom to decide how to balance the costs and benefits of HA for their SAP databases.
SAP HANA System Replication (HSR) is one of the most important application-native technologies for ensuring HA for any SAP HANA system. It works by replicating data continuously from a primary system to one or more secondary systems, and that data can be preloaded into memory to allow for a rapid failover if there’s a disaster.
Google Cloud supports and complements HSR by supporting the use of synchronous replication for SAP instances that reside in any zone within the same region. That means users can place their primary and secondary instances in different zones to keep them protected against a single-point-of-failure in either zone.
Other database systems like SAP ASE or IBM Db2 offer similar functionalities, which are also supported to run on Google Cloud infrastructure. The low network latency between zones in the same region coupled with our tools for automated deployments give companies the choice to run a variety of database HA options, tailored to their current business needs. Review our latest documentation for a current list of supported database systems and reference architectures.
Level 4: Application server
SAP’s NetWeaver architecture helps users avoid app-server bottlenecks that can threaten HA uptime requirements. Google Cloud takes that advantage and runs with it by giving customers the high availability compute and networking capabilities they need to protect against the loss of data through synchronization and to get the most reliability and performance from SAP NetWeaver. It uses one OS level cluster (SLES or RHEL), with Pacemaker cluster resource manager and STONITH fencing for the ABAP SAP Central Services (ASCS) and Enqueue Replication Server (ERS), each with is own internal load balancer (ILB) for virtual IP. Detailed documentation for deploying and configuring HA clusters can be found for both RHEL and SLES in our NetWeaver high availability planning guides.
Distributing application server instances across multiple zones of the same region provides the best protection against zonal failures while still providing great performance to the end user. Through automated deployments your IT team can quickly react to changes in demand and spin up additional instances in moments to keep the SAP system up and running, even during peak situations.
Other ways Google Cloud supports high availability SAP systems
There are many other ways Google Cloud can help maximize SAP application uptime, even in the most challenging circumstances. Consider a few examples, and keep in mind how tough it can be for enterprises, even larger ones, to implement similar capabilities at an affordable cost:
Geographic distribution and redundancy. Google Cloud’s global footprint currently includes 30 regions, divided into 91 zones and over 140 points of presence. By distributing key Google Cloud services across multiple zones in a region, most SAP users can achieve their availability goals without sacrificing performance or affordability.
Powerful and versatile load-balancing capabilities. For many enterprises, load balancing and distribution is another key to maintaining the availability of their SAP applications. Google Cloud meets this need with a range of load-balancing options, including global load balancing that can direct traffic to a healthy region closest to users. Google Cloud Load Balancing reacts instantaneously to changes in users, traffic, network, backend health, and other related conditions. And, as a software-defined service, it avoids the scalability and management issues many enterprises encounter with physical load-balancing infrastructure. Another important load balancer service for highly available SAP systems is the Internal Load Balancer, which allows you to automate the Virtual IP (VIP) implementation between the primary and secondary systems.
Tools that keep developers focused and productive. Google Cloud’s serverless platform includes managed compute and database products that offer built-in redundancy and load balancing. It allows a company’s SAP development teams to deploy side-by-side extensions to the SAP systems without worrying about the underlying infrastructure. By using Apigee API Management, companies can provide a scalable interface to their SAP systems for these extensions, which protects the backend system from traffic peaks and malicious attacks. Google Cloud also supports CI/CD through native tools and integrations with popular open source technologies, giving modern DevOps organizations the tools they need to deliver software faster and more securely. Moreover, Google Cloud’s Cortex Framework provides accelerators and best practices to reduce risk, complexity and costs when innovating alongside SAP and unlocks the best of Google Cloud’s Analytics in a seamless setup that brings more value to the business.
Flexible, full-stack monitoring. Google Cloud Monitoring gives enterprises deep visibility into the performance, uptime, and overall health of their SAP environments. It collects metrics, events, and metadata from Google Cloud, Amazon Web Services, hosted uptime probes, application instrumentation, and even application components such as Cassandra, Nginx, Apache Web Server, Elasticsearch, and many others. With a custom monitoring agent for SAP HANA and the Cloud Operation’s Ops Agent, Cloud Monitoring uses this data to power flexible dashboards and rich visualization tools, which helps SAP teams identify and fix emerging issues before they affect your business.
Explore your HA options
We’ve only scratched the surface when it comes to understanding the many ways Google Cloud supports and extends HA for SAP instances. For an even deeper dive, our documentation goes into more technical detail on how you can set up a high availability architecture for SAP landscapes using Google Cloud services.
By: Joe Darlak (Head of SAP Solution Management) and Osmar Vinci (Customer Engineer, SAP Solutions)
Source: Google Cloud Blog