aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Design
  • Engineering

Composite Availability: Calculating The Overall Availability Of Cloud Infrastructure

  • aster_cloud
  • December 20, 2022
  • 6 minute read

When designing cloud systems, a major consideration is reliability. As infrastructure architects, what do we want to guarantee to our customers, whether they’re internal or end users? How much time can we possibly allow our services to be down? How do we design services for resiliency?

A huge portion of reliability is dependent on availability (also called “uptime” or “uptime availability”). In this Google Cloud blog post “Available… or not?” availability is defined as whether or not a system is able to fulfill its intended function at a point in time. In other words, how often is a web page able to serve content to visitors? Will I always be served the webpage, regardless of latency, when I visit that link?


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

Service availability, SLAs, and SLOs

Generally, service availability is calculated as # of successful units / # of total units. Such as, uptime / (uptime + downtime), or successful requests / (successful requests + failed requests). Availability acts as a reporting tool and a probability tool for discussing the likelihood that your system will perform as expected in the future. It’s then reported as a percentage, like 99.99%. For example, “my web page serves content to visitors 99.99% of the time.”

Within Google Cloud, services and products have their own service-level agreements (SLAs) that describe their target availability. For example, if properly configured, a single Compute Engine instance in one zone will offer a Monthly Uptime Percentage of >= 99.5% .

Note, an SLA is different from an SLO (Service-Level Objective). An SLO defines a numerical target for a service’s availability, whereas an SLA defines a promise to the service’s users that an SLO is met over a given time period. For simplicity, we’ll use the term “SLA” for the remainder of the document.

However, it’s rare that customer systems on Google Cloud are made up of a single Compute Engine instance. In reality, applications are much more complicated, with services intertwined and dependent on each other. Can you guarantee your users an SLA of >=99.5% for an application running on a single Compute Engine instance if that instance is also dependent on a Cloud Storage bucket that is only 99.0% available?

What we need to calculate is the combined availability across all the different services that make up an application — the composite availability of an application. We need to analyze how the relationship between services in an application impacts the resulting availability of the application overall. This approach allows us to better design more resilient systems and therefore offer users a better experience.

Read More  Built With BigQuery: How To Accelerate Data-Centric AI Development With Google Cloud And Snorkel AI

Depending on the relationship between the services, the composite availability might be higher or lower than an individual service alone. Let’s take a look at some application design examples. While this calculation only speaks to the architectural-level availability (i.e., procedural and operational risks that are specific to customer’s systems are excluded), it provides meaningful “upper-bounds” of availability to help guide design.

Dependent services

Dependent services are defined as services where one service’s availability is dependent on the other. There are a few common variants of dependent service architecture. The first one to consider is “Serial Services.”

Serial services

Consider an application where successive services are dependent on each other directly:

 

If the frontend is directly reliant on the middleware, the likelihood of the entire system becoming unavailable is compounded by each of the services’ availability. The reliability of the system becomes:
Frontend_SLA * Middleware_SLA = SLA of system
.9995 * .9995 = 0.999

Or formulaically, for any number of dependent serial services:
(SLA_1) * (SLA_2) … * (SLA_N) = SLA of system

Here, it is critical to notice that the SLA of the system has gonethe system doesfrom the SLAs of the individual services. That is to say, your architecture choices — namely, your architecture’s dependencies — can be more impactful than your provider’s guarantees. Even if you have “3.5” nines for each Cloud Run service, you can only get “3” nines for your system.

Parallel services

Another common dependent service architecture is to place your services in parallel, where your app is the composite of all of them. For example, middleware is now dependent on a backend running on Cloud SQL and a cache running on Memorystore. Let’s assume you need both the cache and the backend up and running for the middleware to function.

 

At first, this seems like this design might be an improvement from the last one — as the calls to the different services can be made independently. However, the cache or backend failing will carry the same reliability impact as the middleware failing. Though the backend does not rely on the cache, the system does.As middleware uses each service equally, the overall observable availability of the set of services is the probability that ANY of these services being up at a given time:
Frontend_SLA * Middleware_SLA * Backend_SLA * Cache_SLA = SLA of system

Read More  MLOps In BigQuery ML With Vertex AI Model Registry
Because these systems still depend on each other, we’re stuck at (SLA_1) * (SLA_2) … * (SLA_N) = SLA of the system. And as this list of dependent services gets longer, we see our system’s SLA go down.
.9995 * .9995 * .9995 * .999 = .997

Independent services

So… how can we improve the SLA of a system? We need a way to introduce services that can increase the resiliency of an application as a whole. A good example of this is redundancy!

Redundant services

Having independent copies of the service means that as long as one copy is running, your application is running. So what if we duplicated the application — perhaps across multiple regions — and load balanced between them?

 

Now, we have two replicas of our application and have introduced a load balancer. If we ignore the the Cloud Load Balancer for now, the likelihood of the application being down is the probability that the replicas fail at the same time:
1 - ( (probability_failure_replica1) * (probability_failure_replica2) … (probability_failure_N)) = SLA of the system excluding Cloud LB
1 - ( (1-(Frontend_SLA * Middleware_SLA * Backend_SLA * Cache_SLA) * (1-(Frontend_SLA * Middleware_SLA * Backend_SLA * Cache_SLA)) = SLA of the system excluding Cloud LB
1 - ( (1-.997) * (1-.997)) = SLA of the system excluding Cloud LB

1 - ((.003) * (.003)) = = SLA of the system excluding Cloud LB
= 0.999991

An improvement on the system! This calculation maps to our real-life scenarios. In the event of a regional outage, the application is still up and running!

Load Balancers

With all this in mind, it begs the question: why can’t we get infinite 9s in our applications then? Theoretically, we could duplicate our application to our heart’s content. Here’s where the Cloud Load Balancer comes into play. The two replicas can’t simply failover to each other by themselves – they need to be routed to by the Cloud Load Balancer and the Cloud Load Balancer has its own availability. Therefore, the system will be down if either the Cloud Load Balancer fails or if both replicas fail:

Read More  Introducing The Professional Cloud Database Engineer Certification
Cloud_LB_SA * (1 - ( (probability_failure_1) * (probability_failure_2) … (probability_failure_N))) = SLA of the system
Cloud_LB_SA * ( (1-(Frontend_SLA * Middleware_SLA * Backend_SLA * Cache_SLA) * (1-(Frontend_SLA * Middleware_SLA * Backend_SLA * Cache_SLA)) = SLA of the system

.9999 *  ( (1-.997) * (1-.997)) = SLA of the system

.9999 * (1 - ((.003) * (.003))) =  SLA of the system

.9999 * . 0.999991 = SLA of the system 

.999891 = SLA of the system

So, the system is bounded by the availability of the Cloud Load Balancer. However with replication, we can still improve the system’s overall SLA, compared to a single region SLA (99.7%). 
An improvement on the system! This calculation also maps to our real-life scenarios. In the event of a regional outage, the application is still up and running.

Real-life considerations

 

It seems like the problem of calculating composite availability has been solved. In reality, things are a bit more complicated. As mentioned, these calculations are just the “upper-bounds” of availability. There are considerations of capacity and resource budgets, time and effort associated with configuration— and in the end, oftentimes, the bottleneck of reliability isn’t even the infrastructure at all. It’s the network, application logic, and most notably, the real life implications of people managing complex systems.

Reliability is the responsibility of everyone in engineering: development, product management, Ops, Dev, SRE, and more. Team members are accountable for knowing their project’s reliability target, risk and error budgets, and prioritizing and escalating work appropriately.

Ultimately, designing for composite availability is important but is only a piece of the puzzle. It’s important to contextualize your infrastructure with your own customer defined SLOs, your error budgets, and what operational complexity your teams can take on. Finding a balance is the key for your application’s success and your users’ happiness.

Further reading:

 

  • NASA Lessons Learned
  • Available…or not?
  • SRE fundamentals: SLAs vs SLOs vs. SLis
  • Google Cloud Reliability Architecture Guidance
 

By: Cat Chu (Strategic Cloud Engineer, App Modernization) and Gang Chen (Strategic Cloud Engineer, Infrastructure Modernization)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster_cloud

Related Topics
  • devops
  • Google Cloud
  • Site Reliability Engineering
You May Also Like
View Post
  • Computing
  • Engineering
  • Software Engineering

Kubernetes CRD Validation Using CEL

  • December 4, 2023
Web
View Post
  • Engineering
  • Software Engineering

Mastering the Art of Load Testing for Web Applications

  • November 29, 2023
Ubuntu. Chiselled containers.
View Post
  • Engineering
  • Technology

Canonical Announces The General Availability Of Chiselled Ubuntu Containers

  • November 25, 2023
Brush, Color, and Sketch pad
View Post
  • Cloud-Native
  • Design
  • Engineering

6 Security Best Practices For Cloud-Native Applications

  • November 17, 2023
Ingrasys
View Post
  • Computing
  • Engineering
  • Technology

Ingrasys Unveils Next-Gen AI And Cooling Solutions At Supercomputing 2023

  • November 15, 2023
Malware, Security, and Laptop
View Post
  • Engineering
  • Technology

Singapore And Google Partner On Web Risk To Protect Citizens From Online Scams And Phishing

  • November 12, 2023
View Post
  • Engineering
  • Public Cloud

Golang’s GORM Support For Cloud Spanner Is Now Generally Available

  • November 9, 2023
Cloud
View Post
  • Design
  • Engineering
  • Public Cloud

The Impact Of Public Cloud Price Hikes

  • November 8, 2023

Stay Connected!
LATEST
  • 1
    Bard Gets Its Biggest Upgrade Yet With Gemini
    • December 6, 2023
  • Gemini 2
    Introducing Gemini: Our Largest And Most Capable AI Model
    • December 6, 2023
  • 3
    Kubernetes CRD Validation Using CEL
    • December 4, 2023
  • 4
    AI For Impact: How Google Cloud Is Bringing AI To Accelerate Climate Action
    • December 3, 2023
  • Birthday Cake 5
    How ChatGPT Altered Our World in Just One Year
    • November 30, 2023
  • OpenAI 6
    Sam Altman Returns As CEO, OpenAI Has A New Initial Board
    • November 30, 2023
  • Web 7
    Mastering the Art of Load Testing for Web Applications
    • November 29, 2023
  • 8
    The IBM Approach To Reliable Quantum Computing
    • November 28, 2023
  • 9
    IBM Collaborates with AWS to Launch a New Cloud Database Offering, Enabling Customers to Optimize Data Management for AI Workloads
    • November 27, 2023
  • Data center. Servers. 10
    Intel Granulate Optimizes Databricks Data Management Operations
    • November 27, 2023
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • Oracle | Microsoft 1
    Oracle Cloud Infrastructure Utilized by Microsoft for Bing Conversational Search
    • November 7, 2023
  • Riyadh Air and IBM 2
    Riyadh Air And IBM Sign Collaboration Agreement To Establish Technology Foundation Of The Digitally Led Airline
    • November 6, 2023
  • Ubuntu. Chiselled containers. 3
    Canonical Announces The General Availability Of Chiselled Ubuntu Containers
    • November 25, 2023
  • Ingrasys 4
    Ingrasys Unveils Next-Gen AI And Cooling Solutions At Supercomputing 2023
    • November 15, 2023
  • Cyber Monday Sale. Guzz. Ideals collection. 5
    Decode Workweek Style with guzz
    • November 23, 2023
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.