aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • DevOps
  • Platforms

Gauge The Effectiveness Of Your DevOps Organization Running In Google Cloud

  • aster.cloud
  • October 7, 2020
  • 7 minute read

Many organizations aspire to become true, high-functioning DevOps shops, but it can be hard to know where you stand. According to DevOps Research and Assessment, or DORA, you can prioritize just four metrics to measure the effectiveness of your DevOps organization—two to measure speed, and two to measure stability:

Speed


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

1. Lead Time for Changes – Code commit to code in production
2. Deployment Frequency – How often you push code

Stability

3. Change Failure Rate – Rate of deployment failures in production that require immediate remedy. (Rollback or manual change)
4. Time to Restore Service (MTTR) – Mean time to recovery.

 

In this post, we present a methodology to collect these four metrics from software delivery pipelines and applications deployed in Google Cloud. You can then use those metrics to rate your overall practice effectiveness, and baseline your organization’s performance against DORA industry benchmarks, and determine whether you’re an Elite, High, Medium or Low performer.

1 breaking down devops perf.jpg

The 2019 Accelerate State of DevOps: Elite performance, productivity, and scaling

DORA and Google Cloud have published the 2019 Accelerate State of DevOps Report.

Let’s take a look at how to do this in practice, with a sample architecture running on Google Cloud.

Services and reference architecture

To get started, we create a CI/CD pipeline with the following cloud services:

  • Github Code Repo
  • Cloud Build, a container-based CI/CD Tool)
  • Container Registry
  • Google Kubernetes Engine (GKE)
  • Cloud Load Balancing, used as an Ingress Controller for GKE)
  • Cloud Uptime Checks, for synthetic application monitoring
  • Cloud Monitoring
  • Cloud Functions
  • Pub/Sub, used as a message bus to connect Alerts to Cloud Functions)

These are combined into the reference architecture below. Note that all of these Google Cloud services are integrated with Cloud Monitoring. As such, there’s nothing in particular that you need to set up to receive service logs, and many of these services have built-in metrics that we’ll use in this post.

2 Google Cloud Platform CI_CD pipeline.jpg
Google Cloud Platform CI/CD pipeline and application topology

Measuring Speed

To measure our two speed metrics—deployment frequency and lead time to commit—we instrument Cloud Build, which is a continuous integration and continuous delivery tool. As a container-based CI/CD tool, Cloud Build lets you load a series of Google managed or community managed Cloud Builders to manipulate your code or interact with internal and external services during the build/deployment process. Upon firing a build trigger, Cloud Build reaches into our Git Repository for our source code, creates a container image artifact that it pushes to the container registry, and then deploys the container image to a GKE cluster.

You can also import your own cloud builder container in the process and insert it as the final build step, to determine the time from commit to deployment as well as whether this is a rollback deployment. For this example, we’ve created a custom container to be used as the last build step that:

  1. Retrieves the payload binding for the commit timestamp accessed by the variable $(push.repository.pushed_at) and compares it against the current timestamp to calculate lead time. The payload binding variable is used when we create the trigger and is referenced by a custom variable, $_MERGE_TIME in cloudbould.yaml.
  2. Reaches into the source repo to get the commit ID of the latest commit on the master branch and compares it to the current commit ID of the build to determine if it is a rollback or a match.
Read More  IBM Power Systems Certified For SAP HANA® Enterprise Cloud As A Provider For Large SAP HANA Systems

You can find a reference Cloud Build config yaml here that shows each build step described above. If you’re using a non-built-in variable like ’$_MERGE_TIME’ payload binding in your config file, you need to specify the variable map when you setup the cloud build trigger to the $(push.repository.pushed_at) value.

3 MERGE_TIME.jpg

You can find the custom cloud builder container used here. After the build step for this container runs the following is outputted to the Cloud Build logs, which are fed automatically into Cloud Monitoring. Notice the commit ID, Rollback value, and LeadTime values which are written to the logs from our custom cloud builder:

4 LeadTime values.jpg

Next we can create a log-based metric in Cloud Logging to absorb these custom values. Log-based metrics can be based on filters for specific log entries.

5 log-based metric.jpg

Once we have our specific log entries filter, we can use regular expressions assigned to a particular piece of the output logs to capture specific sections of the log entry into metrics. In the screenshots below we created labels for the commit name and rollback value that will attach to the LeadTime value that shows up in the ‘textPayload’ field of our log. We use the following regular expressions:

Rollback:\s([aA-zZ]+)
Commit:\s([a-zA-Z0-9]+)
LeadTime:\s([0-9]+\.[0-9]{2})
6 Rollback.jpg

Metric Value:

7 Metric Value.jpg
Create log based metric and labels

Lead Time for Changes

Once we have the above metric and labels created from our Cloud Build log we can access it in Cloud Operations Metrics explorer via the metric label ‘logging/user/dorametics’ (‘DoraMetrics’ was the name we gave our log-based metric). The value of the metric will be the LeadTime as extracted from the regular expression above, with Rollbacks filtered out. We use the median or 50th percentile.

8 Lead Time for Changes.jpg

Deployment Frequency

Now that we have the lead time for each commit, we can determine the frequency of deployments by just counting the number of lead times we recorded in a window!

9 Deployment Frequency.jpg

Measuring stability

Change Failure Count

To determine the number of software rollbacks that were performed, we can look at our Deployment Frequency and filter for ‘Rollback=True’ metrics. This gives us a count of the total rollbacks performed. If we wanted to determine the Change Failure Rate we would use data collected in this chart and divide it by the Deployment Frequency metric collected above for the same window.

Read More  Using Pacemaker For SAP High Availability On Google Cloud - Part 1
10 Change Failure Count.jpg

Mean-Time-To-Resolution (MTTR)

In typical enterprise environments there are incident response systems that allow you to determine when an issue was reported and when it is ultimately resolved. Assuming these times could be queried, MTTR could be determined by the average time between the reported and resolved timestamps of the issues.

In this blog we use automation to alert and graph issues, which allows us to gather more accurate service disruption metrics. Our strategy involves the use of Service Level Objectives (SLO), which represents  Service Level Indicators (SLI)  that we’ve determined represent our customers’ happiness with our application and an objective. When we violate an SLO we consider our mean-time-to-restore service is the total time it takes to detect, mitigate, and resolve a problem until we are back in compliance with the SLO.

11 Mean-Time-To-Resolution.jpg

MTTR and customer satisfaction

For the purposes of simplicity we’ve highlighted one metric we feel represents our customer satisfaction: overall HTTP response code errors from our website. The ratio of this metric against the total response codes sent over a given time window constitutes our Service Level Indicator (SLI).

For total errors we monitor response codes returned from our front-end load balancer, which is set up as an ingress controller in our GKE cluster.

12 MTTR and customer satisfaction.jpg

Metric Used: loadbalancing.googleapis.com/https/request_count Group by response_code

Using this metric above we can build our SLI and wrap it into an SLO that represents the customer satisfaction observed over a longer time window. Using the SLO API, we create custom SLOs that represent the level of customer satisfaction we want to monitor, where being in violation of that SLO indicates an issue. There’s a great tutorial on how to create custom SLOs and services here.

In this example, we’ve created a custom service to represent our application and an SLO for HTTP LB response codes (code). It assumes a quality of service level in which 98% of responses from the load balancer should not be errors in a given day. Doing this automatically creates an error budget of 2% over 24 hours. Now, when it comes to monitoring for MTTR, we have a metric (SLI) that’s attached to a service level SLO that represents quality of service over a given window of time. The failure of the SLO is simulated in the screenshot below:

13 SLO.jpg

Next, we set up an alert policy that fires when we are in danger of violating this SLO. This also starts a timer to calculate the time-to-resolution. What we’re measuring here is referred to as ‘burn rate’—how much of our error budget (2% of errors over 24 hours) we are eating up with the current SLI metic. The window we measure for our alert is much smaller than our entire SLO so when the SLI has moved back within compliance of a threshold, another alert fires, indicating the incident has cleared. For more information on setting up alerting policies please visit this page.

Read More  Is Platform Engineering Putting An End To DevOps And SRE?

You can also send out alerts through a variety of channels, allowing you to integrate into existing ticketing or messaging systems to record the MTTR in a way that makes sense for your organization. For our purposes we integrate with the Pub/Sub message bus channel, sending the alerts to a cloud function that performs the necessary charting calculators.

In the message from the clearing alert we see the JSON payload has the started_at and ended_at timestamps. We use these timestamps in our cloud function to calculate the time to resolve the issue and then output it to the logs.

Here is the entire Pub/Sub message sent to Cloud Functions:

14 Pub_Sub message.jpg

Here is the cloud function connected to the same Pub/Sub topic as the Alert:

15 Alert.jpg

The results in the following messages sent to Cloud Functions logs:

16 Cloud Functions logs.jpg

The final step is to create another log-based metric to pick up the ‘Time to Resolve’ value that we print to our cloud functions log. We do so with this regex expression

 Resolve:\s([0-9]+);

17 Time to Resolve.jpg

Now the metric is available in Cloud Operations.

18 Cloud Operations.jpg

Conclusion

We’ve shown above how you can create custom cloud builders in Cloud Build to generate metrics relating to deployment frequency, mean-time-to-deployment and rollback that will appear in Cloud Operations logs. We’ve also shown you how to use SLOs and SLIs to generate and push alerts to your Cloud Functions logs. We’ve used log-based metrics to pull our metrics out of the logs and chart them. These metrics can be used to evaluate the effectiveness of your organization’s software development and delivery pipelines over time as well as help you evaluate your performance amongst the greater DevOps community. Where does your organization land?

For more inspiration, here is some further reference material to help you measure the effectiveness of your own DevOps organization:

  • Google Cloud Application Modernization Program (blog)
  • Setting SLOs: a step-by-step guide (blog)
  • Setting SLOs: observability using custom metrics (blog)
  • Concepts in Service Monitoring (documentation)
  • Working with the SLO API (documentation)
  • How to create SLOs in the GCP Console (video)
  • How to create SLOs at scale with the SLO API (video)
  • How to create SLOs using custom metrics (video)
  • GitHub SLO API Code used for Blog
  • DORA Quick Check
  • The 4 Keys Project for DORA Metric Ingression into BigQuery
  • 21 new ways we’re improving observability with Cloud Ops (blog)

By Brian Kaufman Specialist Customer Engineer, Hybrid

Source https://cloud.google.com/blog/products/devops-sre/another-way-to-gauge-your-devops-performance-according-to-dora


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • DevOps Research and Assessment
  • DORA
  • Google Cloud
You May Also Like
Google Cloud and Smart Communications
View Post
  • Platforms
  • Technology

Smart Communications, Inc. Dials into Google Cloud AI to Help Personalize Digital Services for Filipinos

  • October 25, 2024
View Post
  • Platforms
  • Public Cloud

Empowering builders with the new AWS Asia Pacific (Malaysia) Region

  • August 30, 2024
Red Hat and Globe Telecoms
View Post
  • Platforms
  • Technology

Globe Collaborates with Red Hat Open Innovation Labs to Modernize IT Infrastructure for Greater Agility and Scalability

  • August 19, 2024
Huawei Cloud Cairo Region Goes Live
View Post
  • Cloud-Native
  • Computing
  • Platforms

Huawei Cloud Goes Live in Egypt

  • May 24, 2024
Asteroid
View Post
  • Computing
  • Platforms
  • Technology

Asteroid Institute And Google Cloud Identify 27,500 New Asteroids, Revolutionizing Minor Planet Discovery With Cloud Technology

  • April 30, 2024
IBM
View Post
  • Hybrid Cloud
  • Platforms

IBM To Acquire HashiCorp, Inc. Creating A Comprehensive End-to-End Hybrid Cloud Platform

  • April 24, 2024
View Post
  • Platforms
  • Technology

Canonical Delivers Secure, Compliant Cloud Solutions for Google Distributed Cloud

  • April 9, 2024
Redis logo
View Post
  • Platforms
  • Software

Redis Moves To Source-Available Licenses

  • April 2, 2024

Stay Connected!
LATEST
  • college-of-cardinals-2025 1
    The Definitive Who’s Who of the 2025 Papal Conclave
    • May 7, 2025
  • conclave-poster-black-smoke 2
    The World Is Revalidating Itself
    • May 6, 2025
  • 3
    Conclave: How A New Pope Is Chosen
    • April 25, 2025
  • Getting things done makes her feel amazing 4
    Nurturing Minds in the Digital Revolution
    • April 25, 2025
  • 5
    AI is automating our jobs – but values need to change if we are to be liberated by it
    • April 17, 2025
  • 6
    Canonical Releases Ubuntu 25.04 Plucky Puffin
    • April 17, 2025
  • 7
    United States Army Enterprise Cloud Management Agency Expands its Oracle Defense Cloud Services
    • April 15, 2025
  • 8
    Tokyo Electron and IBM Renew Collaboration for Advanced Semiconductor Technology
    • April 2, 2025
  • 9
    IBM Accelerates Momentum in the as a Service Space with Growing Portfolio of Tools Simplifying Infrastructure Management
    • March 27, 2025
  • 10
    Tariffs, Trump, and Other Things That Start With T – They’re Not The Problem, It’s How We Use Them
    • March 25, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    IBM contributes key open-source projects to Linux Foundation to advance AI community participation
    • March 22, 2025
  • 2
    Co-op mode: New partners driving the future of gaming with AI
    • March 22, 2025
  • 3
    Mitsubishi Motors Canada Launches AI-Powered “Intelligent Companion” to Transform the 2025 Outlander Buying Experience
    • March 10, 2025
  • PiPiPi 4
    The Unexpected Pi-Fect Deals This March 14
    • March 13, 2025
  • Nintendo Switch Deals on Amazon 5
    10 Physical Nintendo Switch Game Deals on MAR10 Day!
    • March 9, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.