Gauge The Effectiveness Of Your DevOps Organization Running In Google Cloud

Many organizations aspire to become true, high-functioning DevOps shops, but it can be hard to know where you stand. According to DevOps Research and Assessment, or DORA, you can prioritize just four metrics to measure the effectiveness of your DevOps organization—two to measure speed, and two to measure stability:

Speed

Partner with aster.cloud
for your next big idea.
Let us know here.

From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.

CYBERPOGO.COM :: For the Arts, Sciences, and Technology.

DADAHACKS.COM :: Parenting For The Rest Of Us.

ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.

TAKUMAKU.COM :: For The Hearth And Home.

ASTER.CLOUD :: From The Cloud And Beyond.

LIWAIWAI.COM :: Intelligence, Inside and Outside.

GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.

FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.

ASTERCASTER.COM :: Supra Astra. Beyond The Stars.

BARTDAY.COM :: Prosperity For Everyone.

1. Lead Time for Changes – Code commit to code in production
2. Deployment Frequency – How often you push code

Stability

3. Change Failure Rate – Rate of deployment failures in production that require immediate remedy. (Rollback or manual change)
4. Time to Restore Service (MTTR) – Mean time to recovery.

In this post, we present a methodology to collect these four metrics from software delivery pipelines and applications deployed in Google Cloud. You can then use those metrics to rate your overall practice effectiveness, and baseline your organization’s performance against DORA industry benchmarks, and determine whether you’re an Elite, High, Medium or Low performer.

Let’s take a look at how to do this in practice, with a sample architecture running on Google Cloud.

Services and reference architecture

To get started, we create a CI/CD pipeline with the following cloud services:

Github Code Repo
Cloud Build, a container-based CI/CD Tool)
Container Registry
Google Kubernetes Engine (GKE)
Cloud Load Balancing, used as an Ingress Controller for GKE)
Cloud Uptime Checks, for synthetic application monitoring
Cloud Monitoring
Cloud Functions
Pub/Sub, used as a message bus to connect Alerts to Cloud Functions)

These are combined into the reference architecture below. Note that all of these Google Cloud services are integrated with Cloud Monitoring. As such, there’s nothing in particular that you need to set up to receive service logs, and many of these services have built-in metrics that we’ll use in this post.

2 Google Cloud Platform CI_CD pipeline.jpg — Google Cloud Platform CI/CD pipeline and application topology

Measuring Speed

To measure our two speed metrics—deployment frequency and lead time to commit—we instrument Cloud Build, which is a continuous integration and continuous delivery tool. As a container-based CI/CD tool, Cloud Build lets you load a series of Google managed or community managed Cloud Builders to manipulate your code or interact with internal and external services during the build/deployment process. Upon firing a build trigger, Cloud Build reaches into our Git Repository for our source code, creates a container image artifact that it pushes to the container registry, and then deploys the container image to a GKE cluster.

You can also import your own cloud builder container in the process and insert it as the final build step, to determine the time from commit to deployment as well as whether this is a rollback deployment. For this example, we’ve created a custom container to be used as the last build step that:

Retrieves the payload binding for the commit timestamp accessed by the variable $(push.repository.pushed_at) and compares it against the current timestamp to calculate lead time. The payload binding variable is used when we create the trigger and is referenced by a custom variable, $_MERGE_TIME in cloudbould.yaml.
Reaches into the source repo to get the commit ID of the latest commit on the master branch and compares it to the current commit ID of the build to determine if it is a rollback or a match.

Metric Value:

Lead Time for Changes

Once we have the above metric and labels created from our Cloud Build log we can access it in Cloud Operations Metrics explorer via the metric label ‘logging/user/dorametics’ (‘DoraMetrics’ was the name we gave our log-based metric). The value of the metric will be the LeadTime as extracted from the regular expression above, with Rollbacks filtered out. We use the median or 50th percentile.

Deployment Frequency

Now that we have the lead time for each commit, we can determine the frequency of deployments by just counting the number of lead times we recorded in a window!

Measuring stability

Change Failure Count

To determine the number of software rollbacks that were performed, we can look at our Deployment Frequency and filter for ‘Rollback=True’ metrics. This gives us a count of the total rollbacks performed. If we wanted to determine the Change Failure Rate we would use data collected in this chart and divide it by the Deployment Frequency metric collected above for the same window.

Mean-Time-To-Resolution (MTTR)

In typical enterprise environments there are incident response systems that allow you to determine when an issue was reported and when it is ultimately resolved. Assuming these times could be queried, MTTR could be determined by the average time between the reported and resolved timestamps of the issues.

In this blog we use automation to alert and graph issues, which allows us to gather more accurate service disruption metrics. Our strategy involves the use of Service Level Objectives (SLO), which represents Service Level Indicators (SLI) that we’ve determined represent our customers’ happiness with our application and an objective. When we violate an SLO we consider our mean-time-to-restore service is the total time it takes to detect, mitigate, and resolve a problem until we are back in compliance with the SLO.

MTTR and customer satisfaction

For the purposes of simplicity we’ve highlighted one metric we feel represents our customer satisfaction: overall HTTP response code errors from our website. The ratio of this metric against the total response codes sent over a given time window constitutes our Service Level Indicator (SLI).

For total errors we monitor response codes returned from our front-end load balancer, which is set up as an ingress controller in our GKE cluster.

Metric Used: loadbalancing.googleapis.com/https/request_count Group by response_code

Using this metric above we can build our SLI and wrap it into an SLO that represents the customer satisfaction observed over a longer time window. Using the SLO API, we create custom SLOs that represent the level of customer satisfaction we want to monitor, where being in violation of that SLO indicates an issue. There’s a great tutorial on how to create custom SLOs and services here.

In this example, we’ve created a custom service to represent our application and an SLO for HTTP LB response codes (code). It assumes a quality of service level in which 98% of responses from the load balancer should not be errors in a given day. Doing this automatically creates an error budget of 2% over 24 hours. Now, when it comes to monitoring for MTTR, we have a metric (SLI) that’s attached to a service level SLO that represents quality of service over a given window of time. The failure of the SLO is simulated in the screenshot below:

Next, we set up an alert policy that fires when we are in danger of violating this SLO. This also starts a timer to calculate the time-to-resolution. What we’re measuring here is referred to as ‘burn rate’—how much of our error budget (2% of errors over 24 hours) we are eating up with the current SLI metic. The window we measure for our alert is much smaller than our entire SLO so when the SLI has moved back within compliance of a threshold, another alert fires, indicating the incident has cleared. For more information on setting up alerting policies please visit this page.

You can also send out alerts through a variety of channels, allowing you to integrate into existing ticketing or messaging systems to record the MTTR in a way that makes sense for your organization. For our purposes we integrate with the Pub/Sub message bus channel, sending the alerts to a cloud function that performs the necessary charting calculators.

In the message from the clearing alert we see the JSON payload has the started_at and ended_at timestamps. We use these timestamps in our cloud function to calculate the time to resolve the issue and then output it to the logs.

Here is the entire Pub/Sub message sent to Cloud Functions:

Here is the cloud function connected to the same Pub/Sub topic as the Alert:

The results in the following messages sent to Cloud Functions logs:

The final step is to create another log-based metric to pick up the ‘Time to Resolve’ value that we print to our cloud functions log. We do so with this regex expression

Resolve:\s([0-9]+);

Now the metric is available in Cloud Operations.

Conclusion

We’ve shown above how you can create custom cloud builders in Cloud Build to generate metrics relating to deployment frequency, mean-time-to-deployment and rollback that will appear in Cloud Operations logs. We’ve also shown you how to use SLOs and SLIs to generate and push alerts to your Cloud Functions logs. We’ve used log-based metrics to pull our metrics out of the logs and chart them. These metrics can be used to evaluate the effectiveness of your organization’s software development and delivery pipelines over time as well as help you evaluate your performance amongst the greater DevOps community. Where does your organization land?

For more inspiration, here is some further reference material to help you measure the effectiveness of your own DevOps organization:

Google Cloud Application Modernization Program (blog)
Setting SLOs: a step-by-step guide (blog)
Setting SLOs: observability using custom metrics (blog)
Concepts in Service Monitoring (documentation)
Working with the SLO API (documentation)
How to create SLOs in the GCP Console (video)
How to create SLOs at scale with the SLO API (video)
How to create SLOs using custom metrics (video)
GitHub SLO API Code used for Blog
DORA Quick Check
The 4 Keys Project for DORA Metric Ingression into BigQuery
21 new ways we’re improving observability with Cloud Ops (blog)

By Brian Kaufman Specialist Customer Engineer, Hybrid

Source https://cloud.google.com/blog/products/devops-sre/another-way-to-gauge-your-devops-performance-according-to-dora

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

Gauge The Effectiveness Of Your DevOps Organization Running In Google Cloud

From our partners:

Services and reference architecture

Measuring Speed

Metric Value:

Lead Time for Changes

Deployment Frequency

Measuring stability

Conclusion

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

aster.cloud

Related Topics

What is cloud bursting?

What is confidential computing?

Oracle adds xAI Grok models to OCI

Fine-tune your storage-as-a-service approach

Advanced audio dialog and generation with Gemini 2.5

A Father’s Day Gift for Every Pop and Papa

Global cloud spending might be booming, but AWS is trailing Microsoft and Google

Google Cloud, Cloudflare struck by widespread outages

What is PC as a service (PCaaS)?

Apple services deliver powerful features and intelligent updates to users this autumn

Most Popular

Crayon targets mid-market gains with expanded Google Cloud partnership

By the numbers: Use AI to fill the IT skills gap

Apple supercharges its tools and technologies for developers to foster creativity, innovation, and design

Apple Intelligence gets even more powerful with new capabilities across Apple devices

Apple introduces a delightful and elegant new software design

Gauge The Effectiveness Of Your DevOps Organization Running In Google Cloud

From our partners:

The 2019 Accelerate State of DevOps: Elite performance, productivity, and scaling

Services and reference architecture

Measuring Speed

Metric Value:

Lead Time for Changes

Deployment Frequency

Measuring stability

Conclusion

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Related Topics

You May Also Like