Which data sources do DevOps teams need in order to achieve observability? At a high level, that’s an easy question to answer. Concepts like the “three pillars of observability”—logs, metrics, and traces—may come to mind. Or, you may think in terms of techniques like the RED Method or Google’s Golden Signals, which are other popular frameworks for defining which types of data teams should collect for monitoring and observability purposes.
These concepts are great for helping to define an observability strategy at a high level. However, the problem with them is that they are not very specific about the precise types of data you need to enable observability, or where to find it. Which specific logs, traces, and metrics should you collect? Which parts of an application do you need to monitor to collect relevant request rate, error rate, and duration data?
To answer questions like these, you need to dive deeper into data sources for observability. This article does so by discussing five specific types of application-level insights that DevOps teams should collect as part of an observability strategy, along with tips on where to find the data and how to collect it from within a typical distributed application.
Total Application Requests
One of the most basic, but also most useful, metrics to track is the total number of requests that your application receives in a given period. Total requests reflect how much demand is placed on the app.
By correlating this data point with other metrics, like the time it takes to complete each request, teams can measure how application demand correlates with application performance. If you see a correlation between request rate and degradations in performance, for instance, it likely means that your application lacks sufficient resources to handle periods of peak demand.
Your method for tracking total application requests will depend on how your application is designed and deployed. In some cases, you could measure request rates through data exposed by your API gateway, if your application uses an API gateway to manage incoming requests. In other cases, application logs will record request rates, and you can use a logging agent to collect data from the logs and forward it to an external log analysis service.
Request Duration for Each Microservice
Request duration measures how long it takes to process a request.
You can measure request duration for an entire application. That would allow you to track how long it takes the application as a whole to handle each user request.
However, duration metrics tend to be more useful when you collect them at the level of microservices. After all, if your app is taking a long time to complete requests, the first question you’re likely to ask is which microservice (or microservices) is creating the bottleneck. In many cases, duration problems can be traced to a specific microservice that is taking longer than it should to do its part in handling a request.
There are two main ways to measure duration on a microservice-by-microservice basis. One is to collect this data from a service mesh, if you have one and it records information about request duration for each microservice. The other is to collect log data from the containers that host each microservice.
Microservice Instances Count
Knowing how many instances of each microservice you have running at a given time is useful for determining whether your application is under- or over-provisioned. It will also help you to determine whether a lack of available instances is responsible for performance problems that you may detect.
If you deploy each microservice in its own container, counting total microservices instances means counting how many containers you have running for each microservice. In most situations, the easiest way to get this data is from container orchestrator logs. Kubernetes audit logs could be configured to track container instances, for example.
Container Liveness and Readiness
Kubernetes uses the concepts of container “liveness” and “readiness” to assess whether containers are functioning properly. Containers that are not “live” or “ready” are typically not able to handle application requests.
Arguably, failure rates for liveness and readiness probes are an infrastructure metric rather than an application-level metric. However, because liveness and readiness problems are most often caused by an issue with the application, tracking these metrics is a very useful means of identifying application coding or configuration problems that are causing a container not to run properly.
(In some instances, liveness or readiness errors could be the result of a configuration issue with Kubernetes, not the app. But if other containers are working properly, chances are that the app is the source of the problem.)
To track readiness and liveness, you must first ensure that your Kubernetes pods are configured for liveness and readiness probes. (See the Kubernetes documentation for details.) Once they are, you can track the results of probes with a command like:
kubectl describe pod liveness-exec
CI/CD Pipeline Metrics
Last but not least on a list of DevOps observability metrics are metrics from the CI/CD pipeline. CI/CD pipeline metrics are metrics that measure the status and performance of CI/CD processes, such as how many new application releases you deploy per week or how many rollbacks you have to perform.
Here again, these aren’t technically application metrics, but they provide observability into the health of the application. A high rate of recurring rollbacks probably means that your application has a deep-seated coding problem that you should investigate, for example. Or, if you notice that application performance degrades as you increase your frequency of deployments, it may be a sign that you are trying to move too fast and are at risk of compromising application quality to maximize release velocity.
It’s easy to overlook the importance of CI/CD pipeline metrics, but they provide crucial context for observability that you can’t obtain from other parts of your stack.
Collecting CI/CD pipeline metrics can be tricky because logging and monitoring tools are not typically designed to track this category of data. However, it should be easy enough to use logs from your CI tools and/or release automation tools to track metrics like build frequency and deployment frequency.
Observability success requires defining the right types of data to collect and knowing where and how to collect it. From request rate for the application as a whole, to request duration for individual microservices, to CI/CD pipeline metrics and beyond, DevOps teams need a host of disparate data sources to achieve the benefits that observability stands to offer.
By Chris Tozzi