Why are synthetic monitoring tools not enough?
When thousands or even millions of users are making multiple requests to your APIs, just relying on synthetic monitoring tools (that rely on sampling or limited API availability information) is not enough for precision diagnoses or useful forensics. At the same time, monitoring every single aspect just increases your overhead and mean time to diagnosis. Apropos, API monitoring has become absolutely critical — and a fusion of art and science — for operations teams to make sure all APIs are running and performing as intended. If you are worried about monitoring blind spots or overheads, let us look at 3 key practices you can follow to stop dreading your sev1 alerts.
#1 Prioritize alerts for critical events requiring immediate investigation
Ask any engineer on-call for a critical service and they will tell you about the overheads created by incorrectly prioritized alerts. For example let us imagine a distributed application with 20 APIs. Even if you set up basic alert monitors across latency, error, and traffic for these APIs, you end up monitoring and maintaining 60 alert definitions – which is a lot. To balance avoiding monitoring blindspots and alert fatigue at the same time, operations teams must develop a clear understanding of all events and prioritize configuring alerts for events supporting critical traffic.
- Consider quality over quantity while defining new alert conditions where each new condition is urgent, actionable, and actively or imminently user-visible. Every alert condition created should also contain intelligence that requires active engagement from a user as opposed to a mere robotic response. Apigee’s API monitoring allows creation of alert conditions based on metrics or logs while providing actionable information (Ex: status code, rate etc.,) and playbooks for diagnosis
- In today’s multilayered systems, one team’s symptom (“what’s broken?”) is another downstream system’s cause (“Why?”). Even if some events are not suitable for actionable alerts, a failure needs to create an informational broadcast to a downstream system to mitigate the impact of the upstream dependency. In such cases, investing in automating alerts, grouping multiple incidents in notification channels, and incident tracking. For example, Apigee lets you integrate and group your alert notifications in channels like Slack, Pagerduty, webhooks etc.,
- Modern production systems are ever-evolving where an alert that’s currently rare might become frequent and automatable. Analogous to ticket backlog grooming, alert policies need to be reviewed periodically to make sure new conditions are identified and existing alerts are refined with new thresholds, prioritization and correlation. Controls like Advanced API Ops leverage AI and ML to detect anomalous traffic differentiated from random fluctuations to help define accurate alert definitions
Check out some examples of alerts here
#2 Isolate problem areas quickly with dashboards
Google’s Site Reliability Engineering book presents the case for efficient diagnosis by building dashboards that answer basic questions about every service, normally including some form of the four golden signals — latency, traffic, errors and saturation. But at the same time, capturing just these golden metrics at different levels of granularity can quickly add up. Like all software systems, monitoring can become an endless pit of complexity, complicated to change and a burden to maintain. In the same book, the most effective direction to create a well-functioning standalone system is to collect and aggregate basic metrics, paired with alerting and dashboards.
If you are running a large scale API program with a dedicated team to monitor your APIs, you can leverage the out-of-the-box monitoring dashboards in your API Management solution (like Apigee’s API monitoring) to gather real-time insights into your API performance, availability, latency, and errors. In other cases, you can use solutions like Cloud Monitoring that can provide visibility across your full application stack where individual metrics, events, and metadata can be visualized in a rich query language for rapid analysis. Leveraging a single system for your application stack provides observability in context and can reduce your time spent navigating between systems (Apigee customers can use Cloud Monitoring by default or integrate with other systems using Cloud Monitoring API)
- Timeline view to diagnose traffic (in 1 min intervals), error rates (4xx and 5xx across traffic) and latencies (50th, 90th, 95th and 99th percentiles)
- Pivot tables of metrics and attributes for all API traffic, to help compare activity across different metrics
- Treemaps of recent API traffic by proxy to get a snapshot of incidents, error rates and latencies
#3 Incorporate distributed tracing into your end to end observability strategy
Modern application development accelerated the adoption of technologies and practices like cloud, containers, APIs, microservice architectures, DevOps, SRE etc. While this increases release velocity, it also introduces complexity and more points of failure in an application stack. For example, a slow response to a customer request spans across multiple micro services owned (and monitored) by various teams who might not observe any individual performance issues. Without an end-to-end contextual view of a request, it is nearly impossible to isolate the point of high latency.
In such cases, distributed tracing is the best way for DevOps, Operations and SREs to get answers to questions such as service health, root cause of defects, or performance bottlenecks in a distributed system. Organizations should invest in instrumenting their distributed applications using open source standards such as OpenCensus and Zipkin. Using tools like Cloud Trace with a broad platform, language and environment support can help easily ingest data from any source — open instrumentation or proprietary agents.
Implement effective API monitoring in Apigee
Apigee’s API monitoring (based on metrics exposed by the internals of the system) capabilities work with your existing monitoring infrastructure to help reduce mean time to diagnosis and increase application resiliency. Specifically, operations teams can leverage
- Monitoring dashboards to gain in-depth insights into API availability and performance metrics.
- Debug to precisely diagnose with deeper insights into an API proxy flow without toggling multiple tools.
- Alerts and notifications to gather contextual Insights, facilitate customization, and grouping with first class integrations to various tools (like Slack, PagerDuty, email and support for webhooks).
- Best of Google technologies such as Data Flow, Pub/Sub, Stackdriver, Bigtable and BigQuery to handle massive volumes and complex metrics at scale
Using Apigee’s API monitoring will help you maintain high application resiliency with comprehensive controls to reduce mean time to diagnosis and resolution. Get started with Apigee today or explore Apigee’s API monitoring for free here. Check out our documentation for additional information on API monitoring.
By: Varun Krovvidi (Product Marketing Manager, Google Cloud) and Rachel Richardson (Head of Product Marketing, Business Application Platform, Google Cloud)
Source: Google Cloud Blog