This example service (with real details redacted) was selected as an illustration of a single general runbook covering a small (but v important!) collection of microservices.
I particularly like the clarity that their upstreams and downstreams are described (a diagram would enhance this, of course), and concise advice on responding to the various alerts (platform built-in and custom ones they’ve created).
I’ve deliberately left much of the alert detail in place to showcase some of the standard ones we’ve selected to provide to all services on our platform, regardless of what they do.
Deployment | Endpoint URL |
---|---|
Stubs | https://stubs.link/to/my/endpoint |
Integration | https://integration.link/to/my/endpoint |
Prod | https://prod.link/to/my/endpoint |
Client 1 is … and depends on this service to … It is business critical.
Client 2 is … and depends on this service to … It can tolerate downtime due to a cached copy of the data, but freshness becomes a concern after X minutes
SERVICE_NAME depends on Content SaaS - a third party CMS for data about … Without it, this service cannot function
SERVICE_NAME also depends on A.N.Other Service Z. It relies on this for any relevant product data for this content page. Without it, content is still returned, just missing this information
Dashboard of Active Alerts - currently active alerts in Prometheus AlertManager for this service.
Metrics including CPU, Memory, Response codes and Latency are available in Grafana.
Links to some dashboards here
Links to some custom dashboards here for their dependencies
Links to saved queries in Kibana here
Instructions on how to connect to the cluster and manually kill a faulty pod
These sub-headings are linked to directly from the Alerts that go into the team’s Slack Channel / PagerDuty notification
services.application-1.ENV_NAME-MICROSERVICE_X.all-pods-unavailable
kubectl describe deployment ENV_NAME-MICROSERVICE_X
to see why Kubernetes is not been able to bring the pods up. If symptom persists, contact #team-x to investigate furtherplatform.pod-restart-loop
Note: The downstream caching service running with only one or two pods has been tested without significantly impacting customer experience. Pages returned by the service are cached with 10 minutes TTL. Even when the cache TTL is expired, it will still serve the cached pages for next 24 hours, if it cannot successfully get a latest copy from MICROSERVICE_X.
slo-application-1.ENV_NAME-MICROSERVICE_X.5xx-errors
slo-application-1.ENV_NAME-MICROSERVICE_X.zero-traffic
services.application-1.ENV_NAME-MICROSERVICE_X.high-latency
These alerts represent custom ones that this team have created to warn on issues with their dependencies. A strong indicator of good operability!
services.application-1.ENV_NAME-MICROSERVICE_X.caas-5xx-errors
services.application-1.ENV_NAME-MICROSERVICE_X.product-variants-5xx-errors