How to Investigate Live Issues

The majority of Application-2 issues are one of the following root causes:

  1. An upstream dependency is generating errors to us.
  2. The platform or Google Cloud Services are having issues and causing a platform level outage.
  3. Client side issues e.g. generated errors generated by monetate client side tests.

To find the root cause, use the following debugging process:

  • Get visibilty
  • Narrow the focus
  • Repeat until you spot the error!

First step - Read the Alert

Answer the following by reading the alert:

  1. What triggered the alert? Is it a client side error from new relic monitoring? Or a server side error from platform? Or are you seeing both!
  2. What does the error message say? Read it and understand what it is trying to alert on - eg low traffic, 40x/50x errors?

Detailed Steps - Client Side

  1. If the error alert is triggered by NewRelic, it is a Real User Monitoring Alert for customers/bots in the browser. Drop into the NewRelic alert to understand the errors being thrown client side in the browser.
  2. Investigate the error being thrown and debug the issue. Look for browser type, headers and try and replicate the error.
  3. Monetate issues can be spotted by problem tags in the source html being wrapped in montetate tags.
  4. Disabling experiences in live can be done to see if that removes the error.

Detailed Steps - Server Side

  1. Start with Grafana Golden Signals Dashboard to get a general view of server application status:

    • Are we receiving traffic? Is traffic spiking up?
    • What errors is the app encountering?
    • How stressed are the pods - what is happening with CPU/Memory/Pod count?
  2. Then drop into kibana DOWNSTREAM_X and our own logs logs for debugging the alert.

  3. Identify the source of the errors by filtering the logs by status log.wstatus. Is the error message related to a specific dependency eg product API or basket?

  4. If the error is for an upstream API check the dependencies page for details on the dashboard and svc channel and check it to see if that service is having issues.

  5. If the error is only occurring in Application-2, the next action is to look a infrastructure and code health to understand if code needs to be reverted. See release and rollback on how to roll back code.

  6. Report back findings into any incident channel or to the team channel as you find out more details so all working on the inc can see any new information.