How to Investigate Live Issues
The majority of Application-2 issues are one of the following root causes:
- An upstream dependency is generating errors to us.
- The platform or Google Cloud Services are having issues and causing a platform level outage.
- Client side issues e.g. generated errors generated by monetate client side tests.
To find the root cause, use the following debugging process:
- Get visibilty
- Narrow the focus
- Repeat until you spot the error!
First step - Read the Alert
Answer the following by reading the alert:
- What triggered the alert? Is it a client side error from new relic monitoring? Or a server side error from platform? Or are you seeing both!
- What does the error message say? Read it and understand what it is trying to alert on - eg low traffic, 40x/50x errors?
Detailed Steps - Client Side
- If the error alert is triggered by NewRelic, it is a Real User Monitoring Alert for customers/bots in the browser. Drop into the NewRelic alert to understand the errors being thrown client side in the browser.
- Investigate the error being thrown and debug the issue. Look for browser type, headers and try and replicate the error.
- Monetate issues can be spotted by problem tags in the source html being wrapped in montetate tags.
- Disabling experiences in live can be done to see if that removes the error.
Detailed Steps - Server Side
-
Start with Grafana Golden Signals Dashboard to get a general view of server application status:
- Are we receiving traffic? Is traffic spiking up?
- What errors is the app encountering?
- How stressed are the pods - what is happening with CPU/Memory/Pod count?
-
Then drop into kibana DOWNSTREAM_X and our own logs logs for debugging the alert.
-
Identify the source of the errors by filtering the logs by status log.wstatus
. Is the error message related to a specific dependency eg product API or basket?
-
If the error is for an upstream API check the dependencies page for details on the dashboard and svc channel and check it to see if that service is having issues.
-
If the error is only occurring in Application-2, the next action is to look a infrastructure and code health to understand if code needs to be reverted. See release and rollback on how to roll back code.
-
Report back findings into any incident channel or to the team channel as you find out more details so all working on the inc can see any new information.