Closed Bug 1859614 Opened 2 years ago Closed 9 months ago

Suggestions for metric incident investigation playbook

Categories

(Data Platform and Tools :: Glean: SDK, enhancement)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: travis_, Assigned: travis_)

Details

This bug is aimed mainly at Frank and the work on this doc: https://docs.google.com/document/d/1tmI7PyHR1DGjbyTU8J1hAMq0lwngm6cV9c6E8SppCFw/edit

I had some additional (and some overlapping) suggestions that I have found useful, along with some explanations of what each split might tell us if we find anything:

Countries (e.g. China, Iran..?)

  • If so, is there a national holiday or something similar going on?
  • Is this an area known for bots or unusual activity (malaysia, china, Ireland, etc.)

ISP (e.g. BrowserStack?)

  • Typically this is more fine grained than country and can be more proof of potential bots or automation if the anomaly is coming from a single ISP.
  • There’s a lot of ISPs, might need a HAVING clause to filter out smaller ISPs

Product Version/build-id

  • Did this start in a specific product version, if so, what changed in that version (work with the product team to answer this question)?
  • Is this build-id a known Mozilla build-id, if not it could be a clone/fork or sideload build.

Glean SDK version

  • Did this start in a new Glean version, if so, what changed in that version (work with the Glean team to answer this question)?

Other library version changes?

  • Check Application Services updates, gecko updates, etc. to see if this can be tied to a specific version change there. Glean relies on things like viaduct and rkv which can affect data collection if there is a regression.

OS-SDK version (android SDK, iOS targets, etc)

  • Something may have changed in the platform SDK that is affecting data collection.
  • Typically this behavior shows up in either platform lifecycle event behavior changing (see 0 duration pings, etc), or in background task work (uploading pings)

Time difference between start/end_time and submission_timestamp.

  • Do the timestamps we record appear reasonable for both the ping time window and the delay from collection/submission to receiving the ping in ingestion.

What Glean errors are there?

  • Any networking or other telemetry errors that might be indicative of the issue? This could be an ingestion issue, etc.

Hardware mfg./version/etc.

  • Does this only happen on older/newer hardware?
Component: Glean Platform → Glean: SDK
Assignee: nobody → tlong
Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.