Closed Bug 1129185 Opened 9 years ago Closed 9 years ago

Reporting to make sure we don't have broken or incomplete session fragment chains (FHR)

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kparlante, Assigned: trink)

References

Details

quoting bcolloran:

I will also want metrics that monitor the data *sent* by clients; in
particular, in https://bugzilla.mozilla.org/show_bug.cgi?id=1120982 we
put a number of data elements in the main ping session fragment
payload that are designed to allow us to reassemble the complete
sequence of session fragments from a client. We'll need monitoring to
make sure that there are no broken and/or incomplete session fragment
chains. This will have to be downstream from the Heka intake pipeline,
but will be absolutely critical-- it is the main way we'll have of
detecting dropped/omitted submissions from clients.
Comments from bug triage:
- Requires by-client API.
- If client sends the records in order (time, fragment order) might help with this report.
- Not really a "monitor" (not real time) may be several days later that we can observe the problem
I will add to this: monitoring duplicate uploads.
I'll also add that this check (which I consider the probably most important data quality check, and which should precede the checks in https://bugzilla.mozilla.org/show_bug.cgi?id=1134661), should look not just for incomplete subsession chains, but also for branching trees of subsessions.

We should do this check for as many clientIds as practical since these are likely to be rare problems (as opposed to the smaller samples needed for the comparisons in bug 1134661). We should count the prevalence of broken chains and branches, and set aside these sets of pings.

Branching should be rare but is expected to exist. If it is common it may indicate a problems with how sessionId pointers are being stored and submitted.

Missing subsessions should really not exist, perhaps modulo an occasional subsession toward the end of a data series that needs another opportunity to be resubmitted. If this occurs frequently we may need to closely examine the submission/acknowledgement/resubmission model.
Priority: -- → P2
Assignee: nobody → mtrinkala
Priority: P2 → P3
Fwiw regarding prioritizing v4 sanity checking, I consider this to be really critical. Doesn't have to happen today as P1, but does have to happen before Metrics will give a thumbs up to v4, so probably good to have some maneuvering room between having this check and needing to make that call.
I am not actively working on this, so I am taking my name off of it.
Assignee: mtrinkala → nobody
Depends on: 1157408
Summary: Monitoring to make sure we don't have broken or incomplete session fragment chains (FHR) → Reporting to make sure we don't have broken or incomplete session fragment chains (FHR)
Now that we have substantial evidence that subsession collection/reporting/recording/something is actually a problem [1,2,3] I want to mention again that I believe this should be a relatively high priority for automated and ongoing monitoring-- until the subsession graphs look the way we expect them to (connected chains with no branches) in almost all cases, we will have low confidence in v4 data. Getting this right is foundational to our trust in the system, and if at anytime in the future these graphs start behaving differently, we'll definitely want to know about it.

Also because of what has been found in [1], the step I mention in comment 3 is quite important: we will need to count the prevalence of session graph branching as well as breaks in the session graph.

Of course, I'm happy to collaborate on that effort and can certainly help out in terms of defining filters on that data and stuff like that.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1157408
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1157359
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=1154113
I forgot to bring this up at our triage, but re-prioritizing.
Priority: P3 → P2
More issues have been spotted in [1]. I feel that having an updated dashboard that shows the number/frequency of invalid chains, duplicate fragments, etc. should have higher priority than any work that requires to interpret the data.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1159297
Assignee: nobody → mtrinkala
Priority: P2 → P1
Depends on: 1164872
This will not run as a report due to 1164872.  We could run it in real time but it is a memory hog (1-2GB to track 128MM sessions).  Also this implementation is not blocked on 1157408 previousSubsessionId is not needed or used (although it may be useful for GC of sessions that did not report completion, we are currently collecting numbers to see if/how often that occurs)
r+
Flags: needinfo?(mreid)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Is the data available somewhere as a monitor or do we get regular reports for this?
Flags: needinfo?(mtrinkala)
To run it efficiently as a report it is still blocked on 1164872 (although currently we should be able to get away with running a single S3 worker).  To run it as a monitor we would need to setup an instance in production (dedicated most likely).  There was also talk of using a curated data stream, like the executive summary stream, to reduce the I/O to the box.  This would require the executive summary stream to be modified and deployed to production.  New bugs should be filed with the requirements including the filters (i.e., based on recent emails it looks like some nightly-* submissions should be ignored (this may impact the way we normalize channels in the curated stream)).
Flags: needinfo?(mtrinkala)
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.