1129185 - Reporting to make sure we don't have broken or incomplete session fragment chains (FHR)

Reporter

Description

•

10 years ago

quoting bcolloran: I will also want metrics that monitor the data *sent* by clients; in particular, in https://bugzilla.mozilla.org/show_bug.cgi?id=1120982 we put a number of data elements in the main ping session fragment payload that are designed to allow us to reassemble the complete sequence of session fragments from a client. We'll need monitoring to make sure that there are no broken and/or incomplete session fragment chains. This will have to be downstream from the Heka intake pipeline, but will be absolutely critical-- it is the main way we'll have of detecting dropped/omitted submissions from clients.

Katie Parlante

Reporter

Comment 1

•

10 years ago

Comments from bug triage: - Requires by-client API. - If client sends the records in order (time, fragment order) might help with this report. - Not really a "monitor" (not real time) may be several days later that we can observe the problem

Benjamin Smedberg

Comment 2

•

10 years ago

I will add to this: monitoring duplicate uploads.

brendan c

Comment 3

•

10 years ago

I'll also add that this check (which I consider the probably most important data quality check, and which should precede the checks in https://bugzilla.mozilla.org/show_bug.cgi?id=1134661), should look not just for incomplete subsession chains, but also for branching trees of subsessions. We should do this check for as many clientIds as practical since these are likely to be rare problems (as opposed to the smaller samples needed for the comparisons in bug 1134661). We should count the prevalence of broken chains and branches, and set aside these sets of pings. Branching should be rare but is expected to exist. If it is common it may indicate a problems with how sessionId pointers are being stored and submitted. Missing subsessions should really not exist, perhaps modulo an occasional subsession toward the end of a data series that needs another opportunity to be resubmitted. If this occurs frequently we may need to closely examine the submission/acknowledgement/resubmission model.

Mark Reid [:mreid]

Updated

•

10 years ago

Priority: -- → P2

Mark Reid [:mreid]

Updated

•

10 years ago

Assignee: nobody → mtrinkala

Mark Reid [:mreid]

Updated

•

10 years ago

Priority: P2 → P3

brendan c

Comment 4

•

10 years ago

Fwiw regarding prioritizing v4 sanity checking, I consider this to be really critical. Doesn't have to happen today as P1, but does have to happen before Metrics will give a thumbs up to v4, so probably good to have some maneuvering room between having this check and needing to make that call.

Mike Trinkala [:trink]

Assignee

Comment 5

•

10 years ago

I am not actively working on this, so I am taking my name off of it.

Assignee: mtrinkala → nobody

brendan c

Updated

•

10 years ago

Depends on: 1157408

Katie Parlante

Reporter

Updated

•

10 years ago

Summary: Monitoring to make sure we don't have broken or incomplete session fragment chains (FHR) → Reporting to make sure we don't have broken or incomplete session fragment chains (FHR)

brendan c

Comment 6

•

10 years ago

Now that we have substantial evidence that subsession collection/reporting/recording/something is actually a problem [1,2,3] I want to mention again that I believe this should be a relatively high priority for automated and ongoing monitoring-- until the subsession graphs look the way we expect them to (connected chains with no branches) in almost all cases, we will have low confidence in v4 data. Getting this right is foundational to our trust in the system, and if at anytime in the future these graphs start behaving differently, we'll definitely want to know about it. Also because of what has been found in [1], the step I mention in comment 3 is quite important: we will need to count the prevalence of session graph branching as well as breaks in the session graph. Of course, I'm happy to collaborate on that effort and can certainly help out in terms of defining filters on that data and stuff like that. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1157408 [2] https://bugzilla.mozilla.org/show_bug.cgi?id=1157359 [3] https://bugzilla.mozilla.org/show_bug.cgi?id=1154113

Katie Parlante

Reporter

Comment 7

•

10 years ago

I forgot to bring this up at our triage, but re-prioritizing.

Priority: P3 → P2

Roberto Agostino Vitillo (:rvitillo)

Comment 8

•

10 years ago

More issues have been spotted in [1]. I feel that having an updated dashboard that shows the number/frequency of invalid chains, duplicate fragments, etc. should have higher priority than any work that requires to interpret the data. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1159297

Mark Reid [:mreid]

Updated

•

10 years ago

Assignee: nobody → mtrinkala

Priority: P2 → P1

Mike Trinkala [:trink]

Assignee

Updated

•

10 years ago

Depends on: 1164872

Mike Trinkala [:trink]

Assignee

Comment 9

•

10 years ago

This will not run as a report due to 1164872. We could run it in real time but it is a memory hog (1-2GB to track 128MM sessions). Also this implementation is not blocked on 1157408 previousSubsessionId is not needed or used (although it may be useful for GC of sessions that did not report completion, we are currently collecting numbers to see if/how often that occurs)

Mike Trinkala [:trink]

Assignee

Comment 10

•

10 years ago

r? https://github.com/mozilla-services/data-pipeline/pull/71

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Comment 11

•

10 years ago

r+

Flags: needinfo?(mreid)

Mike Trinkala [:trink]

Assignee

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Georg Fritzsche [:gfritzsche]

Comment 12

•

10 years ago

Is the data available somewhere as a monitor or do we get regular reports for this?

Flags: needinfo?(mtrinkala)

Mike Trinkala [:trink]

Assignee

Comment 13

•

10 years ago

To run it efficiently as a report it is still blocked on 1164872 (although currently we should be able to get away with running a single S3 worker). To run it as a monitor we would need to setup an instance in production (dedicated most likely). There was also talk of using a curated data stream, like the executive summary stream, to reduce the I/O to the box. This would require the executive summary stream to be modified and deployed to production. New bugs should be filed with the requirements including the filters (i.e., based on recent emails it looks like some nightly-* submissions should be ignored (this may impact the way we normalize channels in the curated stream)).

Mike Trinkala [:trink]

Assignee

Updated

•

10 years ago

Flags: needinfo?(mtrinkala)

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

Reporting to make sure we don't have broken or incomplete session fragment chains (FHR)

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: kparlante, Assigned: trink)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Updated

Comment 4

Comment 5

Updated

Updated

Comment 6

Comment 7

Comment 8

Updated

Updated

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Updated

Updated