Closed Bug 1243523 Opened 8 years ago Closed 8 years ago

Define a schema for main pings for validation purposes

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: mreid)

References

Details

(Whiteboard: [loasis])

Attachments

(1 file)

Validating incoming telemetry payloads will ease the burden on the analysts, no matter the dataset they are using, and allow us to quantify globally the number of submissions or individual metrics that didn't pass the validation criteria in a certain time frame.

"json-schema" is our validation schema format of choice for this Bug as it is more expressive (e.g. supports regexps for string validation) than others we have evaluated.
Attached file main.schema.json
Here's v1
Blocks: 1243524
Whiteboard: [loasis]
Roberto, could you point me at the Scala code you are using to validate pings so I can see which fields have been giving you trouble? Right now I am relaxing the schemas for common problems such as fields which the documentation says are required that don't appear, but we will want to incorporate what you have learned.

Tested 300 declarations, mostly in environment, against 300K pings. About 1% have absurd values in creationDate or resetDate, so I removed bounds checking for now. Uncovered some documentation bugs.

Messy work-in-progress is here:
 
 https://github.com/SamPenrose/data-pipeline/tree/v4-baseset-schema/schemas/test

specifically {main, application, payload-main, environment}.schema.json . Except for main the schemas are inside a wrapper object with the name of the top-level field. I recommend waiting until next week to work with these files. Feedback welcome regarding how to structure them.
Flags: needinfo?(rvitillo)
(In reply to Sam Penrose from comment #2)
> Roberto, could you point me at the Scala code you are using to validate
> pings so I can see which fields have been giving you trouble? Right now I am
> relaxing the schemas for common problems such as fields which the
> documentation says are required that don't appear, but we will want to
> incorporate what you have learned.

The code [1] that powers the aggregates of t.m.o is probably more interesting for your needs. As you can see it has plenty of checks to ensure histograms are in the right format.

> Tested 300 declarations, mostly in environment, against 300K pings. About 1%
> have absurd values in creationDate or resetDate, so I removed bounds
> checking for now. Uncovered some documentation bugs.
> 
> Messy work-in-progress is here:
>  
>  https://github.com/SamPenrose/data-pipeline/tree/v4-baseset-schema/schemas/
> test
> 
> specifically {main, application, payload-main, environment}.schema.json .
> Except for main the schemas are inside a wrapper object with the name of the
> top-level field. I recommend waiting until next week to work with these
> files. Feedback welcome regarding how to structure them.

This looks very promising! I am going to try to base part of the environment schema of the longitudinal dataset on your schema.

[1] https://github.com/vitillo/python_mozaggregator/blob/master/mozaggregator/aggregator.py
Flags: needinfo?(rvitillo)
This notebook shows how to test a single-file JSON-Schema for main pings:

    https://github.com/SamPenrose/mozilla-pipeline-schemas/blob/master/validation/main-ping-validation-with-jsonschema.ipynb

The schema is at:

    https://github.com/SamPenrose/mozilla-pipeline-schemas/blob/master/telemetry/main.schema.json

You can see notes on how I relaxed the schema to deal with real-world ping variability in the comments for:

    https://bugzilla.mozilla.org/show_bug.cgi?id=1243893
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
I gave the schema a quick look and I don't see how this schema validates the whole ping. For instance, "childPayloads" and "threadHangStats" don't seem to have a proper schema, am I correct?
Flags: needinfo?(spenrose)
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #5)
> I gave the schema a quick look and I don't see how this schema validates the
> whole ping. For instance, "childPayloads" and "threadHangStats" don't seem
> to have a proper schema, am I correct?

This schema contains what's documented in toolkit/components/telemetry/ , cross-checked against real data. If there is documentation elsewhere for more fields I am happy to extend the schema. We can also make bugs for individual fields and base a schema on data based on the importance of those fields.

The challenge is that JSON-Schema lets you completely lock down a document (too strict for our data) OR it allows unbounded untyped expansion within containers, and the libraries I have been using don't let you say "tell me which parts of the object were not tested by the schema."

The back-and-forth between tightening what the client sends, tracking ping variability at scale, and evolving the schemas (there will be more than ten eventually) will be a large ongoing project.
Flags: needinfo?(spenrose)
threadHangStats is here: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/docs/main-ping.rst#179

It's probably a good idea to file individual bugs for individual sections and have one meta bug to track the ongoing work.

Sam, could you file a bug for all missing sections and make them blockers for this bug?

Georg, could you add some more documentation for childPayloads and more generally for any other field that hasn't yet been documented?
Status: RESOLVED → REOPENED
Flags: needinfo?(gfritzsche)
Resolution: FIXED → ---
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #7)
> threadHangStats is here:
> https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/
> docs/main-ping.rst#179
> 
> It's probably a good idea to file individual bugs for individual sections
> and have one meta bug to track the ongoing work.
> 
> Sam, could you file a bug for all missing sections and make them blockers
> for this bug?
> 
> Georg, could you add some more documentation for childPayloads and more
> generally for any other field that hasn't yet been documented?

I added and quickly tested threadHangStats and childPayloads; currently in a branch:

https://github.com/SamPenrose/mozilla-pipeline-schemas/tree/threadHangs
Thanks, the missing fields now seem to be:
- addonDetails
- chromeHangs
- fileIOReports
- info
- lateWrites
- log
- simpleMeasurements
- slowSQL
- slowSQLStartup
- UIMeasurements
- webrtc
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #7)
> Georg, could you add some more documentation for childPayloads and more
> generally for any other field that hasn't yet been documented?

Sorry for the lag.
Documentation was completed in bug 1244182, further issues are being fixed in bug 1243893.
We have bug 1246644 open on more details for the childPayloads.

Does that cover it?
Flags: needinfo?(gfritzsche)
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #9)
> Thanks, the missing fields now seem to be:
> - addonDetails
> - chromeHangs
> - fileIOReports
> - info
> - lateWrites
> - log
> - simpleMeasurements
> - slowSQL
> - slowSQLStartup
> - UIMeasurements
> - webrtc

I have added some coverage for fileIOReports, info, log, slowSQL/Startup, and webrtc in a PR [1], which also includes details for threadHangs and childPayloads. For the other fields I could use documentation. Georg would you like me to file a new bug or comment in [2]?

[1] https://github.com/mozilla-services/mozilla-pipeline-schemas/compare/master...SamPenrose:threadHangs?expand=1
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1243893
Flags: needinfo?(gfritzsche)
(In reply to Sam Penrose from comment #11)
> I have added some coverage for fileIOReports, info, log, slowSQL/Startup,
> and webrtc in a PR [1], which also includes details for threadHangs and
> childPayloads. For the other fields I could use documentation. Georg would
> you like me to file a new bug or comment in [2]?

Bug 1243893 is about to land, can you file a new bug and needinfo Alessio there?
Flags: needinfo?(gfritzsche)
Points: --- → 3
Priority: -- → P1
Assignee: spenrose → nobody
Priority: P1 → P2
The only action required on this bug is for follow up bugs to be filed and this one to be closed. Even filing the follow up bugs is not urgent, however, so it's a P2.
Assignee: nobody → mreid
See followup bugs:

Bug 1264022 - addonDetails
Bug 1264024 - childPayloads
Bug 1264025 - chromeHangs
Bug 1264026 - fileIOReports
Bug 1264027 - info
Bug 1264028 - lateWrites
Bug 1264029 - log
Bug 1264030 - simpleMeasurements
Bug 1264031 - slowSQL
Bug 1264033 - slowSQLStartup
Bug 1264038 - threadHangStats
Bug 1264039 - UIMeasurements
Bug 1264040 - webrtc
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: