Closed Bug 1349633 Opened 7 years ago Closed 2 years ago

Automate JSON schema validation

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: peterbe, Unassigned)

Details

At the moment we have validate_and_test.py [0]
The idea is that you run that script from the command line from whence you are doing your intended changes to crash_report.json [1].

Also, you're ideally expected to try to really "push the envelope" with this script. If you don't supply a URL to a Super Search API query, e.g. `python validate_and_test.py` it will just test the 100 most recent crashes. Likely that most of them are Firefox Release crashes. 
So if you're adding a new field called `"foo_bar":{"type": "integer"}` you're expected to pass in a URL where Super Search makes sure crashes that have this are included. 

Would be nice to automate all of this in some efficient manner. 



[0] https://github.com/mozilla/socorro/blob/master/socorro/schemas/validate_and_test.py
[1] https://github.com/mozilla/socorro/blob/master/socorro/schemas/crash_report.json
Do we have any idea how often this comes up? Is it a once-a-week sort of thing or once-a-month or something else?
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #1)
> Do we have any idea how often this comes up? Is it a once-a-week sort of
> thing or once-a-month or something else?

About once a month. But the whole system just landed in production in Dec 2016.
That frequency might increase. I suspect it will but then maybe it'll become (at most) once a week.

Running validate_and_test.py is slow. It's because it has to do a LOT of network IO. I.e. download 100 raw crashes and 100 processed crashes. (Mind you, validate_and_test.py does exercise that TelemetryBotoS3CrashStorage [0] class nicely)

We could potentially check the content of the git branch's difference and if "crash_report.json" is mentioned, then execute validate_and_test.py

Another idea is to run validate_and_test.py frequentlyish using cron on stage (perhaps against a mix of stage and prod data). 
That means, if we too hastily land something in crash_report.json, land it in master and break crontabber on stage, we'd be alerted through monitoring. 

More crazy ideas?


[0] https://github.com/mozilla/socorro/blob/5e1d5585ca03256043ae83832da552cde328db84/socorro/external/boto/crashstorage.py#L294
For what it's worth at least we have decent documentation about editing crash_report.json AND how to test your changes.
https://github.com/mozilla/socorro/blob/master/socorro/schemas/README.md#testing-schema-changes

A while back, I added schema validation to the unit tests. Any schema that we have gets validated as a valid JSON schema. I don't think that covers this issue, though.

Recently, I rewrote the telemetry crashstorage code to use the processed crash and reduce it by the telemetry_socorro_crash.json schema. I wrote a new json schema reducer which does some light validation of types as it's reducing. Further it now only includes bits from the processed crash that are specified in the schema.

I think doing that eliminates the need to have a validate_telemetry_socorro_crash.py script and run it. Why? Because now we're building a crash report document for telemetry ingestion that's explicitly formed from normalized and validated data and specified by the JSON schema. There isn't room here for data creep anymore.

Given all that, marking this as WONTFIX.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.