Test the crash reports in Telemetry

RESOLVED FIXED

Status

Socorro
General
RESOLVED FIXED
a year ago
7 months ago

People

(Reporter: peterbe, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

a year ago
We now have crashes in Telemetry in the socorro_crash "table". 
You can play with it here: https://sql.telemetry.mozilla.org/queries/1194
This is only from ingesting the S3 stage bucket of the public crashes. 

We need to verify that the data that's query'able is "sane". I.e. matches the types (and data) listed in: https://github.com/mozilla/socorro/blob/master/socorro/schemas/crash_report.json
(Reporter)

Updated

a year ago
Depends on: 1296621
(Reporter)

Comment 1

a year ago
Lonnen, as we talked about. It would be nice if you can think of people who would be good to help out with this. 
Like I said, because my poor presto sql chops, if I were to test this I'd go "the happy path" and just check what I already know is working. Ideal would be someone who's interested in the outout AND knows how to write non-basic SQL that checks the more complex fields like the nested structs with arrays.

Comment 2

a year ago
What do you want to see tested? What would you do if it was a more familiar flavor of SQL?
My team is probably a good candidate for this.
I think a good way for us to test it is to build our existing reports (topcrashers, crashes per day... ) in redash.
(Reporter)

Comment 5

a year ago
(In reply to Adrian Gaudebert [:adrian] from comment #4)
> I think a good way for us to test it is to build our existing reports
> (topcrashers, crashes per day... ) in redash.

Be my guest :)
(Reporter)

Comment 6

a year ago
Attention David,

As of yesterday we've started sending EVERYTHING from STAGE into Telemetry. We used to send a mutated much much smaller dump that didn't contain full value of the key "json_dump". 

Basically, this means that you can now do some really deep analysis on correlation of crashes's ALL threads. E.g. find all crashes for Firefox 51.0a1 where the function `PresShell::UpdateApproximateFrameVisibility()` is used in any of the threads. 

One bigger question for Mark, is, how do we cope with the fact that the json_dump field has changed so radically?
It used (prior to October 7th) to look like this::

    "json_dump": {
        "largest_free_vm_block": "0x70df0000",
        "tiny_block_size": "32088064",
        "write_combine_size": "24666112"
    },

Now, it's a much much larger and different structure, looking like this::

    "json_dump": {
        "crash_info": {
            "address": "0x2c",
            "crashing_thread": 0,
            "type": "EXCEPTION_ACCESS_VIOLATION_READ"
        },
        "crashing_thread": {
            "frames": [
                {
                    "file": "hg:hg.mozilla.org/releases/mozilla-release:layout/generic/nsIFrame.h:2d931a5eaf8a",
                    "frame": 0,
                    "function": "nsIFrame::TrackingVisibility()",
                    "function_offset": "0x0",
                    "line": 1107,
                    "module": "xul.dll",
    ...

If you want Mark, I think we can just delete all older records where the json_dump value was the abbreviated version.
Just to clarify, Peter said Socorro -stage is now sending *everything*, but there are two caveats. First, we're only sending publicly ok data. Second, we're still throttling crashes, so most of them don't get processed and thus don't go to Telemetry.
(Reporter)

Comment 8

a year ago
Mark, still an important question in there for you about the fact that this "json_dump" field (completely independent of is absence/presence in re-dash) has changed. Before we got our s**t together the json_dump *value* used to be a summary object, now it's very different.
Flags: needinfo?(mreid)

Comment 9

a year ago
(In reply to Peter Bengtsson [:peterbe] from comment #8)
> Mark, still an important question in there for you about the fact that this
> "json_dump" field (completely independent of is absence/presence in re-dash)
> has changed. Before we got our s**t together the json_dump *value* used to
> be a summary object, now it's very different.

Has the source crash data you dump as json been updated in S3 also? If so, we could consider backfilling the data to reflect the new format.

The way we normally handle this type of change on the parquet / re:dash side is to version the data, so we'll have data with the old structure in

s3://bucket/path/socorro_crash/v1/crash_date=X/...

and the new structure in 

s3://bucket/path/socorro_crash/v2/crash_date=X/...

Within re:dash, the base table (called "socorro_crash" in this case) always points to the newest version, but you can explicitly query older versions if you need to (using a table name like "socorro_crash_v1").

Ideally we'd do this, moving data after Oct. 7th over to v2 and updating the import code with the new version prefix.

If the older data has been updated with the new structure, we can simply wipe it out on the parquet end and re-process.
Flags: needinfo?(mreid)
(Reporter)

Comment 10

a year ago
(In reply to Mark Reid [:mreid] from comment #9)
> (In reply to Peter Bengtsson [:peterbe] from comment #8)
> > Mark, still an important question in there for you about the fact that this
> > "json_dump" field (completely independent of is absence/presence in re-dash)
> > has changed. Before we got our s**t together the json_dump *value* used to
> > be a summary object, now it's very different.
> 
> Has the source crash data you dump as json been updated in S3 also? If so,
> we could consider backfilling the data to reflect the new format.
> 

No and that's not feasible. It's only stage data. 

And most importantly, upon closer inspection, the schema actually hasn't changed. It's hard to see this clearly because the value of that key is so huge (that it doesn't fit on a screen). 
As of that day, Oct 6, we just started filling in the blanks. 
We actually don't specify which fields inside the json_dump key are required, which falls back to them all being optional. 

In conclusion, there is NO SCHEMA BREAKAGE. Sorry for kicking up an unnecessary fuss. 

> The way we normally handle this type of change on the parquet / re:dash side
> is to version the data, so we'll have data with the old structure in
> 
> s3://bucket/path/socorro_crash/v1/crash_date=X/...
> 
> and the new structure in 
> 
> s3://bucket/path/socorro_crash/v2/crash_date=X/...
> 

Ok. That's interesting and important. But also inconvenient. It means we need to remember to change the "key builder" every time the data is going to be different. 
https://github.com/mozilla/socorro/blob/5391e4f663ae0a9b8f8eda14c428f062204be522/socorro/external/boto/connection_context.py#L113

One solution to this would be to hash the JSON Schema into a 7 char string and make that the version. However, then it looses the beauty of looking like something sensible, like "v3". 

We can either try some automation with documentation that says "If you change crash_reports.json in any way, remember to increment the version number here in this file"
Or we can speculate the idea of version number that is a hash. 

What do you think?
You don't necessarily need to change it on the generation side, the paths above were for the parquet/re:dash side.

We will need to touch the code there when the schema changes anyways - the Struct stuff needs to match what we're parsing. So if we need to change it in a backwards-incompatible way, we'll need to bump the version there.

We can keep the version on the generation side for versioning the interface between Socorro and the Data Platform tools.
Per IRC discussion, another possibility is to add a "_target_version" key at the top level within the crash_reports.json schema file itself.

Then the conversion code could read this value to determine the version string to use (as well as using it to dynamically generate the required StructType for parsing the crash json).

In that case, changes to the schema and version all happen in the same place, and backwards-compatible changes don't necessarily need to increment the version.

Sound ok?
Flags: needinfo?(peterbe)
(Reporter)

Comment 13

a year ago
(In reply to Mark Reid [:mreid] from comment #12)
> Per IRC discussion, another possibility is to add a "_target_version" key at
> the top level within the crash_reports.json schema file itself.
> 
> Then the conversion code could read this value to determine the version
> string to use (as well as using it to dynamically generate the required
> StructType for parsing the crash json).
> 
> In that case, changes to the schema and version all happen in the same
> place, and backwards-compatible changes don't necessarily need to increment
> the version.
> 
> Sound ok?

Just to be absolutely sure...

Your suggestion is that we have a version in the crash_report.json JSON Schema document. Fine. 
And if I decide to change something "breaking" in crash_report.json I increment this value. E.g. from `"_target_version": 1` to `"_target_version": 2`. 
But what about the prefix for the S3 upload? Currently we upload to a "directory" called "v1"
E.g. /v1/crash_report/20160822/0025fc5d-f786-467a-ac5a-a71d62160822
**Does that need to change too?** Or is it ok to just have a key inside the JSON Schema document?
Flags: needinfo?(peterbe)
(In reply to Peter Bengtsson [:peterbe] from comment #13)
> But what about the prefix for the S3 upload? Currently we upload to a
> "directory" called "v1"
> E.g. /v1/crash_report/20160822/0025fc5d-f786-467a-ac5a-a71d62160822
> **Does that need to change too?** Or is it ok to just have a key inside the
> JSON Schema document?

No, that does not need to change unless our basic mechanism for exchanging data changes. As long as we continue to use day-based dirs full of json files, the path can stay the same on your end.
(Reporter)

Comment 15

7 months ago
We added the `$target_version` https://github.com/mozilla/socorro/blob/8b93800d32551882612318cfe463cd729c1037c4/socorro/schemas/crash_report.json#L3 
And we documented what people need to do when editing crash_report.json https://github.com/mozilla/socorro/blob/8b93800d32551882612318cfe463cd729c1037c4/socorro/schemas/README.md#changing-the-schema
Status: NEW → RESOLVED
Last Resolved: 7 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.