Closed
Bug 1343018
Opened 4 years ago
Closed 4 years ago
Socorro crashes in Telemetry have empty values since Feb 22 2017
Categories
(Data Platform and Tools :: Pipeline Ingestion, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: peterbe, Assigned: amiyaguchi)
Details
If you run `SELECT uptime FROM socorro_crash WHERE crash_date = '20170221' LIMIT 10` you'll get 10 rows of numbers. https://sql.telemetry.mozilla.org/queries/3361/source HOWEVER, if you run `SELECT uptime FROM socorro_crash WHERE crash_date = '20170222' LIMIT 10` you'll always get empty rows. https://sql.telemetry.mozilla.org/queries/3362/source I randomly picked a JSON file from the S3 bucket org-mozilla-telemetry-crashes/v1/crash_report/20170221 and downloaded it. It had an uptime value. I also downloaded a random JSON file from org-mozilla-telemetry-crashes/v1/crash_report/20170222 (note, 1 day later). It *also* **has** an uptime value. In the past we've managed to confuse ourselves with database because Redash "fails" to display column values. But in this case, it's a simple integer field.
| Reporter | ||
Comment 1•4 years ago
|
||
That's the problem right now. But I will try to inject some clues. 1. When the query does `...WHERE crash_date = '20170222'` that touches on crashes ingested (from S3 to parquet) on the next day, 23rd. The actual crash happened on Feb 22 and it was stored in S3 on Feb 22 (almost immediately after the crash actually happened). 2. Due to under-development infrastructure on our side, we did NOT upgrade the crash_report.json schema file (stored in the root of the S3 bucket) EVEN though the file changed and those changes affected S3 uploads. So on Feb 22 around 21:00 UTC the crash_reports.json file WAS changed. https://bugzilla.mozilla.org/show_bug.cgi?id=1340105#c4 It could very well be that we introduced something in crash_reports.json that broke ingestion on Feb 23 (the next day).
| Assignee | ||
Updated•4 years ago
|
Assignee: nobody → amiyaguchi
| Assignee | ||
Comment 2•4 years ago
|
||
Thank you for the detailed report. I will keep you posted on the status of the bug.
| Assignee | ||
Comment 3•4 years ago
|
||
:mreid and I have reproduced the issue and are looking into the possible causes for this failure [1]. There have been no errors in the socorro import job hosted on Airflow [2], so the problem probably lies with the schema or the data. Changes to the schema in the last week [3] have been ruled out by running `diff` on JSON schema hosted on s3 and github. The time window since the last change is far too large for it to have specifically caused this particular issue. The notebook in [1] hints towards an issue with the pyspark schema. As in comment 0, a random JSON blob was chosen and converted into a dataframe. The first dataframe uses inference to assign property names and types, while the second uses the JSON schema that has been converted to a pyspark schema. The second dataframe exhibits the behavior in the bug description. I found an interesting behavior caused by select over known non-null values in the original JSON blob. While `df.show()` contains null columns, a `df.select(subset).show()` will produce the expected outputs. The physical plan scans the JSON with a fairly simple schema corresponding to the columns of the select statement. > == Physical Plan == > CollectLimit 21 > +- *Project [adapter_device_id#12915, uuid#12992, uptime#12989] > +- *Scan json [adapter_device_id#12915,uptime#12989,uuid#12992] Format: JSON, InputPaths: s3://org-mozilla-telemetry-crashes/v1/crash_report/20170222/782739b2-81df-40fb-bb63-464d82170222, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<adapter_device_id:string,uptime:int,uuid:string> I plan to re-run `20170221` and `20170222` and verify that the isolated dataset behaves the same as the production one. If `20170221` shows a significantly higher rate of nulls than the original, then the problem probably lies with the schema. In the case where it meets expectations, I think it would be more likely to look at the data itself. [1] https://gist.github.com/acmiyaguchi/1db8a2a920fe01917d6bf91ba27b0d65/7ac4c58951de8db873a14a6724c5bd63eb2bfcb6 [2] https://pageshot.net/qscgVy24zUReddk8/workflow.telemetry.mozilla.org [3] https://github.com/mozilla/socorro/commits/master/socorro/schemas
| Assignee | ||
Comment 4•4 years ago
|
||
I have added an extra cell to notebook [1]. Here, the schema from commit 911cff4 [2] is used resulting in a dataframe that containing expected values. This illustrates how this situation probably played out. Feb 8 - updates to the schema for bug 1311647 (543cbc0) and bug 1291173 (5559de7) are committed to mozilla/socorro, but changes are not propagated to s3 Feb 22 - the crash_reports.json file in s3 is replaced with the updated schema, and schema validation fails as a side-effect [1] https://gist.github.com/acmiyaguchi/1db8a2a920fe01917d6bf91ba27b0d65/8579cdd2d768c055a0c6e4def55aa912bf769b0d [2] https://github.com/mozilla/socorro/commit/911cff4b4429a72a344a10239437591e176bc431
Comment 5•4 years ago
|
||
I believe the problem is that the updated JSONSchema is not properly validating against the data. I grabbed a random crash report and tried applying the latest JSONSchema[1] to validate it, and discovered the following: Fields specified as integer, but data contains a string: - available_physical_memory - total_virtual_memory - total_physical_memory - available_virtual_memory Fields specified as boolean, but data contains a string: - dom_ipc_enabled - safe_mode - startup_crash There may be others, these are just ones that surfaced from looking at a single example record. [1] https://github.com/mozilla/socorro/blob/master/socorro/schemas/crash_report.json
| Reporter | ||
Comment 6•4 years ago
|
||
The new old crash_report.json file is back in https://console.aws.amazon.com/s3/home?region=us-west-2#&bucket=org-mozilla-telemetry-crashes&prefix= In other words, please go ahead and try to reprocess Feb 22 and onwards.
| Assignee | ||
Comment 7•4 years ago
|
||
Data has been reprocessed from 20170222 onward. The query in comment 0 executes properly now.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•