Closed Bug 1187267 Opened 9 years ago Closed 9 years ago

builds-4hr ingestion should not break if the blobber_files json is invalid ("JSONDecodeError: Unterminated string starting at: line 1 column 4079 (char 4078)")

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: glandium, Assigned: emorley)

References

Details

Attachments

(3 files)

Ah this is presumably due to:
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/4e9295-bc880623-31f2-11e5-8f08-b82a72d2466d

File "/data/www/treeherder.mozilla.org/treeherder-service/treeherder/etl/tasks/buildapi_tasks.py", line 40, in fetch_buildapi_build4h
File "/data/www/treeherder.mozilla.org/treeherder-service/treeherder/etl/buildapi.py", line 427, in run
File "/data/www/treeherder.mozilla.org/treeherder-service/treeherder/etl/buildapi.py", line 175, in transform
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/simplejson/__init__.py", line 505, in loads
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/simplejson/decoder.py", line 370, in decode
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/simplejson/decoder.py", line 400, in raw_decode

...looks like builds-4hr had bad data in it.

The job is now showing up, and the last New Relic error was 13 mins ago.

Annoyingly I'm presuming this means the bad data is now no longer present in the file, and so I can't see what it was :-/
Maybe the daily archive will still show it?
Severity: normal → critical
Priority: -- → P1
(missed the key line, doh)

 simplejson.scanner:JSONDecodeError: Unterminated string starting at: line 1 column 4079 (char 4078)
So the inability to process builds-4hr then resulted in a massive spike of new jobs, once the bad data disappeared. This is what caused the alert a few mins ago for high message counts on rabbitmq, since there was a backlog of jobs (in case anyone getting those alerts wondered what they were from).

I've also just added the builds-4hr and cycle-data tasks as "key transactions" in New Relic, which fingers crossed means we'll get alerts for this bug next time (currently there was just a global error rate threshold, but since the builds-4hr task only runs once every 3 mins, issues there only result in a 0.01% error rate etc). 

I'd tried to set this up in the past, but the New Relic key transactions UI only lets you set "web transactions" as key transactions, whereas this is one of the non-web tasks. However I've found a workaround, which is to use the "add as a key transaction" link tucked away when viewing a non-web transaction on the "transactions" New Relic section.
So production is running 11dcdd96b1a3785d9fa09ce38ada5755ade79c19, so the exception occurred here:
https://github.com/mozilla/treeherder/blob/11dcdd96b1a3785d9fa09ce38ada5755ade79c19/treeherder/etl/buildapi.py#L175

Once of the files "blobber_files" properties cannot be parsed.
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Summary: Finished jobs are not appearing as done on treeherder → Builds-4hr ingestion failing with "simplejson.scanner:JSONDecodeError: Unterminated string starting at: line 1 column 4079 (char 4078)"
Attached file Debug script
The daily archive updates periodically, which has now occurred, so it will hopefully include the bad data that was in builds-4hr previously.

This script will dump the bad jobs out of the 75MB json file (no way I'm going through it by hand).
So using the daily archive here:
https://secure.pub.build.mozilla.org/builddata/buildjson/builds-2015-07-24.js.gz

This was the only bad job found.

Looks like the blobber files blob just got truncated - presume there is a max length on it.

We should:
(a) Handle the exception more gracefully in Treeherder (ie just skip that job's blobber files field, rather than completely breaking builds-4hr ingestion.
(b) File a bug against releng to increase the max length of that field, or at least not dump truncated json in there.
Blocks: 1113873
Summary: Builds-4hr ingestion failing with "simplejson.scanner:JSONDecodeError: Unterminated string starting at: line 1 column 4079 (char 4078)" → builds-4hr ingestion should not break if the blobber_files json is invalid ("JSONDecodeError: Unterminated string starting at: line 1 column 4079 (char 4078)")
Depends on: 1187284
Attachment #8638538 - Flags: review?(mdoglio) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/2c57b1d02e63e6948de0ed44c99b3470857e3fec
Bug 1187267 - Handle invalid builds-4hr blobber_files json gracefully

The blobber_files property on jobs in builds-4hr is a json blob,
containing a number of name-url pairs. In some cases this blob is
truncated, so fails to decode. This change makes us handle these
failures more gracefully (by treating it as though the blobber_files
property was not set for that job), rather than causing the entire
builds-4hr task to fail.

Bug 1187284 is filed to try and make mozharness never output truncated
json in the first place.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: