Closed
Bug 1162682
Opened 10 years ago
Closed 10 years ago
Catch exceptions whilst handling objectstore jobs to prevent losing all jobs in that batch
Categories
(Tree Management :: Treeherder: Data Ingestion, defect, P1)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
Details
Attachments
(1 file)
One of the causes of bug 1125476 is us hitting exceptions whilst we're processing the contents of json_blob in the objectstore (as opposed to the worker being killed due to infra reasons).
Now we already handle exceptions during deserialization of json_blob, however we do not do so for the _load_ref_and_job_data_structs() call afterwards, which if anything is more risky.
This means that if an exception occurs, we fail to store any of the other jobs in that batch, leaving up to 100 jobs stuck in the 'loading' state indefinitely - even if the remaining 99 were handled successfully.
Assignee | ||
Comment 1•10 years ago
|
||
Attachment #8603054 -
Flags: review?(cdawson)
Updated•10 years ago
|
Attachment #8603054 -
Flags: review?(cdawson) → review+
Comment 2•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/8c27f6656b9692549a00b988a208e785ab703229
Bug 1162682 - Catch exceptions whilst processing objectstore jobs
Now if an exception occurs during _load_ref_and_job_data_structs(), we
mark that job in the objectstore as errored and continue inserting the
other jobs. Previously the exception would have meant all of the other
jobs were not inserted, causing up to 100 rows in the objectstore to be
stuck in the 'loading' processed_state indefinitely.
The exception string passed to mark_object_error() isn't ideal, but it's
the same as the handling above, so will do for now until we remove the
objectstore.
In addition, this change means that we lose visibility in New Relic for
these exceptions - and someone has to manually check the objectstore for
jobs with error = "Y". However short term this seems preferable to
dropping 100 jobs every time we get an exception, particularly since
this is already the case for deserialisation exceptions. In a followup
bug we could always try using the New Relic Python agent's
record_exception() to maintain reporting without having to re-raise the
exception ourselves.
Assignee | ||
Updated•10 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•