Closed Bug 1248172 Opened 9 years ago Closed 7 years ago

Autophone - treeherder job collections created with invalid job_guid

Categories

(Testing Graveyard :: Autophone, defect)

defect
Not set
normal

Tracking

(firefox47 affected)

RESOLVED WONTFIX
Tracking Status
firefox47 --- affected

People

(Reporter: bc, Unassigned)

References

Details

Attachments

(5 files)

Attached file autophone 2 log
In bug 1216578 we started queuing treeherder submissions to the treeherder table in the jobs database. Overnight, it appear several entries were created with null job_guid. This stalled the submission of other pending treeherder jobs completely. I also did not receive any email notifications that there was a problem. The error in the log was: Traceback (most recent call last): File "/mozilla/autophone/autophone/autophonetreeherder.py", line 91, in post_request client.post_collection(project, job_collection) File "/mozilla/autophone/venv/local/lib/python2.7/site-packages/thclient/client.py", line 923, in post_collection collection_inst.validate() File "/mozilla/autophone/venv/local/lib/python2.7/site-packages/thclient/client.py", line 529, in validate d.validate() File "/mozilla/autophone/venv/local/lib/python2.7/site-packages/thclient/client.py", line 62, in validate cb(prop.split('.'), required_properties[prop], prop) File "/mozilla/autophone/venv/local/lib/python2.7/site-packages/thclient/client.py", line 117, in validate_existence raise TreeherderClientError(msg, []) TreeherderClientError: TreeherderJob structure validation errors detected for property:job.job_guid Value not defined for job.job_guid I am not sure of the root cause, but there were apparently network issues where downloads were failing due to incomplete downloads.
Attached file autophone 3 log
Apart from not creating invalid jobs, when we see structural fatal errors that are not due to transient network or treeherder server issues, we should make sure to email notification of the issues and not block on the bad jobs. We can either delete the structurally bad jobs from the database or at least skip over them so the other jobs are submitted in a timely fashion.
(In reply to Bob Clary [:bc:] from comment #0) > I also did not receive any email notifications that there was a problem. Neither did I. That's odd, since there is code to send email here, and we've seen that type of email recently: Phone: nexus-6p-2 TreeherderClientError: HTTPSConnectionPool(host='treeherder.mozilla.org', port=443): Max retries exceeded with url: /api/project/fx-team/jobs/ (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f1310187d90>, 'Connection to treeherder.mozilla.org timed out. (connect timeout=120)')) Last attempt: None Response: 2016-02-09T14:49:28.689539 (I note as an aside that the Response and Last attempt are reversed.)
Assignee: nobody → gbrown
Heh. I haven't looked closer into what happened but I wonder if the failures related to the downloading and unzipping the build caused us to hit a dead lock for some reason. I mostly wanted to file this so we wouldn't lose the datum and could think about the potential causes at our leisure. heh. ;-)
I'm not sure what conditions we want to use to decide when to discard a job - let's discuss. In the mean time, I noticed these 3 issues related to email notification.
Attachment #8720027 - Flags: review?(bob)
Comment on attachment 8720027 [details] [diff] [review] improve mail notification Review of attachment 8720027 [details] [diff] [review]: ----------------------------------------------------------------- lgtm
Attachment #8720027 - Flags: review?(bob) → review+
https://github.com/mozilla/autophone/commit/d6823b825c7cd0db532f24efb61cabff0b1baf7a There more to do here - figure out when to retry and when to discard a failed submission.
Assignee: gbrown → nobody
See Also: → 1266990
Autophone is going away. Resolving these to wontfix.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: