Jobs not getting scheduled or finish after connection issues (trees closed)

RESOLVED FIXED

Status

Taskcluster
Queue
RESOLVED FIXED
6 months ago
6 months ago

People

(Reporter: aryx, Assigned: bstack)

Tracking

Details

Attachments

(1 attachment)

After some connection issues which let many jobs fail, jobs are either not getting scheduled at all or don't finish, see e.g. the autoland tree: https://treeherder.mozilla.org/#/jobs?repo=autoland
Re-opened trees. No idea what or who let the jobs re-run (and succeed) after several hours, but jobs get scheduled and started.
Severity: blocker → normal
Issue is back, setting back to blocker.
Severity: normal → blocker
Looks like a taskcluster queue issue:

u'index.gecko.v2.autoland.pushdate.2017.04.22.20170422223156.firefox.win32-opt', u'index.gecko.v2.autoland.latest.firefox.win32-opt']
16:20:25     INFO - Taskcluster taskId: FBbE1RDARQ6ipUI-D-1dIw
16:20:25     INFO - Routes: [u'index.gecko.v2.autoland.revision.a2771a2c77abdf7af52cfba08299c27b5a4c143b.firefox.win32-opt', u'index.gecko.v2.autoland.pushdate.2017.04.22.20170422223156.firefox.win32-opt', u'index.gecko.v2.autoland.latest.firefox.win32-opt']
16:20:25     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
c:\builds\moz2_slave\autoland-w32-00000000000000000\build\venv\Lib\site-packages\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
16:20:56  WARNING - Retrying because of: 503 Server Error: Service Unavailable
16:20:56     INFO - Sleeping 0.10 seconds for exponential backoff
16:20:56     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
c:\builds\moz2_slave\autoland-w32-00000000000000000\build\venv\Lib\site-packages\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
16:21:26  WARNING - Retrying because of: 503 Server Error: Service Unavailable
16:21:26     INFO - Sleeping 0.40 seconds for exponential backoff
16:21:26     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
c:\builds\moz2_slave\autoland-w32-00000000000000000\build\venv\Lib\site-packages\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
16:21:57  WARNING - Retrying because of: 503 Server Error: Service Unavailable
16:21:57     INFO - Sleeping 0.90 seconds for exponential backoff
16:21:58     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
c:\builds\moz2_slave\autoland-w32-00000000000000000\build\venv\Lib\site-packages\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
16:22:28  WARNING - Retrying because of: 503 Server Error: Service Unavailable
16:22:28     INFO - Sleeping 1.60 seconds for exponential backoff
16:22:30     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
c:\builds\moz2_slave\autoland-w32-00000000000000000\build\venv\Lib\site-packages\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
16:23:00  WARNING - Retrying because of: 503 Server Error: Service Unavailable
16:23:00     INFO - Sleeping 2.50 seconds for exponential backoff
16:23:03     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
c:\builds\moz2_slave\autoland-w32-00000000000000000\build\venv\Lib\site-packages\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
16:23:33     INFO - [mozharness: 2017-04-22 23:23:33.420000Z] Finished upload-files step (failed)
16:23:33    FATAL - Uncaught exception: Traceback (most recent call last):
16:23:33    FATAL -   File "c:\builds\moz2_slave\autoland-w32-00000000000000000\scripts\mozharness\base\script.py", line 2064, in run
16:23:33    FATAL -     self.run_action(action)
16:23:33    FATAL -   File "c:\builds\moz2_slave\autoland-w32-00000000000000000\scripts\mozharness\base\script.py", line 2003, in run_action
16:23:33    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
16:23:33    FATAL -   File "c:\builds\moz2_slave\autoland-w32-00000000000000000\scripts\mozharness\base\script.py", line 1943, in _possibly_run_method
16:23:33    FATAL -     return getattr(self, method_name)()
16:23:33    FATAL -   File "c:\builds\moz2_slave\autoland-w32-00000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1529, in upload_files
16:23:33    FATAL -     property_conditions=property_conditions)
16:23:33    FATAL -   File "c:\builds\moz2_slave\autoland-w32-00000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1421, in _taskcluster_upload
16:23:33    FATAL -     task = tc.create_task(routes)
16:23:33    FATAL -   File "c:\builds\moz2_slave\autoland-w32-00000000000000000\scripts\mozharness\mozilla\taskcluster_helper.py", line 66, in create_task
16:23:33    FATAL -     }, taskId=self.task_id)
16:23:33    FATAL -   File "c:\builds\moz2_slave\autoland-w32-00000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall
16:23:33    FATAL -     return self._makeApiCall(e, *args, **kwargs)
16:23:33    FATAL -   File "c:\builds\moz2_slave\autoland-w32-00000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall
16:23:33    FATAL -     return self._makeHttpRequest(entry['method'], route, payload)
16:23:33    FATAL -   File "c:\builds\moz2_slave\autoland-w32-00000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest
16:23:33    FATAL -     superExc=rerr
Component: Buildduty → Queue
Product: Release Engineering → Taskcluster
QA Contact: bugspam.Callek
(Assignee)

Comment 4

6 months ago
It seems that since 3:00am PST this Saturday we've had an extremely high error rate from the queue. I'm investigating now.
Assignee: nobody → bstack
Status: NEW → ASSIGNED
(Assignee)

Comment 5

6 months ago
I restarted the queue and the errors ceased for a while but seem to have returned. I'm seeing various errors in Sentry and from the Heroku viewpoint they seem to be manifesting mostly as timeouts. I believe most of the issues seem to be when dealing with Azure resources such as table store and queues. I'm not seeing anything on the azure status dashboard at the moment.
(Assignee)

Comment 6

6 months ago
Created attachment 8860703 [details]
Screen Shot 2017-04-22 at 9.00.13 PM.png

The success rate for azure table store fell to 85% at around the times we saw this outage (from a standard of 98-99% over the past week or so).
(Assignee)

Comment 7

6 months ago
After some investigation, Jonas and I are of the mind that this is from an increased rate of 500s from azure. Within the last hour we've had other 15% failure rates from them, so I don't know if things are good again yet. The last 50 minutes or so have been fine, but I think we ought to wait a bit longer before opening trees. I'll update this again before too long with news.
(Assignee)

Comment 8

6 months ago
As far as I can tell the error rate from azure has dropped off. It might be safe to open the trees. I'll check again in the morning to see if things are quieter then as well.

Comment 9

6 months ago
I reopened the trees, but I won't be around for 8+ hours if anything goes wrong. Enjoy. :)
(Assignee)

Comment 10

6 months ago
Error rates from azure still seem ok. There was a 5% drop sometime last night, but we're back to nominal. I was able to submit tasks via the task creator and a try push has gone through. We can check in again first thing Monday.
Thank you for investigating this. Lowering to 'normal'.
Severity: blocker → normal
(Assignee)

Updated

6 months ago
Status: ASSIGNED → RESOLVED
Last Resolved: 6 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.