Closed
Bug 1250374
Opened 9 years ago
Closed 9 years ago
Trees closed, Mac builds failing when they upload to taskcluster
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
Details
Attachments
(1 file)
|
1.87 KB,
patch
|
Details | Diff | Splinter Review |
Can't make this stuff up.
https://treeherder.mozilla.org/logviewer.html#?job_id=7469456&repo=fx-team
https://treeherder.mozilla.org/logviewer.html#?job_id=22151725&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=22152066&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=22152123&repo=mozilla-inbound
All of the form
18:38:57 INFO - Starting new HTTPS connection (1): taskcluster-public-artifacts.s3-us-west-2.amazonaws.com
/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
18:38:58 INFO - Resolving V8nFykokRIGDnBdLiPoHSQ, run 0. Full task:
18:38:58 INFO - {u'status': {u'workerType': u'buildbot', u'taskGroupId': u'V8nFykokRIGDnBdLiPoHSQ', u'runs': [{u'scheduled': u'2016-02-23T01:54:52.874Z', u'reasonCreated': u'scheduled', u'state': u'pending', u'runId': 0}], u'expires': u'2017-02-23T02:54:51.771Z', u'retriesLeft': 5, u'state': u'pending', u'schedulerId': u'-', u'deadline': u'2016-02-23T02:54:51.771Z', u'taskId': u'V8nFykokRIGDnBdLiPoHSQ', u'provisionerId': u'null-provisioner'}}
18:38:58 INFO - Starting new HTTPS connection (1): queue.taskcluster.net
/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
18:38:59 INFO - Running post-action listener: influxdb_recording_post_action
18:38:59 INFO - Resetting dropped connection: goldiewilson-onepointtwentyone-1.c.influxdb.com
/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
18:39:00 FATAL - Uncaught exception: Traceback (most recent call last):
18:39:00 FATAL - File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/base/script.py", line 1765, in run
18:39:00 FATAL - self.run_action(action)
18:39:00 FATAL - File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/base/script.py", line 1707, in run_action
18:39:00 FATAL - self._possibly_run_method(method_name, error_if_missing=True)
18:39:00 FATAL - File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/base/script.py", line 1647, in _possibly_run_method
18:39:00 FATAL - return getattr(self, method_name)()
18:39:00 FATAL - File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1611, in upload_files
18:39:00 FATAL - property_conditions=property_conditions)
18:39:00 FATAL - File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1539, in _taskcluster_upload
18:39:00 FATAL - tc.report_completed(task)
18:39:00 FATAL - File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/mozilla/taskcluster_helper.py", line 135, in report_completed
18:39:00 FATAL - self.taskcluster_queue.reportCompleted(task_id, run_id)
18:39:00 FATAL - File "/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall
18:39:00 FATAL - return self._makeApiCall(e, *args, **kwargs)
18:39:00 FATAL - File "/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall
18:39:00 FATAL - return self._makeHttpRequest(entry['method'], route, payload)
18:39:00 FATAL - File "/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest
18:39:00 FATAL - superExc=rerr
18:39:00 FATAL - TaskclusterRestFailure: Run 0 on task V8nFykokRIGDnBdLiPoHSQ is resolved or not running.
18:39:00 FATAL - ----
18:39:00 FATAL - errorCode: RequestConflict
18:39:00 FATAL - statusCode: 409
18:39:00 FATAL - requestInfo:
18:39:00 FATAL - method: reportCompleted
18:39:00 FATAL - params: {"taskId":"V8nFykokRIGDnBdLiPoHSQ","runId":"0"}
18:39:00 FATAL - payload: {}
18:39:00 FATAL - time: 2016-02-23T02:38:59.501Z
18:39:00 FATAL - details:
18:39:00 FATAL - {
18:39:00 FATAL - "taskId": "V8nFykokRIGDnBdLiPoHSQ",
18:39:00 FATAL - "runId": 0
18:39:00 FATAL - }
18:39:00 FATAL - Running post_fatal callback...
18:39:00 ERROR - setting return code to 2 because fatal was called
18:39:00 WARNING - setting return code to 2
18:39:00 FATAL - Exiting -1
18:39:00 INFO - Running post-run listener: _summarize
18:39:00 ERROR - # TBPL FAILURE #
18:39:00 INFO - #####
18:39:00 INFO - ##### FxDesktopBuild summary:
18:39:00 INFO - #####
18:39:00 ERROR - # TBPL FAILURE #
At least so far, Try seems unaffected, so it's open, but everything else is closed.
Comment 1•9 years ago
|
||
In theory, this will
a) retry submitting taskcluster report_completed on failure, and
b) only result in ERROR if it fails the max_retries number of times, not a FATAL.
I'm not sure if this works, and I'm not sure if the new behavior is the one we want. I do know that since mozharness is in-tree, landing will only mitigate this problem going forward, not for older unmerged changesets or branches etc etc.
Comment 2•9 years ago
|
||
Also errors in mac opt, eg
https://treeherder.mozilla.org/logviewer.html#?job_id=22151731&repo=mozilla-inbound
19:25:14 INFO - Uploading to S3: filename=/builds/slave/m-in-m64-000000000000000000000/logs/log_raw.log mimetype=text/plain length=17366589
19:25:14 INFO - Starting new HTTPS connection (1): queue.taskcluster.net
/builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
19:25:15 INFO - Running post-action listener: influxdb_recording_post_action
19:25:15 INFO - Resetting dropped connection: goldiewilson-onepointtwentyone-1.c.influxdb.com
/builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
19:25:15 FATAL - Uncaught exception: Traceback (most recent call last):
19:25:15 FATAL - File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/base/script.py", line 1765, in run
19:25:15 FATAL - self.run_action(action)
19:25:15 FATAL - File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/base/script.py", line 1707, in run_action
19:25:15 FATAL - self._possibly_run_method(method_name, error_if_missing=True)
19:25:15 FATAL - File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/base/script.py", line 1647, in _possibly_run_method
19:25:15 FATAL - return getattr(self, method_name)()
19:25:15 FATAL - File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1611, in upload_files
19:25:15 FATAL - property_conditions=property_conditions)
19:25:15 FATAL - File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1522, in _taskcluster_upload
19:25:15 FATAL - tc.create_artifact(task, upload_file)
19:25:15 FATAL - File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/mozilla/taskcluster_helper.py", line 110, in create_artifact
19:25:15 FATAL - "contentType": mime_type,
19:25:15 FATAL - File "/builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall
19:25:15 FATAL - return self._makeApiCall(e, *args, **kwargs)
19:25:15 FATAL - File "/builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall
19:25:15 FATAL - return self._makeHttpRequest(entry['method'], route, payload)
19:25:15 FATAL - File "/builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest
19:25:15 FATAL - superExc=rerr
19:25:15 FATAL - TaskclusterRestFailure: Artifacts cannot be created for a task after it is resolved, unless it is resolved 'exception', and even in this case only up to 25 min past resolution.This to ensure that artifacts have been uploaded before a task is 'completed' and output is consumed by a dependent task
19:25:15 FATAL - Running post_fatal callback...
19:25:15 ERROR - setting return code to 2 because fatal was called
19:25:15 WARNING - setting return code to 2
19:25:15 FATAL - Exiting -1
19:25:15 INFO - Running post-run listener: _summarize
19:25:15 ERROR - # TBPL FAILURE #
19:25:15 INFO - #####
19:25:15 INFO - ##### FxDesktopBuild summary:
19:25:15 INFO - #####
19:25:15 ERROR - # TBPL FAILURE #
19:25:15 INFO - Running post-run listener: copy_logs_to_upload_dir
19:25:15 INFO - Copying logs to upload dir...
program finished with exit code 255
Summary: Trees closed, Mac debug builds failing when they try to tell taskcluster that they are done → Trees closed, Mac builds failing when they upload to taskcluster
Comment 3•9 years ago
|
||
As the four people I'm finding in hg annotate across the two repos (m-c and build/mozharness) for this file (mozharness.mozilla.taskcluster_helper), which of the following behaviors is desired?
* fatal on report_completed failure
* retries + error on report_completed failure
* something else
Flags: needinfo?(rail)
Flags: needinfo?(mshal)
Flags: needinfo?(jlund)
Flags: needinfo?(armenzg)
Comment 4•9 years ago
|
||
There's something weird going on here:
* the logs appear to be out of order (influxdb_recording_post_action before the FATAL lines for the error)
* two runs on the taskcluster task - for the job https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&selectedJob=22149496 the matching task is https://tools.taskcluster.net/task-inspector/#cLqhnZkrRImmIOuCpm6OFw/
** uploads start at 18:34 and continue 19:21 (log time)
** tc task run0 starts 18:34 but is resolved at 18:54 pacific; state: exception, reason-resolved: claimed expired
** tc task run1 scheduled at 18:54, never starts, but is resolved 19:35 in state: exception and Reason Resolved: deadline-exceeded
Comment 5•9 years ago
|
||
Also, in a mac opt, we were uploading in 15 minutes before, and it's much slower now - 45 minutes in one log I saw. No packet loss between bm81 in scl3 and taskcluster-public-artifacts.s3-us-west-2.amazonaws.com though.
Comment 6•9 years ago
|
||
Another timeout datapoint, which may be related - the two linux builds for 45.0b9 failed to upload symbols. Normally they take about 60s to upload, and now they timeout in 2460 seconds. If we're conservative and overestimate the symbol zip to be 300MB in size, then we're getting less than 120KB/s upload speed.
Comment 7•9 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #6)
> Another timeout datapoint, which may be related - the two linux builds for
> 45.0b9 failed to upload symbols. Normally they take about 60s to upload, and
> now they timeout in 2460 seconds. If we're conservative and overestimate the
> symbol zip to be 300MB in size, then we're getting less than 120KB/s upload
> speed.
tl;dr - this likely to also be SCL3 -> AWS us-west-2 traffic, like the mac uploads.
Longer
The uploads go to crash-stats.mozilla.com, which has three IPs. The linux builds ran in two regions of AWS (one each in us-west-2 and us-east-1), but the upload is routed back to SCL3 over the IPsec tunnel for two of the IPs (54.68.216.133, 52.35.229.240), and then out over v-1030.core1.releng.scl3.mozilla.net to the internet. The path to third IP (54.200.18.169) doesn't go via the releng firewall but still goes over xe-1-2-2.border1.pao1.mozilla.net.
Comment 8•9 years ago
|
||
pmoore says that task must be reclaimed every 20 minutes, which lines up nicely with comment #4. The code in question starts at
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/building/buildbase.py#1485
We claim the task at 1502, then loop over the files to upload at 1517. Each call create_artifact() doesn't claim the task again, so the task quickly expires when the upload to slow.
We are also seeing much slower uploads to the archive.m.o system, but they aren't failing out.
So we have three issues
* the network needs to be faster, unclear if this is SCL3 -> AWS, or AWS is throttling us. Engaging netops
* the tc upload code needs to be more robust in the face of slow uploads (eg giving up after 18 minutes, reclaiming the task, retrying)
* routing of the symbol uploads should be AWS -> AWS direct
Comment 9•9 years ago
|
||
If I simulate SCL3 -> AWS traffic with a 10MB random file:
[cltbld@bld-lion-r5-057.build.releng.scl3.mozilla.com ~]$ scp -i ~/.ssh/ffxbld_rsa test ffxbld@upload.ffxbld.productdelivery.prod.mozaws.net:/tmp/nthomas.test
then 2MB is transferred quickly but the connection stalls, and eventually transfers at ~50KB/s.
I spoke to XioNoX on IRC and he pointed out that various ipsec tunnels from scl3 to us-west-2 have been flapping from roughly the time this bug was filed, 'so the ipsec tunnels between scl3 and usw2 keep bouncing, so something is clearly not right'.
XioNox: do you have any more information on where the problem might be ? eg our internal network/providers/aws.
Flags: needinfo?(arzhel)
Comment 10•9 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #8)
> * the tc upload code needs to be more robust in the face of slow uploads (eg
> giving up after 18 minutes, reclaiming the task, retrying)
--> bug 1250458
> * routing of the symbol uploads should be AWS -> AWS direct
--> bug 1250462
Comment 11•9 years ago
|
||
Seems like one provider (Zayo) between us and AWS is having a small amount of packet loss (~2%) starting about midnight UTC.
I opened ticket 955996 with them and temporary routing USW2 endpoint through another provider.
xionox@border1.mgmt.pao1.mozilla.net# show | compare
[edit routing-options static]
route 63.245.208.0/20 { ... }
+ route 205.251.233.121/32 next-hop 67.92.171.50;
+ route 205.251.233.122/32 next-hop 67.92.171.50;
Flags: needinfo?(arzhel)
Comment 12•9 years ago
|
||
Unfortunately upload is still slow when repeating the scp in comment #9:
test 100% 9766KB 51.9KB/s 03:08
Does that routing change only affect some ipsec vpn endpoints, and leave out other traffic which might end up terminating on other IPs ? Could we add these ones
upload.ffxbld.productdelivery.prod.mozaws.net
taskcluster-public-artifacts.s3-us-west-2.amazonaws.com
crash-stats.mozilla.com
upload.tbirdbld.productdelivery.prod.mozaws.net
(for traffic originating in releng, if that makes any difference)
Some of these are multiple IPs (crash-stats), or in a range controlled by amazon (s3-us-west-2.amazonaws.com). https://ip-ranges.amazonaws.com/ip-ranges.json if you want the complete list ;-)
Flags: needinfo?(arzhel)
Comment 13•9 years ago
|
||
XioNox is there a way to speed this up, almost all trees are closed for this problem
Comment 14•9 years ago
|
||
Deactivated our BGP session to Zayo in PAO1 until Zayo investigates. No more packet loss so far.
Flags: needinfo?(arzhel)
Comment 15•9 years ago
|
||
[cltbld@bld-lion-r5-057.build.releng.scl3.mozilla.com ~]$ scp -i ~/.ssh/ffxbld_rsa test ffxbld@upload.ffxbld.productdelivery.prod.mozaws.net:/tmp/xionox.test
test 100% 9766KB 4.8MB/s 00:02
Comparing to comment 9, this seems a lot better.
Comment 16•9 years ago
|
||
triggered test builds to check if they come back green
Comment 17•9 years ago
|
||
(In reply to Arzhel Younsi [:XioNoX] from comment #14)
> Deactivated our BGP session to Zayo in PAO1 until Zayo investigates. No more
> packet loss so far.
that worked, we have green builds - trees reopen at 6:20 pacific
Comment 18•9 years ago
|
||
Nothing to add. My changes on that file are unrelated to the upload issue.
Flags: needinfo?(armenzg)
Comment 19•9 years ago
|
||
Zayo found a bad interface between us and them. They are dispatching a technician to fix it.
Comment 20•9 years ago
|
||
clearing needinfo's, this wasn't a code change. And lowering severity since trees are reopened and the network vendor is dispatched.
Severity: blocker → major
Flags: needinfo?(rail)
Flags: needinfo?(mshal)
Flags: needinfo?(jlund)
Comment 21•9 years ago
|
||
I'm sure NetOps are tracking the circuit fix separately, and this is fixed from a RelEng pov.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•