Closed Bug 1250374 Opened 8 years ago Closed 8 years ago

Trees closed, Mac builds failing when they upload to taskcluster

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

Attachments

(1 file)

Can't make this stuff up.

https://treeherder.mozilla.org/logviewer.html#?job_id=7469456&repo=fx-team
https://treeherder.mozilla.org/logviewer.html#?job_id=22151725&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=22152066&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=22152123&repo=mozilla-inbound

All of the form

 18:38:57     INFO - Starting new HTTPS connection (1): taskcluster-public-artifacts.s3-us-west-2.amazonaws.com
 /builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
   InsecurePlatformWarning
 18:38:58     INFO - Resolving V8nFykokRIGDnBdLiPoHSQ, run 0. Full task:
 18:38:58     INFO - {u'status': {u'workerType': u'buildbot', u'taskGroupId': u'V8nFykokRIGDnBdLiPoHSQ', u'runs': [{u'scheduled': u'2016-02-23T01:54:52.874Z', u'reasonCreated': u'scheduled', u'state': u'pending', u'runId': 0}], u'expires': u'2017-02-23T02:54:51.771Z', u'retriesLeft': 5, u'state': u'pending', u'schedulerId': u'-', u'deadline': u'2016-02-23T02:54:51.771Z', u'taskId': u'V8nFykokRIGDnBdLiPoHSQ', u'provisionerId': u'null-provisioner'}}
 18:38:58     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
 /builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
   InsecurePlatformWarning
 18:38:59     INFO - Running post-action listener: influxdb_recording_post_action
 18:38:59     INFO - Resetting dropped connection: goldiewilson-onepointtwentyone-1.c.influxdb.com
 /builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
   InsecurePlatformWarning
 18:39:00    FATAL - Uncaught exception: Traceback (most recent call last):
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/base/script.py", line 1765, in run
 18:39:00    FATAL -     self.run_action(action)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/base/script.py", line 1707, in run_action
 18:39:00    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/base/script.py", line 1647, in _possibly_run_method
 18:39:00    FATAL -     return getattr(self, method_name)()
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1611, in upload_files
 18:39:00    FATAL -     property_conditions=property_conditions)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1539, in _taskcluster_upload
 18:39:00    FATAL -     tc.report_completed(task)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/mozilla/taskcluster_helper.py", line 135, in report_completed
 18:39:00    FATAL -     self.taskcluster_queue.reportCompleted(task_id, run_id)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall
 18:39:00    FATAL -     return self._makeApiCall(e, *args, **kwargs)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall
 18:39:00    FATAL -     return self._makeHttpRequest(entry['method'], route, payload)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest
 18:39:00    FATAL -     superExc=rerr
 18:39:00    FATAL - TaskclusterRestFailure: Run 0 on task V8nFykokRIGDnBdLiPoHSQ is resolved or not running.
 18:39:00    FATAL - ----
 18:39:00    FATAL - errorCode:  RequestConflict
 18:39:00    FATAL - statusCode: 409
 18:39:00    FATAL - requestInfo:
 18:39:00    FATAL -   method:   reportCompleted
 18:39:00    FATAL -   params:   {"taskId":"V8nFykokRIGDnBdLiPoHSQ","runId":"0"}
 18:39:00    FATAL -   payload:  {}
 18:39:00    FATAL -   time:     2016-02-23T02:38:59.501Z
 18:39:00    FATAL - details:
 18:39:00    FATAL - {
 18:39:00    FATAL -   "taskId": "V8nFykokRIGDnBdLiPoHSQ",
 18:39:00    FATAL -   "runId": 0
 18:39:00    FATAL - }
 18:39:00    FATAL - Running post_fatal callback...
 18:39:00    ERROR - setting return code to 2 because fatal was called
 18:39:00  WARNING - setting return code to 2
 18:39:00    FATAL - Exiting -1
 18:39:00     INFO - Running post-run listener: _summarize
 18:39:00    ERROR - # TBPL FAILURE #
 18:39:00     INFO - #####
 18:39:00     INFO - ##### FxDesktopBuild summary:
 18:39:00     INFO - #####
 18:39:00    ERROR - # TBPL FAILURE #


At least so far, Try seems unaffected, so it's open, but everything else is closed.
Attached patch 1250374Splinter Review
In theory, this will
a) retry submitting taskcluster report_completed on failure, and
b) only result in ERROR if it fails the max_retries number of times, not a FATAL.

I'm not sure if this works, and I'm not sure if the new behavior is the one we want.  I do know that since mozharness is in-tree, landing will only mitigate this problem going forward, not for older unmerged changesets or branches etc etc.
Also errors in mac opt, eg
https://treeherder.mozilla.org/logviewer.html#?job_id=22151731&repo=mozilla-inbound

 19:25:14     INFO - Uploading to S3: filename=/builds/slave/m-in-m64-000000000000000000000/logs/log_raw.log mimetype=text/plain length=17366589
 19:25:14     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
 /builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
   InsecurePlatformWarning
 19:25:15     INFO - Running post-action listener: influxdb_recording_post_action
 19:25:15     INFO - Resetting dropped connection: goldiewilson-onepointtwentyone-1.c.influxdb.com
 /builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
   InsecurePlatformWarning
 19:25:15    FATAL - Uncaught exception: Traceback (most recent call last):
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/base/script.py", line 1765, in run
 19:25:15    FATAL -     self.run_action(action)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/base/script.py", line 1707, in run_action
 19:25:15    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/base/script.py", line 1647, in _possibly_run_method
 19:25:15    FATAL -     return getattr(self, method_name)()
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1611, in upload_files
 19:25:15    FATAL -     property_conditions=property_conditions)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1522, in _taskcluster_upload
 19:25:15    FATAL -     tc.create_artifact(task, upload_file)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/mozilla/taskcluster_helper.py", line 110, in create_artifact
 19:25:15    FATAL -     "contentType": mime_type,
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall
 19:25:15    FATAL -     return self._makeApiCall(e, *args, **kwargs)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall
 19:25:15    FATAL -     return self._makeHttpRequest(entry['method'], route, payload)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest
 19:25:15    FATAL -     superExc=rerr
 19:25:15    FATAL - TaskclusterRestFailure: Artifacts cannot be created for a task after it is resolved, unless it is resolved 'exception', and even in this case only up to 25 min past resolution.This to ensure that artifacts have been uploaded before a task is 'completed' and output is consumed by a dependent task
 19:25:15    FATAL - Running post_fatal callback...
 19:25:15    ERROR - setting return code to 2 because fatal was called
 19:25:15  WARNING - setting return code to 2
 19:25:15    FATAL - Exiting -1
 19:25:15     INFO - Running post-run listener: _summarize
 19:25:15    ERROR - # TBPL FAILURE #
 19:25:15     INFO - #####
 19:25:15     INFO - ##### FxDesktopBuild summary:
 19:25:15     INFO - #####
 19:25:15    ERROR - # TBPL FAILURE #
 19:25:15     INFO - Running post-run listener: copy_logs_to_upload_dir
 19:25:15     INFO - Copying logs to upload dir...
 program finished with exit code 255
Summary: Trees closed, Mac debug builds failing when they try to tell taskcluster that they are done → Trees closed, Mac builds failing when they upload to taskcluster
As the four people I'm finding in hg annotate across the two repos (m-c and build/mozharness) for this file (mozharness.mozilla.taskcluster_helper), which of the following behaviors is desired?

* fatal on report_completed failure
* retries + error on report_completed failure
* something else
Flags: needinfo?(rail)
Flags: needinfo?(mshal)
Flags: needinfo?(jlund)
Flags: needinfo?(armenzg)
There's something weird going on here:
* the logs appear to be out of order (influxdb_recording_post_action before the FATAL lines for the error)
* two runs on the taskcluster task - for the job https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&selectedJob=22149496 the matching task is https://tools.taskcluster.net/task-inspector/#cLqhnZkrRImmIOuCpm6OFw/
** uploads start at 18:34 and continue 19:21 (log time)
** tc task run0 starts 18:34 but is resolved at 18:54 pacific; state: exception, reason-resolved: claimed expired
** tc task run1 scheduled at 18:54, never starts, but is resolved 19:35 in state: exception and Reason Resolved: deadline-exceeded
Also, in a mac opt, we were uploading in 15 minutes before, and it's much slower now - 45 minutes in one log I saw. No packet loss between bm81 in scl3 and taskcluster-public-artifacts.s3-us-west-2.amazonaws.com though.
Another timeout datapoint, which may be related - the two linux builds for 45.0b9 failed to upload symbols. Normally they take about 60s to upload, and now they timeout in 2460 seconds. If we're conservative and overestimate the symbol zip to be 300MB in size, then we're getting less than 120KB/s upload speed.
(In reply to Nick Thomas [:nthomas] from comment #6)
> Another timeout datapoint, which may be related - the two linux builds for
> 45.0b9 failed to upload symbols. Normally they take about 60s to upload, and
> now they timeout in 2460 seconds. If we're conservative and overestimate the
> symbol zip to be 300MB in size, then we're getting less than 120KB/s upload
> speed.

tl;dr - this likely to also be SCL3 -> AWS us-west-2 traffic, like the mac uploads.

Longer
The uploads go to crash-stats.mozilla.com, which has three IPs. The linux builds ran in two regions of AWS (one each in us-west-2 and us-east-1), but the upload is routed back to SCL3 over the IPsec tunnel for two of the IPs (54.68.216.133, 52.35.229.240), and then out over v-1030.core1.releng.scl3.mozilla.net to the internet. The path to third IP (54.200.18.169) doesn't go via the releng firewall but still goes over xe-1-2-2.border1.pao1.mozilla.net.
pmoore says that task must be reclaimed every 20 minutes, which lines up nicely with comment #4. The code in question starts at 
 https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/building/buildbase.py#1485
We claim the task at 1502, then loop over the files to upload at 1517. Each call create_artifact() doesn't claim the task again, so the task quickly expires when the upload to slow.

We are also seeing much slower uploads to the archive.m.o system, but they aren't failing out.

So we have three issues
* the network needs to be faster, unclear if this is SCL3 -> AWS, or AWS is throttling us. Engaging netops
* the tc upload code needs to be more robust in the face of slow uploads (eg giving up after 18 minutes, reclaiming the task, retrying)
* routing of the symbol uploads should be AWS -> AWS direct
If I simulate SCL3 -> AWS traffic with a 10MB random file:

[cltbld@bld-lion-r5-057.build.releng.scl3.mozilla.com ~]$ scp -i ~/.ssh/ffxbld_rsa test ffxbld@upload.ffxbld.productdelivery.prod.mozaws.net:/tmp/nthomas.test

then 2MB is transferred quickly but the connection stalls, and eventually transfers at ~50KB/s.

I spoke to XioNoX on IRC and he pointed out that various ipsec tunnels from scl3 to us-west-2 have been flapping from roughly the time this bug was filed, 'so the ipsec tunnels between scl3 and usw2 keep bouncing, so something is clearly not right'.

XioNox: do you have any more information on where the problem might be ? eg our internal network/providers/aws.
Flags: needinfo?(arzhel)
(In reply to Nick Thomas [:nthomas] from comment #8)
> * the tc upload code needs to be more robust in the face of slow uploads (eg
> giving up after 18 minutes, reclaiming the task, retrying)   

--> bug 1250458

> * routing of the symbol uploads should be AWS -> AWS direct  

--> bug 1250462
See Also: → 1250462, 1250458
Seems like one provider (Zayo) between us and AWS is having a small amount of packet loss (~2%) starting about midnight UTC.

I opened ticket 955996 with them and temporary routing USW2 endpoint through another provider. 

xionox@border1.mgmt.pao1.mozilla.net# show | compare 
[edit routing-options static]
     route 63.245.208.0/20 { ... }
+    route 205.251.233.121/32 next-hop 67.92.171.50;
+    route 205.251.233.122/32 next-hop 67.92.171.50;
Flags: needinfo?(arzhel)
Unfortunately upload is still slow when repeating the scp in comment #9:
test                          100% 9766KB  51.9KB/s   03:08

Does that routing change only affect some ipsec vpn endpoints, and leave out other traffic which might end up terminating on other IPs ? Could we add these ones 
 upload.ffxbld.productdelivery.prod.mozaws.net
 taskcluster-public-artifacts.s3-us-west-2.amazonaws.com
 crash-stats.mozilla.com
 upload.tbirdbld.productdelivery.prod.mozaws.net
(for traffic originating in releng, if that makes any difference)

Some of these are multiple IPs (crash-stats), or in a range controlled by amazon (s3-us-west-2.amazonaws.com). https://ip-ranges.amazonaws.com/ip-ranges.json if you want the complete list ;-)
Flags: needinfo?(arzhel)
XioNox is there a way to speed this up, almost all trees are closed for this problem
Deactivated our BGP session to Zayo in PAO1 until Zayo investigates. No more packet loss so far.
Flags: needinfo?(arzhel)
[cltbld@bld-lion-r5-057.build.releng.scl3.mozilla.com ~]$ scp -i ~/.ssh/ffxbld_rsa test ffxbld@upload.ffxbld.productdelivery.prod.mozaws.net:/tmp/xionox.test
test                                                                            100% 9766KB   4.8MB/s   00:02

Comparing to comment 9, this seems a lot better.
triggered test builds to check if they come back green
(In reply to Arzhel Younsi [:XioNoX] from comment #14)
> Deactivated our BGP session to Zayo in PAO1 until Zayo investigates. No more
> packet loss so far.

that worked, we have green builds - trees reopen at 6:20 pacific
Nothing to add. My changes on that file are unrelated to the upload issue.
Flags: needinfo?(armenzg)
Zayo found a bad interface between us and them. They are dispatching a technician to fix it.
clearing needinfo's, this wasn't a code change. And lowering severity since trees are reopened and the network vendor is dispatched.
Severity: blocker → major
Flags: needinfo?(rail)
Flags: needinfo?(mshal)
Flags: needinfo?(jlund)
I'm sure NetOps are tracking the circuit fix separately, and this is fixed from a RelEng pov.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: