1250374 - Trees closed, Mac builds failing when they upload to taskcluster

Reporter

Description

•

8 years ago

Can't make this stuff up.

https://treeherder.mozilla.org/logviewer.html#?job_id=7469456&repo=fx-team
https://treeherder.mozilla.org/logviewer.html#?job_id=22151725&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=22152066&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=22152123&repo=mozilla-inbound

All of the form

 18:38:57     INFO - Starting new HTTPS connection (1): taskcluster-public-artifacts.s3-us-west-2.amazonaws.com
 /builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
   InsecurePlatformWarning
 18:38:58     INFO - Resolving V8nFykokRIGDnBdLiPoHSQ, run 0. Full task:
 18:38:58     INFO - {u'status': {u'workerType': u'buildbot', u'taskGroupId': u'V8nFykokRIGDnBdLiPoHSQ', u'runs': [{u'scheduled': u'2016-02-23T01:54:52.874Z', u'reasonCreated': u'scheduled', u'state': u'pending', u'runId': 0}], u'expires': u'2017-02-23T02:54:51.771Z', u'retriesLeft': 5, u'state': u'pending', u'schedulerId': u'-', u'deadline': u'2016-02-23T02:54:51.771Z', u'taskId': u'V8nFykokRIGDnBdLiPoHSQ', u'provisionerId': u'null-provisioner'}}
 18:38:58     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
 /builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
   InsecurePlatformWarning
 18:38:59     INFO - Running post-action listener: influxdb_recording_post_action
 18:38:59     INFO - Resetting dropped connection: goldiewilson-onepointtwentyone-1.c.influxdb.com
 /builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
   InsecurePlatformWarning
 18:39:00    FATAL - Uncaught exception: Traceback (most recent call last):
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/base/script.py", line 1765, in run
 18:39:00    FATAL -     self.run_action(action)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/base/script.py", line 1707, in run_action
 18:39:00    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/base/script.py", line 1647, in _possibly_run_method
 18:39:00    FATAL -     return getattr(self, method_name)()
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1611, in upload_files
 18:39:00    FATAL -     property_conditions=property_conditions)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1539, in _taskcluster_upload
 18:39:00    FATAL -     tc.report_completed(task)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/scripts/mozharness/mozilla/taskcluster_helper.py", line 135, in report_completed
 18:39:00    FATAL -     self.taskcluster_queue.reportCompleted(task_id, run_id)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall
 18:39:00    FATAL -     return self._makeApiCall(e, *args, **kwargs)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall
 18:39:00    FATAL -     return self._makeHttpRequest(entry['method'], route, payload)
 18:39:00    FATAL -   File "/builds/slave/fx-team-m64-d-0000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest
 18:39:00    FATAL -     superExc=rerr
 18:39:00    FATAL - TaskclusterRestFailure: Run 0 on task V8nFykokRIGDnBdLiPoHSQ is resolved or not running.
 18:39:00    FATAL - ----
 18:39:00    FATAL - errorCode:  RequestConflict
 18:39:00    FATAL - statusCode: 409
 18:39:00    FATAL - requestInfo:
 18:39:00    FATAL -   method:   reportCompleted
 18:39:00    FATAL -   params:   {"taskId":"V8nFykokRIGDnBdLiPoHSQ","runId":"0"}
 18:39:00    FATAL -   payload:  {}
 18:39:00    FATAL -   time:     2016-02-23T02:38:59.501Z
 18:39:00    FATAL - details:
 18:39:00    FATAL - {
 18:39:00    FATAL -   "taskId": "V8nFykokRIGDnBdLiPoHSQ",
 18:39:00    FATAL -   "runId": 0
 18:39:00    FATAL - }
 18:39:00    FATAL - Running post_fatal callback...
 18:39:00    ERROR - setting return code to 2 because fatal was called
 18:39:00  WARNING - setting return code to 2
 18:39:00    FATAL - Exiting -1
 18:39:00     INFO - Running post-run listener: _summarize
 18:39:00    ERROR - # TBPL FAILURE #
 18:39:00     INFO - #####
 18:39:00     INFO - ##### FxDesktopBuild summary:
 18:39:00     INFO - #####
 18:39:00    ERROR - # TBPL FAILURE #


At least so far, Try seems unaffected, so it's open, but everything else is closed.

Aki Sasaki (not active)

Comment 1

•

8 years ago

Attached patch 1250374 — Details — Splinter Review

In theory, this will
a) retry submitting taskcluster report_completed on failure, and
b) only result in ERROR if it fails the max_retries number of times, not a FATAL.

I'm not sure if this works, and I'm not sure if the new behavior is the one we want.  I do know that since mozharness is in-tree, landing will only mitigate this problem going forward, not for older unmerged changesets or branches etc etc.

Nick Thomas [:nthomas] (UTC+12)

Comment 2

•

8 years ago

Also errors in mac opt, eg
https://treeherder.mozilla.org/logviewer.html#?job_id=22151731&repo=mozilla-inbound

 19:25:14     INFO - Uploading to S3: filename=/builds/slave/m-in-m64-000000000000000000000/logs/log_raw.log mimetype=text/plain length=17366589
 19:25:14     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
 /builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
   InsecurePlatformWarning
 19:25:15     INFO - Running post-action listener: influxdb_recording_post_action
 19:25:15     INFO - Resetting dropped connection: goldiewilson-onepointtwentyone-1.c.influxdb.com
 /builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
   InsecurePlatformWarning
 19:25:15    FATAL - Uncaught exception: Traceback (most recent call last):
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/base/script.py", line 1765, in run
 19:25:15    FATAL -     self.run_action(action)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/base/script.py", line 1707, in run_action
 19:25:15    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/base/script.py", line 1647, in _possibly_run_method
 19:25:15    FATAL -     return getattr(self, method_name)()
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1611, in upload_files
 19:25:15    FATAL -     property_conditions=property_conditions)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/mozilla/building/buildbase.py", line 1522, in _taskcluster_upload
 19:25:15    FATAL -     tc.create_artifact(task, upload_file)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/scripts/mozharness/mozilla/taskcluster_helper.py", line 110, in create_artifact
 19:25:15    FATAL -     "contentType": mime_type,
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall
 19:25:15    FATAL -     return self._makeApiCall(e, *args, **kwargs)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall
 19:25:15    FATAL -     return self._makeHttpRequest(entry['method'], route, payload)
 19:25:15    FATAL -   File "/builds/slave/m-in-m64-000000000000000000000/build/venv/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest
 19:25:15    FATAL -     superExc=rerr
 19:25:15    FATAL - TaskclusterRestFailure: Artifacts cannot be created for a task after it is resolved, unless it is resolved 'exception', and even in this case only up to 25 min past resolution.This to ensure that artifacts have been uploaded before a task is 'completed' and output is consumed by a dependent task
 19:25:15    FATAL - Running post_fatal callback...
 19:25:15    ERROR - setting return code to 2 because fatal was called
 19:25:15  WARNING - setting return code to 2
 19:25:15    FATAL - Exiting -1
 19:25:15     INFO - Running post-run listener: _summarize
 19:25:15    ERROR - # TBPL FAILURE #
 19:25:15     INFO - #####
 19:25:15     INFO - ##### FxDesktopBuild summary:
 19:25:15     INFO - #####
 19:25:15    ERROR - # TBPL FAILURE #
 19:25:15     INFO - Running post-run listener: copy_logs_to_upload_dir
 19:25:15     INFO - Copying logs to upload dir...
 program finished with exit code 255

Summary: Trees closed, Mac debug builds failing when they try to tell taskcluster that they are done → Trees closed, Mac builds failing when they upload to taskcluster

Aki Sasaki (not active)

Comment 3

•

8 years ago

As the four people I'm finding in hg annotate across the two repos (m-c and build/mozharness) for this file (mozharness.mozilla.taskcluster_helper), which of the following behaviors is desired?

* fatal on report_completed failure
* retries + error on report_completed failure
* something else

Flags: needinfo?(rail)

Flags: needinfo?(mshal)

Flags: needinfo?(jlund)

Flags: needinfo?(armenzg)

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

8 years ago

There's something weird going on here:
* the logs appear to be out of order (influxdb_recording_post_action before the FATAL lines for the error)
* two runs on the taskcluster task - for the job https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&selectedJob=22149496 the matching task is https://tools.taskcluster.net/task-inspector/#cLqhnZkrRImmIOuCpm6OFw/
** uploads start at 18:34 and continue 19:21 (log time)
** tc task run0 starts 18:34 but is resolved at 18:54 pacific; state: exception, reason-resolved: claimed expired
** tc task run1 scheduled at 18:54, never starts, but is resolved 19:35 in state: exception and Reason Resolved: deadline-exceeded

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

8 years ago

Also, in a mac opt, we were uploading in 15 minutes before, and it's much slower now - 45 minutes in one log I saw. No packet loss between bm81 in scl3 and taskcluster-public-artifacts.s3-us-west-2.amazonaws.com though.

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

8 years ago

Another timeout datapoint, which may be related - the two linux builds for 45.0b9 failed to upload symbols. Normally they take about 60s to upload, and now they timeout in 2460 seconds. If we're conservative and overestimate the symbol zip to be 300MB in size, then we're getting less than 120KB/s upload speed.

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

8 years ago

(In reply to Nick Thomas [:nthomas] from comment #6)
> Another timeout datapoint, which may be related - the two linux builds for
> 45.0b9 failed to upload symbols. Normally they take about 60s to upload, and
> now they timeout in 2460 seconds. If we're conservative and overestimate the
> symbol zip to be 300MB in size, then we're getting less than 120KB/s upload
> speed.

tl;dr - this likely to also be SCL3 -> AWS us-west-2 traffic, like the mac uploads.

Longer
The uploads go to crash-stats.mozilla.com, which has three IPs. The linux builds ran in two regions of AWS (one each in us-west-2 and us-east-1), but the upload is routed back to SCL3 over the IPsec tunnel for two of the IPs (54.68.216.133, 52.35.229.240), and then out over v-1030.core1.releng.scl3.mozilla.net to the internet. The path to third IP (54.200.18.169) doesn't go via the releng firewall but still goes over xe-1-2-2.border1.pao1.mozilla.net.

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

8 years ago

pmoore says that task must be reclaimed every 20 minutes, which lines up nicely with comment #4. The code in question starts at 
 https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/building/buildbase.py#1485
We claim the task at 1502, then loop over the files to upload at 1517. Each call create_artifact() doesn't claim the task again, so the task quickly expires when the upload to slow.

We are also seeing much slower uploads to the archive.m.o system, but they aren't failing out.

So we have three issues
* the network needs to be faster, unclear if this is SCL3 -> AWS, or AWS is throttling us. Engaging netops
* the tc upload code needs to be more robust in the face of slow uploads (eg giving up after 18 minutes, reclaiming the task, retrying)
* routing of the symbol uploads should be AWS -> AWS direct

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

8 years ago

If I simulate SCL3 -> AWS traffic with a 10MB random file:

[cltbld@bld-lion-r5-057.build.releng.scl3.mozilla.com ~]$ scp -i ~/.ssh/ffxbld_rsa test ffxbld@upload.ffxbld.productdelivery.prod.mozaws.net:/tmp/nthomas.test

then 2MB is transferred quickly but the connection stalls, and eventually transfers at ~50KB/s.

I spoke to XioNoX on IRC and he pointed out that various ipsec tunnels from scl3 to us-west-2 have been flapping from roughly the time this bug was filed, 'so the ipsec tunnels between scl3 and usw2 keep bouncing, so something is clearly not right'.

XioNox: do you have any more information on where the problem might be ? eg our internal network/providers/aws.

Flags: needinfo?(arzhel)

Nick Thomas [:nthomas] (UTC+12)

Comment 10

•

8 years ago

(In reply to Nick Thomas [:nthomas] from comment #8)
> * the tc upload code needs to be more robust in the face of slow uploads (eg
> giving up after 18 minutes, reclaiming the task, retrying)   

--> bug 1250458

> * routing of the symbol uploads should be AWS -> AWS direct  

--> bug 1250462

Comment 11

•

8 years ago

Seems like one provider (Zayo) between us and AWS is having a small amount of packet loss (~2%) starting about midnight UTC.

I opened ticket 955996 with them and temporary routing USW2 endpoint through another provider. 

xionox@border1.mgmt.pao1.mozilla.net# show | compare 
[edit routing-options static]
     route 63.245.208.0/20 { ... }
+    route 205.251.233.121/32 next-hop 67.92.171.50;
+    route 205.251.233.122/32 next-hop 67.92.171.50;

Flags: needinfo?(arzhel)

Nick Thomas [:nthomas] (UTC+12)

Comment 12

•

8 years ago

Unfortunately upload is still slow when repeating the scp in comment #9:
test                          100% 9766KB  51.9KB/s   03:08

Does that routing change only affect some ipsec vpn endpoints, and leave out other traffic which might end up terminating on other IPs ? Could we add these ones 
 upload.ffxbld.productdelivery.prod.mozaws.net
 taskcluster-public-artifacts.s3-us-west-2.amazonaws.com
 crash-stats.mozilla.com
 upload.tbirdbld.productdelivery.prod.mozaws.net
(for traffic originating in releng, if that makes any difference)

Some of these are multiple IPs (crash-stats), or in a range controlled by amazon (s3-us-west-2.amazonaws.com). https://ip-ranges.amazonaws.com/ip-ranges.json if you want the complete list ;-)

Flags: needinfo?(arzhel)

Carsten Book [:Tomcat]

Comment 13

•

8 years ago

XioNox is there a way to speed this up, almost all trees are closed for this problem

Arzhel Younsi [:XioNoX]

Comment 14

•

8 years ago

Deactivated our BGP session to Zayo in PAO1 until Zayo investigates. No more packet loss so far.

Flags: needinfo?(arzhel)

Arzhel Younsi [:XioNoX]

Comment 15

•

8 years ago

[cltbld@bld-lion-r5-057.build.releng.scl3.mozilla.com ~]$ scp -i ~/.ssh/ffxbld_rsa test ffxbld@upload.ffxbld.productdelivery.prod.mozaws.net:/tmp/xionox.test
test                                                                            100% 9766KB   4.8MB/s   00:02

Comparing to comment 9, this seems a lot better.

Carsten Book [:Tomcat]

Comment 16

•

8 years ago

triggered test builds to check if they come back green

Carsten Book [:Tomcat]

Comment 17

•

8 years ago

(In reply to Arzhel Younsi [:XioNoX] from comment #14)
> Deactivated our BGP session to Zayo in PAO1 until Zayo investigates. No more
> packet loss so far.

that worked, we have green builds - trees reopen at 6:20 pacific

Armen [:armenzg]

Comment 18

•

8 years ago

Nothing to add. My changes on that file are unrelated to the upload issue.

Flags: needinfo?(armenzg)

Arzhel Younsi [:XioNoX]

Comment 19

•

8 years ago

Zayo found a bad interface between us and them. They are dispatching a technician to fix it.

Justin Wood (:Callek)

Comment 20

•

8 years ago

clearing needinfo's, this wasn't a code change. And lowering severity since trees are reopened and the network vendor is dispatched.

Severity: blocker → major

Flags: needinfo?(rail)

Flags: needinfo?(mshal)

Flags: needinfo?(jlund)

Nick Thomas [:nthomas] (UTC+12)

Comment 21

•

8 years ago

I'm sure NetOps are tracking the circuit fix separately, and this is fixed from a RelEng pov.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard