Frequent Windows build failures due to "TaskclusterRestFailure: The given run is not running" errors

RESOLVED FIXED

Status

Infrastructure & Operations
RelOps
--
blocker
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: RyanVM, Assigned: Q)

Tracking

Details

Attachments

(2 attachments)

(Reporter)

Description

3 years ago
These are pretty widespread at the moment. The trees have been closed since 11am PT as a result.

https://treeherder.mozilla.org/logviewer.html#?job_id=9974002&repo=mozilla-inbound

11:27:31 INFO - Running post-action listener: influxdb_recording_post_action
11:27:31 INFO - Starting new HTTP connection (1): goldiewilson-onepointtwentyone-1.c.influxdb.com
11:27:31 INFO - Running post-action listener: record_mach_stats
11:27:31 INFO - No build_resources.json found, not logging stats
11:27:31 FATAL - Uncaught exception: Traceback (most recent call last):
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1288, in run
11:27:31 FATAL - self.run_action(action)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1231, in run_action
11:27:31 FATAL - self._possibly_run_method("postflight_%s" % method_name)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1171, in _possibly_run_method
11:27:31 FATAL - return getattr(self, method_name)()
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1626, in postflight_build
11:27:31 FATAL - self.upload_files()
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1422, in upload_files
11:27:31 FATAL - tc.create_artifact(task, upload_file)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\taskcluster_helper.py", line 103, in create_artifact
11:27:31 FATAL - "contentType": mime_type,
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall
11:27:31 FATAL - return self._makeApiCall(e, *args, **kwargs)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall
11:27:31 FATAL - return self._makeHttpRequest(entry['method'], route, payload)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest
11:27:31 FATAL - superExc=rerr
11:27:31 FATAL - TaskclusterRestFailure: The given run is not running
11:27:31 FATAL - Running post_fatal callback...
11:27:31 ERROR - setting return code to 2 because fatal was called
You need a little more copy-paste (well, really a lot more, more than you would paste or anyone would read) to capture the failure, which is that it started uploading artifacts to S3 at 11:07, and by 11:27 taskcluster had given up on it managing to finish, so the problem is probably network between SCL3 and AWS.
Q is working on a fix.
Assignee: nobody → q
Component: Buildduty → RelOps
Product: Release Engineering → Infrastructure & Operations
QA Contact: bugspam.Callek → arich
Version: unspecified → other
We've diagnosed these as the same windows socket/send window issues we were seeing in AWS. Thankfully Q cracked that over the weekend, and is working on a fix to deploy via GPO for the datacenter now.
Flags: needinfo?(q)
(Assignee)

Comment 4

3 years ago
Created attachment 8607816 [details]
tcp_param.reg

reg fix for upload issue to s3
Flags: needinfo?(q)
(Assignee)

Comment 5

3 years ago
Pushed out the contents of the attached reg file and rebooted all reachable builders. The reg file is a scaled down and corrected for hardware version of our AWS fix. 
In a nutshell it:
* Turns on WSCALE 
* Sets the congestion control to ctcp
* Ups the winsock buffer
* Enables hardware offload 
* Sets autotune scaling to "normal" 
* Disables per connection type heuristics

after application upload speeds to usw2 and use1 S3 buckets averages 40 /MBs which is up from 200kbps before the fix. This can be tuned even more however we will be operational for now.
I retriggered Windows builds on various integration branches a while ago and things like they're passing, so I reopened the trees a few minutes ago.
There was another one of these after the forced reboots: https://treeherder.mozilla.org/logviewer.html#?job_id=9981597&repo=mozilla-inbound

So far we've seen 8 greens and this one failure, which seems to be better than we were before, at least.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment 10 may not be interesting, since it started at 14:32, or, it may be interesting in that it's a sign that we didn't actually successfully reboot everything.
Comment hidden (Treeherder Robot)
root@B-2008-IX-0081 ~
$ net statistics server
Server Statistics for \\B-2008-IX-0081

:markco The machines which are running puppet are going to burn jobs until we patch them accordingly, so disabling those for now (I searched for puppet in slavealloc).

b-2008-ix-0078
b-2008-ix-0175
b-2008-ix-0176
b-2008-ix-0177
b-2008-ix-0178
b-2008-ix-0179
b-2008-ix-0132
b-2008-ix-0133
b-2008-ix-0168
b-2008-ix-0001
Flags: needinfo?(mcornmesser)
Of course it managed not to paste the pertinent line for 0081:

Statistics since 5/19/2015 7:05:20 PM

So that should have been after the reboot.
We're in a two-system world now, and we can't keep disabling puppet-based machines when they fall behind.  So every rollout to windows needs to be both puppet and GPO before it's considered complete..
Created attachment 8608500 [details] [diff] [review]
BUG1165314.patch
Flags: needinfo?(mcornmesser)
Attachment #8608500 - Flags: review?(dustin)
Comment on attachment 8608500 [details] [diff] [review]
BUG1165314.patch

This is the initial patch to get the performance of datacenter machines up to par. I am planning on building in tweaks for 2008 in AWS onto this patch, and as well as future network tweaks for other Windows platforms.
Attachment #8608500 - Flags: feedback?(q)
Attachment #8608500 - Flags: feedback?(rthijssen)
Comment on attachment 8608500 [details] [diff] [review]
BUG1165314.patch

Review of attachment 8608500 [details] [diff] [review]:
-----------------------------------------------------------------

A+++++ would deploy again and again

::: modules/tweaks/manifests/windows_network_optimization.pp
@@ +2,5 @@
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +class tweaks::windows_network_optimization {
> +    # For 2008 refrence Bugs 1165314 & 1166415 

:like:
Attachment #8608500 - Flags: review?(dustin) → review+
Comment on attachment 8608500 [details] [diff] [review]
BUG1165314.patch

https://hg.mozilla.org/build/puppet/rev/be855005b0ae
Attachment #8608500 - Flags: checked-in+
The patch has been landed. I am going to enable the machines from comments 13 and 14.
Attachment #8608500 - Flags: feedback?(rthijssen)
Attachment #8608500 - Flags: feedback?(q)
(Assignee)

Comment 21

3 years ago
I believe this is live in our current test bed. Per conversation with arr closing bug
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.