Closed Bug 1166415 Opened 9 years ago Closed 9 years ago

Frequent Windows build failures due to "TaskclusterRestFailure: The given run is not running" errors

Categories

(Infrastructure & Operations :: RelOps: General, task)

Unspecified
Windows
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Assigned: q)

Details

Attachments

(2 files)

These are pretty widespread at the moment. The trees have been closed since 11am PT as a result.

https://treeherder.mozilla.org/logviewer.html#?job_id=9974002&repo=mozilla-inbound

11:27:31 INFO - Running post-action listener: influxdb_recording_post_action
11:27:31 INFO - Starting new HTTP connection (1): goldiewilson-onepointtwentyone-1.c.influxdb.com
11:27:31 INFO - Running post-action listener: record_mach_stats
11:27:31 INFO - No build_resources.json found, not logging stats
11:27:31 FATAL - Uncaught exception: Traceback (most recent call last):
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1288, in run
11:27:31 FATAL - self.run_action(action)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1231, in run_action
11:27:31 FATAL - self._possibly_run_method("postflight_%s" % method_name)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1171, in _possibly_run_method
11:27:31 FATAL - return getattr(self, method_name)()
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1626, in postflight_build
11:27:31 FATAL - self.upload_files()
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1422, in upload_files
11:27:31 FATAL - tc.create_artifact(task, upload_file)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\taskcluster_helper.py", line 103, in create_artifact
11:27:31 FATAL - "contentType": mime_type,
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall
11:27:31 FATAL - return self._makeApiCall(e, *args, **kwargs)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall
11:27:31 FATAL - return self._makeHttpRequest(entry['method'], route, payload)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest
11:27:31 FATAL - superExc=rerr
11:27:31 FATAL - TaskclusterRestFailure: The given run is not running
11:27:31 FATAL - Running post_fatal callback...
11:27:31 ERROR - setting return code to 2 because fatal was called
You need a little more copy-paste (well, really a lot more, more than you would paste or anyone would read) to capture the failure, which is that it started uploading artifacts to S3 at 11:07, and by 11:27 taskcluster had given up on it managing to finish, so the problem is probably network between SCL3 and AWS.
Q is working on a fix.
Assignee: nobody → q
Component: Buildduty → RelOps
Product: Release Engineering → Infrastructure & Operations
QA Contact: bugspam.Callek → arich
Version: unspecified → other
We've diagnosed these as the same windows socket/send window issues we were seeing in AWS. Thankfully Q cracked that over the weekend, and is working on a fix to deploy via GPO for the datacenter now.
Flags: needinfo?(q)
Attached file tcp_param.reg
reg fix for upload issue to s3
Flags: needinfo?(q)
Pushed out the contents of the attached reg file and rebooted all reachable builders. The reg file is a scaled down and corrected for hardware version of our AWS fix. 
In a nutshell it:
* Turns on WSCALE 
* Sets the congestion control to ctcp
* Ups the winsock buffer
* Enables hardware offload 
* Sets autotune scaling to "normal" 
* Disables per connection type heuristics

after application upload speeds to usw2 and use1 S3 buckets averages 40 /MBs which is up from 200kbps before the fix. This can be tuned even more however we will be operational for now.
I retriggered Windows builds on various integration branches a while ago and things like they're passing, so I reopened the trees a few minutes ago.
There was another one of these after the forced reboots: https://treeherder.mozilla.org/logviewer.html#?job_id=9981597&repo=mozilla-inbound

So far we've seen 8 greens and this one failure, which seems to be better than we were before, at least.
Comment 10 may not be interesting, since it started at 14:32, or, it may be interesting in that it's a sign that we didn't actually successfully reboot everything.
root@B-2008-IX-0081 ~
$ net statistics server
Server Statistics for \\B-2008-IX-0081

:markco The machines which are running puppet are going to burn jobs until we patch them accordingly, so disabling those for now (I searched for puppet in slavealloc).

b-2008-ix-0078
b-2008-ix-0175
b-2008-ix-0176
b-2008-ix-0177
b-2008-ix-0178
b-2008-ix-0179
b-2008-ix-0132
b-2008-ix-0133
b-2008-ix-0168
b-2008-ix-0001
Flags: needinfo?(mcornmesser)
Of course it managed not to paste the pertinent line for 0081:

Statistics since 5/19/2015 7:05:20 PM

So that should have been after the reboot.
We're in a two-system world now, and we can't keep disabling puppet-based machines when they fall behind.  So every rollout to windows needs to be both puppet and GPO before it's considered complete..
Attached patch BUG1165314.patchSplinter Review
Flags: needinfo?(mcornmesser)
Attachment #8608500 - Flags: review?(dustin)
Comment on attachment 8608500 [details] [diff] [review]
BUG1165314.patch

This is the initial patch to get the performance of datacenter machines up to par. I am planning on building in tweaks for 2008 in AWS onto this patch, and as well as future network tweaks for other Windows platforms.
Attachment #8608500 - Flags: feedback?(q)
Attachment #8608500 - Flags: feedback?(rthijssen)
Comment on attachment 8608500 [details] [diff] [review]
BUG1165314.patch

Review of attachment 8608500 [details] [diff] [review]:
-----------------------------------------------------------------

A+++++ would deploy again and again

::: modules/tweaks/manifests/windows_network_optimization.pp
@@ +2,5 @@
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +class tweaks::windows_network_optimization {
> +    # For 2008 refrence Bugs 1165314 & 1166415 

:like:
Attachment #8608500 - Flags: review?(dustin) → review+
The patch has been landed. I am going to enable the machines from comments 13 and 14.
Attachment #8608500 - Flags: feedback?(rthijssen)
I believe this is live in our current test bed. Per conversation with arr closing bug
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: