1166415 - Frequent Windows build failures due to "TaskclusterRestFailure: The given run is not running" errors

Reporter

Description

•

9 years ago

These are pretty widespread at the moment. The trees have been closed since 11am PT as a result.

https://treeherder.mozilla.org/logviewer.html#?job_id=9974002&repo=mozilla-inbound

11:27:31 INFO - Running post-action listener: influxdb_recording_post_action
11:27:31 INFO - Starting new HTTP connection (1): goldiewilson-onepointtwentyone-1.c.influxdb.com
11:27:31 INFO - Running post-action listener: record_mach_stats
11:27:31 INFO - No build_resources.json found, not logging stats
11:27:31 FATAL - Uncaught exception: Traceback (most recent call last):
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1288, in run
11:27:31 FATAL - self.run_action(action)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1231, in run_action
11:27:31 FATAL - self._possibly_run_method("postflight_%s" % method_name)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1171, in _possibly_run_method
11:27:31 FATAL - return getattr(self, method_name)()
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1626, in postflight_build
11:27:31 FATAL - self.upload_files()
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1422, in upload_files
11:27:31 FATAL - tc.create_artifact(task, upload_file)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\taskcluster_helper.py", line 103, in create_artifact
11:27:31 FATAL - "contentType": mime_type,
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall
11:27:31 FATAL - return self._makeApiCall(e, *args, **kwargs)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall
11:27:31 FATAL - return self._makeHttpRequest(entry['method'], route, payload)
11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest
11:27:31 FATAL - superExc=rerr
11:27:31 FATAL - TaskclusterRestFailure: The given run is not running
11:27:31 FATAL - Running post_fatal callback...
11:27:31 ERROR - setting return code to 2 because fatal was called

Phil Ringnalda (:philor)

Comment 1

•

9 years ago

You need a little more copy-paste (well, really a lot more, more than you would paste or anyone would read) to capture the failure, which is that it started uploading artifacts to S3 at 11:07, and by 11:27 taskcluster had given up on it managing to finish, so the problem is probably network between SCL3 and AWS.

Selena Deckelmann :selenamarie :selena

Comment 2

•

9 years ago

Q is working on a fix.

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Assignee: nobody → q

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Component: Buildduty → RelOps

Product: Release Engineering → Infrastructure & Operations

QA Contact: bugspam.Callek → arich

Version: unspecified → other

Amy Rich [:arr] [:arich]

Comment 3

•

9 years ago

We've diagnosed these as the same windows socket/send window issues we were seeing in AWS. Thankfully Q cracked that over the weekend, and is working on a fix to deploy via GPO for the datacenter now.

Amy Rich [:arr] [:arich]

Updated

•

9 years ago

Flags: needinfo?(q)

Q

Assignee

Comment 4

•

9 years ago

Attached file tcp_param.reg — Details

reg fix for upload issue to s3

Flags: needinfo?(q)

Q

Assignee

Comment 5

•

9 years ago

Pushed out the contents of the attached reg file and rebooted all reachable builders. The reg file is a scaled down and corrected for hardware version of our AWS fix. 
In a nutshell it:
* Turns on WSCALE 
* Sets the congestion control to ctcp
* Ups the winsock buffer
* Enables hardware offload 
* Sets autotune scaling to "normal" 
* Disables per connection type heuristics

after application upload speeds to usw2 and use1 S3 buckets averages 40 /MBs which is up from 200kbps before the fix. This can be tuned even more however we will be operational for now.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 6

•

9 years ago

I retriggered Windows builds on various integration branches a while ago and things like they're passing, so I reopened the trees a few minutes ago.

Amy Rich [:arr] [:arich]

Comment 7

•

9 years ago

There was another one of these after the forced reboots: https://treeherder.mozilla.org/logviewer.html#?job_id=9981597&repo=mozilla-inbound

So far we've seen 8 greens and this one failure, which seems to be better than we were before, at least.

Comment hidden (Legacy TBPL/Treeherder Robot)

log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=9982711
repository: mozilla-inbound
start_time: 2015-05-19T17:38:48
who: philringnalda[at]gmail[dot]com
machine: b-2008-ix-0078
buildname: WINNT 6.1 x86-64 mozilla-inbound leak test build
revision: 39f481e86829

Uncaught exception: Traceback (most recent call last):
File "c:\builds\moz2_slave\m-in-w64-d-0000000000000000000\scripts\mozharness\base\script.py", line 1288, in run
self.run_action(action)
File "c:\builds\moz2_slave\m-in-w64-d-0000000000000000000\scripts\mozharness\base\script.py", line 1231, in run_action
self._possibly_run_method("postflight_%s" % method_name)
File "c:\builds\moz2_slave\m-in-w64-d-0000000000000000000\scripts\mozharness\base\script.py", line 1171, in _possibly_run_method
return getattr(self, method_name)()
File "c:\builds\moz2_slave\m-in-w64-d-0000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1626, in postflight_build
self.upload_files()
File "c:\builds\moz2_slave\m-in-w64-d-0000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1422, in upload_files
tc.create_artifact(task, upload_file)
File "c:\builds\moz2_slave\m-in-w64-d-0000000000000000000\scripts\mozharness\mozilla\taskcluster_helper.py", line 103, in create_artifact
"contentType": mime_type,
File "c:\builds\moz2_slave\m-in-w64-d-0000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall
return self._makeApiCall(e, *args, **kwargs)
File "c:\builds\moz2_slave\m-in-w64-d-0000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall
return self._makeHttpRequest(entry['method'], route, payload)
File "c:\builds\moz2_slave\m-in-w64-d-0000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest
superExc=rerr
TaskclusterRestFailure: The given run is not running
Running post_fatal callback...
setting return code to 2 because fatal was called
Exiting -1
# TBPL FAILURE #
# TBPL FAILURE #

Comment hidden (Legacy TBPL/Treeherder Robot)

log: https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=3130676
repository: fx-team
start_time: 2015-05-19T16:50:01
who: philringnalda[at]gmail[dot]com
machine: b-2008-ix-0001
buildname: WINNT 5.2 fx-team leak test build
revision: e650970b34d3

Uncaught exception: Traceback (most recent call last):
File "c:\builds\moz2_slave\fx-team-w32-d-0000000000000000\scripts\mozharness\base\script.py", line 1288, in run
self.run_action(action)
File "c:\builds\moz2_slave\fx-team-w32-d-0000000000000000\scripts\mozharness\base\script.py", line 1231, in run_action
self._possibly_run_method("postflight_%s" % method_name)
File "c:\builds\moz2_slave\fx-team-w32-d-0000000000000000\scripts\mozharness\base\script.py", line 1171, in _possibly_run_method
return getattr(self, method_name)()
File "c:\builds\moz2_slave\fx-team-w32-d-0000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1626, in postflight_build
self.upload_files()
File "c:\builds\moz2_slave\fx-team-w32-d-0000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1422, in upload_files
tc.create_artifact(task, upload_file)
File "c:\builds\moz2_slave\fx-team-w32-d-0000000000000000\scripts\mozharness\mozilla\taskcluster_helper.py", line 103, in create_artifact
"contentType": mime_type,
File "c:\builds\moz2_slave\fx-team-w32-d-0000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall
return self._makeApiCall(e, *args, **kwargs)
File "c:\builds\moz2_slave\fx-team-w32-d-0000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall
return self._makeHttpRequest(entry['method'], route, payload)
File "c:\builds\moz2_slave\fx-team-w32-d-0000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest
superExc=rerr
TaskclusterRestFailure: The given run is not running
Running post_fatal callback...
setting return code to 2 because fatal was called
Exiting -1
# TBPL FAILURE #
# TBPL FAILURE #

Comment hidden (Legacy TBPL/Treeherder Robot)

log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=9980456
repository: mozilla-inbound
start_time: 2015-05-19T14:32:51
who: philringnalda[at]gmail[dot]com
machine: b-2008-ix-0091
buildname: WINNT 6.1 x86-64 mozilla-inbound pgo-build
revision: 3bc47925ba63

Uncaught exception: Traceback (most recent call last):
File "c:\builds\moz2_slave\m-in-w64-pgo-00000000000000000\scripts\mozharness\base\script.py", line 1288, in run
self.run_action(action)
File "c:\builds\moz2_slave\m-in-w64-pgo-00000000000000000\scripts\mozharness\base\script.py", line 1231, in run_action
self._possibly_run_method("postflight_%s" % method_name)
File "c:\builds\moz2_slave\m-in-w64-pgo-00000000000000000\scripts\mozharness\base\script.py", line 1171, in _possibly_run_method
return getattr(self, method_name)()
File "c:\builds\moz2_slave\m-in-w64-pgo-00000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1626, in postflight_build
self.upload_files()
File "c:\builds\moz2_slave\m-in-w64-pgo-00000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1422, in upload_files
tc.create_artifact(task, upload_file)
File "c:\builds\moz2_slave\m-in-w64-pgo-00000000000000000\scripts\mozharness\mozilla\taskcluster_helper.py", line 103, in create_artifact
"contentType": mime_type,
File "c:\builds\moz2_slave\m-in-w64-pgo-00000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall
return self._makeApiCall(e, *args, **kwargs)
File "c:\builds\moz2_slave\m-in-w64-pgo-00000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall
return self._makeHttpRequest(entry['method'], route, payload)
File "c:\builds\moz2_slave\m-in-w64-pgo-00000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest
superExc=rerr
TaskclusterRestFailure: The given run is not running
Running post_fatal callback...
setting return code to 2 because fatal was called
Exiting -1
# TBPL FAILURE #
# TBPL FAILURE #

Phil Ringnalda (:philor)

Comment 11

•

9 years ago

Comment 10 may not be interesting, since it started at 14:32, or, it may be interesting in that it's a sign that we didn't actually successfully reboot everything.

Comment hidden (Legacy TBPL/Treeherder Robot)

log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=9982179
repository: mozilla-inbound
start_time: 2015-05-19T17:27:17
who: philringnalda[at]gmail[dot]com
machine: b-2008-ix-0088
buildname: WINNT 5.2 mozilla-inbound build
revision: 8ed5bf80757f

Uncaught exception: Traceback (most recent call last):
File "c:\builds\moz2_slave\m-in-w32-000000000000000000000\scripts\mozharness\base\script.py", line 1288, in run
self.run_action(action)
File "c:\builds\moz2_slave\m-in-w32-000000000000000000000\scripts\mozharness\base\script.py", line 1231, in run_action
self._possibly_run_method("postflight_%s" % method_name)
File "c:\builds\moz2_slave\m-in-w32-000000000000000000000\scripts\mozharness\base\script.py", line 1171, in _possibly_run_method
return getattr(self, method_name)()
File "c:\builds\moz2_slave\m-in-w32-000000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1626, in postflight_build
self.upload_files()
File "c:\builds\moz2_slave\m-in-w32-000000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1422, in upload_files
tc.create_artifact(task, upload_file)
File "c:\builds\moz2_slave\m-in-w32-000000000000000000000\scripts\mozharness\mozilla\taskcluster_helper.py", line 103, in create_artifact
"contentType": mime_type,
File "c:\builds\moz2_slave\m-in-w32-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall
return self._makeApiCall(e, *args, **kwargs)
File "c:\builds\moz2_slave\m-in-w32-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall
return self._makeHttpRequest(entry['method'], route, payload)
File "c:\builds\moz2_slave\m-in-w32-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest
superExc=rerr
TaskclusterRestFailure: The given run is not running
Running post_fatal callback...
setting return code to 2 because fatal was called
Exiting -1
# TBPL FAILURE #
# TBPL FAILURE #

Amy Rich [:arr] [:arich]

Comment 13

•

9 years ago

root@B-2008-IX-0081 ~
$ net statistics server
Server Statistics for \\B-2008-IX-0081

:markco The machines which are running puppet are going to burn jobs until we patch them accordingly, so disabling those for now (I searched for puppet in slavealloc).

b-2008-ix-0078
b-2008-ix-0175
b-2008-ix-0176
b-2008-ix-0177
b-2008-ix-0178
b-2008-ix-0179
b-2008-ix-0132
b-2008-ix-0133
b-2008-ix-0168
b-2008-ix-0001

Flags: needinfo?(mcornmesser)

Amy Rich [:arr] [:arich]

Comment 14

•

9 years ago

Of course it managed not to paste the pertinent line for 0081:

Statistics since 5/19/2015 7:05:20 PM

So that should have been after the reboot.

Dustin J. Mitchell [:dustin] (he/him)

Comment 15

•

9 years ago

We're in a two-system world now, and we can't keep disabling puppet-based machines when they fall behind.  So every rollout to windows needs to be both puppet and GPO before it's considered complete..

Mark Cornmesser [:markco] OOO 2024/04/15

Comment 16

•

9 years ago

Attached patch BUG1165314.patch — Details — Splinter Review

Flags: needinfo?(mcornmesser)

Attachment #8608500 - Flags: review?(dustin)

Mark Cornmesser [:markco] OOO 2024/04/15

Comment 17

•

9 years ago

Comment on attachment 8608500 [details] [diff] [review]
BUG1165314.patch

This is the initial patch to get the performance of datacenter machines up to par. I am planning on building in tweaks for 2008 in AWS onto this patch, and as well as future network tweaks for other Windows platforms.

Attachment #8608500 - Flags: feedback?(q)

Mark Cornmesser [:markco] OOO 2024/04/15

Updated

•

9 years ago

Attachment #8608500 - Flags: feedback?(rthijssen)

Dustin J. Mitchell [:dustin] (he/him)

Comment 18

•

9 years ago

Comment on attachment 8608500 [details] [diff] [review]
BUG1165314.patch

Review of attachment 8608500 [details] [diff] [review]:
-----------------------------------------------------------------

A+++++ would deploy again and again

::: modules/tweaks/manifests/windows_network_optimization.pp
@@ +2,5 @@
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +class tweaks::windows_network_optimization {
> +    # For 2008 refrence Bugs 1165314 & 1166415 

:like:

Attachment #8608500 - Flags: review?(dustin) → review+

Mark Cornmesser [:markco] OOO 2024/04/15

Comment 19

•

9 years ago

Comment on attachment 8608500 [details] [diff] [review]
BUG1165314.patch

https://hg.mozilla.org/build/puppet/rev/be855005b0ae

Attachment #8608500 - Flags: checked-in+

Mark Cornmesser [:markco] OOO 2024/04/15

Comment 20

•

9 years ago

The patch has been landed. I am going to enable the machines from comments 13 and 14.

Rob Thijssen [:grenade (EET/UTC+0300)]

Updated

•

9 years ago

Attachment #8608500 - Flags: feedback?(rthijssen)

Mark Cornmesser [:markco] OOO 2024/04/15

Updated

•

9 years ago

Attachment #8608500 - Flags: feedback?(q)

Q

Assignee

Comment 21

•

9 years ago

I believe this is live in our current test bed. Per conversation with arr closing bug

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

tcp_param.reg 9 years ago Q 2.18 KB, text/x-ms-regedit		Details
BUG1165314.patch 9 years ago Mark Cornmesser [:markco] OOO 2024/04/15 4.74 KB, patch	dustin : review+ markco : checked-in+	Details \| Diff \| Splinter Review