Closed
Bug 1166415
Opened 9 years ago
Closed 9 years ago
Frequent Windows build failures due to "TaskclusterRestFailure: The given run is not running" errors
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: RyanVM, Assigned: q)
Details
Attachments
(2 files)
2.18 KB,
text/x-ms-regedit
|
Details | |
4.74 KB,
patch
|
dustin
:
review+
markco
:
checked-in+
|
Details | Diff | Splinter Review |
These are pretty widespread at the moment. The trees have been closed since 11am PT as a result. https://treeherder.mozilla.org/logviewer.html#?job_id=9974002&repo=mozilla-inbound 11:27:31 INFO - Running post-action listener: influxdb_recording_post_action 11:27:31 INFO - Starting new HTTP connection (1): goldiewilson-onepointtwentyone-1.c.influxdb.com 11:27:31 INFO - Running post-action listener: record_mach_stats 11:27:31 INFO - No build_resources.json found, not logging stats 11:27:31 FATAL - Uncaught exception: Traceback (most recent call last): 11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1288, in run 11:27:31 FATAL - self.run_action(action) 11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1231, in run_action 11:27:31 FATAL - self._possibly_run_method("postflight_%s" % method_name) 11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\base\script.py", line 1171, in _possibly_run_method 11:27:31 FATAL - return getattr(self, method_name)() 11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1626, in postflight_build 11:27:31 FATAL - self.upload_files() 11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1422, in upload_files 11:27:31 FATAL - tc.create_artifact(task, upload_file) 11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\scripts\mozharness\mozilla\taskcluster_helper.py", line 103, in create_artifact 11:27:31 FATAL - "contentType": mime_type, 11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall 11:27:31 FATAL - return self._makeApiCall(e, *args, **kwargs) 11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall 11:27:31 FATAL - return self._makeHttpRequest(entry['method'], route, payload) 11:27:31 FATAL - File "c:\builds\moz2_slave\m-in-w64-000000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest 11:27:31 FATAL - superExc=rerr 11:27:31 FATAL - TaskclusterRestFailure: The given run is not running 11:27:31 FATAL - Running post_fatal callback... 11:27:31 ERROR - setting return code to 2 because fatal was called
Comment 1•9 years ago
|
||
You need a little more copy-paste (well, really a lot more, more than you would paste or anyone would read) to capture the failure, which is that it started uploading artifacts to S3 at 11:07, and by 11:27 taskcluster had given up on it managing to finish, so the problem is probably network between SCL3 and AWS.
Comment 2•9 years ago
|
||
Q is working on a fix.
Updated•9 years ago
|
Assignee: nobody → q
Updated•9 years ago
|
Component: Buildduty → RelOps
Product: Release Engineering → Infrastructure & Operations
QA Contact: bugspam.Callek → arich
Version: unspecified → other
Comment 3•9 years ago
|
||
We've diagnosed these as the same windows socket/send window issues we were seeing in AWS. Thankfully Q cracked that over the weekend, and is working on a fix to deploy via GPO for the datacenter now.
Updated•9 years ago
|
Flags: needinfo?(q)
Pushed out the contents of the attached reg file and rebooted all reachable builders. The reg file is a scaled down and corrected for hardware version of our AWS fix. In a nutshell it: * Turns on WSCALE * Sets the congestion control to ctcp * Ups the winsock buffer * Enables hardware offload * Sets autotune scaling to "normal" * Disables per connection type heuristics after application upload speeds to usw2 and use1 S3 buckets averages 40 /MBs which is up from 200kbps before the fix. This can be tuned even more however we will be operational for now.
I retriggered Windows builds on various integration branches a while ago and things like they're passing, so I reopened the trees a few minutes ago.
Comment 7•9 years ago
|
||
There was another one of these after the forced reboots: https://treeherder.mozilla.org/logviewer.html#?job_id=9981597&repo=mozilla-inbound So far we've seen 8 greens and this one failure, which seems to be better than we were before, at least.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 11•9 years ago
|
||
Comment 10 may not be interesting, since it started at 14:32, or, it may be interesting in that it's a sign that we didn't actually successfully reboot everything.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 13•9 years ago
|
||
root@B-2008-IX-0081 ~ $ net statistics server Server Statistics for \\B-2008-IX-0081 :markco The machines which are running puppet are going to burn jobs until we patch them accordingly, so disabling those for now (I searched for puppet in slavealloc). b-2008-ix-0078 b-2008-ix-0175 b-2008-ix-0176 b-2008-ix-0177 b-2008-ix-0178 b-2008-ix-0179 b-2008-ix-0132 b-2008-ix-0133 b-2008-ix-0168 b-2008-ix-0001
Flags: needinfo?(mcornmesser)
Comment 14•9 years ago
|
||
Of course it managed not to paste the pertinent line for 0081: Statistics since 5/19/2015 7:05:20 PM So that should have been after the reboot.
Comment 15•9 years ago
|
||
We're in a two-system world now, and we can't keep disabling puppet-based machines when they fall behind. So every rollout to windows needs to be both puppet and GPO before it's considered complete..
Comment 16•9 years ago
|
||
Flags: needinfo?(mcornmesser)
Attachment #8608500 -
Flags: review?(dustin)
Comment 17•9 years ago
|
||
Comment on attachment 8608500 [details] [diff] [review] BUG1165314.patch This is the initial patch to get the performance of datacenter machines up to par. I am planning on building in tweaks for 2008 in AWS onto this patch, and as well as future network tweaks for other Windows platforms.
Attachment #8608500 -
Flags: feedback?(q)
Updated•9 years ago
|
Attachment #8608500 -
Flags: feedback?(rthijssen)
Comment 18•9 years ago
|
||
Comment on attachment 8608500 [details] [diff] [review] BUG1165314.patch Review of attachment 8608500 [details] [diff] [review]: ----------------------------------------------------------------- A+++++ would deploy again and again ::: modules/tweaks/manifests/windows_network_optimization.pp @@ +2,5 @@ > +# License, v. 2.0. If a copy of the MPL was not distributed with this > +# file, You can obtain one at http://mozilla.org/MPL/2.0/. > + > +class tweaks::windows_network_optimization { > + # For 2008 refrence Bugs 1165314 & 1166415 :like:
Attachment #8608500 -
Flags: review?(dustin) → review+
Comment 19•9 years ago
|
||
Comment on attachment 8608500 [details] [diff] [review] BUG1165314.patch https://hg.mozilla.org/build/puppet/rev/be855005b0ae
Attachment #8608500 -
Flags: checked-in+
Comment 20•9 years ago
|
||
The patch has been landed. I am going to enable the machines from comments 13 and 14.
Updated•9 years ago
|
Attachment #8608500 -
Flags: feedback?(rthijssen)
Updated•9 years ago
|
Attachment #8608500 -
Flags: feedback?(q)
Assignee | ||
Comment 21•9 years ago
|
||
I believe this is live in our current test bed. Per conversation with arr closing bug
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•