Closed
Bug 1337132
Opened 7 years ago
Closed 7 years ago
64-bit Windows static builds are very frequently timing out since the worker change from bug 1336948 happened
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: KWierso, Assigned: pmoore)
References
Details
Attachments
(2 files)
Starting this morning on pushes from around 8:40am Pacific time, Win64 static builds have been timing out like https://treeherder.mozilla.org/logviewer.html#?job_id=74880275&repo=autoland&lineNumber=28772 across multiple trunk trees.
Reporter | ||
Comment 1•7 years ago
|
||
Win64 static builds from inbound: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=4f8613cea566ec266cbf666a85155a547c5476ed&bugfiler&group_state=expanded&noautoclassify&filter-searchStr=windows%20static%20x64 Autoland: https://treeherder.mozilla.org/#/jobs?repo=autoland&bugfiler&noautoclassify&group_state=expanded&fromchange=c5621cb6f9075d731ccb28dcf20dd29deae2bddd&filter-searchStr=windows%20x64%20static&selectedJob=74880275 And on mozilla-central: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&fromchange=c5621cb6f9075d731ccb28dcf20dd29deae2bddd&noautoclassify&bugfiler&group_state=expanded&selectedJob=74834912&filter-searchStr=win%2064%20static
Comment 2•7 years ago
|
||
pmoore: i have no idea why the g-w 8.0.0 deploy would affect these but not other builds, however the timings of the failed builds correspond exactly to the rollout. it likely merits investigation.
Flags: needinfo?(pmoore)
Reporter | ||
Comment 3•7 years ago
|
||
Hrm, maybe it fixed itself? More recent Win64 static builds on autoland and inbound have been coming back green. Maybe some cache needed to get repopulated after the new worker rolled out?
Reporter | ||
Comment 4•7 years ago
|
||
And later pushes had it failing again...
Assignee | ||
Comment 5•7 years ago
|
||
Hi guys, I think this is related to job group handling, which was anticipated in https://github.com/taskcluster/runlib/pull/2#pullrequestreview-17247538 I'll look into it... Pete
Flags: needinfo?(pmoore)
Assignee | ||
Comment 6•7 years ago
|
||
I'll roll back to the previous g-w version on the gecko builders, and work with Ted this afternoon to see if we can find a good solution for this in-tree.
Flags: needinfo?(ted)
Comment 7•7 years ago
|
||
closed inbound and autoland now for this issues
Assignee | ||
Comment 8•7 years ago
|
||
Comment 9•7 years ago
|
||
merged: https://github.com/mozilla-releng/OpenCloudConfig/commit/08fef17ed4c919a16788f2b155ef3ce8361365a9
Updated•7 years ago
|
Attachment #8834305 -
Flags: review?(rthijssen) → review+
Assignee | ||
Comment 10•7 years ago
|
||
Progress can be monitored here: https://tools.taskcluster.net/task-group-inspector/#/ah-JKzrXThSQwk5iBgyO2g?_k=pvjvs4 When the following AMI creation tasks have completed, the new worker type should be available. gecko-1-b-win2012 gecko-2-b-win2012 gecko-3-b-win2012 After that, it can take up to half an hour for the old workers to detect that they have been superseded, and to terminate themselves. However, during this half an hour, we could manually kill them.
Assignee | ||
Comment 11•7 years ago
|
||
This is from the task in comment 0 (under generic-worker 8.0.0): https://papertrailapp.com/systems/685891392/events?highlight=765335101042249764&focus=765335101042249764 2017/02/06 18:55:28 Making system call CreateProcessWithLogonW with args: [C0422A7060 C0424847AC C042483400 1 0 C042248360 404 C042130000 C0422A2360 C0421F6070 C0422A63C0] This shows that process creation flags used under 8.0.0 are 0x404 (see https://msdn.microsoft.com/en-us/library/windows/desktop/ms682431(v=vs.85).aspx). This means the following creation flags were set: CREATE_UNICODE_ENVIRONMENT 0x00000400 CREATE_SUSPENDED 0x00000004 Here is a recent job running since the rollback to 7.1.3: https://public-artifacts.taskcluster.net/b2TuEaD9T2Gx-98ORhL_jw/0/public/logs/live_backing.log This shows identical process creation flags: https://papertrailapp.com/systems/687604972/events?highlight=765612809483636744&focus=765612809483636744 2017/02/07 13:18:59 Making system call CreateProcessWithLogonW with args: [C0425A50A0 C04245649C C0422E7380 1 0 C0421A2720 404 C04237D000 C0425B0CF0 C0424B80E0 C0425A4440] Again the process creation flags are 0x404 - so a change to the creation process flags is *not* the cause of this bug.
Assignee | ||
Comment 12•7 years ago
|
||
This is consistent with the change in https://github.com/taskcluster/runlib/pull/2 which only affected Windows 10 and Windows Server 2012 R2 when "runTasksAsCurrentUser" is set to true in the worker config (i.e. do not create task users, use current admin user that runs the generic worker). This property is not set in gecko-{level}-b-win2012 worker types, so it was not expected that these flags should change. So the question remains, what did change?
Assignee | ||
Comment 13•7 years ago
|
||
In the 7.1.3 log I see: https://public-artifacts.taskcluster.net/b2TuEaD9T2Gx-98ORhL_jw/0/public/logs/live_backing.log 13:21:00 INFO - Calling ['C:\\mozilla-build\\msys\\bin\\bash.exe', 'Z:\\task_1486473398\\build\\src\\mach', '--log-no-times', 'build', '-v'] with output_timeout 4800 ProcessManager UNABLE to use job objects to manage child processes In the 8.0.0 log I see the same: https://public-artifacts.taskcluster.net/L41WMoCLST2tHD-P-b85QQ/0/public/logs/live_backing.log 18:56:36 INFO - Calling ['C:\\mozilla-build\\msys\\bin\\bash.exe', 'Z:\\task_1486406213\\build\\src\\mach', '--log-no-times', 'build', '-v'] with output_timeout 4800 ProcessManager UNABLE to use job objects to manage child processes So this also suggests this is not related to whether mozprocess tries to manage child processes with job objects or not. The plot thickens.
Assignee | ||
Comment 14•7 years ago
|
||
Ahhhhhhhhhhhh Looks like I introduced a change where we call SHSetKnownFolderPath at runtime to redirect the APPDATA folder (see bug 1319487): https://github.com/taskcluster/generic-worker/blob/a4a93352d9164f1aa974adc6ea982c4c2cf067b7/plat_windows.go#L113-L119 I bet that is it. It didn't exist in generic worker version 7.1.3.
Assignee | ||
Comment 15•7 years ago
|
||
https://tools.taskcluster.net/task-inspector/#b2TuEaD9T2Gx-98ORhL_jw/ (generic-worker 7.1.3) has: 13:19:03 INFO - 'APPDATA': 'C:\\Users\\task_1486473398\\AppData\\Roaming', 13:19:03 INFO - 'LOCALAPPDATA': 'C:\\Users\\task_1486473398\\AppData\\Local', 13:19:03 INFO - 'TEMP': 'C:\\Users\\task_1486473398\\AppData\\Local\\Temp', 13:19:03 INFO - 'TMP': 'C:\\Users\\task_1486473398\\AppData\\Local\\Temp', https://tools.taskcluster.net/task-inspector/L41WMoCLST2tHD-P-b85QQ/ (generic-worker 8.0.0) has: 18:55:32 INFO - 'APPDATA': 'Z:\\AppData\\Roaming', 18:55:32 INFO - 'LOCALAPPDATA': 'Z:\\AppData\\Local', 18:55:32 INFO - 'TEMP': 'C:\\Users\\task_1486406213\\AppData\\Local\\Temp', 18:55:32 INFO - 'TMP': 'C:\\Users\\task_1486406213\\AppData\\Local\\Temp',
Assignee | ||
Comment 16•7 years ago
|
||
So it looks like the SHSetKnownFolderPath calls are working, but I suspect that either: 1) they are setting to the wrong values 2) something in-tree has a hard-coded assumption that these folders are in the default location
Assignee | ||
Comment 17•7 years ago
|
||
(note: TMP/TEMP are not affected by SHSetKnownFolderPath as explained in bug 1319487)
Assignee | ||
Comment 18•7 years ago
|
||
I wonder if the problem is that we have err = RedirectAppData(loginInfo.HUser, filepath.Join(config.TasksDir, "AppData")) rather than: err = RedirectAppData(loginInfo.HUser, filepath.Join(taskContext.TaskDir, "AppData")) If we make this change, we would adjust the paths as follows: APPDATA: Z:\AppData\Roaming => Z:\task_XXXXXXX\AppData\Raoming LOCALAPPDATA: Z:\AppData\Local => Z:\task_XXXXXXX\AppData\Local One issue I see with the current setup, is that the appdata folders are getting reused between tasks. I'll make this task, and roll it out to gecko-1-b-win2012-beta worker type, then rerun task L41WMoCLST2tHD-P-b85QQ on it. If this fixes the problem, I'll also see if I can write a test to cover this. The test would probably be to write a file to %APPDATA% and %LOCALAPPDATA% locations, in two sequential tasks, and check that both writes succeed. Here we see that KF_FLAG_CREATE is set to create the folder: https://github.com/taskcluster/runlib/blob/d4f1c59aef8fb5c2312203b1c5d7316b5fd5b406/win32/extra_windows.go#L426 And the docs https://msdn.microsoft.com/en-us/library/windows/desktop/dd378447(v=vs.85).aspx state "Forces the creation of the specified folder if that folder does not already exist." - thus if that folder exists, nothing is done to apply ACL(s) to make the folder writable to the user represented by the access token passed in the SHGetKnownFolderPath call. Thus, the call will succeed, even if the folder exists but is not writable by the given user, so we wouldn't have seen a failure, hence why this issue might have gone unnoticed until now.
Assignee | ||
Comment 19•7 years ago
|
||
s/I'll make this task/I'll make this change/
Comment 20•7 years ago
|
||
+1 on comment 18. We definitely want AppData folders to be task specific. - and thanks for getting the SHSetKnownFolderPath change into g-w. That is going to fix many, many things.
Comment 21•7 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/fac3f31e56dc1e4908e8bb619d5b6097560e24b3 Bug1337132 - Don't reuse APPDATA folder between tasks (#39)
Assignee | ||
Comment 23•7 years ago
|
||
Released 8.0.1 and testing now on gecko-3-b-win2012-beta: https://tools.taskcluster.net/task-inspector/#OXmmU79URq6ytz7Bj9facg/ https://tools.taskcluster.net/task-inspector/#M7GagixTQZmVWVoVP4HUNQ/ https://tools.taskcluster.net/task-inspector/#QoNOeHHqRqGJOU9G5fKUkA/ https://tools.taskcluster.net/task-inspector/#JpGnuSeQTbGx0TFaWFNqPA/ https://tools.taskcluster.net/task-inspector/#BtK4Ip4jQlCMbjxSw929tw/ These 5 tasks should all run on the same worker, since maxCapacity is 1 on the worker type. If all five pass, there is a good chance the underlying problem has been fixed. It will take some time to run all of these tasks serially, so I'll check back in later today to see how they are getting on. These five new tasks are copies of the former failed task running under 8.0.0: https://tools.taskcluster.net/task-inspector/#L41WMoCLST2tHD-P-b85QQ/ This task was first mentioned in comment 15, unfortunately the '#' got stripped from the url - this is the corrected url.
Assignee | ||
Comment 24•7 years ago
|
||
Since these gecko-3-b-win2012-beta tasks are dragging their feet a bit (i.e. the worker is still initialising after an hour), I'll recreate a failed try job against gecko-1-b-win2012-beta... Original try push failure: https://tools.taskcluster.net/task-inspector/#VDrTYmaqQ1egk_hfa876Mg/0 Reruns: https://tools.taskcluster.net/task-inspector/#AUPWA0mGTBy5aQNnzpKm8g/ https://tools.taskcluster.net/task-inspector/#cCM-J-daRC6n0m_gmFt81w/ https://tools.taskcluster.net/task-inspector/#KRdltGXUTy2KGmItmCY_-g/ https://tools.taskcluster.net/task-inspector/#H5ZvaWy5TYyiIjrJxyrzEg/ https://tools.taskcluster.net/task-inspector/#IqUbbkLlSeOfghR7kaPpvQ/ So the difference between these tasks, and the tasks listed in comment 23, is that these are for gecko-1-b-win2012-beta whereas the tasks in comment 23 are for gecko-3-b-win2012-beta.
Assignee | ||
Comment 25•7 years ago
|
||
I've just spotted that the deadline was set to 1 hour after creation, in all the tasks in comment 24, so I'm recreating (again!) to set the deadline in a day from now... https://tools.taskcluster.net/task-inspector/#aLEn4Pb1RSGyBdqQlR_Uag/ https://tools.taskcluster.net/task-inspector/#QmJVZKK7TGK8azMky6SCAQ/ https://tools.taskcluster.net/task-inspector/#ViCRAYObR7CrQTl2PFXqLw/ https://tools.taskcluster.net/task-inspector/#SKvgrHagSl2pamFrTcEYtA/ https://tools.taskcluster.net/task-inspector/#EgTtTwkEQWWyJBNM2G9ejg/
Assignee | ||
Comment 26•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #25) > I've just spotted that the deadline was set to 1 hour after creation, in all > the tasks in comment 24, so I'm recreating (again!) to set the deadline in a > day from now... > > https://tools.taskcluster.net/task-inspector/#aLEn4Pb1RSGyBdqQlR_Uag/ > https://tools.taskcluster.net/task-inspector/#QmJVZKK7TGK8azMky6SCAQ/ > https://tools.taskcluster.net/task-inspector/#ViCRAYObR7CrQTl2PFXqLw/ > https://tools.taskcluster.net/task-inspector/#SKvgrHagSl2pamFrTcEYtA/ > https://tools.taskcluster.net/task-inspector/#EgTtTwkEQWWyJBNM2G9ejg/ All successful, no failures, and all ran on same worker, one after another, so let's upgrade!
Assignee | ||
Comment 27•7 years ago
|
||
Given the successful beta testing in comment 25, this should upgrade everything to the fixed version, where static analysis jobs have been shown to consistently pass.
Attachment #8835592 -
Flags: review?(rthijssen)
Comment 28•7 years ago
|
||
Comment on attachment 8835592 [details] [review] OpenCloudConfig - Upgrade to generic-worker v8.0.1 https://github.com/mozilla-releng/OpenCloudConfig/commit/2b4c9a753bc5d1f2d94bee736a81bb9cca2b9424
Attachment #8835592 -
Flags: review?(rthijssen) → review+
Assignee | ||
Comment 29•7 years ago
|
||
I've also re-enabled papertrail alerts for go crashes (to be sent to team alerting email address).
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 31•6 years ago
|
||
Released in https://github.com/taskcluster/generic-worker/releases/tag/v8.0.1
QA Contact: pmoore
Updated•5 years ago
|
Component: Generic-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•