Closed Bug 1337132 Opened 7 years ago Closed 7 years ago

64-bit Windows static builds are very frequently timing out since the worker change from bug 1336948 happened

Categories

(Taskcluster :: Workers, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: KWierso, Assigned: pmoore)

References

Details

Attachments

(2 files)

Starting this morning on pushes from around 8:40am Pacific time, Win64 static builds have been timing out like https://treeherder.mozilla.org/logviewer.html#?job_id=74880275&repo=autoland&lineNumber=28772 across multiple trunk trees.
pmoore: i have no idea why the g-w 8.0.0 deploy would affect these but not other builds, however the timings of the failed builds correspond exactly to the rollout. it likely merits investigation.
Flags: needinfo?(pmoore)
Hrm, maybe it fixed itself? More recent Win64 static builds on autoland and inbound have been coming back green. Maybe some cache needed to get repopulated after the new worker rolled out?
And later pushes had it failing again...
Hi guys,

I think this is related to job group handling, which was anticipated in https://github.com/taskcluster/runlib/pull/2#pullrequestreview-17247538

I'll look into it...

Pete
Flags: needinfo?(pmoore)
I'll roll back to the previous g-w version on the gecko builders, and work with Ted this afternoon to see if we can find a good solution for this in-tree.
Flags: needinfo?(ted)
closed inbound and autoland now for this issues
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Attachment #8834305 - Flags: review?(rthijssen)
Attachment #8834305 - Flags: review?(rthijssen) → review+
Progress can be monitored here: https://tools.taskcluster.net/task-group-inspector/#/ah-JKzrXThSQwk5iBgyO2g?_k=pvjvs4

When the following AMI creation tasks have completed, the new worker type should be available.

gecko-1-b-win2012
gecko-2-b-win2012
gecko-3-b-win2012

After that, it can take up to half an hour for the old workers to detect that they have been superseded, and to terminate themselves. However, during this half an hour, we could manually kill them.
This is from the task in comment 0 (under generic-worker 8.0.0):

https://papertrailapp.com/systems/685891392/events?highlight=765335101042249764&focus=765335101042249764

2017/02/06 18:55:28 Making system call CreateProcessWithLogonW with args: [C0422A7060 C0424847AC C042483400 1 0 C042248360 404 C042130000 C0422A2360 C0421F6070 C0422A63C0]

This shows that process creation flags used under 8.0.0 are 0x404 (see https://msdn.microsoft.com/en-us/library/windows/desktop/ms682431(v=vs.85).aspx).

This means the following creation flags were set:

CREATE_UNICODE_ENVIRONMENT
0x00000400

CREATE_SUSPENDED
0x00000004


Here is a recent job running since the rollback to 7.1.3:

https://public-artifacts.taskcluster.net/b2TuEaD9T2Gx-98ORhL_jw/0/public/logs/live_backing.log

This shows identical process creation flags: https://papertrailapp.com/systems/687604972/events?highlight=765612809483636744&focus=765612809483636744

2017/02/07 13:18:59 Making system call CreateProcessWithLogonW with args: [C0425A50A0 C04245649C C0422E7380 1 0 C0421A2720 404 C04237D000 C0425B0CF0 C0424B80E0 C0425A4440]

Again the process creation flags are 0x404 - so a change to the creation process flags is *not* the cause of this bug.
This is consistent with the change in https://github.com/taskcluster/runlib/pull/2 which only affected Windows 10 and Windows Server 2012 R2 when "runTasksAsCurrentUser" is set to true in the worker config (i.e. do not create task users, use current admin user that runs the generic worker). This property is not set in gecko-{level}-b-win2012 worker types, so it was not expected that these flags should change.

So the question remains, what did change?
In the 7.1.3 log I see:

https://public-artifacts.taskcluster.net/b2TuEaD9T2Gx-98ORhL_jw/0/public/logs/live_backing.log

13:21:00     INFO - Calling ['C:\\mozilla-build\\msys\\bin\\bash.exe', 'Z:\\task_1486473398\\build\\src\\mach', '--log-no-times', 'build', '-v'] with output_timeout 4800
ProcessManager UNABLE to use job objects to manage child processes


In the 8.0.0 log I see the same:

https://public-artifacts.taskcluster.net/L41WMoCLST2tHD-P-b85QQ/0/public/logs/live_backing.log

18:56:36     INFO - Calling ['C:\\mozilla-build\\msys\\bin\\bash.exe', 'Z:\\task_1486406213\\build\\src\\mach', '--log-no-times', 'build', '-v'] with output_timeout 4800
ProcessManager UNABLE to use job objects to manage child processes


So this also suggests this is not related to whether mozprocess tries to manage child processes with job objects or not.

The plot thickens.
Ahhhhhhhhhhhh

Looks like I introduced a change where we call SHSetKnownFolderPath at runtime to redirect the APPDATA folder (see bug 1319487):

https://github.com/taskcluster/generic-worker/blob/a4a93352d9164f1aa974adc6ea982c4c2cf067b7/plat_windows.go#L113-L119

I bet that is it. It didn't exist in generic worker version 7.1.3.
https://tools.taskcluster.net/task-inspector/#b2TuEaD9T2Gx-98ORhL_jw/ (generic-worker 7.1.3) has:

13:19:03     INFO -  'APPDATA': 'C:\\Users\\task_1486473398\\AppData\\Roaming',
13:19:03     INFO -  'LOCALAPPDATA': 'C:\\Users\\task_1486473398\\AppData\\Local',
13:19:03     INFO -  'TEMP': 'C:\\Users\\task_1486473398\\AppData\\Local\\Temp',
13:19:03     INFO -  'TMP': 'C:\\Users\\task_1486473398\\AppData\\Local\\Temp',


https://tools.taskcluster.net/task-inspector/L41WMoCLST2tHD-P-b85QQ/ (generic-worker 8.0.0) has:

18:55:32     INFO -  'APPDATA': 'Z:\\AppData\\Roaming',
18:55:32     INFO -  'LOCALAPPDATA': 'Z:\\AppData\\Local',
18:55:32     INFO -  'TEMP': 'C:\\Users\\task_1486406213\\AppData\\Local\\Temp',
18:55:32     INFO -  'TMP': 'C:\\Users\\task_1486406213\\AppData\\Local\\Temp',
So it looks like the SHSetKnownFolderPath calls are working, but I suspect that either:

1) they are setting to the wrong values
2) something in-tree has a hard-coded assumption that these folders are in the default location
(note: TMP/TEMP are not affected by SHSetKnownFolderPath as explained in bug 1319487)
I wonder if the problem is that we have

err = RedirectAppData(loginInfo.HUser, filepath.Join(config.TasksDir, "AppData"))

rather than:

err = RedirectAppData(loginInfo.HUser, filepath.Join(taskContext.TaskDir, "AppData"))



If we make this change, we would adjust the paths as follows:

APPDATA:        Z:\AppData\Roaming   => Z:\task_XXXXXXX\AppData\Raoming
LOCALAPPDATA:   Z:\AppData\Local     => Z:\task_XXXXXXX\AppData\Local


One issue I see with the current setup, is that the appdata folders are getting reused between tasks.

I'll make this task, and roll it out to gecko-1-b-win2012-beta worker type, then rerun task L41WMoCLST2tHD-P-b85QQ on it.

If this fixes the problem, I'll also see if I can write a test to cover this. The test would probably be to write a file to %APPDATA% and %LOCALAPPDATA% locations, in two sequential tasks, and check that both writes succeed.

Here we see that KF_FLAG_CREATE is set to create the folder:
https://github.com/taskcluster/runlib/blob/d4f1c59aef8fb5c2312203b1c5d7316b5fd5b406/win32/extra_windows.go#L426

And the docs https://msdn.microsoft.com/en-us/library/windows/desktop/dd378447(v=vs.85).aspx state "Forces the creation of the specified folder if that folder does not already exist." - thus if that folder exists, nothing is done to apply ACL(s) to make the folder writable to the user represented by the access token passed in the SHGetKnownFolderPath call.

Thus, the call will succeed, even if the folder exists but is not writable by the given user, so we wouldn't have seen a failure, hence why this issue might have gone unnoticed until now.
s/I'll make this task/I'll make this change/
+1 on comment 18. We definitely want AppData folders to be task specific. - and thanks for getting the SHSetKnownFolderPath change into g-w. That is going to fix many, many things.
Sounds like you've got this sorted without me. :)
Flags: needinfo?(ted)
Released 8.0.1 and testing now on gecko-3-b-win2012-beta:

https://tools.taskcluster.net/task-inspector/#OXmmU79URq6ytz7Bj9facg/
https://tools.taskcluster.net/task-inspector/#M7GagixTQZmVWVoVP4HUNQ/
https://tools.taskcluster.net/task-inspector/#QoNOeHHqRqGJOU9G5fKUkA/
https://tools.taskcluster.net/task-inspector/#JpGnuSeQTbGx0TFaWFNqPA/
https://tools.taskcluster.net/task-inspector/#BtK4Ip4jQlCMbjxSw929tw/


These 5 tasks should all run on the same worker, since maxCapacity is 1 on the worker type. If all five pass, there is a good chance the underlying problem has been fixed. It will take some time to run all of these tasks serially, so I'll check back in later today to see how they are getting on.

These five new tasks are copies of the former failed task running under 8.0.0:

https://tools.taskcluster.net/task-inspector/#L41WMoCLST2tHD-P-b85QQ/

This task was first mentioned in comment 15, unfortunately the '#' got stripped from the url - this is the corrected url.
Since these gecko-3-b-win2012-beta tasks are dragging their feet a bit (i.e. the worker is still initialising after an hour), I'll recreate a failed try job against gecko-1-b-win2012-beta...

Original try push failure:

https://tools.taskcluster.net/task-inspector/#VDrTYmaqQ1egk_hfa876Mg/0

Reruns:

https://tools.taskcluster.net/task-inspector/#AUPWA0mGTBy5aQNnzpKm8g/
https://tools.taskcluster.net/task-inspector/#cCM-J-daRC6n0m_gmFt81w/
https://tools.taskcluster.net/task-inspector/#KRdltGXUTy2KGmItmCY_-g/
https://tools.taskcluster.net/task-inspector/#H5ZvaWy5TYyiIjrJxyrzEg/
https://tools.taskcluster.net/task-inspector/#IqUbbkLlSeOfghR7kaPpvQ/

So the difference between these tasks, and the tasks listed in comment 23, is that these are for gecko-1-b-win2012-beta whereas the tasks in comment 23 are for gecko-3-b-win2012-beta.
(In reply to Pete Moore [:pmoore][:pete] from comment #25)
> I've just spotted that the deadline was set to 1 hour after creation, in all
> the tasks in comment 24, so I'm recreating (again!) to set the deadline in a
> day from now...
> 
> https://tools.taskcluster.net/task-inspector/#aLEn4Pb1RSGyBdqQlR_Uag/
> https://tools.taskcluster.net/task-inspector/#QmJVZKK7TGK8azMky6SCAQ/
> https://tools.taskcluster.net/task-inspector/#ViCRAYObR7CrQTl2PFXqLw/
> https://tools.taskcluster.net/task-inspector/#SKvgrHagSl2pamFrTcEYtA/
> https://tools.taskcluster.net/task-inspector/#EgTtTwkEQWWyJBNM2G9ejg/

All successful, no failures, and all ran on same worker, one after another, so let's upgrade!
Given the successful beta testing in comment 25, this should upgrade everything to the fixed version, where static analysis jobs have been shown to consistently pass.
Attachment #8835592 - Flags: review?(rthijssen)
I've also re-enabled papertrail alerts for go crashes (to be sent to team alerting email address).
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: