Frequent "blue jobs" on treeherder for generic-worker 7.x windows tasks

RESOLVED FIXED

Status

Taskcluster
Generic-Worker
RESOLVED FIXED
a year ago
10 months ago

People

(Reporter: pmoore, Assigned: pmoore)

Tracking

Details

Attachments

(2 attachments)

(Assignee)

Description

a year ago
At first glance, this looks like it might be that after a while, the worker is unable to create a new desktop for a new task, and this causes the worker to shut down.

This could be the case, if for some reason it is not able to release a previous desktop. Perhaps there are open resources that prevent a previous task user's desktop from being closed, and eventually the desktop heap is too small to allow for the creation of a new desktop.

I'll do some debugging to see if this really is the case.

Options might be:
1) More aggressively releasing resources
2) Introducing reboots between tasks
3) Go back to using a shared desktop

I think 3) would be sad. 1) would be the nice/preferred way. 2) is a very simple fix which would be almost guaranteed to work, but adds time delay and avoids dealing with a fundamental problem, which we probably should at least try to understand and resolve properly.

First though, I have to be sure this really is the cause! So I'll attend to that first.
(Assignee)

Comment 1

a year ago
So after tracing all of the system calls, I've managed to determine that the runlib library isn't closing desktop handles.

This should be pretty straightforward to fix, I can implement this as a cleanup activity when we drop the user.

https://msdn.microsoft.com/en-us/library/windows/desktop/ms682024(v=vs.85).aspx is the call that needs to be made, and only the handle to the desktop is required as input, which we are already tracking.
Per pete, this is now much better, but still not fixed (one in eleven jobs are blue, vs. one in two)
(Assignee)

Comment 3

a year ago
(In reply to Dustin J. Mitchell [:dustin] from comment #2)
> Per pete, this is now much better, but still not fixed (one in eleven jobs
> are blue, vs. one in two)

This improvement was due to closing desktop handles - it looks like this wasn't everything that was needed though as we still have occasional failures. Something else must be eating up the heap/stack... :/
(Assignee)

Comment 4

10 months ago
We currently work around this issue on win7 by rebooting between tasks. I haven't had the chance to revisit this, and the OS X work at the moment is higher priority, so realistically I'm not going to get a chance to look at this again in the near future.

The good news is - when there is an intermittent, it immediately fails - so in real terms the delay due to the intermittent is just seconds, rather than minutes or hours. Also, we can work around this with reboots, if required. So I think this is probably still quite low priority, but if anyone wants to look into it, I'm happy to support with mentoring etc.
(Assignee)

Comment 5

10 months ago
Funnily enough, this has today now moved up my priority list, since I added taskcluster tasks to the generic-worker CI via taskcluster-github. Previously, tests did not require a privileged user to be able to create task users, they all ran as the current user - but the limitation of this was that it didn't test the user creation process or the process creation as a different user.

Today I enabled this test behaviour via an environment variable setting, which when set, means that the test suite will create new task users for each task they execute. Since the test suite has quite a few tasks, we hit the point at which no more task users can be generated due to the resource leak described in this bug, and therefore the generic worker CI tests will now consistently fail due to this.

So fixing this bug would make also fix the updated generic worker CI.
I did some brief googling, and found some resources that might be useful. Windows apparently has a finite resource called "desktop heap", and if that gets exhausted things will start failing:
https://blogs.msdn.microsoft.com/ntdebugging/2007/01/04/desktop-heap-overview/

That blog post mentions a commandline tool called dheapmon that you can download from Microsoft:
https://www.microsoft.com/en-us/download/confirmation.aspx?id=17782
https://blogs.msdn.microsoft.com/alejacma/2008/07/29/how-to-use-dheapmon-exe-to-troubleshoot-desktop-heap-issues/

It might be useful to try that out. We might have to bake it into the AMI if we want to run it early in every task so we can monitor the output.
(Assignee)

Comment 7

10 months ago
Thanks Ted! I think I downloaded and tried it out, but at the time I had problems running it on Windows Server 2012 R2 (I think it is targeted at earlier versions of Windows). That said, I can try again - I've forgotten exactly how far I got, and exactly what the problems were... :/
(Assignee)

Comment 8

10 months ago
Until now I've not been successful getting dheapmon to work on Windows 10.

I've seen it might be possible to use the debugger that ships with the Windows SDK:
https://msdn.microsoft.com/en-us/library/windows/hardware/ff562412(v=vs.85).aspx

In parallel, I've unearthed some failures to close handles, see e.g. https://public-artifacts.taskcluster.net/Jp4icuMDQmurzUmCb9ozcQ/0/public/logs/live_backing.log where the CloseHandle calls are now explicitly logged, and have return code 0, when trying to close the handle to a user, after unloading the user profile (UnloadUserProfile syscall). The UnloadUserProfile call is successful (non-zero return code) but the CloseHandle call returns 0.

UnloadUserProfile: https://msdn.microsoft.com/en-us/library/windows/desktop/bb762282(v=vs.85).aspx
CloseHandle: https://msdn.microsoft.com/en-us/library/windows/desktop/ms724211(v=vs.85).aspx

I've added some extra logging: https://github.com/taskcluster/runlib/commit/d4f1c59aef8fb5c2312203b1c5d7316b5fd5b406#diff-3d8ce073db4fafb9a7ac19582196b8d4R395

We'll hopefully see the output from this extra logging in the CI jobs: https://tools.taskcluster.net/task-group-inspector/#/BUXbfA_rSTKzWLZENs7CAw?_k=ol7kqz
FYI https://msdn.microsoft.com/en-us/library/windows/desktop/bb762281(v=vs.85).aspx says:
"Do not close the hProfile handle. Instead, pass it to the UnloadUserProfile function. This function closes the handle."
(Assignee)

Comment 10

10 months ago
Thanks Ted. Yes this turned out to be a non-issue. We were already calling UnloadUserProfile, and the CloseHandle failure was ignored, causing no harm. So that wasn't it.

One thing that is strange is that I am consistently able to call CloseDesktop, without failure (i.e. we get a non-zero return code) which suggests that there are no open handles or anything else preventing the closing of the desktop, in which case, I can't account for why there might be a shortage of desktop heap, since all previously opened desktops are closed, and not attached to open resources.

I think I'm going to have to park this issue for now. If anyone is interested in diving deeper, I'm guessing playing with the debugger that comes with the SDK should provide some insight. In particular:

https://msdn.microsoft.com/en-us/library/windows/hardware/ff562412(v=vs.85).aspx

Or if anyone manages to get dheapmon installed and working on Windows 10, I'd be keen to hear how they managed it (I found several guides online, but none of them worked for me).
(Assignee)

Comment 11

10 months ago
So I finally got to the bottom of it!!

It turns out, although my CloseDesktop system call returned success, it wasn't actually closing the desktop, since the desktop was being actively displayed. I actually woke up in the middle of the night when I realised this, and then couldn't sleep for the next four hours thinking about it, as I realised this must be the case.

Today I made some patches to make sure to track what the current desktop is, before creating a new one for the task, and after the task has completed and we go to close the new desktop, we first SwitchDesktop back to the original thread desktop, and then close the new desktop.

In the CI, I now have green builds, and it is able to execute all the tests, without ever failing to create a desktop, so I think we can consider this issue fixed! Of course there is a small possibility we just reduced the frequency of it, but I'm hoping that we've actually put it to bed.

I won't close this bug yet, until we've rolled out fixes to the existing worker types.

CI results are here:
https://tools.taskcluster.net/task-group-inspector/#/GXNwrcKcQG2e1omQK-vK2A?_k=8vzr1b
(Assignee)

Comment 12

10 months ago
Created attachment 8833640 [details] [review]
Github Pull Request for OpenCloudConfig
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Attachment #8833640 - Flags: review?(rthijssen)
Comment on attachment 8833640 [details] [review]
Github Pull Request for OpenCloudConfig

merged: https://github.com/mozilla-releng/OpenCloudConfig/commit/85e8a2b7caa6818e2cc7791e26ccbc9bc922543f
Attachment #8833640 - Flags: review?(rthijssen) → review+
(Assignee)

Comment 14

10 months ago
I'm going to mark this as resolved, as it is now rolled out, and try pushes for NSS have been positive.

If it turns out there are any problems, we can open separate bugs (and link them to this one).
Status: ASSIGNED → RESOLVED
Last Resolved: 10 months ago
Resolution: --- → FIXED
(Assignee)

Comment 15

10 months ago
Oh - erm - I guess it is only rolled out on beta worker types - maybe better for us to wait until it is rolled out globally!

I'll make a PR already, in preparation, assuming beta testing goes ok.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 16

10 months ago
Created attachment 8833693 [details] [review]
Github Pull Request for upgrading all remaining worker types

Filing this now in advance - but we should only merge once we know all is ok. When we see that jobs run fine on beta worker types, we should be able to land this patch, which will then upgrade all the other worker types to 7.3.0.
Attachment #8833693 - Flags: review?(rthijssen)
Comment on attachment 8833693 [details] [review]
Github Pull Request for upgrading all remaining worker types

this pr was superseded by 40 & 41
Attachment #8833693 - Flags: review?(rthijssen) → review-
(Assignee)

Updated

10 months ago
Status: REOPENED → RESOLVED
Last Resolved: 10 months ago10 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.