Closed Bug 1433854 Opened 6 years ago Closed 4 years ago

generic-worker: remove files and directories more forcefully on windows

Categories

(Taskcluster :: Workers, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: grenade, Assigned: pmoore, Mentored)

References

Details

(Keywords: good-first-bug)

Attachments

(3 files)

if windows test instances are not rebooted between task runs, generic-worker leaves traces of the previous task behind:

- parts of the previous task directory are left intact (z:\task_xxxx)
- parts of the user profile associated with the previous task are left intact (c:\Users\GenericWorker\AppData). note that this occurs even if the instance is rebooted.
- processes started by the previous task are left running (often firefox.exe)

the last item almost always causes subsequent tests to fail since some of the ports required for test runs are still in use by the previous task.

in python code in-tree (clobber is a good example) when we want to delete large directory trees, we use a retry loop with a random or arbitrary sleep timer between retries to delete files that still have open file handles or where the delete fails for another reason. i think this is missing in gw where the task directory delete mechanism ignores delete failures whih results in a portion of the task directory tree surviving as long as the instance.
if cleanup of task files fails and there is still data left behind from the previous task, generic worker should terminate the instance rather than take on new tasks.
In generic-worker 10, which we are rolling out in bug 1399401 there are reboots between all tasks, so we shouldn't have to worry about running processes.

I would say let's park this until g-w 10 is out on all the workers, and then see if we continue to have a problem.

Are you seeing this on builders (which are already on g-w 10.2.3) ?
Blocks: 1399401
No longer blocks: 1399401
Depends on: 1399401
(In reply to Pete Moore [:pmoore][:pete] from comment #2)
> Are you seeing this on builders (which are already on g-w 10.2.3) ?

yes. i think the issue i was hoping we would resolve is the one where the task user directory deletes triggered by gw fail. i think relying on the reboot mechanism is fine as a workaround but it doesn't solve the issue. if gw were fixed properly, the task cleanup should succeed without relying on the reboot. rebooting costs time and money which is an inefficiency i was hoping we wouldn't treat as an acceptable fix.

introducing a delete retry loop similar to the ones used by clobber in-tree should fix the issues gw is currently ignoring when deleting task user directories.
QA Contact: pmoore
(In reply to Rob Thijssen (:grenade UTC+2) from comment #3)
> (In reply to Pete Moore [:pmoore][:pete] from comment #2)
> > Are you seeing this on builders (which are already on g-w 10.2.3) ?
> 
> yes. i think the issue i was hoping we would resolve is the one where the
> task user directory deletes triggered by gw fail. i think relying on the
> reboot mechanism is fine as a workaround but it doesn't solve the issue. if
> gw were fixed properly, the task cleanup should succeed without relying on
> the reboot. rebooting costs time and money which is an inefficiency i was
> hoping we wouldn't treat as an acceptable fix.
> 
> introducing a delete retry loop similar to the ones used by clobber in-tree
> should fix the issues gw is currently ignoring when deleting task user
> directories.

This is a misunderstanding - I believe the task cleanup is working, and generic-worker 10 requires reboots for a different reason - it performs a winlogon to the interactive desktop session of the host as the generated task user. It isn't possible to programmatically log out of userspace and back in as a different user, without performing a reboot. Therefore the reboot is a necessary step to ensure session isolation for the task users.

The reason I suggest to park this until we've upgraded all our workers is because I suspect the new workers will not have a problem, because the worker is cleaning up reliably. Alternatively, we can close, and reopen if we decide it isn't - but until now I have no evidence of generic-worker 10 running under task user isolation not performing cleanup correctly. I suspect the issues you have seen are from earlier versions of generic-worker, or running the worker under a different configuration (such as running tasks as current user, rather than having task user isolation).
Hey Rob,

Is there any more cleanup for us to do here? I'm wondering if this issue is solved now with the upgrade, or if there is still anything getting left behind.

Thanks!
Flags: needinfo?(rthijssen)
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
I found the bug! PR created (comment 7) with review requested from :jonasfj.
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/f310b8e671db552497652969f956e22554237eb4
Bug 1433854 - delete old task user profile directories (not just task directories)

https://github.com/taskcluster/generic-worker/commit/01c891b81c258d4bc7e737cd31389e894ebca15d
Merge pull request #94 from taskcluster/bug1433854

Bug 1433854 - delete old task user profile directories
Depends on: 1465113
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
this appears to have regressed on generic worker 10.8.5 hardware
see:
https://papertrailapp.com/groups/1958653/events?q=program%3Ageneric-worker%20%22could%20not%20delete%20directory%22
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
My suspicion is the reboot isn't working between tasks and/or the tasks are running as current user.

I tried to submit some tasks to test the theory but I have no means to submit them to the workers running 10.8.5. Every task I ran was executed by a worker running 10.0.5:

https://tools.taskcluster.net/groups/B1-aOxfuQwuPn0lJR_NxBA

Can we move the workers running 10.8.5 to a different worker type name?

Also, can you check the up time on a worker that exhibits this issue, and see if it aligns with the tasks that have been running on it?

I can see from the logs that a reboot is issued, but what I can't be sure of is whether it was successful or not, only that it was issued. I wonder if something prevents the reboot (maybe there is a dialogue box open asking for user intervention) and then the reboot never takes place.

The reason this is my suspicion is because if there is a reboot, the generic-worker is running as Local System, it should have no problem cleaning out directories when there can be no open handles on files in those directories.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INVALID
(In reply to Pete Moore [:pmoore][:pete] from comment #11)
> My suspicion is the reboot isn't working between tasks and/or the tasks are
> running as current user.

papertrail logs show that the machines are rebooting:
https://papertrailapp.com/systems/1731044491/events?q=program%3Auser32

the same logs show that runTasksAsCurrentUser is set to false:
https://papertrailapp.com/systems/1731044491/events?q=runTasksAsCurrentUser

> I can see from the logs that a reboot is issued, but what I can't be sure of
> is whether it was successful or not, only that it was issued. I wonder if
> something prevents the reboot (maybe there is a dialogue box open asking for
> user intervention) and then the reboot never takes place.

looking at logs subsequent to the reboot logs shows system state is consistent with a reboot having taken place. eg: occ runs and the sequence of logs is consistent with what we would expect from a rebooted machine.

> The reason this is my suspicion is because if there is a reboot, the
> generic-worker is running as Local System, it should have no problem
> cleaning out directories when there can be no open handles on files in those
> directories.

agree. however the logs suggest that a reboot is taking place.

reopening because the issue of generic worker failing to delete task directories needs to be fixed and the explanation of failure to reboot is not consistent with what is shown in the logs.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
i neglected to respond to a couple of points.

(In reply to Pete Moore [:pmoore][:pete] from comment #11)
> Can we move the workers running 10.8.5 to a different worker type name?

as i understand it (:markco will know for sure), these workers use a unique worker type name of gecko-t-win10-64-hw-GW10

> Also, can you check the up time on a worker that exhibits this issue, and
> see if it aligns with the tasks that have been running on it?

one way to check uptime since a given point (like a failure to delete a task directory) in the logs is to compare the time of the log entry in question with the preceeding log entry that has a program attribute of "user32" since the only user32 entries that are forwarded to papertrail are those which contain shutdown or restart commands. there is of course some arbitrary time between those entries and the system restart and subsequent log entries.
(In reply to [github robot] from comment #9)
> Commits pushed to master at https://github.com/taskcluster/generic-worker
> 
> https://github.com/taskcluster/generic-worker/commit/
> f310b8e671db552497652969f956e22554237eb4
> Bug 1433854 - delete old task user profile directories (not just task
> directories)
> 
> https://github.com/taskcluster/generic-worker/commit/
> 01c891b81c258d4bc7e737cd31389e894ebca15d
> Merge pull request #94 from taskcluster/bug1433854
> 
> Bug 1433854 - delete old task user profile directories

Released in https://github.com/taskcluster/generic-worker/releases/tag/v10.8.4


... I will respond to the open points separately.
for debugging, here's some links to the delete failure logs.

- on hardware instances:
  https://papertrailapp.com/groups/1958653/events?q=program%3Ageneric-worker%20%22WARNING%3A%20could%20not%20delete%20directory%22
- on ec2 instances:
  https://papertrailapp.com/groups/2488493/events?q=program%3Ageneric-worker%20%22WARNING%3A%20could%20not%20delete%20directory%22

the failures occur on all windows worker types (hardware, ec2, testers, builders)
Thanks Rob.

I have a hunch the issue might be that we need to apply ACLs to allow file deletion before actually deleting - see bug 1443589 comment 84.

I'll see if I can reproduce and make a fix and add a CI test for it...
See Also: → 1443589
Attached file GitHub Pull Request
this pr in its current form is intended as a temporary workaround.

it will hopefully help to remove some complexity from the generic worker wrapper script by removing the need to perform disk cleanup tasks there.

when this bug is fixed in gw, it should not be necessary to clean up task folders with the MaintainSystem scheduled task, but it may be a good place for other system maintenance which needs to occur with regularity.
Attachment #9009918 - Flags: review?(mcornmesser)
Attachment #9009918 - Flags: feedback?(pmoore)
Attachment #9009918 - Flags: review?(mcornmesser) → review+
Attachment #9009918 - Flags: feedback?(pmoore) → feedback+
Component: Generic-Worker → Workers
Blocks: 1528198

I think comment 16 of this bug is the path forward.

I'm going on PTO for a week at the end of today, so I'll look at this when I'm back in week starting 25 Feb 2019...

No longer blocks: 1528198
Whiteboard: good-first-bug

This still needs working on, it is a genuine issue, but priority on other things is higher at the moment as we have a workaround in place in OCC, so unassigning myself...

Assignee: pmoore → nobody
Summary: clean up after tasks on windows test workers → generic-worker: remove files and directories more forcefully on windows
Mentor: pmoore
Whiteboard: good-first-bug
Blocks: 1553280
Priority: -- → P2
Assignee: nobody → pmoore
Status: REOPENED → ASSIGNED

Released in taskcluster 25.4.0.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: