Closed
Bug 1433854
Opened 6 years ago
Closed 4 years ago
generic-worker: remove files and directories more forcefully on windows
Categories
(Taskcluster :: Workers, defect, P2)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: grenade, Assigned: pmoore, Mentored)
References
Details
(Keywords: good-first-bug)
Attachments
(3 files)
if windows test instances are not rebooted between task runs, generic-worker leaves traces of the previous task behind: - parts of the previous task directory are left intact (z:\task_xxxx) - parts of the user profile associated with the previous task are left intact (c:\Users\GenericWorker\AppData). note that this occurs even if the instance is rebooted. - processes started by the previous task are left running (often firefox.exe) the last item almost always causes subsequent tests to fail since some of the ports required for test runs are still in use by the previous task. in python code in-tree (clobber is a good example) when we want to delete large directory trees, we use a retry loop with a random or arbitrary sleep timer between retries to delete files that still have open file handles or where the delete fails for another reason. i think this is missing in gw where the task directory delete mechanism ignores delete failures whih results in a portion of the task directory tree surviving as long as the instance.
Reporter | ||
Comment 1•6 years ago
|
||
if cleanup of task files fails and there is still data left behind from the previous task, generic worker should terminate the instance rather than take on new tasks.
Assignee | ||
Comment 2•6 years ago
|
||
In generic-worker 10, which we are rolling out in bug 1399401 there are reboots between all tasks, so we shouldn't have to worry about running processes. I would say let's park this until g-w 10 is out on all the workers, and then see if we continue to have a problem. Are you seeing this on builders (which are already on g-w 10.2.3) ?
Assignee | ||
Updated•6 years ago
|
Reporter | ||
Comment 3•6 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #2) > Are you seeing this on builders (which are already on g-w 10.2.3) ? yes. i think the issue i was hoping we would resolve is the one where the task user directory deletes triggered by gw fail. i think relying on the reboot mechanism is fine as a workaround but it doesn't solve the issue. if gw were fixed properly, the task cleanup should succeed without relying on the reboot. rebooting costs time and money which is an inefficiency i was hoping we wouldn't treat as an acceptable fix. introducing a delete retry loop similar to the ones used by clobber in-tree should fix the issues gw is currently ignoring when deleting task user directories.
Assignee | ||
Updated•6 years ago
|
QA Contact: pmoore
Assignee | ||
Comment 4•6 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #3) > (In reply to Pete Moore [:pmoore][:pete] from comment #2) > > Are you seeing this on builders (which are already on g-w 10.2.3) ? > > yes. i think the issue i was hoping we would resolve is the one where the > task user directory deletes triggered by gw fail. i think relying on the > reboot mechanism is fine as a workaround but it doesn't solve the issue. if > gw were fixed properly, the task cleanup should succeed without relying on > the reboot. rebooting costs time and money which is an inefficiency i was > hoping we wouldn't treat as an acceptable fix. > > introducing a delete retry loop similar to the ones used by clobber in-tree > should fix the issues gw is currently ignoring when deleting task user > directories. This is a misunderstanding - I believe the task cleanup is working, and generic-worker 10 requires reboots for a different reason - it performs a winlogon to the interactive desktop session of the host as the generated task user. It isn't possible to programmatically log out of userspace and back in as a different user, without performing a reboot. Therefore the reboot is a necessary step to ensure session isolation for the task users. The reason I suggest to park this until we've upgraded all our workers is because I suspect the new workers will not have a problem, because the worker is cleaning up reliably. Alternatively, we can close, and reopen if we decide it isn't - but until now I have no evidence of generic-worker 10 running under task user isolation not performing cleanup correctly. I suspect the issues you have seen are from earlier versions of generic-worker, or running the worker under a different configuration (such as running tasks as current user, rather than having task user isolation).
Assignee | ||
Comment 5•6 years ago
|
||
Hey Rob, Is there any more cleanup for us to do here? I'm wondering if this issue is solved now with the upgrade, or if there is still anything getting left behind. Thanks!
Flags: needinfo?(rthijssen)
Reporter | ||
Comment 6•6 years ago
|
||
It doesn't look solved: https://tools.taskcluster.net/groups/ZjHavZ71QN25zgaaWiwi6A/tasks/ZjHavZ71QN25zgaaWiwi6A/runs/0/logs/public%2Flogs%2Flive.log
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 7•6 years ago
|
||
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Assignee | ||
Comment 8•6 years ago
|
||
I found the bug! PR created (comment 7) with review requested from :jonasfj.
Comment 9•6 years ago
|
||
Commits pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/f310b8e671db552497652969f956e22554237eb4 Bug 1433854 - delete old task user profile directories (not just task directories) https://github.com/taskcluster/generic-worker/commit/01c891b81c258d4bc7e737cd31389e894ebca15d Merge pull request #94 from taskcluster/bug1433854 Bug 1433854 - delete old task user profile directories
Assignee | ||
Updated•6 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 10•6 years ago
|
||
this appears to have regressed on generic worker 10.8.5 hardware see: https://papertrailapp.com/groups/1958653/events?q=program%3Ageneric-worker%20%22could%20not%20delete%20directory%22
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 11•6 years ago
|
||
My suspicion is the reboot isn't working between tasks and/or the tasks are running as current user. I tried to submit some tasks to test the theory but I have no means to submit them to the workers running 10.8.5. Every task I ran was executed by a worker running 10.0.5: https://tools.taskcluster.net/groups/B1-aOxfuQwuPn0lJR_NxBA Can we move the workers running 10.8.5 to a different worker type name? Also, can you check the up time on a worker that exhibits this issue, and see if it aligns with the tasks that have been running on it? I can see from the logs that a reboot is issued, but what I can't be sure of is whether it was successful or not, only that it was issued. I wonder if something prevents the reboot (maybe there is a dialogue box open asking for user intervention) and then the reboot never takes place. The reason this is my suspicion is because if there is a reboot, the generic-worker is running as Local System, it should have no problem cleaning out directories when there can be no open handles on files in those directories.
Status: REOPENED → RESOLVED
Closed: 6 years ago → 6 years ago
Resolution: --- → INVALID
Reporter | ||
Comment 12•6 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #11) > My suspicion is the reboot isn't working between tasks and/or the tasks are > running as current user. papertrail logs show that the machines are rebooting: https://papertrailapp.com/systems/1731044491/events?q=program%3Auser32 the same logs show that runTasksAsCurrentUser is set to false: https://papertrailapp.com/systems/1731044491/events?q=runTasksAsCurrentUser > I can see from the logs that a reboot is issued, but what I can't be sure of > is whether it was successful or not, only that it was issued. I wonder if > something prevents the reboot (maybe there is a dialogue box open asking for > user intervention) and then the reboot never takes place. looking at logs subsequent to the reboot logs shows system state is consistent with a reboot having taken place. eg: occ runs and the sequence of logs is consistent with what we would expect from a rebooted machine. > The reason this is my suspicion is because if there is a reboot, the > generic-worker is running as Local System, it should have no problem > cleaning out directories when there can be no open handles on files in those > directories. agree. however the logs suggest that a reboot is taking place. reopening because the issue of generic worker failing to delete task directories needs to be fixed and the explanation of failure to reboot is not consistent with what is shown in the logs.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Reporter | ||
Comment 13•6 years ago
|
||
i neglected to respond to a couple of points. (In reply to Pete Moore [:pmoore][:pete] from comment #11) > Can we move the workers running 10.8.5 to a different worker type name? as i understand it (:markco will know for sure), these workers use a unique worker type name of gecko-t-win10-64-hw-GW10 > Also, can you check the up time on a worker that exhibits this issue, and > see if it aligns with the tasks that have been running on it? one way to check uptime since a given point (like a failure to delete a task directory) in the logs is to compare the time of the log entry in question with the preceeding log entry that has a program attribute of "user32" since the only user32 entries that are forwarded to papertrail are those which contain shutdown or restart commands. there is of course some arbitrary time between those entries and the system restart and subsequent log entries.
Assignee | ||
Comment 14•6 years ago
|
||
(In reply to [github robot] from comment #9) > Commits pushed to master at https://github.com/taskcluster/generic-worker > > https://github.com/taskcluster/generic-worker/commit/ > f310b8e671db552497652969f956e22554237eb4 > Bug 1433854 - delete old task user profile directories (not just task > directories) > > https://github.com/taskcluster/generic-worker/commit/ > 01c891b81c258d4bc7e737cd31389e894ebca15d > Merge pull request #94 from taskcluster/bug1433854 > > Bug 1433854 - delete old task user profile directories Released in https://github.com/taskcluster/generic-worker/releases/tag/v10.8.4 ... I will respond to the open points separately.
Reporter | ||
Comment 15•6 years ago
|
||
for debugging, here's some links to the delete failure logs. - on hardware instances: https://papertrailapp.com/groups/1958653/events?q=program%3Ageneric-worker%20%22WARNING%3A%20could%20not%20delete%20directory%22 - on ec2 instances: https://papertrailapp.com/groups/2488493/events?q=program%3Ageneric-worker%20%22WARNING%3A%20could%20not%20delete%20directory%22 the failures occur on all windows worker types (hardware, ec2, testers, builders)
Assignee | ||
Comment 16•6 years ago
|
||
Thanks Rob. I have a hunch the issue might be that we need to apply ACLs to allow file deletion before actually deleting - see bug 1443589 comment 84. I'll see if I can reproduce and make a fix and add a CI test for it...
Reporter | ||
Comment 17•6 years ago
|
||
this pr in its current form is intended as a temporary workaround. it will hopefully help to remove some complexity from the generic worker wrapper script by removing the need to perform disk cleanup tasks there. when this bug is fixed in gw, it should not be necessary to clean up task folders with the MaintainSystem scheduled task, but it may be a good place for other system maintenance which needs to occur with regularity.
Attachment #9009918 -
Flags: review?(mcornmesser)
Attachment #9009918 -
Flags: feedback?(pmoore)
Updated•6 years ago
|
Attachment #9009918 -
Flags: review?(mcornmesser) → review+
Reporter | ||
Comment 18•6 years ago
|
||
the workaround patch is now merged and live. you can see it in action here: - ec2 systems: https://papertrailapp.com/groups/2488493/events?q=program%3AMaintainSystem - hardware: https://papertrailapp.com/groups/1958653/events?q=program%3AMaintainSystem
Assignee | ||
Updated•6 years ago
|
Attachment #9009918 -
Flags: feedback?(pmoore) → feedback+
Updated•5 years ago
|
Component: Generic-Worker → Workers
Assignee | ||
Comment 19•5 years ago
|
||
I think comment 16 of this bug is the path forward.
I'm going on PTO for a week at the end of today, so I'll look at this when I'm back in week starting 25 Feb 2019...
Assignee | ||
Updated•5 years ago
|
Whiteboard: good-first-bug
Assignee | ||
Comment 20•5 years ago
|
||
This still needs working on, it is a genuine issue, but priority on other things is higher at the moment as we have a workaround in place in OCC, so unassigning myself...
Assignee: pmoore → nobody
Assignee | ||
Updated•5 years ago
|
Summary: clean up after tasks on windows test workers → generic-worker: remove files and directories more forcefully on windows
Assignee | ||
Updated•5 years ago
|
Mentor: pmoore
Updated•5 years ago
|
Keywords: good-first-bug
Assignee | ||
Updated•5 years ago
|
Whiteboard: good-first-bug
Assignee | ||
Updated•4 years ago
|
Priority: -- → P2
Assignee | ||
Comment 21•4 years ago
|
||
Assignee: nobody → pmoore
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 22•4 years ago
|
||
Released in taskcluster 25.4.0.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago → 4 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•