Closed Bug 862355 Opened 6 years ago Closed 6 years ago

Clean up the tmpdir on Windows build slaves on reboot

Categories

(Release Engineering :: General, defect, P2, major)

x86
Windows 7

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: coop)

References

Details

(Keywords: sheriffing-P1)

Attachments

(1 file)

Windows machines don't automatically clean out tmpdir on reboot, so we've discovered some machines with very large tmpdirs causing interesting test failures (in bug 852429). Please arrange the windows build slaves so that they delete everything in tmpdir on reboot.
found in triage.
Component: Release Engineering → Release Engineering: Machine Management
QA Contact: armenzg
Component: Release Engineering: Machine Management → Release Engineering: Automation (General)
QA Contact: armenzg → catlee
Whiteboard: [mozharness]
Keywords: sheriffing-P1
Is anyone working on this?
Blocks: 861176
Though it's not obvious from the terrible reporting we have for make check, since a green run passes 54802 tests and one that hits this problem passes 50607 tests, we're apparently failing to even run around 8% of make check most of the time on Windows.

If nobody is ever going to do anything about this, please say so, so we can work on some other approach to dealing with it.
Severity: normal → major
Armen, what options do we have here ? AFAICT the issue is the w64 compile slaves, so can we use an on-boot scheduled task ? Or perhaps it's better in runslave.py, or count_and_reboot.py ?
Flags: needinfo?(armenzg)
Also, why is this tagged [mozharness] ?
Summary: Clean up the tmpdir on Windows machines on reboot → Clean up the tmpdir on Windows build slaves on reboot
Whiteboard: [mozharness]
The good things about the Win64 machines is that we can access them through SSH as Administrators.
This means that we can add pretty much anything to it.
An on-boot task should do the trick. Modifying runslave.py is also an option.

What specific directories are we referring to?
C:\Windows\Temp?
Flags: needinfo?(armenzg)
Doesn't look like it. It's the XPCOM TmpD, which gets set from the Windows function ::GetTempPathW, which sets it from the env var TMP, or the env var TEMP, or the env var USERPROFILE, or (sweet!) the Windows dir.

Since the make check buildstep's env dump says that both TMP and TEMP are C:/Users/cltbld/AppData/Local/Temp, my expectation is that if you looked at that on an affect buildslave, like w64-ix-slave08 or any of the others listed in bug 861176, you would find that it contains whatever the maximum number of files/subdirectories Windows allows actually is (and while you are there, could you clear it out?).
On w64-ix-slave08, Windows reports a total of 109684 folders and 61899 files in C:/Users/cltbld/AppData/Local/Temp. Breaking that down a bit at the top-level:
 10000 cpp-unit-profd or cpp-unit-profd-nnnn  (up to 9999, suspicious!)
  7292 tmp-<6randchars>
  3930 ssh-<10randchars>
Some of those date back to 2011, I removed them now.

philor, do you know of any similar issues on test slaves ? It'd affect where we put the fix.
I don't know of any active tmpdir problems on test slaves, but then I've never seen any part of them other than the crap we dump on the desktop. TmpD is XPCOM, so you can most certainly get it and drop stuff in it just as easily from any other sort of test as from a cppunittest, so even if we haven't already set ourselves up for a fail like only being able to run tests 10,000 times (or possibly 10,000 divided by the number of tests that create a profile times), we probably will in the future.

If it doesn't overcomplicate the fix, I'd say fixing it for either all Windows slaves, or all slaves (are we not seeing this problem on not-Windows because the cppunittest harness tries to delete the profile but, just like all deleting of things on Windows, it fails every time, or are we not seeing it because we already clean up the tmpdir on not-Windows?) would be better.
And see also bug 870638 about wanting to have crash reports cleaned up on test slaves, which is a dupe of another one about wanting them cleaned up because... Firefox Health Reporter, maybe, was getting baffled by the way its tests had to deal with the surprise of seeing tens of thousands of crashes having happened.
Armen tells me the right place to do this is in this file:

http://hg.mozilla.org/build/puppet-manifests/file/22b8f942937e/modules/buildslave/files/buildbot-win64.bat

So we want to clean out
C:/Users/cltbld/AppData/Local/Temp
C:/Users/cltbld/Desktop

anything else?
Alas, we didn't mean only Windows build slaves, we just managed to fill them up faster - bld-lion-r5-042 apparently has 10000 cpp-unit-profd-NNNN directories, since it is now failing make check too.
Ah, apparently we did mean only Windows, it's just that when randomfoothing happens and we can't create a profile directory on a Mac, it has the same symptom as not being able to create cpp-unit-profd-1234 because there's already a 1234.
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
This seemed to do the trick on the staging slave I tried this on.

I'll start cobbling together an install script.
Attachment #750006 - Flags: review?(armenzg)
Comment on attachment 750006 [details] [diff] [review]
Remove temp files on Windows

Review of attachment 750006 [details] [diff] [review]:
-----------------------------------------------------------------

Why do we clobber the desktop?

FYI this would not work on test machines since on the Desktop we have startTalos.bat
Attachment #750006 - Flags: review?(armenzg) → review+
(In reply to Armen Zambrano G. [:armenzg] (Release Enginerring) from comment #15) 
> FYI this would not work on test machines since on the Desktop we have
> startTalos.bat

Right, but I this bug is specifically about the build slaves. 

For hygiene reasons, I think we should also deploy a similar change to the test slaves, but that can be a follow-up.
This has been deployed to all build slaves now, modulo w64-ix-slave23 that needs a re-image in bug 873140.
Appears to have done the trick, thanks!
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.