Closed Bug 880003 Opened 11 years ago Closed 9 years ago

Cleanup temporary files on boot

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86
macOS

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: armenzg, Unassigned)

References

Details

Attachments

(1 file)

Summary: Determine why the Mac builders are loosing their ability to create in TMPDIR → Determine why the Mac builders are losing their ability to create in TMPDIR
Hm, it's not clear -- should this be blocking the above builders' bugs, or should we set up those builders to see if it still happens?
My feeling is that we should:

* disable the affected builders (the Try ones in particular are all still enabled and failing job after job)

* pick one to give to the hypothetical investigator of this bug

* fix the rest
So, $TMPDIR is /var/folders/30/yq_p3wk15yb9wdsv6sm_m0v00000gn/T/ on bld-lion-r5-031, from the last build.  Part of that is dynamically generated, but a find on /var/folders doesn't show any huge number of files.

A find for cpp-unit-profd shows, along with a lot of permission denied,
/Users/cltbld/.Trash/Recovered files/cpp-unit-profd
/Users/cltbld/.Trash/Recovered files #1/cpp-unit-profd
/Users/cltbld/Library/Caches/TemporaryItems/cpp-unit-profd

and under that TemporaryItems directory, I also see ./cpp-unit-profd-{1..9999}.  So philor nailed it in bug 852429 comment 627:

> The something else being that the unique naming only planned on needing four
> digits of unique, and that wasn't enough. Hard to believe we've crashed
> 10,000 times on all of these broken Windows slaves, so I suspect (as is the
> case most of the time when we try to delete files on Windows) that the
> harness doesn't actually succeed at deleting the profile directory at all on
> Windows.

So the fix is clean that up on startup - either in the builds, in puppet, or in the preflight scripts catlee's working on.

I'm going to reimage all of the dependent machines for bug 760093 anyway.
Summary: Determine why the Mac builders are losing their ability to create in TMPDIR → Cleanup temporary files on boot
Removing the dependencies here. I think it's pretty safe to say that *all* lion slaves are at risk of hitting this bug until it's resolved.
Component: Release Engineering: Machine Management → Release Engineering: Platform Support
QA Contact: armenzg → coop
Product: mozilla.org → Release Engineering
Blocks: 874642
Please can we bump the priority on this? I've unfortunately just had to disable 9 lion try build pool slaves, leaving us slightly short :-s
It seems that ted might be fixing this with the bug he's going to be filing.

10:46 ted: edmorley: TestHarness.h is old :-/
10:47 edmorley: ted: could we disable whatever is making these directories and not cleaning them up, until we fix this (or move cpp unit tests out of make check, if that will also fix it)
10:47 ted: edmorley: we can probably just fix TestHarness.h to use the cwd for the profile dir now
10:47 ted: since runcppunittests.py runs them in a temp dir
10:48 edmorley: ted: ah
10:48 ted: http://mxr.mozilla.org/mozilla-central/source/testing/runcppunittests.py#41
10:48 ted: should be a really small fix
10:50 edmorley: ted: could I leave you to file a bug blocking bug 874642 with a quick explanation? (I've got to start getting organised for my doctor's appointment shortly)
10:50 bugbot: Bug https://bugzilla.mozilla.org/show_bug.cgi?id=874642 normal, --, ---, nobody, NEW , Intermittent TEST-UNEXPECTED-FAIL | TestSTSParser | test failed with return code 65280 | Couldn't get the profile directory.
10:50 armenzg_brb is now known as armenzg_buildduty
10:50 ted: edmorley: yeah
10:50 ted: can probably patch it too
10:50 edmorley: ty
Assignee: nobody → ted
I'm planning on fixing one of the causes of this, but I don't think it'll fix it in general. This is probably still a useful thing to do.
Assignee: ted → nobody
Is this all it is needed to fix the machines? (using -* complained of too many to expand; I think).

rm -rf /Users/cltbld/Library/Caches/TemporaryItems/cpp-unit-profd-1*
rm -rf /Users/cltbld/Library/Caches/TemporaryItems/cpp-unit-profd-2*
rm -rf /Users/cltbld/Library/Caches/TemporaryItems/cpp-unit-profd-3*
rm -rf /Users/cltbld/Library/Caches/TemporaryItems/cpp-unit-profd-4*
rm -rf /Users/cltbld/Library/Caches/TemporaryItems/cpp-unit-profd-5*
rm -rf /Users/cltbld/Library/Caches/TemporaryItems/cpp-unit-profd-6*
rm -rf /Users/cltbld/Library/Caches/TemporaryItems/cpp-unit-profd-7*
rm -rf /Users/cltbld/Library/Caches/TemporaryItems/cpp-unit-profd-8*
rm -rf /Users/cltbld/Library/Caches/TemporaryItems/cpp-unit-profd-9*
Assignee: nobody → armenzg
Depends on: 906706
It seems that our current list of hosts to fix are:

bld-lion-r5-023
bld-lion-r5-024
bld-lion-r5-029
bld-lion-r5-030
bld-lion-r5-031
bld-lion-r5-033
bld-lion-r5-036
bld-lion-r5-037
bld-lion-r5-038
bld-lion-r5-069
The two machines that I tried this on took jobs appropriately.

I cleaned up and rebooted all remaining ones.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Thank you for sorting out those machines.

Reopening this bug, since it's for the automated ongoing cleanup (eg via puppet/...; comment 3) vs the one-off cleanup needed to get those particular machines back in service :-)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
edmorley, you want me to keep this one open until bug 906706 is completed?

I thought that having bug 906706 was good enough to close this one.
No longer blocks: bld-lion-r5-069
I've cleaned up *all* hosts.

It seems that Puppet doesn't make the cut.

We'll have to deal with it with pre-flight tasks.
Assignee: armenzg → nobody
Depends on: 712206
Assignee: nobody → sbruno
Priority: -- → P3
No longer blocks: bld-lion-r5-031
See Also: → bld-lion-r5-031
This is a temporary fix to stop the bleeding on Mac builders until catlee has runner ready for pre-flight tasks.
Attachment #8420139 - Flags: review?(bugspam.Callek)
Comment on attachment 8420139 [details] [diff] [review]
Cleanup temp files on Darwin before starting buildbot

Review of attachment 8420139 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/buildslave/files/darwin-run-buildslave.sh
@@ +7,5 @@
>  rm -rf /var/tmp/semaphore/run-buildbot
>  
> +# Pre-flight cleanup
> +if [ "${TMPDIR}" != "" ]; then
> +  rm -rf ${TMPDIR}/*

I'm pretty sure this might kill off any ssh-agent type stuff, and possibly some plist startup things, I forget.

Can we do this with a find ... -delete say with an age > 2 hours?
Attachment #8420139 - Flags: review?(bugspam.Callek) → review+
(In reply to Justin Wood (:Callek) from comment #15)
> Comment on attachment 8420139 [details] [diff] [review]
> Cleanup temp files on Darwin before starting buildbot
> 
> Review of attachment 8420139 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: modules/buildslave/files/darwin-run-buildslave.sh
> @@ +7,5 @@
> >  rm -rf /var/tmp/semaphore/run-buildbot
> >  
> > +# Pre-flight cleanup
> > +if [ "${TMPDIR}" != "" ]; then
> > +  rm -rf ${TMPDIR}/*
> 
> I'm pretty sure this might kill off any ssh-agent type stuff, and possibly
> some plist startup things, I forget.
> 
> Can we do this with a find ... -delete say with an age > 2 hours?

None of that stuff lives in $TMPDIR, but rather in /private/tmp.
Comment on attachment 8420139 [details] [diff] [review]
Cleanup temp files on Darwin before starting buildbot

https://hg.mozilla.org/build/puppet/rev/c60a425448a7
Attachment #8420139 - Flags: checked-in+
Assignee: sbruno → nobody
No longer blocks: bld-lion-r5-030
No longer blocks: bld-lion-r5-033
No longer blocks: bld-lion-r5-023
Status: REOPENED → RESOLVED
Closed: 11 years ago9 years ago
Resolution: --- → WORKSFORME
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: