Status

defect
P2
normal
RESOLVED FIXED
9 years ago
6 years ago

People

(Reporter: nthomas, Assigned: coop)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [buildslaves][buildduty])

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

9 years ago
On windows we fill up 
 C:\Documents and Settings\cltbld\Local Settings\Temp
with files from tests and builds. Other platforms have similar problems, but linux at least culls objects older than 30 days.

We should periodically clean up $TEMP; and possibly provide a per-job directory under build/ that we set TEMP to, and which we empty at build start time.
(Reporter)

Updated

9 years ago
Summary: Clean up TMP automatically → Clean up TEMP/TMP automatically
(Reporter)

Updated

9 years ago
Duplicate of this bug: 561587
(Reporter)

Comment 2

9 years ago
I think we have to do this sooner, at least for win32 boxes. I've cleaned up several machines in the last few days as a result of nagios alerts like
moz2-win32-slave28.build:disk - C is WARNING: WARNING: C:: Total: 9.75G - Used: 9.44G (96%) - Free: 316M (4%) warning
Priority: P4 → P3
This issue was the cause of some Windows Disk-Space issues on the SeaMonkey
machines. I think a bit of back and forth with nthomas is what helped me
discover this a few weeks ago [just didn't get around to filling a releng bug
on it, sorry]
Whiteboard: [buildslaves] → [buildslaves][buildduty]
(Reporter)

Comment 4

9 years ago
I've been continuing to clean up win32-slaveNN.b.m.o boxes as nagios reports them and decided to just sweep through that whole pool to empty the temp dir. Presumably this will affect try-win32-slaveNN eventually, but haven't seen any reports of that yet. 

I still think we should fix the general problem here - not cleaning up the tmpdir on all platforms. This would happen automatically if we pointed TEMP to the %(builddir)s/temp when running tests.
Blocks: 617920
(In reply to comment #4)
> This would happen automatically if we pointed TEMP to
> the %(builddir)s/temp when running tests.

Right, though tests should better clean up after themselves.
The usual bug/fix is for them to use the (temporary) profile directory (now/when there is one) instead of the temp directory.

Yet, I agree that using a temporary temp dir should help a lot, from a system point of view.
(Assignee)

Updated

9 years ago
Assignee: nobody → dustin
(Assignee)

Comment 6

9 years ago
I have a patch in bug 603238 that takes care of this on Linux/Mac.

Comment 7

9 years ago
Unfortunately, that doesn't help the most common case me and Callek are doing manually right now in a double manner - it doesn't help win32, and it doesn't help SeaMonkey, where no puppet infrastructure is available right now. :(

But still, it's good to hear that some way of progress is being made on this. The win32 problems are being left on Firefox as well though, I guess. I wonder if there's any reason we can't just care to always do a rm -rf on the temp dir at the beginning of any build cycle, and do so from buildbot (right before/after we ensure free space on the builds partition).
(Assignee)

Comment 8

9 years ago
(In reply to comment #7)
> Unfortunately, that doesn't help the most common case me and Callek are doing
> manually right now in a double manner - it doesn't help win32, and it doesn't
> help SeaMonkey, where no puppet infrastructure is available right now. :(

Note that I haven't closed that other bug. There's still work to do for win32. I'm working on an OPSI patch now.
 
> But still, it's good to hear that some way of progress is being made on this.
> The win32 problems are being left on Firefox as well though, I guess. I wonder
> if there's any reason we can't just care to always do a rm -rf on the temp dir
> at the beginning of any build cycle, and do so from buildbot (right
> before/after we ensure free space on the builds partition).

Nothing's stopping you from doing so. Serge asserts in comment #5 that it's probably better to explicitly cleanup anyway rather than waiting for a reboot.
(Assignee)

Comment 9

8 years ago
(In reply to comment #6)
> I have a patch in bug 603238 that takes care of this on Linux/Mac.

I had to comment the /tmp cleanup part of this patch out. The tidy command doesn't have any way to ignore files of a given type (e.g. sockets) so it was hitting the ssh socket leftovers and "failing," causing it to loop forever.

We can probably work around this with a File/Exec block that simply does an 'rm.'
(Assignee)

Comment 10

8 years ago
Since I promised and then failed-to-deliver in comment #6, here's an updated patch that works in staging *AND* allows buildbot to startup afterwards. I hear that's important.

Tested on both Mac and Linux slaves in staging.
Attachment #504559 - Flags: review?(dustin)
(Assignee)

Comment 11

8 years ago
I'm doing this work over in bug 603238.
Assignee: dustin → coop
Status: NEW → ASSIGNED
Priority: P3 → P2
Comment on attachment 504559 [details] [diff] [review]
Do a 'rm -rf' to remove files in /tmp

The first hunk looks fine.  Were the other two hunks collateral damage from another patch?  r+ to the first hunk..
Attachment #504559 - Flags: review?(dustin) → review+
Duplicate of this bug: 624154
(Assignee)

Comment 14

8 years ago
Comment on attachment 504559 [details] [diff] [review]
Do a 'rm -rf' to remove files in /tmp

(In reply to comment #12)
> The first hunk looks fine.  Were the other two hunks collateral damage from
> another patch?  r+ to the first hunk..

Gah. Cruft removed.

https://hg.mozilla.org/build/puppet-manifests/rev/37772e58dcb2

Puppet masters have been restarted with this change.
Attachment #504559 - Flags: checked-in+
Rolled back in 37772e58dcb2.  Restarting the puppet masters now.  Details to follow.
Details as promised:

-- technical --
puppet runs *simultaneously* with X startup, so the rm -rf killed important parts of the X desktop, perhaps one of

drwx------  2 cltbld cltbld 4096 2011-01-20 12:21 /tmp/keyring-MlOkJt
drwx------  2 cltbld cltbld 4096 2011-01-20 12:24 /tmp/orbit-cltbld
drwx------  2 cltbld cltbld 4096 2011-01-20 12:21 /tmp/pulse-h6ThzQSI7UwJ
drwx------  2 cltbld cltbld 4096 2011-01-20 12:21 /tmp/virtual-cltbld.kIpBkk
drwx------  2 cltbld cltbld 4096 2011-01-20 12:06 /tmp/virtual-cltbld.vHzmGF

which caused the desktop environment to misbehave, leading to oranges.

-- timeline --

The change was landed by 9:26.  rev e080aa16fae3 (pushed at 8:15) was just starting up its tests at that point, and began burning.  ehsan pinged me at 9:58:

09:58 < ehsan> dies anybody know what's up with the linux64 machines?
09:58 < ehsan> dustin|buildduty: ^

and we began looking for the problem from there.  Ehsan closed the tree at 10:10.

We re-built 26710d7a4b45 at 11:35 to see if the build itself was poisoned, or if the tests were failing for other reasons.  Most of the tests from this build (those that ran before the fix was rolled out) failed, as well.

We rolled back the puppet change at 11:59, and immediately began retriggering the failing tests for e080aa16fae3, 7812a04bcaa9, c7d11beadb75, and 26a942ad805b.  As of this writing the few that have completed are green.

Dbaron re-pushed to start some new tests, too.

The tree re-opened at 12:42.
In the particular, the oranges were:

A fatal cairo assertion on shutdown of Linux (32 and 64 bit) debug builds for mochitest/reftest:
firefox-bin: cairo-hash.c:199: _cairo_hash_table_destroy: Assertion `hash_table->live_entries == 0' failed.
I think this is related to font stuff.

A failure of an xpcshell test testing the GConf service and GNOME shell service on all Linux builds (debug and opt, 32 and 64 bit):
TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/tests/browser/components/shell/test/unit/test_421977.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIGConfService.getString]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: /home/cltbld/talos-slave/test/build/xpcshell/tests/browser/components/shell/test/unit/test_421977.js :: run_test :: line 78" data: no]

A failure of an xpcshell test testing external protocol handlers on all Linux builds:
TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/testsTEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/tests/uriloader/exthandler/tests/unit/test_handlerService.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/tests/uriloader/exthandler/tests/unit/test_handlerService.js | true == false - See following stack:

Both failed xpcshell tests probably involve some GConf or GNOME interaction.
(Assignee)

Comment 18

8 years ago
We can set the threshold to whatever we want, but 15 minutes seems about right to me to make sure we avoid files created on start-up but remove the rest.

Tested on Mac and Linux slaves in staging.
Attachment #504559 - Attachment is obsolete: true
Attachment #508574 - Flags: review?(dustin)
Comment on attachment 508574 [details] [diff] [review]
Delete files from /tmp, but only those older than 15 minutes.

Why tmp/* on Darwin and tmp/ on linux?  Assuming that's intentional due to platform differences, this looks fine.
Attachment #508574 - Flags: review?(dustin) → review+
(Assignee)

Comment 20

8 years ago
(In reply to comment #19)
> Why tmp/* on Darwin and tmp/ on linux?  Assuming that's intentional due to
> platform differences, this looks fine.

Yes, this was intentional. The find command on Darwin includes the actual /tmp dir in the returned list unless you add the trailing *.
(In reply to comment #20)
> (In reply to comment #19)
> > Why tmp/* on Darwin and tmp/ on linux?  Assuming that's intentional due to
> > platform differences, this looks fine.
> 
> Yes, this was intentional. The find command on Darwin includes the actual /tmp
> dir in the returned list unless you add the trailing *.

Wouldn't it make more sense then to have both systems use tmp/* or does that return different results on linux itself?
(Assignee)

Comment 22

8 years ago
(In reply to comment #21)
> Wouldn't it make more sense then to have both systems use tmp/* or does that
> return different results on linux itself?

No, the results are the same. I'll land with that change, thanks.
(Assignee)

Comment 23

8 years ago
Comment on attachment 508574 [details] [diff] [review]
Delete files from /tmp, but only those older than 15 minutes.

https://hg.mozilla.org/build/puppet-manifests/rev/794d10a1ce88
Attachment #508574 - Flags: checked-in+
(Assignee)

Comment 24

8 years ago
Puppet masters have all been restarted with this fix.
Status: ASSIGNED → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.