Closed Bug 558595 Opened 14 years ago Closed 14 years ago

tracemonkey 64-bit linux tinderboxes out of disk space

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: gal, Unassigned)

References

Details

(Whiteboard: [linux64])

It's a bummer that Andreas had to play meat nagios here. Shouldn't disk provisioning work automatically? And, in the event that it fails, shouldn't it be reported automatically?
(In reply to comment #1)
> Shouldn't disk provisioning work automatically? 

What do you mean by provisioning (or automatically)?

> And, in the event that it fails, shouldn't it
> be reported automatically?

I think you're talking about Nagios?  If Nagios is monitoring disk usage, yes, it'll alert RelEng.
(In reply to comment #2)
> (In reply to comment #1)
> > Shouldn't disk provisioning work automatically? 
> 
> What do you mean by provisioning (or automatically)?

I don't see why any of these machines should ever run out of disk space... but they all seem to run at like 90% capacity for some reason, so of course they fail occasionally.

> 
> > And, in the event that it fails, shouldn't it
> > be reported automatically?
> 
> I think you're talking about Nagios?  If Nagios is monitoring disk usage, yes,
> it'll alert RelEng.

Why wouldn't Nagios monitor disk usage?
 
> I don't see why any of these machines should ever run out of disk space... but
> they all seem to run at like 90% capacity for some reason, so of course they
> fail occasionally.

I don't know the process - that's for RelEng.  Guessing they don't clean up. 

Has three drives, 9GB split between / & swap, 20GB split between /builds & /var and 30GB disk that I don't see mounted.

(RelEng - why aren't you using that 30GB disk?)

> > I think you're talking about Nagios?  If Nagios is monitoring disk usage, yes,
> > it'll alert RelEng.
> 
> Why wouldn't Nagios monitor disk usage?

They only reason it wouldn't is if it wasn't configured to do so.  In this case it is configured to do so but oncall doesn't get paged or notified on these.  I believe RelEng does.

Anyways, all of these issues are RelEng issues so I'm punting this bug over to them.
Assignee: server-ops → nobody
Component: Server Operations: Tinderbox Maintenance → Release Engineering
QA Contact: mrz → release
(In reply to comment #4)
> Has three drives, 9GB split between / & swap, 20GB split between /builds & /var
> and 30GB disk that I don't see mounted.
> 
> (RelEng - why aren't you using that 30GB disk?)

Hm, where are you seeing this disk?
moz2-linux64-slave12, in VI Edit Settings, I only see 2 disks.
I'd love to have a third disk lying there though -- then I could expand /builds without feeling guilty about taking even more space on the SAN :)

(In reply to comment #3)
> I don't see why any of these machines should ever run out of disk space... but
> they all seem to run at like 90% capacity for some reason, so of course they
> fail occasionally.

We keep builds around for:

a) debugging
b) faster depend builds

The way buildbot currently lays it out, there is a separate directory (separate checkout, separate objdir) per build type per branch.  So if there's a debug and opt build for 8 project/release branches, that's 16 directory trees.  If you include l10n and unit tests, there are even more directory trees.

We are working on having smarter layouts as a longer term fix... Recent versions of buildbot allow for sharing of build directories between builders now, for example.  And moving unit tests off to the talos boxes should reduce the number of test binaries lying around.
maybe we should buy some hard drives.


from bug 531675

(In reply to comment #8)
>
> Full disks took out Tinderbox right in the middle of my Try Server run, but
> what results I got look good to me.
> 
> I took the bug summary literally and just made eval ignore the 2nd argument
> entirely. Hope that's the right thing.
(In reply to comment #6)
> from bug 531675
> > Full disks took out Tinderbox right in the middle of my Try Server run, but
> > what results I got look good to me.

Unrelated. That was a disk space issue on the tinderbox server.

> maybe we should buy some hard drives.

The newer hardware slaves have larger drives, but the VMs (like this one, moz2-linux64-slave12) have smaller drives because the VMs are all sharing space on the network storage device.

(In reply to comment #4)
> > they all seem to run at like 90% capacity for some reason, so of course they
> > fail occasionally.
> 
> I don't know the process - that's for RelEng.  Guessing they don't clean up. 

We try not to cleanup too aggressively *on purpose*. The alternative is sucking down a brand new hg clone every time and we're already straining existing bandwidth.
OS: Mac OS X → Linux
Hardware: x86 → x86_64
This doesn't seem to have recurred, but I'm marking it and linking it to the linux64 tracking bug so it will be easy to find if it does.
Blocks: support-L64
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
Whiteboard: [linux64]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.