633277 - Don't reboot build machines

Reporter

Description

•

13 years ago

Apparently the 32 bit build machines were having memory allocation problems. This shouldn't be an issue on the 64 bit machines.

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 1

•

13 years ago

I've apparently assumed context that everyone didn't have.

As far as I know there are two reasons for rebooting the machines:
1. unit tests not cleaning up after themselves (this should be less of a problem now that we only run make check)
2. the linker not being able allocate memory after running for sometime. The assumed reason for this was address space fragmentation. This shouldn't be a problem on 64bit.

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

13 years ago

Rebooting after every run is a big part of our configuration management and monitoring.  At this point, rather than the lack of a reason to reboot, I think we need strong reasons *not* to reboot.

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

13 years ago

catlee told me that not rebooting means we keep the file caching the OS has built up, and therefore compile performance would be better. 

The size of that hit depends on how many times we build the same branch and compile type in a row, and the effect of ccache. It would also be useful to check if linking libxul uses so much memory that the file cache is effectively emptied.

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

13 years ago

If this is truly worth investigating, then let's go for it, but realize that it's a 180 degree turn from a bunch of work I've got in progress toward rebooting *all* slaves periodically.

Several months' additional redesign and reimplementation (of monitoring, configuration management, and the slave allocator) is a big price to pay for potential disk cache improvement.  IMHO the burden of proof is heavily on the advocates for fewer reboots.

Chris AtLee [:catlee]

Comment 5

•

13 years ago

(In reply to comment #4)
> If this is truly worth investigating, then let's go for it, but realize that
> it's a 180 degree turn from a bunch of work I've got in progress toward
> rebooting *all* slaves periodically.
> 
> Several months' additional redesign and reimplementation (of monitoring,
> configuration management, and the slave allocator) is a big price to pay for
> potential disk cache improvement.  IMHO the burden of proof is heavily on the
> advocates for fewer reboots.

I don't think these have to conflict with each other.  If we turn off the 'always reboot' step after a build, and instead rely on 'reboot after N hours', then aren't we still ok?

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

13 years ago

A very busy set of slaves (like our current linux64 VMs) could end up with slaves that build and build and build and never get an idle 6 hours to reboot. Granted we want to fix that but it's a difference from rebooting after every build.

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 7

•

13 years ago

(In reply to comment #2)
> Rebooting after every run is a big part of our configuration management and
> monitoring.  At this point, rather than the lack of a reason to reboot, I think
> we need strong reasons *not* to reboot.

I expect that build time is a pretty strong reason.

(In reply to comment #3)
> The size of that hit depends on how many times we build the same branch and
> compile type in a row, and the effect of ccache. It would also be useful to
> check if linking libxul uses so much memory that the file cache is effectively
> emptied.

If this is true, you should probably get more memory. On linux machines, builds should not be doing very much IO as the working set should easily fit in RAM.

Chris AtLee [:catlee]

Comment 8

•

13 years ago

(In reply to comment #6)
> A very busy set of slaves (like our current linux64 VMs) could end up with
> slaves that build and build and build and never get an idle 6 hours to reboot.
> Granted we want to fix that but it's a difference from rebooting after every
> build.

We don't need to wait for 6 hours of idle time. We could trigger a graceful shutdown after 6 hours, wait for whatever job is running to finish, and then reboot.

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

13 years ago

Catlee's description in comment 8 is how it's planned right now.  So this bug would just mean removing the count_and_reboot.py invocation from the buildsteps on linux64.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 10

•

13 years ago

There's a few overlapping, but orthogonal topics here:

(In reply to comment #2)
> Rebooting after every run is a big part of our configuration management and
> monitoring.  At this point, rather than the lack of a reason to reboot, I think
> we need strong reasons *not* to reboot.

Agreed. We reboot after every build for all OS, and we have machines check with puppet for toolchain updates on reboot. This is important for keeping all production machines identical. This is also tied into how we monitor for sick/missing slaves. Given all this, I'd move to WONTFIX.


(In reply to comment #3)
> catlee told me that not rebooting means we keep the file caching the OS has
> built up, and therefore compile performance would be better. 
> 
> The size of that hit depends on how many times we build the same branch and
> compile type in a row, and the effect of ccache. It would also be useful to
> check if linking libxul uses so much memory that the file cache is effectively
> emptied.
I thinks that "Preserving file cache across reboots for faster compiles" sounds like what jrmuizel is actually asking for. If so, we can morph this bug, and investigate this reasonable request.


(In reply to comment #6)
> A very busy set of slaves (like our current linux64 VMs) could end up with
> slaves that build and build and build and never get an idle 6 hours to reboot.
Yes, today we have a very limited number of linux64 build slaves. This is true until bug#588957 is fixed, and we get a linux64 refimage installed on the racks of IX machines we have installed and waiting... At that point, linux64 will have a full compliment of IX build machines.

Chris AtLee [:catlee]

Comment 11

•

13 years ago

(In reply to comment #10)
> There's a few overlapping, but orthogonal topics here:
> 
> (In reply to comment #2)
> > Rebooting after every run is a big part of our configuration management and
> > monitoring.  At this point, rather than the lack of a reason to reboot, I think
> > we need strong reasons *not* to reboot.
> 
> Agreed. We reboot after every build for all OS, and we have machines check with
> puppet for toolchain updates on reboot. This is important for keeping all
> production machines identical. This is also tied into how we monitor for
> sick/missing slaves. Given all this, I'd move to WONTFIX.

We have better checks on build machines. And, as I mentioned, we can still reboot, we just don't *have* to after each run.

> (In reply to comment #3)
> > catlee told me that not rebooting means we keep the file caching the OS has
> > built up, and therefore compile performance would be better. 
> > 
> > The size of that hit depends on how many times we build the same branch and
> > compile type in a row, and the effect of ccache. It would also be useful to
> > check if linking libxul uses so much memory that the file cache is effectively
> > emptied.
> I thinks that "Preserving file cache across reboots for faster compiles" sounds
> like what jrmuizel is actually asking for. If so, we can morph this bug, and
> investigate this reasonable request.

The point is that if you don't reboot, you stand a good chance on holding most of the important files in memory already, so you don't have to read from disk. This is *not* about ccache.

> (In reply to comment #6)
> > A very busy set of slaves (like our current linux64 VMs) could end up with
> > slaves that build and build and build and never get an idle 6 hours to reboot.
> Yes, today we have a very limited number of linux64 build slaves. This is true
> until bug#588957 is fixed, and we get a linux64 refimage installed on the racks
> of IX machines we have installed and waiting... At that point, linux64 will
> have a full compliment of IX build machines.

Let's get some cold/warm build timings on the IX build machines and then we can make a better decision.

John Ford [:jhford] CET/CEST Berlin Time

Updated

•

13 years ago

Severity: normal → enhancement

OS: Mac OS X → All

Priority: -- → P5

Summary: Don't reboot 64 bit build machines → Don't reboot build machines

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

13 years ago

The new buildbot monitoring *does* assume that slaves reboot regularly, so changing that is not a good idea.  I move to wontfix this.

Aki Sasaki (not active)

Comment 13

•

12 years ago

Better build time with a loss of package management and build master load management seems like a net loss to me.

Agreeing with the above recommendations for WONTFIX, doing so.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Bugzilla

Quick Search

Don't reboot build machines

Categories

(Release Engineering :: General, enhancement, P5)

Tracking

(Not tracked)

People

(Reporter: jrmuizel, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Updated