Closed Bug 633277 Opened 13 years ago Closed 12 years ago

Don't reboot build machines

Categories

(Release Engineering :: General, enhancement, P5)

x86
All
enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jrmuizel, Unassigned)

Details

Apparently the 32 bit build machines were having memory allocation problems. This shouldn't be an issue on the 64 bit machines.
I've apparently assumed context that everyone didn't have.

As far as I know there are two reasons for rebooting the machines:
1. unit tests not cleaning up after themselves (this should be less of a problem now that we only run make check)
2. the linker not being able allocate memory after running for sometime. The assumed reason for this was address space fragmentation. This shouldn't be a problem on 64bit.
Rebooting after every run is a big part of our configuration management and monitoring.  At this point, rather than the lack of a reason to reboot, I think we need strong reasons *not* to reboot.
catlee told me that not rebooting means we keep the file caching the OS has built up, and therefore compile performance would be better. 

The size of that hit depends on how many times we build the same branch and compile type in a row, and the effect of ccache. It would also be useful to check if linking libxul uses so much memory that the file cache is effectively emptied.
If this is truly worth investigating, then let's go for it, but realize that it's a 180 degree turn from a bunch of work I've got in progress toward rebooting *all* slaves periodically.

Several months' additional redesign and reimplementation (of monitoring, configuration management, and the slave allocator) is a big price to pay for potential disk cache improvement.  IMHO the burden of proof is heavily on the advocates for fewer reboots.
(In reply to comment #4)
> If this is truly worth investigating, then let's go for it, but realize that
> it's a 180 degree turn from a bunch of work I've got in progress toward
> rebooting *all* slaves periodically.
> 
> Several months' additional redesign and reimplementation (of monitoring,
> configuration management, and the slave allocator) is a big price to pay for
> potential disk cache improvement.  IMHO the burden of proof is heavily on the
> advocates for fewer reboots.

I don't think these have to conflict with each other.  If we turn off the 'always reboot' step after a build, and instead rely on 'reboot after N hours', then aren't we still ok?
A very busy set of slaves (like our current linux64 VMs) could end up with slaves that build and build and build and never get an idle 6 hours to reboot. Granted we want to fix that but it's a difference from rebooting after every build.
(In reply to comment #2)
> Rebooting after every run is a big part of our configuration management and
> monitoring.  At this point, rather than the lack of a reason to reboot, I think
> we need strong reasons *not* to reboot.

I expect that build time is a pretty strong reason.

(In reply to comment #3)
> The size of that hit depends on how many times we build the same branch and
> compile type in a row, and the effect of ccache. It would also be useful to
> check if linking libxul uses so much memory that the file cache is effectively
> emptied.

If this is true, you should probably get more memory. On linux machines, builds should not be doing very much IO as the working set should easily fit in RAM.
(In reply to comment #6)
> A very busy set of slaves (like our current linux64 VMs) could end up with
> slaves that build and build and build and never get an idle 6 hours to reboot.
> Granted we want to fix that but it's a difference from rebooting after every
> build.

We don't need to wait for 6 hours of idle time. We could trigger a graceful shutdown after 6 hours, wait for whatever job is running to finish, and then reboot.
Catlee's description in comment 8 is how it's planned right now.  So this bug would just mean removing the count_and_reboot.py invocation from the buildsteps on linux64.
There's a few overlapping, but orthogonal topics here:

(In reply to comment #2)
> Rebooting after every run is a big part of our configuration management and
> monitoring.  At this point, rather than the lack of a reason to reboot, I think
> we need strong reasons *not* to reboot.

Agreed. We reboot after every build for all OS, and we have machines check with puppet for toolchain updates on reboot. This is important for keeping all production machines identical. This is also tied into how we monitor for sick/missing slaves. Given all this, I'd move to WONTFIX.


(In reply to comment #3)
> catlee told me that not rebooting means we keep the file caching the OS has
> built up, and therefore compile performance would be better. 
> 
> The size of that hit depends on how many times we build the same branch and
> compile type in a row, and the effect of ccache. It would also be useful to
> check if linking libxul uses so much memory that the file cache is effectively
> emptied.
I thinks that "Preserving file cache across reboots for faster compiles" sounds like what jrmuizel is actually asking for. If so, we can morph this bug, and investigate this reasonable request.


(In reply to comment #6)
> A very busy set of slaves (like our current linux64 VMs) could end up with
> slaves that build and build and build and never get an idle 6 hours to reboot.
Yes, today we have a very limited number of linux64 build slaves. This is true until bug#588957 is fixed, and we get a linux64 refimage installed on the racks of IX machines we have installed and waiting... At that point, linux64 will have a full compliment of IX build machines.
(In reply to comment #10)
> There's a few overlapping, but orthogonal topics here:
> 
> (In reply to comment #2)
> > Rebooting after every run is a big part of our configuration management and
> > monitoring.  At this point, rather than the lack of a reason to reboot, I think
> > we need strong reasons *not* to reboot.
> 
> Agreed. We reboot after every build for all OS, and we have machines check with
> puppet for toolchain updates on reboot. This is important for keeping all
> production machines identical. This is also tied into how we monitor for
> sick/missing slaves. Given all this, I'd move to WONTFIX.

We have better checks on build machines. And, as I mentioned, we can still reboot, we just don't *have* to after each run.

> (In reply to comment #3)
> > catlee told me that not rebooting means we keep the file caching the OS has
> > built up, and therefore compile performance would be better. 
> > 
> > The size of that hit depends on how many times we build the same branch and
> > compile type in a row, and the effect of ccache. It would also be useful to
> > check if linking libxul uses so much memory that the file cache is effectively
> > emptied.
> I thinks that "Preserving file cache across reboots for faster compiles" sounds
> like what jrmuizel is actually asking for. If so, we can morph this bug, and
> investigate this reasonable request.

The point is that if you don't reboot, you stand a good chance on holding most of the important files in memory already, so you don't have to read from disk. This is *not* about ccache.

> (In reply to comment #6)
> > A very busy set of slaves (like our current linux64 VMs) could end up with
> > slaves that build and build and build and never get an idle 6 hours to reboot.
> Yes, today we have a very limited number of linux64 build slaves. This is true
> until bug#588957 is fixed, and we get a linux64 refimage installed on the racks
> of IX machines we have installed and waiting... At that point, linux64 will
> have a full compliment of IX build machines.

Let's get some cold/warm build timings on the IX build machines and then we can make a better decision.
Severity: normal → enhancement
OS: Mac OS X → All
Priority: -- → P5
Summary: Don't reboot 64 bit build machines → Don't reboot build machines
The new buildbot monitoring *does* assume that slaves reboot regularly, so changing that is not a good idea.  I move to wontfix this.
Better build time with a loss of package management and build master load management seems like a net loss to me.

Agreeing with the above recommendations for WONTFIX, doing so.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.