Closed Bug 393411 Opened 17 years ago Closed 15 years ago

reboot, restart build+unittest machines automatically

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rcampbell, Unassigned)

References

Details

(Whiteboard: [try] need to do try builds)

Attachments

(5 files, 1 obsolete file)

set all builds_before_reboot to 5 in config.py 15 years ago Aki Sasaki (not active) 18.05 KB, patch	bhearsum : review+ mozilla : checked-in+	Details \| Diff \| Splinter Review
use DisconnectStep like talos for maybe_reboot 15 years ago Aki Sasaki (not active) 1.30 KB, patch	nthomas : review+ mozilla : checked-in+	Details \| Diff \| Splinter Review
reboot after every build (excluding l10n) 15 years ago bhearsum@mozilla.com (:bhearsum) 26.12 KB, patch	mozilla : review+	Details \| Diff \| Splinter Review
reboot after every build (excluding l10n and staging) 15 years ago bhearsum@mozilla.com (:bhearsum) 13.07 KB, patch	catlee : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
reboot mobile builds - configs 15 years ago Aki Sasaki (not active) 30.00 KB, patch	catlee : review+ mozilla : checked-in+	Details \| Diff \| Splinter Review
reboot mobile builds - custom 15 years ago Aki Sasaki (not active) 3.14 KB, patch	catlee : review+ mozilla : checked-in+	Details \| Diff \| Splinter Review

Rob Campbell [:rc] (:robcee)

Reporter

Description

•

17 years ago

Build slaves occasionally need to be rebooted. This is mostly a problem for the Windows slaves, specifically on win2k3 under MSYS which can have stuck processes as a result of failed builds. We need a mechanism to reboot a slave on demand (invoking shutdown through buildbot?) or on schedule, likely once per day.

Rob Campbell [:rc] (:robcee)

Reporter

Updated

•

17 years ago

Blocks: 393418

Rob Campbell [:rc] (:robcee)

Reporter

Updated

•

17 years ago

Blocks: 393419

alice nodelman [:alice] [:anode]

Updated

•

16 years ago

Component: Testing → Release Engineering: Future

Product: Core → mozilla.org

QA Contact: testing → release

Version: unspecified → other

bhearsum@mozilla.com (:bhearsum)

Comment 1

•

16 years ago

We're already doing this for Talos, and probably don't need to for other machines at this point.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → WONTFIX

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 2

•

16 years ago

Doing periodic reboots of build/unittest slaves still feels like a good idea, so reopening after offline discussions with bhearsum. Auto-rebooting of Talos seems to have helped reduce intermittent problems there. Hopefully, doing auto-rebooting for the build/unittest machines should help avoid problems like "machine went weird after 'n' days continuously building" and the msys problem reported in comment#0. Unclear at this stage if we would need to reboot after every job, like we do for Talos, or if less frequently will be good enough. (Summary tweaked as we don't use "reboot-on-demand" for talos, and to clarify which machines we're talking about handling in this bug.)

Status: RESOLVED → REOPENED

Depends on: 472517

Resolution: WONTFIX → ---

Summary: reboot, restart on-demand and automatically → reboot, restart build+unittest machines automatically

bhearsum@mozilla.com (:bhearsum)

Comment 3

•

16 years ago

Do know for a fact that unittest machines go wonky over time? I don't recall any of them needing a reboot for something like that in ages. Rob's initial comment here isn't an issue anymore because we run a patched twisted which properly kills processes. I don't think this is something we should bother with unless we have concrete benefits.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

16 years ago

No longer blocks: 393112

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 4

•

16 years ago

(In reply to comment #3) > Do know for a fact that unittest machines go wonky over time? I don't recall > any of them needing a reboot for something like that in ages. bug#483199 is one recent example. > Rob's initial comment here isn't an issue anymore because we run a patched > twisted which properly kills processes. > > I don't think this is something we should bother with unless we have concrete > benefits. Agreed. Lets leave this open in Future, and see how often this trips us again.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 5

•

16 years ago

(In reply to comment #4) > (In reply to comment #3) > > Do know for a fact that unittest machines go wonky over time? I don't recall > > any of them needing a reboot for something like that in ages. > bug#483199 is one recent example. > Bug#483199 was originally filed for problems with try-linux-slave01, and closed after rebooting fixed it. Bug#483199 was recently reopened for similar problems with try-linux-slave03, and closed after rebooting fixed it.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 6

•

16 years ago

bug#490850 tracked a problem with moz2-win32-slave24 that was fixed by rebooting.

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

16 years ago

(In reply to comment #6) > bug#490850 tracked a problem with moz2-win32-slave24 that was fixed by > rebooting. To be fair, that one was sort of different as it was caused by the buildbot process not getting restarted properly after updating the code.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 8

•

16 years ago

(In reply to comment #7) > (In reply to comment #6) > > bug#490850 tracked a problem with moz2-win32-slave24 that was fixed by > > rebooting. > > To be fair, that one was sort of different as it was caused by the buildbot > process not getting restarted properly after updating the code. Totally true that the root cause (human missed step during upgrade during downtime) is different. However, the solution of fix-by-rebooting is the same as the others tracked here - if I understand Nick in that bug, the only thing he did to fix the problem was reboot. imho, if these slaves auto-rebooted after each job, this bug and the other bugs tracked here so far would have never required human intervention. Hence I believe its correct to track this event here.

Chris AtLee [:catlee]

Comment 9

•

16 years ago

I ran into a problem yesterday on one of the staging machines where the slave was hitting errors running packaged unittests. After the slave was rebooted as part of the downtime, it is now able to run the tests.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 10

•

15 years ago

Aki had to reboot slave in 483199.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 12

•

15 years ago

(In reply to comment #10) > Aki had to reboot slave in 483199. Actually, that bug tracks a few reboots. Aki first rebooted try-linux-slave15, and try-linux-slave19, and then had to come back and reboot try-linux-slave19 again 2 days later.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 13

•

15 years ago

There are several errors noted below that were fixed by simply rebooting. I'd advocate we have build machines reboot themselves after each build/unittest job completes, just like the talos machines already do... unless there are any remaining objections?

bhearsum@mozilla.com (:bhearsum)

Comment 14

•

15 years ago

(In reply to comment #13) > There are several errors noted below that were fixed by simply rebooting. > > I'd advocate we have build machines reboot themselves after each build/unittest > job completes, just like the talos machines already do... unless there are any > remaining objections? You'll be happy to know that you'll get no more objections to this from me.