Closed Bug 393411 Opened 13 years ago Closed 11 years ago

reboot, restart build+unittest machines automatically

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rcampbell, Unassigned)

References

Details

(Whiteboard: [try] need to do try builds)

Attachments

(5 files, 1 obsolete file)

Build slaves occasionally need to be rebooted. This is mostly a problem for the Windows slaves, specifically on win2k3 under MSYS which can have stuck processes as a result of failed builds. We need a mechanism to reboot a slave on demand (invoking shutdown through buildbot?) or on schedule, likely once per day.
Blocks: 393418
Blocks: 393419
Component: Testing → Release Engineering: Future
Product: Core → mozilla.org
QA Contact: testing → release
Version: unspecified → other
We're already doing this for Talos, and probably don't need to for other machines at this point.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Doing periodic reboots of build/unittest slaves still feels like a good idea, so reopening after offline discussions with bhearsum. 

Auto-rebooting of Talos seems to have helped reduce intermittent problems there. Hopefully, doing auto-rebooting for the build/unittest machines should help avoid problems like "machine went weird after 'n' days continuously building" and the msys problem reported in comment#0. Unclear at this stage if we would need to reboot after every job, like we do for Talos, or if less frequently will be good enough.

(Summary tweaked as we don't use "reboot-on-demand" for talos, and to clarify which machines we're talking about handling in this bug.)
Status: RESOLVED → REOPENED
Depends on: 472517
Resolution: WONTFIX → ---
Summary: reboot, restart on-demand and automatically → reboot, restart build+unittest machines automatically
Do know for a fact that unittest machines go wonky over time? I don't recall any of them needing a reboot for something like that in ages.

Rob's initial comment here isn't an issue anymore because we run a patched twisted which properly kills processes.

I don't think this is something we should bother with unless we have concrete benefits.
(In reply to comment #3)
> Do know for a fact that unittest machines go wonky over time? I don't recall
> any of them needing a reboot for something like that in ages.
bug#483199 is one recent example.


> Rob's initial comment here isn't an issue anymore because we run a patched
> twisted which properly kills processes.
> 
> I don't think this is something we should bother with unless we have concrete
> benefits.
Agreed. Lets leave this open in Future, and see how often this trips us again.
(In reply to comment #4)
> (In reply to comment #3)
> > Do know for a fact that unittest machines go wonky over time? I don't recall
> > any of them needing a reboot for something like that in ages.
> bug#483199 is one recent example.
> 
Bug#483199 was originally filed for problems with try-linux-slave01, and closed after rebooting fixed it.

Bug#483199 was recently reopened for similar problems with try-linux-slave03, and closed after rebooting fixed it.
bug#490850 tracked a problem with moz2-win32-slave24 that was fixed by rebooting.
(In reply to comment #6)
> bug#490850 tracked a problem with moz2-win32-slave24 that was fixed by
> rebooting.

To be fair, that one was sort of different as it was caused by the buildbot process not getting restarted properly after updating the code.
(In reply to comment #7)
> (In reply to comment #6)
> > bug#490850 tracked a problem with moz2-win32-slave24 that was fixed by
> > rebooting.
> 
> To be fair, that one was sort of different as it was caused by the buildbot
> process not getting restarted properly after updating the code.

Totally true that the root cause (human missed step during upgrade during downtime) is different. However, the solution of fix-by-rebooting is the same as the others tracked here - if I understand Nick in that bug, the only thing he did to fix the problem was reboot.

imho, if these slaves auto-rebooted after each job, this bug and the other bugs tracked here so far would have never required human intervention. Hence I believe its correct to track this event here.
I ran into a problem yesterday on one of the staging machines where the slave was hitting errors running packaged unittests.

After the slave was rebooted as part of the downtime, it is now able to run the tests.
Aki had to reboot slave in 483199.
(In reply to comment #10)
> Aki had to reboot slave in 483199.

Actually, that bug tracks a few reboots. Aki first rebooted try-linux-slave15, and try-linux-slave19, and then had to come back and reboot try-linux-slave19 again 2 days later.
There are several errors noted below that were fixed by simply rebooting.  

I'd advocate we have build machines reboot themselves after each build/unittest job completes, just like the talos machines already do... unless there are any remaining objections?
(In reply to comment #13)
> There are several errors noted below that were fixed by simply rebooting.  
> 
> I'd advocate we have build machines reboot themselves after each build/unittest
> job completes, just like the talos machines already do... unless there are any
> remaining objections?

You'll be happy to know that you'll get no more objections to this from me.
Un-futuring, taking.

Looks like we have staging set to reboot every 5 builds already; any objections to setting production to do the same?
Assignee: nobody → aki
Component: Release Engineering: Future → Release Engineering
I *think* we can also add buildsBeforeReboot to TryBuildFactory instances, since it inherits MercurialBuildFactory, but I'd prefer to verify.
Attachment #399374 - Flags: review?(bhearsum)
Comment on attachment 399374 [details] [diff] [review]
set all builds_before_reboot to 5 in config.py

I know staging has been running with a reboot every 5 builds but I think we may as well go with a reboot per build - what do you think?
(In reply to comment #17)
> (From update of attachment 399374 [details] [diff] [review])
> I know staging has been running with a reboot every 5 builds but I think we may
> as well go with a reboot per build - what do you think?

A reboot per build on staging or in production?  I would find it very annoying to have a reboot per build in staging, since I'm often working with only one or two slaves.
(In reply to comment #18)
> (In reply to comment #17)
> > (From update of attachment 399374 [details] [diff] [review] [details])
> > I know staging has been running with a reboot every 5 builds but I think we may
> > as well go with a reboot per build - what do you think?
> 
> A reboot per build on staging or in production?  I would find it very annoying
> to have a reboot per build in staging, since I'm often working with only one or
> two slaves.

I was speaking about both. I'd be fine to leave it at 5 in staging though
Comment on attachment 399374 [details] [diff] [review]
set all builds_before_reboot to 5 in config.py

A couple more things on this:
we need to change the nagios PINGs to be like the Talos ones - only fail after many repeated attempts, otherwise it's going to super spammy
We might want to wait until the next downtime to land this, the failure case is pretty bad (we lose all our slaves)
(In reply to comment #17)
> (From update of attachment 399374 [details] [diff] [review])
> I know staging has been running with a reboot every 5 builds but I think we may
> as well go with a reboot per build - what do you think?

Sure. I'd want to try it in staging for a bit first though.
I'll need to revisit for reboots per, nagios settings, and try... testing will take a while.  Do we want to land the every-5 patch before all that's ready, or hold off?
(In reply to comment #21)
> (In reply to comment #17)
> > (From update of attachment 399374 [details] [diff] [review] [details])
> > I know staging has been running with a reboot every 5 builds but I think we may
> > as well go with a reboot per build - what do you think?
> 
> Sure. I'd want to try it in staging for a bit first though.
> I'll need to revisit for reboots per, nagios settings, and try... testing will
> take a while.  Do we want to land the every-5 patch before all that's ready, or
> hold off?

yeah, that's probably a good idea. Can we do that in the downtime tomorrow?
Attachment #399374 - Flags: review?(bhearsum) → review+
Comment on attachment 399374 [details] [diff] [review]
set all builds_before_reboot to 5 in config.py

this is fine, to start with
Comment on attachment 399374 [details] [diff] [review]
set all builds_before_reboot to 5 in config.py

revision 1488:4919ef2ebb65
Attachment #399374 - Flags: checked-in+
This should hopefully eat the twistd exceptions on disconnect, much like the Talos version.
Attachment #399929 - Flags: review?(nthomas)
Attachment #399929 - Flags: review?(nthomas) → review+
Comment on attachment 399929 [details] [diff] [review]
use DisconnectStep like talos for maybe_reboot

revision 405:b094fdd16946
Attachment #399929 - Flags: checked-in+
Depends on: 515795
Now that the ping checks are changed, we can do this without making Nagios go nuts.
Attachment #402874 - Flags: review?(aki)
Attachment #402874 - Flags: review?(aki) → review+
Comment on attachment 402874 [details] [diff] [review]
reboot after every build (excluding l10n)

we're also missing try, but this patch is needed for the production side. r=me
I'm putting together a patch for the Try Server right now, too.
Attachment #402874 - Attachment is obsolete: true
Attachment #404612 - Flags: review?(catlee)
Attachment #404612 - Flags: review?(catlee) → review+
Comment on attachment 404612 [details] [diff] [review]
reboot after every build (excluding l10n and staging)

too bad this wasn't a global setting per platform...
Comment on attachment 404612 [details] [diff] [review]
reboot after every build (excluding l10n and staging)

changeset:   1578:0075aeed9bcc
Attachment #404612 - Flags: checked-in+
Still to do here: reboots on try, and reboots after mobile builds. I have patches in progress, probably don't have time to finish them today. Maybe tomorrow.
Most likely not getting to this this quarter.
Assignee: aki → nobody
We should get this for free on try when bug 520227 is done.
Whiteboard: [mobile][try] need to do try, mobile builds
Try is hopefully free.
I had updated bhearsum's original patch to work in mobile land; tested a maemo build and it worked.
Patch 1/2
Attachment #421191 - Flags: review?(catlee)
Adds the appropriate steps to the mobile builds -- desktop, maemo, winmo.

The Maemo build factory has 2 locations where it's added -- in the init() and addMultiLocaleSteps (since we drop all build steps before the multi-locale repacks during multi-locale builds).
Attachment #421192 - Flags: review?(catlee)
Attachment #421191 - Flags: review?(catlee) → review+
Comment on attachment 421192 [details] [diff] [review]
reboot mobile builds - custom

>@@ -4425,16 +4427,19 @@ class MaemoBuildFactory(MobileBuildFacto
>             self.addUploadSteps(platform='linux')
>         
>         if self.triggerBuilds:
>             self.addTriggeredBuildsSteps()
> 
>         if self.multiLocale:
>             self.nonMultiLocaleStepsLength = len(self.steps)
> 
>+        if self.buildsBeforeReboot and self.buildsBeforeReboot > 0:
>+            self.addPeriodicRebootSteps()
>+
>     def newBuild(self, requests):
>         if self.multiLocale:
>             self.addMultiLocaleSteps(requests, 'locales')
>         return BuildFactory.newBuild(self, requests)
>         
>     def addPreCleanSteps(self):
>         self.addStep(ShellCommand,
>             name='rm_logfile',
>@@ -4587,16 +4592,18 @@ class MaemoBuildFactory(MobileBuildFacto
>             self.checkOutLocale(locale)
>             self.compareLocales(locale)
>             self.addChromeLocale(locale)
>         # Let's package the multi-locale build and upload it
>         self.addPackageSteps(multiLocale=True, packageXulrunner=True)
>         self.packageGlob="mobile/dist/fennec*.tar.bz2 mobile/mobile/fennec*.deb " + \
>                          "xulrunner/dist/*.tar.bz2 xulrunner/xulrunner/xulrunner*.deb"
>         self.uploadMulti()
>+        if self.buildsBeforeReboot and self.buildsBeforeReboot > 0:
>+            self.addPeriodicRebootSteps()
> 
>     def compareLocalesSetup(self):
>         self.addStep(ShellCommand,
>          name='rm_compare_locales',
>          command=['rm', '-rf', 'compare-locales'],
>          description=['remove', 'compare-locales'],
>          workdir=self.baseWorkDir,
>          haltOnFailure=True


I think you're adding the periodic reboot steps in here twice...is that on purpose?  One gets called via newBuild -> addMultiLocateSteps, and one gets called via __init__.
(In reply to comment #37)
> I think you're adding the periodic reboot steps in here twice...is that on
> purpose?  One gets called via newBuild -> addMultiLocateSteps, and one gets
> called via __init__.

That's correct, see comment 36 =)
Attachment #421192 - Flags: review?(catlee) → review+
Seems like aki is working on this
Assignee: nobody → aki
Attachment #421192 - Flags: checked-in+
Whiteboard: [mobile][try] need to do try, mobile builds → [try] need to do try builds
Not working on the try portion.
Assignee: aki → nobody
Try is going to get fixed as part of bug 520227, which is the only part of this bug left.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.