Closed
Bug 393411
Opened 17 years ago
Closed 15 years ago
reboot, restart build+unittest machines automatically
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rcampbell, Unassigned)
References
Details
(Whiteboard: [try] need to do try builds)
Attachments
(5 files, 1 obsolete file)
18.05 KB,
patch
|
bhearsum
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
1.30 KB,
patch
|
nthomas
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
13.07 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
30.00 KB,
patch
|
catlee
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
3.14 KB,
patch
|
catlee
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
Build slaves occasionally need to be rebooted. This is mostly a problem for the Windows slaves, specifically on win2k3 under MSYS which can have stuck processes as a result of failed builds. We need a mechanism to reboot a slave on demand (invoking shutdown through buildbot?) or on schedule, likely once per day.
Updated•16 years ago
|
Component: Testing → Release Engineering: Future
Product: Core → mozilla.org
QA Contact: testing → release
Version: unspecified → other
Comment 1•16 years ago
|
||
We're already doing this for Talos, and probably don't need to for other machines at this point.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → WONTFIX
Comment 2•16 years ago
|
||
Doing periodic reboots of build/unittest slaves still feels like a good idea, so reopening after offline discussions with bhearsum.
Auto-rebooting of Talos seems to have helped reduce intermittent problems there. Hopefully, doing auto-rebooting for the build/unittest machines should help avoid problems like "machine went weird after 'n' days continuously building" and the msys problem reported in comment#0. Unclear at this stage if we would need to reboot after every job, like we do for Talos, or if less frequently will be good enough.
(Summary tweaked as we don't use "reboot-on-demand" for talos, and to clarify which machines we're talking about handling in this bug.)
Status: RESOLVED → REOPENED
Depends on: 472517
Resolution: WONTFIX → ---
Summary: reboot, restart on-demand and automatically → reboot, restart build+unittest machines automatically
Comment 3•16 years ago
|
||
Do know for a fact that unittest machines go wonky over time? I don't recall any of them needing a reboot for something like that in ages.
Rob's initial comment here isn't an issue anymore because we run a patched twisted which properly kills processes.
I don't think this is something we should bother with unless we have concrete benefits.
Comment 4•16 years ago
|
||
(In reply to comment #3)
> Do know for a fact that unittest machines go wonky over time? I don't recall
> any of them needing a reboot for something like that in ages.
bug#483199 is one recent example.
> Rob's initial comment here isn't an issue anymore because we run a patched
> twisted which properly kills processes.
>
> I don't think this is something we should bother with unless we have concrete
> benefits.
Agreed. Lets leave this open in Future, and see how often this trips us again.
Comment 5•16 years ago
|
||
(In reply to comment #4)
> (In reply to comment #3)
> > Do know for a fact that unittest machines go wonky over time? I don't recall
> > any of them needing a reboot for something like that in ages.
> bug#483199 is one recent example.
>
Bug#483199 was originally filed for problems with try-linux-slave01, and closed after rebooting fixed it.
Bug#483199 was recently reopened for similar problems with try-linux-slave03, and closed after rebooting fixed it.
Comment 6•16 years ago
|
||
bug#490850 tracked a problem with moz2-win32-slave24 that was fixed by rebooting.
Comment 7•16 years ago
|
||
(In reply to comment #6)
> bug#490850 tracked a problem with moz2-win32-slave24 that was fixed by
> rebooting.
To be fair, that one was sort of different as it was caused by the buildbot process not getting restarted properly after updating the code.
Comment 8•16 years ago
|
||
(In reply to comment #7)
> (In reply to comment #6)
> > bug#490850 tracked a problem with moz2-win32-slave24 that was fixed by
> > rebooting.
>
> To be fair, that one was sort of different as it was caused by the buildbot
> process not getting restarted properly after updating the code.
Totally true that the root cause (human missed step during upgrade during downtime) is different. However, the solution of fix-by-rebooting is the same as the others tracked here - if I understand Nick in that bug, the only thing he did to fix the problem was reboot.
imho, if these slaves auto-rebooted after each job, this bug and the other bugs tracked here so far would have never required human intervention. Hence I believe its correct to track this event here.
Comment 9•16 years ago
|
||
I ran into a problem yesterday on one of the staging machines where the slave was hitting errors running packaged unittests.
After the slave was rebooted as part of the downtime, it is now able to run the tests.
Comment 10•15 years ago
|
||
Aki had to reboot slave in 483199.
Comment 12•15 years ago
|
||
(In reply to comment #10)
> Aki had to reboot slave in 483199.
Actually, that bug tracks a few reboots. Aki first rebooted try-linux-slave15, and try-linux-slave19, and then had to come back and reboot try-linux-slave19 again 2 days later.
Comment 13•15 years ago
|
||
There are several errors noted below that were fixed by simply rebooting.
I'd advocate we have build machines reboot themselves after each build/unittest job completes, just like the talos machines already do... unless there are any remaining objections?
Comment 14•15 years ago
|
||
(In reply to comment #13)
> There are several errors noted below that were fixed by simply rebooting.
>
> I'd advocate we have build machines reboot themselves after each build/unittest
> job completes, just like the talos machines already do... unless there are any
> remaining objections?
You'll be happy to know that you'll get no more objections to this from me.
Comment 15•15 years ago
|
||
Un-futuring, taking.
Looks like we have staging set to reboot every 5 builds already; any objections to setting production to do the same?
Assignee: nobody → aki
Component: Release Engineering: Future → Release Engineering
Comment 16•15 years ago
|
||
I *think* we can also add buildsBeforeReboot to TryBuildFactory instances, since it inherits MercurialBuildFactory, but I'd prefer to verify.
Attachment #399374 -
Flags: review?(bhearsum)
Comment 17•15 years ago
|
||
Comment on attachment 399374 [details] [diff] [review]
set all builds_before_reboot to 5 in config.py
I know staging has been running with a reboot every 5 builds but I think we may as well go with a reboot per build - what do you think?
Comment 18•15 years ago
|
||
(In reply to comment #17)
> (From update of attachment 399374 [details] [diff] [review])
> I know staging has been running with a reboot every 5 builds but I think we may
> as well go with a reboot per build - what do you think?
A reboot per build on staging or in production? I would find it very annoying to have a reboot per build in staging, since I'm often working with only one or two slaves.
Comment 19•15 years ago
|
||
(In reply to comment #18)
> (In reply to comment #17)
> > (From update of attachment 399374 [details] [diff] [review] [details])
> > I know staging has been running with a reboot every 5 builds but I think we may
> > as well go with a reboot per build - what do you think?
>
> A reboot per build on staging or in production? I would find it very annoying
> to have a reboot per build in staging, since I'm often working with only one or
> two slaves.
I was speaking about both. I'd be fine to leave it at 5 in staging though
Comment 20•15 years ago
|
||
Comment on attachment 399374 [details] [diff] [review]
set all builds_before_reboot to 5 in config.py
A couple more things on this:
we need to change the nagios PINGs to be like the Talos ones - only fail after many repeated attempts, otherwise it's going to super spammy
We might want to wait until the next downtime to land this, the failure case is pretty bad (we lose all our slaves)
Comment 21•15 years ago
|
||
(In reply to comment #17)
> (From update of attachment 399374 [details] [diff] [review])
> I know staging has been running with a reboot every 5 builds but I think we may
> as well go with a reboot per build - what do you think?
Sure. I'd want to try it in staging for a bit first though.
I'll need to revisit for reboots per, nagios settings, and try... testing will take a while. Do we want to land the every-5 patch before all that's ready, or hold off?
Comment 22•15 years ago
|
||
(In reply to comment #21)
> (In reply to comment #17)
> > (From update of attachment 399374 [details] [diff] [review] [details])
> > I know staging has been running with a reboot every 5 builds but I think we may
> > as well go with a reboot per build - what do you think?
>
> Sure. I'd want to try it in staging for a bit first though.
> I'll need to revisit for reboots per, nagios settings, and try... testing will
> take a while. Do we want to land the every-5 patch before all that's ready, or
> hold off?
yeah, that's probably a good idea. Can we do that in the downtime tomorrow?
Updated•15 years ago
|
Attachment #399374 -
Flags: review?(bhearsum) → review+
Comment 23•15 years ago
|
||
Comment on attachment 399374 [details] [diff] [review]
set all builds_before_reboot to 5 in config.py
this is fine, to start with
Comment 24•15 years ago
|
||
Comment on attachment 399374 [details] [diff] [review]
set all builds_before_reboot to 5 in config.py
revision 1488:4919ef2ebb65
Attachment #399374 -
Flags: checked-in+
Comment 25•15 years ago
|
||
This should hopefully eat the twistd exceptions on disconnect, much like the Talos version.
Attachment #399929 -
Flags: review?(nthomas)
Updated•15 years ago
|
Attachment #399929 -
Flags: review?(nthomas) → review+
Comment 26•15 years ago
|
||
Comment on attachment 399929 [details] [diff] [review]
use DisconnectStep like talos for maybe_reboot
revision 405:b094fdd16946
Attachment #399929 -
Flags: checked-in+
Comment 27•15 years ago
|
||
Now that the ping checks are changed, we can do this without making Nagios go nuts.
Attachment #402874 -
Flags: review?(aki)
Updated•15 years ago
|
Attachment #402874 -
Flags: review?(aki) → review+
Comment 28•15 years ago
|
||
Comment on attachment 402874 [details] [diff] [review]
reboot after every build (excluding l10n)
we're also missing try, but this patch is needed for the production side. r=me
Comment 29•15 years ago
|
||
I'm putting together a patch for the Try Server right now, too.
Attachment #402874 -
Attachment is obsolete: true
Attachment #404612 -
Flags: review?(catlee)
Updated•15 years ago
|
Attachment #404612 -
Flags: review?(catlee) → review+
Comment 30•15 years ago
|
||
Comment on attachment 404612 [details] [diff] [review]
reboot after every build (excluding l10n and staging)
too bad this wasn't a global setting per platform...
Comment 31•15 years ago
|
||
Comment on attachment 404612 [details] [diff] [review]
reboot after every build (excluding l10n and staging)
changeset: 1578:0075aeed9bcc
Attachment #404612 -
Flags: checked-in+
Comment 32•15 years ago
|
||
Still to do here: reboots on try, and reboots after mobile builds. I have patches in progress, probably don't have time to finish them today. Maybe tomorrow.
Comment 34•15 years ago
|
||
We should get this for free on try when bug 520227 is done.
Whiteboard: [mobile][try] need to do try, mobile builds
Comment 35•15 years ago
|
||
Try is hopefully free.
I had updated bhearsum's original patch to work in mobile land; tested a maemo build and it worked.
Patch 1/2
Attachment #421191 -
Flags: review?(catlee)
Comment 36•15 years ago
|
||
Adds the appropriate steps to the mobile builds -- desktop, maemo, winmo.
The Maemo build factory has 2 locations where it's added -- in the init() and addMultiLocaleSteps (since we drop all build steps before the multi-locale repacks during multi-locale builds).
Attachment #421192 -
Flags: review?(catlee)
Updated•15 years ago
|
Attachment #421191 -
Flags: review?(catlee) → review+
Comment 37•15 years ago
|
||
Comment on attachment 421192 [details] [diff] [review]
reboot mobile builds - custom
>@@ -4425,16 +4427,19 @@ class MaemoBuildFactory(MobileBuildFacto
> self.addUploadSteps(platform='linux')
>
> if self.triggerBuilds:
> self.addTriggeredBuildsSteps()
>
> if self.multiLocale:
> self.nonMultiLocaleStepsLength = len(self.steps)
>
>+ if self.buildsBeforeReboot and self.buildsBeforeReboot > 0:
>+ self.addPeriodicRebootSteps()
>+
> def newBuild(self, requests):
> if self.multiLocale:
> self.addMultiLocaleSteps(requests, 'locales')
> return BuildFactory.newBuild(self, requests)
>
> def addPreCleanSteps(self):
> self.addStep(ShellCommand,
> name='rm_logfile',
>@@ -4587,16 +4592,18 @@ class MaemoBuildFactory(MobileBuildFacto
> self.checkOutLocale(locale)
> self.compareLocales(locale)
> self.addChromeLocale(locale)
> # Let's package the multi-locale build and upload it
> self.addPackageSteps(multiLocale=True, packageXulrunner=True)
> self.packageGlob="mobile/dist/fennec*.tar.bz2 mobile/mobile/fennec*.deb " + \
> "xulrunner/dist/*.tar.bz2 xulrunner/xulrunner/xulrunner*.deb"
> self.uploadMulti()
>+ if self.buildsBeforeReboot and self.buildsBeforeReboot > 0:
>+ self.addPeriodicRebootSteps()
>
> def compareLocalesSetup(self):
> self.addStep(ShellCommand,
> name='rm_compare_locales',
> command=['rm', '-rf', 'compare-locales'],
> description=['remove', 'compare-locales'],
> workdir=self.baseWorkDir,
> haltOnFailure=True
I think you're adding the periodic reboot steps in here twice...is that on purpose? One gets called via newBuild -> addMultiLocateSteps, and one gets called via __init__.
Comment 38•15 years ago
|
||
(In reply to comment #37)
> I think you're adding the periodic reboot steps in here twice...is that on
> purpose? One gets called via newBuild -> addMultiLocateSteps, and one gets
> called via __init__.
That's correct, see comment 36 =)
Updated•15 years ago
|
Attachment #421192 -
Flags: review?(catlee) → review+
Updated•15 years ago
|
Attachment #421192 -
Flags: checked-in+
Comment 40•15 years ago
|
||
Comment on attachment 421192 [details] [diff] [review]
reboot mobile builds - custom
http://hg.mozilla.org/build/buildbotcustom/rev/5fb543a6f438
Comment 41•15 years ago
|
||
Comment on attachment 421191 [details] [diff] [review]
reboot mobile builds - configs
http://hg.mozilla.org/build/buildbot-configs/rev/b360aa505a84
Attachment #421191 -
Flags: checked-in+
Updated•15 years ago
|
Whiteboard: [mobile][try] need to do try, mobile builds → [try] need to do try builds
Comment 43•15 years ago
|
||
Try is going to get fixed as part of bug 520227, which is the only part of this bug left.
Status: REOPENED → RESOLVED
Closed: 16 years ago → 15 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•