Closed Bug 417887 Opened 12 years ago Closed 11 years ago

linux buildbot masters/slaves should reboot ready for use

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: bhearsum)

References

Details

Attachments

(4 files, 3 obsolete files)

All of the buildbot masters and slaves should not require any extra prodding after a reboot. Masters should always start up everything needed on boot.

For slaves, buildbot should be able to launch apps that talk to the GUI. This is done differently depending on the OS.
During 3.0beta3 and again now in 3.0beta4, we failed out AliveTest because X server was refusing connections. Fixed by running "xhost +", but this should be done automatically on reboot.
...
----------- Output from Profile Creation ------------- 
  Xlib: connection to ":0.0" refused by server
...
note: for 3.0beta4, this happened on both mac and linux, so needs to be fixed on both.
Component: Build & Release → Release Engineering: Projects
QA Contact: build → release
Splitting out win32 specific details in bug#428123.
Splitting out mac specific details in bug#428124.
Summary: all buildbot masters/slaves should reboot ready for use → linux buildbot masters/slaves should reboot ready for use
Blocks: 472517
We chatted a bunch about this today and decided that part of this will be doing scheduled, periodic reboots of staging machines both to iron out kinks in the rebooting and to look for potential performance gains.

I plan to start work on this next week.
Status: NEW → ASSIGNED
Component: Release Engineering: Future → Release Engineering
Assignee: nobody → bhearsum
I did some initial work on this today. Thanks to catlee already setting up a cronjob to make sure Xvnc and metacity are running it seems the only thing to do here is start Buildbot on boot. This can be done by dropping the two files I'm about to attach into /etc/default/buildbot and /etc/init.d/buildbot respectively, and then doing the following:
1. Ensure /builds/slave is the buildslave directory, or is symlinked to it
2. Run 'chkconfig --add buildbot' as root.
3. Reboot to test it.

Before going ahead and deploying it I want to get periodic reboots of the staging Linux slaves going to make sure everything comes up okay.
As per the meeting yesterday I'm going to investigate the difficulty and time consumption of running fsck on every reboot to help avoid cases where we burn builds after an ESX/storage problem.
Priority: P3 → P2
(In reply to comment #8)
> As per the meeting yesterday I'm going to investigate the difficulty and time
> consumption of running fsck on every reboot to help avoid cases where we burn
> builds after an ESX/storage problem.

As it turns out it's pretty easy to do this, but doesn't solve the problem we're trying to solve. When an ESX host or storage array goes down while a build is linking we end up with a corrupt object file. The hope was that fscking would fix this, but it does not. Given that, there's no point in forcing fscks on every boot.

We can solve the burning-after-problems issue by using the clobberer webpage to force clobbers before we start slaves back up, however.

So, two things left to do here:
1) Set-up periodic reboots of Linux machines in staging to help ensure they always come up clean.
2) Deploy to production / the rest of staging.
Priority: P2 → P3
Priority: P3 → P2
Copy and paste instructions:
cd /etc/default
wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359078
cd /etc/init.d
wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359080
chmod +x buildbot
chkconfig --add buildbot
/etc/init.d/buildbot start
I just tested a periodic reboot patch on staging-master. I had to add the following line to sudoers to make it work, but after that it worked great:
cltbld    ALL=NOPASSWD:   /usr/bin/reboot

Patches to come.
This is basically a copy of the Talos code. The Build/Unittest factories end up calling the addPeriodicRebootSteps function since it has to be run at the end of a build.
Attachment #360107 - Flags: review?(catlee)
This patch enables reboots on Linux staging slaves. I've set up moz2-linux-slave03, 04, and 17 with the proper sudoers file and tested them all - they should all come up OK.

I also wanted to get the code in for production so it's just a flip-of-the-switch if we ever care to do it. To be clear: I'm not planning to enable periodic reboots in production as part of this bug.
Attachment #360108 - Flags: review?(catlee)
Attachment #360107 - Attachment is obsolete: true
Attachment #360132 - Flags: review?(catlee)
Attachment #360107 - Flags: review?(catlee)
Attached patch master-side patch (obsolete) — Splinter Review
Attachment #360108 - Attachment is obsolete: true
Attachment #360134 - Flags: review?(catlee)
Attachment #360108 - Flags: review?(catlee)
Here's complete instructions for rollout of thisss, when we do it:
cd /etc/default
wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359078
cd /etc/init.d
wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359080
chmod +x buildbot
chkconfig --add buildbot
/etc/init.d/buildbot start
sudo -e /etc/sudoers
# Add 'cltbld ALL=NOPASSWD: /usr/bin/reboot' as the last line, save and exit

After that's all done it's a good idea to make sure cltbld can run 'reboot' and reboot the slave to make sure it comes up clean. Take care to watch for it to connect to the Buildbot master:
su - cltbld
sudo reboot
Attachment #360132 - Flags: review?(catlee) → review+
Comment on attachment 360134 [details] [diff] [review]
master-side patch

Looks good.  Did you intend to enable periodic rebooting for linux on production?
Attachment #360134 - Flags: review?(catlee) → review+
(In reply to comment #17)
> (From update of attachment 360134 [details] [diff] [review])
> Looks good.  Did you intend to enable periodic rebooting for linux on
> production?

Whoops, no, not at all.
We're going to deploy this on Monday.
Comment on attachment 360319 [details] [diff] [review]
master side patch, disable reboots in production

changeset:   933:37cbc1167a03
Attachment #360319 - Flags: checked‑in+
Comment on attachment 360132 [details] [diff] [review]
periodic reboots, buildsBeforeReboot/doPeriodicReboots combined; use alwaysRun

changeset:   189:f716fe07a806
Attachment #360132 - Flags: checked‑in+
Alright, this got landed this morning. Only thing left to do here is to deploy these changes to the ref platform.
Ref platform updated. So, this has been deployed on the following slaves:
try-linux-slave01 - 06
moz2-linux-slave01 - 19
CentOS-5.0-ref-tools-vm
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
I'm re-opening this because I neglected to deploy this on our 1.8 and 1.9 machines - which is not as critical but still worthwhile to do.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Alright, the following VMs' Buildbot slaves/masters should start on boot now:
fx-linux-1.9-slave1, 2, 03, 04, 07, 08, 09
production-crazyhorse
production-prometheus-vm
production-prometheus-vm02
staging-1.9-master
staging-prometheus-vm
production-1.8-master
production-1.9-master
production-master
staging-master
sm-try-master
sm-staging-try-master

The following VMs were off, and almost never used anymore:
staging-crozyhorse
staging-prometheus-vm02

I did not start them up to add the init script.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.