Closed Bug 417887 Opened 12 years ago Closed 11 years ago
linux buildbot masters/slaves should reboot ready for use
740 bytes, text/plain
3.98 KB, text/plain
3.66 KB, patch
|Details | Diff | Splinter Review|
13.66 KB, patch
|Details | Diff | Splinter Review|
All of the buildbot masters and slaves should not require any extra prodding after a reboot. Masters should always start up everything needed on boot. For slaves, buildbot should be able to launch apps that talk to the GUI. This is done differently depending on the OS.
12 years ago
Priority: -- → P3
During 3.0beta3 and again now in 3.0beta4, we failed out AliveTest because X server was refusing connections. Fixed by running "xhost +", but this should be done automatically on reboot. ... ----------- Output from Profile Creation ------------- Xlib: connection to ":0.0" refused by server ...
note: for 3.0beta4, this happened on both mac and linux, so needs to be fixed on both.
12 years ago
Component: Build & Release → Release Engineering: Projects
QA Contact: build → release
Splitting out win32 specific details in bug#428123. Splitting out mac specific details in bug#428124.
Summary: all buildbot masters/slaves should reboot ready for use → linux buildbot masters/slaves should reboot ready for use
We chatted a bunch about this today and decided that part of this will be doing scheduled, periodic reboots of staging machines both to iron out kinks in the rebooting and to look for potential performance gains. I plan to start work on this next week.
Status: NEW → ASSIGNED
Component: Release Engineering: Future → Release Engineering
I did some initial work on this today. Thanks to catlee already setting up a cronjob to make sure Xvnc and metacity are running it seems the only thing to do here is start Buildbot on boot. This can be done by dropping the two files I'm about to attach into /etc/default/buildbot and /etc/init.d/buildbot respectively, and then doing the following: 1. Ensure /builds/slave is the buildslave directory, or is symlinked to it 2. Run 'chkconfig --add buildbot' as root. 3. Reboot to test it. Before going ahead and deploying it I want to get periodic reboots of the staging Linux slaves going to make sure everything comes up okay.
As per the meeting yesterday I'm going to investigate the difficulty and time consumption of running fsck on every reboot to help avoid cases where we burn builds after an ESX/storage problem.
(In reply to comment #8) > As per the meeting yesterday I'm going to investigate the difficulty and time > consumption of running fsck on every reboot to help avoid cases where we burn > builds after an ESX/storage problem. As it turns out it's pretty easy to do this, but doesn't solve the problem we're trying to solve. When an ESX host or storage array goes down while a build is linking we end up with a corrupt object file. The hope was that fscking would fix this, but it does not. Given that, there's no point in forcing fscks on every boot. We can solve the burning-after-problems issue by using the clobberer webpage to force clobbers before we start slaves back up, however. So, two things left to do here: 1) Set-up periodic reboots of Linux machines in staging to help ensure they always come up clean. 2) Deploy to production / the rest of staging.
Priority: P2 → P3
Copy and paste instructions: cd /etc/default wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359078 cd /etc/init.d wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359080 chmod +x buildbot chkconfig --add buildbot /etc/init.d/buildbot start
I just tested a periodic reboot patch on staging-master. I had to add the following line to sudoers to make it work, but after that it worked great: cltbld ALL=NOPASSWD: /usr/bin/reboot Patches to come.
This is basically a copy of the Talos code. The Build/Unittest factories end up calling the addPeriodicRebootSteps function since it has to be run at the end of a build.
This patch enables reboots on Linux staging slaves. I've set up moz2-linux-slave03, 04, and 17 with the proper sudoers file and tested them all - they should all come up OK. I also wanted to get the code in for production so it's just a flip-of-the-switch if we ever care to do it. To be clear: I'm not planning to enable periodic reboots in production as part of this bug.
Here's complete instructions for rollout of thisss, when we do it: cd /etc/default wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359078 cd /etc/init.d wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359080 chmod +x buildbot chkconfig --add buildbot /etc/init.d/buildbot start sudo -e /etc/sudoers # Add 'cltbld ALL=NOPASSWD: /usr/bin/reboot' as the last line, save and exit After that's all done it's a good idea to make sure cltbld can run 'reboot' and reboot the slave to make sure it comes up clean. Take care to watch for it to connect to the Buildbot master: su - cltbld sudo reboot
Comment on attachment 360134 [details] [diff] [review] master-side patch Looks good. Did you intend to enable periodic rebooting for linux on production?
Attachment #360134 - Flags: review?(catlee) → review+
(In reply to comment #17) > (From update of attachment 360134 [details] [diff] [review]) > Looks good. Did you intend to enable periodic rebooting for linux on > production? Whoops, no, not at all.
We're going to deploy this on Monday.
Comment on attachment 360319 [details] [diff] [review] master side patch, disable reboots in production changeset: 933:37cbc1167a03
Attachment #360319 - Flags: checked‑in+
Comment on attachment 360132 [details] [diff] [review] periodic reboots, buildsBeforeReboot/doPeriodicReboots combined; use alwaysRun changeset: 189:f716fe07a806
Attachment #360132 - Flags: checked‑in+
Alright, this got landed this morning. Only thing left to do here is to deploy these changes to the ref platform.
Ref platform updated. So, this has been deployed on the following slaves: try-linux-slave01 - 06 moz2-linux-slave01 - 19 CentOS-5.0-ref-tools-vm
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
I'm re-opening this because I neglected to deploy this on our 1.8 and 1.9 machines - which is not as critical but still worthwhile to do.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Alright, the following VMs' Buildbot slaves/masters should start on boot now: fx-linux-1.9-slave1, 2, 03, 04, 07, 08, 09 production-crazyhorse production-prometheus-vm production-prometheus-vm02 staging-1.9-master staging-prometheus-vm production-1.8-master production-1.9-master production-master staging-master sm-try-master sm-staging-try-master The following VMs were off, and almost never used anymore: staging-crozyhorse staging-prometheus-vm02 I did not start them up to add the init script.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.