Closed Bug 417887 Opened 16 years ago Closed 15 years ago

linux buildbot masters/slaves should reboot ready for use

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: bhearsum)

References

Details

Attachments

(4 files, 3 obsolete files)

init script configuration file - /etc/default/buildbot 16 years ago bhearsum@mozilla.com (:bhearsum) 740 bytes, text/plain		Details
buildbot init script - /etc/init.d/buildbot 16 years ago bhearsum@mozilla.com (:bhearsum) 3.98 KB, text/plain		Details
reboot code for MozillaBuildFactory, off by default 15 years ago bhearsum@mozilla.com (:bhearsum) 3.62 KB, patch		Details \| Diff \| Splinter Review
enable periodic reboots for linux staging, add code for production 15 years ago bhearsum@mozilla.com (:bhearsum) 13.30 KB, patch		Details \| Diff \| Splinter Review
periodic reboots, buildsBeforeReboot/doPeriodicReboots combined; use alwaysRun 15 years ago bhearsum@mozilla.com (:bhearsum) 3.66 KB, patch	catlee : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
master-side patch 15 years ago bhearsum@mozilla.com (:bhearsum) 13.65 KB, patch	catlee : review+	Details \| Diff \| Splinter Review
master side patch, disable reboots in production 15 years ago bhearsum@mozilla.com (:bhearsum) 13.66 KB, patch	bhearsum : checked-in+	Details \| Diff \| Splinter Review

Robert Helmer [:rhelmer]

Reporter

Description

•

16 years ago

All of the buildbot masters and slaves should not require any extra prodding after a reboot. Masters should always start up everything needed on boot.

For slaves, buildbot should be able to launch apps that talk to the GUI. This is done differently depending on the OS.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

16 years ago

Priority: -- → P3

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 1

•

16 years ago

During 3.0beta3 and again now in 3.0beta4, we failed out AliveTest because X server was refusing connections. Fixed by running "xhost +", but this should be done automatically on reboot.
...
----------- Output from Profile Creation ------------- 
  Xlib: connection to ":0.0" refused by server
...

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 2

•

16 years ago

note: for 3.0beta4, this happened on both mac and linux, so needs to be fixed on both.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

16 years ago

Component: Build & Release → Release Engineering: Projects

QA Contact: build → release

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 3

•

16 years ago

Splitting out win32 specific details in bug#428123.
Splitting out mac specific details in bug#428124.

Summary: all buildbot masters/slaves should reboot ready for use → linux buildbot masters/slaves should reboot ready for use

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

16 years ago

Blocks: 472517

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 4

•

16 years ago

We chatted a bunch about this today and decided that part of this will be doing scheduled, periodic reboots of staging machines both to iron out kinks in the rebooting and to look for potential performance gains.

I plan to start work on this next week.

Status: NEW → ASSIGNED

Component: Release Engineering: Future → Release Engineering

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

16 years ago

Assignee: nobody → bhearsum

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 5

•

16 years ago

I did some initial work on this today. Thanks to catlee already setting up a cronjob to make sure Xvnc and metacity are running it seems the only thing to do here is start Buildbot on boot. This can be done by dropping the two files I'm about to attach into /etc/default/buildbot and /etc/init.d/buildbot respectively, and then doing the following:
1. Ensure /builds/slave is the buildslave directory, or is symlinked to it
2. Run 'chkconfig --add buildbot' as root.
3. Reboot to test it.

Before going ahead and deploying it I want to get periodic reboots of the staging Linux slaves going to make sure everything comes up okay.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 6

•

16 years ago

Attached file init script configuration file - /etc/default/buildbot — Details

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 7

•

16 years ago

Attached file buildbot init script - /etc/init.d/buildbot — Details

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 8

•

16 years ago

As per the meeting yesterday I'm going to investigate the difficulty and time consumption of running fsck on every reboot to help avoid cases where we burn builds after an ESX/storage problem.

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

16 years ago

Priority: P3 → P2

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 9

•

15 years ago

(In reply to comment #8)
> As per the meeting yesterday I'm going to investigate the difficulty and time
> consumption of running fsck on every reboot to help avoid cases where we burn
> builds after an ESX/storage problem.

As it turns out it's pretty easy to do this, but doesn't solve the problem we're trying to solve. When an ESX host or storage array goes down while a build is linking we end up with a corrupt object file. The hope was that fscking would fix this, but it does not. Given that, there's no point in forcing fscks on every boot.

We can solve the burning-after-problems issue by using the clobberer webpage to force clobbers before we start slaves back up, however.

So, two things left to do here:
1) Set-up periodic reboots of Linux machines in staging to help ensure they always come up clean.
2) Deploy to production / the rest of staging.

Priority: P2 → P3

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

15 years ago

Priority: P3 → P2

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 10

•

15 years ago

Copy and paste instructions:
cd /etc/default
wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359078
cd /etc/init.d
wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359080
chmod +x buildbot
chkconfig --add buildbot
/etc/init.d/buildbot start

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 11

•

15 years ago

I just tested a periodic reboot patch on staging-master. I had to add the following line to sudoers to make it work, but after that it worked great:
cltbld    ALL=NOPASSWD:   /usr/bin/reboot

Patches to come.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 12

•

15 years ago

Attached patch reboot code for MozillaBuildFactory, off by default (obsolete) — Details — Splinter Review

This is basically a copy of the Talos code. The Build/Unittest factories end up calling the addPeriodicRebootSteps function since it has to be run at the end of a build.

Attachment #360107 - Flags: review?(catlee)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 13

•

15 years ago

Attached patch enable periodic reboots for linux staging, add code for production (obsolete) — Details — Splinter Review

This patch enables reboots on Linux staging slaves. I've set up moz2-linux-slave03, 04, and 17 with the proper sudoers file and tested them all - they should all come up OK.

I also wanted to get the code in for production so it's just a flip-of-the-switch if we ever care to do it. To be clear: I'm not planning to enable periodic reboots in production as part of this bug.

Attachment #360108 - Flags: review?(catlee)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 14

•

15 years ago

Attached patch periodic reboots, buildsBeforeReboot/doPeriodicReboots combined; use alwaysRun — Details — Splinter Review

Attachment #360107 - Attachment is obsolete: true

Attachment #360132 - Flags: review?(catlee)

Attachment #360107 - Flags: review?(catlee)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 15

•

15 years ago

Attached patch master-side patch (obsolete) — Details — Splinter Review

Attachment #360108 - Attachment is obsolete: true

Attachment #360134 - Flags: review?(catlee)

Attachment #360108 - Flags: review?(catlee)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 16

•

15 years ago

Here's complete instructions for rollout of thisss, when we do it:
cd /etc/default
wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359078
cd /etc/init.d
wget -Obuildbot https://bugzilla.mozilla.org/attachment.cgi?id=359080
chmod +x buildbot
chkconfig --add buildbot
/etc/init.d/buildbot start
sudo -e /etc/sudoers
# Add 'cltbld ALL=NOPASSWD: /usr/bin/reboot' as the last line, save and exit

After that's all done it's a good idea to make sure cltbld can run 'reboot' and reboot the slave to make sure it comes up clean. Take care to watch for it to connect to the Buildbot master:
su - cltbld
sudo reboot

Chris AtLee [:catlee]

Updated

•

15 years ago

Attachment #360132 - Flags: review?(catlee) → review+

Chris AtLee [:catlee]

Comment 17

•

15 years ago

Comment on attachment 360134 [details] [diff] [review]
master-side patch

Looks good.  Did you intend to enable periodic rebooting for linux on production?

Attachment #360134 - Flags: review?(catlee) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 18

•

15 years ago

(In reply to comment #17)
> (From update of attachment 360134 [details] [diff] [review])
> Looks good.  Did you intend to enable periodic rebooting for linux on
> production?

Whoops, no, not at all.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 19

•

15 years ago

Attached patch master side patch, disable reboots in production — Details — Splinter Review

Attachment #360134 - Attachment is obsolete: true

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 20

•

15 years ago

We're going to deploy this on Monday.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 21

•

15 years ago

Comment on attachment 360319 [details] [diff] [review]
master side patch, disable reboots in production

changeset:   933:37cbc1167a03

Attachment #360319 - Flags: checked‑in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 22

•

15 years ago

Comment on attachment 360132 [details] [diff] [review]
periodic reboots, buildsBeforeReboot/doPeriodicReboots combined; use alwaysRun

changeset:   189:f716fe07a806

Attachment #360132 - Flags: checked‑in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 23

•

15 years ago

Alright, this got landed this morning. Only thing left to do here is to deploy these changes to the ref platform.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 24

•

15 years ago

Ref platform updated. So, this has been deployed on the following slaves:
try-linux-slave01 - 06
moz2-linux-slave01 - 19
CentOS-5.0-ref-tools-vm

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 25

•

15 years ago

I'm re-opening this because I neglected to deploy this on our 1.8 and 1.9 machines - which is not as critical but still worthwhile to do.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 26

•

15 years ago

Alright, the following VMs' Buildbot slaves/masters should start on boot now:
fx-linux-1.9-slave1, 2, 03, 04, 07, 08, 09
production-crazyhorse
production-prometheus-vm
production-prometheus-vm02
staging-1.9-master
staging-prometheus-vm
production-1.8-master
production-1.9-master
production-master
staging-master
sm-try-master
sm-staging-try-master

The following VMs were off, and almost never used anymore:
staging-crozyhorse
staging-prometheus-vm02

I did not start them up to add the init script.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 15 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.