Closed Bug 465868 Opened 12 years ago Closed 11 years ago

[Tracking bug] have one Buildbot master instance and pool of slaves produce all builds and unittests for moz2.

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: catlee)

References

Details

Attachments

(7 files, 3 obsolete files)

20.55 KB, patch
bhearsum
: review+
lsblakk
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
1.79 KB, patch
bhearsum
: review+
lsblakk
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
9.82 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
3.94 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
8.41 KB, patch
catlee
: review+
catlee
: checked-in+
Details | Diff | Splinter Review
2.65 KB, patch
catlee
: review+
catlee
: checked-in+
Details | Diff | Splinter Review
3.50 KB, patch
bhearsum
: review+
catlee
: checked-in+
Details | Diff | Splinter Review
Before:
- buildbot master instance for builds; one pool of slaves for builds, all on Build network
- buildbot master instance for unittests; a separate pool of slaves for unittests, all on QA network.


After:
- one buildbot master instance for both builds *and* unittests; one pool of slaves for both builds *and* unittests, all on the Build network.


Note: I specify "moz2" here, because the different project branches there all use the same tool chain, so therefore work on mozilla-central, tracemonkey, mozilla-1.9.1 and places can be shared across the one pool. Other active code lines which use *different* tool chains will require their own pool of slaves.

(Its been talked about since early summer 2008, and prepwork is covered in lots of bugs, and is even a Q4 goal, but I cant find a specific bug on it?! Hence, filing this - if you know of another preexisting bug, please close this as DUP)
Lots of work already done in Q3 to change machines to use same accounts, be on same network, etc. We already have one consolidated pool of slaves in staging m-c. Question is: Whats still to do before we enable this for production m-c?

In Toronto gathering last month, we pulled together these remaining step, and tweaked the list a little more today:

- [+] create 4 new linux slaves (catlee)
- [+] windows slave: accept pskill license (lblakk)
- [+] talk with tracemonkey developers about running all unittests, not
      just "make check" (joduinn). They are all ok with that. Per test
      run last week, all is ok now.
- [+] update win32 refplatforms - moz_no_reset_path, set screen
      resolution (joduinn) bug#460535
- [+] stop including slavename in tinderbox column in staging (bhearsum)
- [+] consolidate master.cfg (bhearsum)
- [+] update existing win32 build slaves - screen resolution (lblakk)
- [+] change tracemonkey unittest HW (bm-win2k3-unittest-02-hw) to be
      VM (lblakk)
- [+] create 4 new win32 slaves, bug#460729 (joduinn)
- [+] put new slaves into staging, then production (catlee, joduinn)
- [ ] mac slave: audit /tools, xcode (lblakk)
- [ ] update linux refplatform - scratchbox,xvnc (joduinn)
- [ ] update existing linux build slaves - xvnc (lblakk)
- [ ] delete and recreate old unittest slaves (4 linux, 4 win32)
      (lblakk)
- [ ] move bm-xserve21 -> mozilla-central (lblakk)
- [ ] change tracemonkey unittest HW (bm-win2k3-unittest-02-hw) to be
      VM (lsblakk)
- [ ] upgrade bm-unittest01 to 10.5 (lblakk)
- [ ] stop including slavename in tinderbox column in production
      (bhearsum)
- [ ] bug#464079 - fix exception when doing reconfig in master (catlee)
- [ ] reimage the staging xserve, use in unittest production as interim
      step (lsblakk)
- [ ] moz2-win32-slave15->18 need utils added (lsblakk)
- [ ] verify new linux slaves have firefox profile initialised (lsblakk)
- [ ] consolidate slave setup doc for build/unittest slaves. Update ref
      images if needed. (lsblakk,
      joduinn)
- [ ] write production consolidation patches (lsblakk)
- [ ] delete staging buildbot master for unittests


Does this look right, or did I miss anything?
The macs are all new images and are up and running successfully
> - [+] mac slave: audit /tools, xcode (lblakk)

The existing linux slave have all been updated by having firefox open once, to create a default profile and by having the slaves started with DISPLAY=:2
> - [+] update existing linux build slaves - xvnc (lblakk)

> - [ ] delete and recreate old unittest slaves (4 linux, 4 win32)
>       (lblakk)
> - [ ] move bm-xserve21 -> mozilla-central (lblakk)

The tracemonkey HW box is closed off and the win32 tracemonkey tests are now being run on a VM
> - [+] change tracemonkey unittest HW (bm-win2k3-unittest-02-hw) to be
>       VM (lsblakk)

There is a bug file to re-image bm-xserve-unittest01 (and to rename it to bm-xserve22)
> - [X] upgrade bm-unittest01 to 10.5 (lblakk) -- Let's rename this
  - [ ] re-image bm-xserve-unittest01 and rename to bm-xserve22 so it can move to production
  ^^ is bug 465766

so that means we can get rid of this:
> - [ ] reimage the staging xserve, use in unittest production as interim
>       step (lsblakk)


These are done:
> - [+] moz2-win32-slave15->18 need utils added (lsblakk)
> - [+] verify new linux slaves have firefox profile initialised (lsblakk)
Revised list after talking it through with lukas on irc.

ToDo
[ ] fix linux mochichrome and mochitest failures on build/unittest slaves. Mac green, win32 green.
[ ] update linux refplatform - scratchbox,xvnc (joduinn)
[ ] stop including slavename in tinderbox column in production
      (bhearsum)
[ ] bug#464079 - fix exception when doing reconfig in master (catlee)
[ ] consolidate slave setup doc for build/unittest slaves. Update ref
      images if needed. (lsblakk, joduinn)
[ ] update support doc to show how to start slave with xfvb and DISPLAY settings
[ ] write production consolidation patches (lsblakk)
[ ] delete staging buildbot master for unittests
[ ] after consolidation is in production, delete and recreate old unittest slaves (4 linux, 4 win32) (lblakk)
[ ] after consolidation, investigate and possibly move bm-xserve21 -> mozilla-central (lblakk)


Done:
[+] create 4 new linux slaves (catlee)
[+] windows slave: accept pskill license (lblakk)
[+] update existing linux build slaves - xvfb (lblakk)
[+] talk with tracemonkey developers about running all unittests, not
    just "make check" (joduinn). They are all ok with that. Per test
    run last week, all is ok now.
[+] update win32 refplatforms - moz_no_reset_path, set screen
    resolution (joduinn) bug#460535
[+] stop including slavename in tinderbox column in staging (bhearsum)
[+] consolidate master.cfg (bhearsum)
[+] update existing win32 build slaves - screen resolution (lblakk)
[+] change tracemonkey unittest HW (bm-win2k3-unittest-02-hw) to be
      VM (lblakk)
[+] create 4 new win32 slaves, bug#460729 (joduinn)
[+] put new slaves into staging, then production (catlee, joduinn)
[+] change tracemonkey unittest HW (bm-win2k3-unittest-02-hw) to be
      VM (lsblakk)
[+] upgrade bm-xserve-unittest01 to 10.5 (lblakk)
[+] re-image bm-xserve-unittest01 ,  rename to bm-xserve22 and move
it to production mozilla-1.9.1. Details in bug 465766
[+] moz2-win32-slave15->18 need utils added (lsblakk)
[+] verify new linux slaves have firefox profile initialised (lsblakk)

Dropped:
[+] mac slave: audit /tools, xcode (lblakk). instead replaced with new machines.
[+] reimage the staging xserve, use in unittest production as interim
      step (lsblakk)
Assignee: nobody → catlee
looks like mochitest is timing out on staging
(In reply to comment #3)
> ToDo
> [ ] fix linux mochichrome and mochitest failures on build/unittest slaves. Mac
> green, win32 green.

On Linux this is caused by a combination of buildbot not being started with DISPLAY=:2, metacity not running, or Xvfb not running.  See bug 468823.

> [ ] update linux refplatform - scratchbox,xvnc (joduinn)

ref platform and existing slaves should be updated with cronjobs from 468823.

> [ ] update support doc to show how to start slave with xfvb and DISPLAY
> settings

this won't be necessary once the above is done.
So our current ToDos for getting this moved into production is:
[ ] set DISPLAY=:2 on linux (from bug 468823)
[ ] update linux slaves with Xvfb, metacity cronjobs from bug 468823
[ ] write production consolidation patches
[ ] stop including slavename in tinderbox column in production
      (bhearsum) ?

These can happen at any time:
[ ] update linux refplatform - Xvfb,metacity
[ ] consolidate slave setup doc for build/unittest slaves. Update ref
      images if needed. (lsblakk, joduinn)
[ ] delete staging buildbot master for unittests
[ ] after consolidation is in production, delete and recreate old unittest
slaves (4 linux, 4 win32) (lblakk)
[ ] after consolidation, investigate and possibly move bm-xserve21 ->
mozilla-central (lblakk)
(In reply to comment #6)
> So our current ToDos for getting this moved into production is:
> [ ] set DISPLAY=:2 on linux (from bug 468823)
> [ ] update linux slaves with Xvfb, metacity cronjobs from bug 468823
> [ ] write production consolidation patches
> [ ] stop including slavename in tinderbox column in production
>       (bhearsum) ?
> 
> These can happen at any time:
> [ ] update linux refplatform - Xvfb,metacity
> [ ] consolidate slave setup doc for build/unittest slaves. Update ref
>       images if needed. (lsblakk, joduinn)


> [X] delete staging buildbot master for unittests 
We deleted the m-c staging buildbot already when we put 1.9.1 and m-c standalone production unittest into being.

> [ ] after consolidation is in production, delete and recreate old unittest
> slaves (4 linux, 4 win32) (lblakk)

I don't remember why we might want to move bm-xserve21, it's doing production 1.9.0 right now.  Do we not want to have 2 Mac builds on that production waterfall?
> [ ] after consolidation, investigate and possibly move bm-xserve21 ->
> mozilla-central (lblakk)
We were still having some problems because modules were being reloaded multiple times per 'buildbot reconfig', resulting in problems when constructing unittest steps.  By reloading modules in one place, we can make sure they're only reloaded once per 'buildbot reconfig'
Attachment #353107 - Flags: review?(bhearsum)
Attachment #353109 - Flags: review?(bhearsum)
Attachment #353107 - Flags: review?(lukasblakk)
(In reply to comment #9)
> Created an attachment (id=353109) [details]
> Add repoPath to UnittestBuildFactory, and fix module reloading

Ignore the changes to env.py in here, those are already covered in another bug.
Attachment #353107 - Flags: review?(lukasblakk) → review+
Comment on attachment 353107 [details] [diff] [review]
Bring unittest factory logic into master.cfg, and fix module reloading

I'm not a big fan of moving the factory.py reloads over here. Is there any way to keep them confined to their buildbotcustom modules?
Depends on: 468823
Attachment #353107 - Attachment is obsolete: true
Attachment #353224 - Flags: review?(bhearsum)
Attachment #353107 - Flags: review?(bhearsum)
Attachment #353109 - Attachment is obsolete: true
Attachment #353225 - Flags: review?(bhearsum)
Attachment #353109 - Flags: review?(bhearsum)
Attachment #353224 - Flags: review?(lukasblakk)
Attachment #353225 - Flags: review?(lukasblakk)
Attachment #353225 - Flags: review?(lukasblakk) → review+
Attachment #353224 - Flags: review?(lukasblakk) → review+
So here's the patch for production implementation - it's the same as the staging one, so if anything needs tweaking from our staging reloads and whathaveyou this will need to change too.
Attachment #353281 - Flags: review?(catlee)
Attachment #353281 - Flags: review?(catlee) → review+
[x] update linux slaves with Xvfb, metacity cronjobs from bug 468823

linux slaves on production-master and staging-master now have this in cltbld's crontab:
# Make sure Xvfb is running on :2
@reboot     ps -C Xvfb | grep -q Xvfb || exec Xvfb :2 -screen 0 1280x1024x24 &
*/5 * * * * ps -C Xvfb | grep -q Xvfb || exec Xvfb :2 -screen 0 1280x1024x24 &

# Make sure metacity is running on :2
@reboot     ps -C metacity -f | grep -q :2 || exec metacity --display :2 --replace &
*/5 * * * * ps -C metacity -f | grep -q :2 || exec metacity --display :2 --replace &
Depends on: 464692
(In reply to comment #7)
> (In reply to comment #6)
> > [ ] stop including slavename in tinderbox column in production
> >       (bhearsum) ?
Done by catlee.
Priority: -- → P2
Comment on attachment 353224 [details] [diff] [review]
Bring unittest factory logic into master.cfg, and fix module reloading

Looks fine to me.
Attachment #353224 - Flags: review?(bhearsum) → review+
Attachment #353225 - Flags: review?(bhearsum) → review+
Comment on attachment 353225 [details] [diff] [review]
Add repoPath to UnittestBuildFactory

Checking in process/factory.py;
/cvsroot/mozilla/tools/buildbotcustom/process/factory.py,v  <--  factory.py
new revision: 1.61; previous revision: 1.60
done
Attachment #353225 - Flags: checked‑in+
Comment on attachment 353224 [details] [diff] [review]
Bring unittest factory logic into master.cfg, and fix module reloading

changeset:   601:b641aa91d1bf
Attachment #353224 - Flags: checked‑in+
Attachment #353281 - Flags: checked‑in+
Comment on attachment 353281 [details] [diff] [review]
Production consolidation patch

changeset:   601:b641aa91d1bf
This rolled out live in production yesterday (18dec2008), so we are now producing builds and unittests from the same pool-of-identical slaves. 

We're still running dedicated old unittest systems as usual, so we can watch the two sets of unittests results in parallel for a while.
Comment on attachment 353224 [details] [diff] [review]
Bring unittest factory logic into master.cfg, and fix module reloading

>--- a/mozilla2-staging/unittest_master.py	Mon Dec 15 15:48:08 2008 +0100
>-    errorparser="unittest"

Losing that means that the brief log is now a useless spew of every passed (or known-fail) test which happens to include the string "error" in the message. Can we have the errorparser that understands unit test error messages back, pretty please?
Attachment #354188 - Flags: review?(catlee) → review+
(In reply to comment #22)
> (From update of attachment 353224 [details] [diff] [review])
> >--- a/mozilla2-staging/unittest_master.py	Mon Dec 15 15:48:08 2008 +0100
> >-    errorparser="unittest"
> 
> Losing that means that the brief log is now a useless spew of every passed (or
> known-fail) test which happens to include the string "error" in the message.
> Can we have the errorparser that understands unit test error messages back,
> pretty please?

Yup, this is bug 470757.  Should be all good now.
Attachment #354206 - Flags: review? → review?(catlee)
The following slaves have been clobbered and re-issued to staging-master (consolidated):

moz2-linux-slave07
moz2-linux-slave08
moz2-linux-slave10
moz2-linux-slave13

moz2-win32-slave07
moz2-win32-slave08
moz2-win32-slave09
moz2-win32-slave10

moz2-darwin9-slave05
bm-xserve22

When bug 470788 is resolved then 2 more mac slaves will be added to staging.

Once all these slaves have been successfully running on staging, they can be switched over to production (consolidated).  The patch is ready for that - see comment 25
I have turned off the standalone production unittest (moz2-unittest in /builds/buildbot) and updated production-master nagios
Comment on attachment 354188 [details] [diff] [review]
Adds new mac slaves to staging

changeset:   620:64e879af7c94
Attachment #354188 - Flags: checked‑in+
2 mac slaves added to staging-pool:

moz2-darwin9-slave06
moz2-darwin9-slave07
Please update the inventory & IT support docs before resolving this bug fixed.
Still left:

[ ] update linux refplatform - Xvfb,metacity
[ ] consolidate slave setup doc for build/unittest slaves. Update ref images if needed. (lsblakk, joduinn)
[ ] update inventory
You should add [ ] update support docs to that list too, as Nick pointed out.
Attachment #354206 - Flags: review?(catlee) → review+
Comment on attachment 354206 [details] [diff] [review]
Add 4 new slaves of each platform to Production

changeset:   653:6797175d6f78
Attachment #354206 - Flags: checked‑in+
crontab was adjusted on:
moz2-linux-slave07
moz2-linux-slave08
moz2-linux-slave10
moz2-linux-slave13
The following have been moved into production:
moz2-linux-slave07
moz2-linux-slave08
moz2-linux-slave10
moz2-linux-slave13

moz2-win32-slave09
moz2-win32-slave10

moz2-darwin9-slave05
moz2-darwin9-slave07
bm-xserve22
Needed to make these clobbers
 moz2-linux-slave08 - mozilla-central-linux/build/obj-firefox 
 moz2-linux-slave10 - mozilla-central-linux-unittest/build/objdir
as the first build (on production anyway) died.
The following have been moved into production:
moz2-darwin9-slave06

moz2-win32-slave07
(In reply to comment #37)
> The following have been moved into production:
> moz2-darwin9-slave06
> 
> moz2-win32-slave07

and moz2-win32-slave08 too.
this slave passed all tests on staging-master, now need to enable on production-master.
Attachment #356276 - Flags: review?(catlee)
Attachment #356276 - Flags: review?(catlee) → review+
Comment on attachment 356276 [details] [diff] [review]
add moz2-win32-slave14 to production-master

changeset:   668:b42dbf33aca4
Attachment #356276 - Flags: checked‑in+
[x] update linux refplatform - Xvfb,metacity (nthomas)
[x] consolidate slave setup doc for build/unittest slaves. Update ref images if
 needed (catlee)
[x] update inventory (catlee)
[x] update support docs (catlee)

Are we done here?
In bug 472779 I noticed that moz2-linux-slave07/08/09/10 are not up to date for scratchbox, so I'm guessing they are an older clone of the reference platform. That's from scratchbox not being in /builds/ and symlinked back to / (there's an update too). We need to reclone, or see if we can do the the steps on the ref doc to bring them up to speed. slave14 seems to be fine, based on the existence of /builds/scratchbox.
Depends on: 473514
I had to modify the umask in buildbot.tac to 022 on bm-xserve22, moz2-darwin9-slave06/07. They were spitting out snippets which the nightly update system didn't have perms to read.
Depends on: 473772
Attachment #357400 - Flags: review?(bhearsum)
Attachment #357400 - Flags: review?(bhearsum) → review-
Comment on attachment 357400 [details] [diff] [review]
add moz2-linux-slave09 to production-master

Please add slave09 to mobile_master.py, too, and renable whichever other ones are fixed there.
Attachment #357433 - Flags: review?(bhearsum) → review+
Attachment #357433 - Flags: checked‑in+
Comment on attachment 357433 [details] [diff] [review]
add moz2-linux-slave09 to production-master, and renable slave7,8,10 for mobile builds

changeset:   683:8907fea7ecd2
moz2-linux-slave07,08,09,10 were moved onto production yesterday and have been running fine.
Putting this one to rest.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.