Closed Bug 848885 Opened 11 years ago Closed 11 years ago

Move staging test minis to production

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86
All

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

Attachments

(5 files, 1 obsolete file)

Adding few machines back to the pool will help slightly the wait times.

Let's take all of our preprod-test machines and make them take production jobs.
Whenever we need one of these machines we can pull them back from production.
If we need to modify them or loan them then we will need to use our usual re-imaging.

For the puppet slaves I assume we will sync with production and put them on staging when we need to test a new package. It will be a little harder to test new packages but I believe it is better than finding discrepancies when taking production jobs.

We will need to remove network rules that we added when we wanted to determine if we could run tests without reaching external networks. AFAIK that project is halted.

Once we have all the patches ready we should re-image all of these machines just to be sure that we start from a clean state.

talos-mtnlion-r5-001
talos-mtnlion-r5-002
talos-mtnlion-r5-003
talos-mtnlion-r5-010
talos-r3-fed-001
talos-r3-fed-002
talos-r3-fed-010
talos-r3-fed64-001
talos-r3-fed64-002
talos-r3-fed64-010
talos-r3-w7-001
talos-r3-w7-002
talos-r3-w7-003
talos-r3-w7-010
talos-r3-xp-001
talos-r3-xp-002
talos-r3-xp-003
talos-r3-xp-010
talos-r4-lion-001
talos-r4-lion-002
talos-r4-lion-003
talos-r4-lion-010
talos-r4-snow-001
talos-r4-snow-002
talos-r4-snow-003
<strike>talos-r4-snow-046</strike> - it misbehaves
(In reply to Armen Zambrano G. [:armenzg] from comment #0)
> <strike>talos-r4-snow-046</strike> - it misbehaves

Yes, we'll need to make sure the slavealloc comment is preserved for this slave (and any others in the same state), and that it doesn't get enabled by accident.
poolids determined through this query:
mysql> select * from pools where name like 'tests-scl1%';
+--------+--------------------+
| poolid | name               |
+--------+--------------------+
|     22 | tests-scl1-linux   |
|     11 | tests-scl1-macosx  |
|     29 | tests-scl1-panda   |
|     19 | tests-scl1-windows |
+--------+--------------------+
4 rows in set (0.01 sec)

This is the last patch to land. Everything else has to happen first.
Attachment #722429 - Flags: review?(nthomas)
Graphs seems to have those machines in production.
Armens-MacBook-Air puppet-manifests hg:[default!] $ for i in `cat list_machines | grep -v "w7|xp|mtn"`; do grep "$i" scl-production.pp; done | wc -l
      13
Armens-MacBook-Air puppet-manifests hg:[default!] $ for i in `cat list_machines`; do grep "$i" scl-production.pp; done | wc -l
      13
Armens-MacBook-Air puppet-manifests hg:[default!] $ for i in `cat list_machines`; do grep "$i" scl-production.pp; done
node "talos-r3-fed-001" inherits "fedora12-i686-test" {
node "talos-r3-fed-002" inherits "fedora12-i686-test" {
node "talos-r3-fed-010" inherits "fedora12-i686-test" {
node "talos-r3-fed64-001" inherits "fedora12-x86_64-test" {
node "talos-r3-fed64-002" inherits "fedora12-x86_64-test" {
node "talos-r3-fed64-010" inherits "fedora12-x86_64-test" {
node "talos-r4-lion-001" inherits "darwin11-x86_64-test" {
node "talos-r4-lion-002" inherits "darwin11-x86_64-test" {
node "talos-r4-lion-003" inherits "darwin11-x86_64-test" {
node "talos-r4-lion-010" inherits "darwin11-x86_64-test" {
node "talos-r4-snow-001" inherits "darwin10-i386-test" {
node "talos-r4-snow-002" inherits "darwin10-i386-test" {
node "talos-r4-snow-003" inherits "darwin10-i386-test" {
Armens-MacBook-Air puppet-manifests hg:[default!] $ for i in `cat list_machines`; do grep "$i" staging.pp; done | wc -l
       0
Armens-MacBook-Air puppet-manifests hg:[default!] $ grep "talos-r4-snow-046" staging.pp 
node "talos-r4-snow-046" inherits "darwin10-i386-test" {
Armens-MacBook-Air puppet-manifests hg:[default!] $ grep "talos-r4-snow-046" scl-production.pp
Attachment #722445 - Flags: review?(nthomas)
Do I need to make any changes for mtnlion slaves?
Attachment #722449 - Flags: review?(nthomas)
Depends on: 848944
Comment on attachment 722429 [details] [diff] [review]
move staging machines to production (slavealloc)

You should use poolid=28 for the talos-mtnlion slaves, as they talk to scl3 masters. Otherwise OK.
Attachment #722429 - Flags: review?(nthomas) → review-
Comment on attachment 722440 [details] [diff] [review]
buildbot-configs - move staging test slaves to production

>diff --git a/mozilla-tests/production_config.py b/mozilla-tests/production_config.py
>-    'fedora64' : dict([("talos-r3-fed64-%03i" % x, {}) for x in range (3,10) + range(11,35) + range(36,72)]),
>+    'fedora64' : dict([("talos-r3-fed64-%03i" % x, {}) for x in range (1,72)]),

talos-r3-fed64-035 got decommissioned but you're adding it back here.

>-    'win7': dict([("talos-r3-w7-%03i" % x, {}) for x in range(4,10) + range(11,17) + range(18,105)]),
>+    'win7': dict([("talos-r3-w7-%03i" % x, {}) for x in range(1,105)]),

talos-r3-w7-018 also got decommissioned

>-    'snowleopard': dict([("talos-r4-snow-%03i" % x, {}) for x in range(4,46) + range(47,81) + [82,84]]),
>+    'snowleopard': dict([("talos-r4-snow-%03i" % x, {}) for x in range(1,84) \
>+        if x not in [46]]), # bug 824754 - This machine is not suitable for production

We don't have a talos-r4-snow-081, apparently, so lets not add it to the config.

r+ if you fix that up.
Attachment #722440 - Flags: review?(nthomas) → review+
Attachment #722449 - Flags: review?(nthomas) → review+
Attachment #722445 - Flags: review?(nthomas) → review+
Comment on attachment 722440 [details] [diff] [review]
buildbot-configs - move staging test slaves to production

Does this still pass buildbot-configs/mozilla/test/test_slave_allocation.py ?
OS: Mac OS X → All
Summary: Make staging tests machines to take production jobs → Move staging test minis to production
(In reply to Nick Thomas [:nthomas] from comment #10)
> Comment on attachment 722440 [details] [diff] [review]
> buildbot-configs - move staging test slaves to production
> 
> Does this still pass buildbot-configs/mozilla/test/test_slave_allocation.py ?

Yes, it still does.
Attachment #722440 - Flags: checked-in+
Attachment #722445 - Flags: checked-in+
Attachment #722449 - Flags: checked-in+
These machines are waiting for a reconfig and then run the slavealloc patch:
> talos-mtnlion-r5-002 - no puppet changes are needed
> talos-mtnlion-r5-003 - no puppet changes are needed
> talos-r3-w7-001      - hostname changed
> talos-r3-w7-002      - hostname changed
> talos-r3-w7-003      - hostname changed
> talos-r3-w7-010      - hostname changed
> talos-r3-xp-001      - hostname changed and added to OPSI production
> talos-r3-xp-010      - hostname changed and added to OPSI production
> talos-r4-lion-001    - booked
> talos-r4-lion-002    - IT working on it
> talos-r4-lion-003    - puppetized
> talos-r4-lion-010    - puppetized
> talos-r4-snow-001    - puppetized
> talos-r4-snow-002    - puppetized
> talos-r4-snow-003    - puppetized

These are waiting for various reasons:
> talos-mtnlion-r5-001 - booked
> talos-mtnlion-r5-010 - booked
> talos-r3-fed-001     - booked
> talos-r3-fed-002     - IT working on it
> talos-r3-fed-010     - IT working on it
> talos-r3-fed64-001   - IT working on it
> talos-r3-fed64-002   - IT working on it
> talos-r3-fed64-010   - booked
> talos-r3-xp-002      - IT working on it
> talos-r3-xp-003      - IT working on it
These machines are still being used by other relengers and might cause them trouble if they don't sync with staging.

I have added them back. This patch is for when we have moved them to production.
Attachment #722860 - Flags: review?(nthomas)
Attachment #722429 - Attachment is obsolete: true
Attachment #722861 - Flags: review?(nthomas)
Priority: -- → P2
Attachment #722860 - Attachment description: add back few staging machines → remove last few staging machines
Attachment #722860 - Flags: review?(nthomas) → review+
Attachment #722861 - Flags: review?(nthomas) → review+
Merged and reconfiguration completed.
I've put these slaves into production:
https://build.mozilla.org/buildapi/recent/talos-mtnlion-r5-002
https://build.mozilla.org/buildapi/recent/talos-mtnlion-r5-003
https://build.mozilla.org/buildapi/recent/talos-r3-w7-001
https://build.mozilla.org/buildapi/recent/talos-r3-w7-002
https://build.mozilla.org/buildapi/recent/talos-r3-w7-003
https://build.mozilla.org/buildapi/recent/talos-r3-w7-010
https://build.mozilla.org/buildapi/recent/talos-r3-xp-001
https://build.mozilla.org/buildapi/recent/talos-r3-xp-010
https://build.mozilla.org/buildapi/recent/talos-r4-lion-003
https://build.mozilla.org/buildapi/recent/talos-r4-lion-010
https://build.mozilla.org/buildapi/recent/talos-r4-snow-001
https://build.mozilla.org/buildapi/recent/talos-r4-snow-002
https://build.mozilla.org/buildapi/recent/talos-r4-snow-003

Ready to be put in production:
* talos-r3-xp-003

> These are waiting for various reasons:
> > talos-mtnlion-r5-001 - booked
> > talos-mtnlion-r5-010 - booked
> > talos-r3-fed-001     - booked
> > talos-r3-fed-002     - IT working on it
> > talos-r3-fed-010     - IT working on it
> > talos-r3-fed64-001   - IT working on it
> > talos-r3-fed64-002   - IT working on it
> > talos-r3-fed64-010   - booked
> > talos-r3-xp-002      - IT working on it
> > talos-r3-xp-003      - IT working on it
> > talos-r4-lion-001    - booked
> > talos-r4-lion-002    - IT working on it
(In reply to Nick Thomas [:nthomas] from comment #9)
> talos-r3-w7-018 also got decommissioned

Are you sure this machine got decommissioned?
I see it on DNS and it had been taking jobs happily:
https://secure.pub.build.mozilla.org/buildapi/recent/talos-r3-w7-018


FTR, I removed one more snow machine that I should have needed to.
I've added it back to default:
http://hg.mozilla.org/build/buildbot-configs/rev/5b79bdab398a
* talos-r3-xp-003
https://secure.pub.build.mozilla.org/buildapi/recent/talos-r3-xp-003

I need to figure out these two:
https://build.mozilla.org/buildapi/recent/talos-mtnlion-r5-002
https://build.mozilla.org/buildapi/recent/talos-mtnlion-r5-003
and these two:
> Reimaged talos-mtnlion-r5-010 and talos-mtnlion-r5-001 (talked to kmoir).

The following still need dcops intervention now:
talos-r4-lion-002
talos-r3-xp-002
talos-r3-fed-002
talos-r3-fed-010
talos-r3-fed64-001
talos-r3-fed64-002
talos-r3-fed64-010

This is still booked:
* talos-r3-fed-001
nthomas, where did you get the info that talos-r3-w7-019 was to be decommissioned? I can't find any reference to it.

The mtnlion slaves are now taking jobs.

talos-r3-xp-002     - taking jobs
talos-r3-fed-002    - taking jobs

These are connected to production masters but I'm still waiting on them:
https://secure.pub.build.mozilla.org/buildapi/recent/talos-r3-fed-010
https://secure.pub.build.mozilla.org/buildapi/recent/talos-r3-fed64-001
https://secure.pub.build.mozilla.org/buildapi/recent/talos-r3-fed64-002
https://secure.pub.build.mozilla.org/buildapi/recent/talos-r3-fed64-010

The following still need dcops intervention now:
talos-r4-lion-002   - IT still working on it

This is still booked:
* talos-r3-fed-001
Flags: needinfo?(nthomas)
(In reply to Armen Zambrano G. [:armenzg] from comment #19)
> nthomas, where did you get the info that talos-r3-w7-019 was to be
> decommissioned? I can't find any reference to it.

Sorry, should have said talos-r3-w7-017 (bug 747734).
Flags: needinfo?(nthomas)
(In reply to Nick Thomas [:nthomas] from comment #20)
> (In reply to Armen Zambrano G. [:armenzg] from comment #19)
> > nthomas, where did you get the info that talos-r3-w7-019 was to be
> > decommissioned? I can't find any reference to it.
> 
> Sorry, should have said talos-r3-w7-017 (bug 747734).

I landed a fix for it and I will put it back after our next reconfig.
Status
######
* talos-r4-lion-002  - IT still working on it
* talos-r3-fed-001   - booked
* talos-r3-w7-018    - put back in production after reconfig

All other slaves have taken jobs on production.
Depends on: 850531
This is in production.
Attachment #722860 - Flags: checked-in+
Attachment #722861 - Flags: checked-in+
Waiting on:
* talos-r4-lion-002
* talos-r3-fed-001

armenzg has to follow up:
* talos-r3-w7-001
* talos-r3-w7-002
* talos-r3-w7-003
* talos-r3-w7-010
Whiteboard: status on comment 23
No longer depends on: 850531
talos-r4-lion-002 is running jobs.
Waiting on talos-r3-fed-001
We will deal with the win7 slaves on bug 850531.
Whiteboard: status on comment 23 → status on comment 25
Priority: P2 → P3
Whiteboard: status on comment 25 → waiting on talos-r3-fed-001
I got fed-001 done.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: waiting on talos-r3-fed-001
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: