Closed
Bug 1138672
Opened 10 years ago
Closed 10 years ago
vlan request - move bld-lion-r5-[007-015] from build pool and servo-lion-r5-[001,002] from servo pool both to try pool
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlund, Assigned: dividehex)
References
Details
I want to get this filed but we can't action it until we disable these machines first.
We should try to time this so we have 10 machines disabled for least amount of time as possible. I'll start looking into what else needs to be done for bug 1137047
these hosts are currently $HOST.build.releng.scl3.mozilla.com but we will need them to be $HOST.try.releng.scl3.mozilla.com and set up like bld-lion-r5-[016-036] are.
Comment 1•10 years ago
|
||
This will require:
* a vlan change from netops/dcops (if this is really time critical, we should try to set a specific time with them to make sure Van will be onsite)
* hostname changes in inventory
* SREG and CNAME modifications in inventory
* dhcp_scope changes in inventory
* nagios changes
* removal/move in deploy studio
* reimage
* any additional rleeng modifications (buildbot configs, slavealloc configs, etc) after the machines are back up.
Assignee: relops → jwatkins
Reporter | ||
Comment 2•10 years ago
|
||
> * any additional rleeng modifications (buildbot configs, slavealloc configs,
> etc) after the machines are back up.
buildbot configs patch is here: https://bugzilla.mozilla.org/show_bug.cgi?id=1137047#c4
I can do the slavealloc bits: disable the slaves before we start and update fqdn's when this bug is finished
van: hi :) what's your availability between now and the near short term to help with this? This is not time critical but doing this all at the same time seems to make sense rather than disabling now.
Flags: needinfo?(vle)
Comment 3•10 years ago
|
||
I have two other minis in bug 1100386 that we might as well tackle in this batch.
servo-lion-r5-001 -> bld-lion-r5-095 (build)
servo-lion-r5-002 -> bld-lion-r5-096 (try)
Can I ask you to take care of these machines at the same time, please?
Assignee | ||
Comment 4•10 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #1)
> * removal/move in deploy studio
> * reimage
Since we have issues with using deploystudio across vlans, we may need the deploystudio server to follow them to the try vlan temporarily
Comment 5•10 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #4)
This shouldn't be an issue. The only problems we had imaging were on the srv network.
Comment 6•10 years ago
|
||
>van: hi :) what's your availability between now and the near short term to help with this? This is not >time critical but doing this all at the same time seems to make sense rather than disabling now.
:jlund, i can work on these hosts tomorrow 3/4/15. i'll be traveling to our pops in the bay area today to install some new routers. i'll ping you or someone in #releng before i start.
Reporter | ||
Comment 7•10 years ago
|
||
(In reply to Van Le [:van] from comment #6)
> >van: hi :) what's your availability between now and the near short term to help with this? This is not >time critical but doing this all at the same time seems to make sense rather than disabling now.
>
> :jlund, i can work on these hosts tomorrow 3/4/15. i'll be traveling to our
> pops in the bay area today to install some new routers. i'll ping you or
> someone in #releng before i start.
great, thanks! I'll be around PT normal hours. I'll need about 1.5 hour notice so I can disable the slaves before we start and let them finish their current builds.
Assignee | ||
Comment 8•10 years ago
|
||
I'll also need some time to prep and test inventory and nagios. Here is the ss for the changes:
https://docs.google.com/a/mozilla.com/spreadsheets/d/1FxyiAoZWzV3UICEy2S0aLgWlgXXEN7CQaaMxCNPD2yU/edit?usp=sharing
Reporter | ||
Comment 9•10 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #8)
> I'll also need some time to prep and test inventory and nagios. Here is the
> ss for the changes:
>
> https://docs.google.com/a/mozilla.com/spreadsheets/d/
> 1FxyiAoZWzV3UICEy2S0aLgWlgXXEN7CQaaMxCNPD2yU/edit?usp=sharing
okay. I am going to disable the slaves at 1200 PT and then Van is going to ping me at 1400 PT to start his work.
from irc: 10:21:00 <van> just so we're on the same page, all i need to do is change vlans, then reimage right?
van: I think so, you will want to confirm with dividehex as there may be some steps inbtween switching vlans and reimaging
dividehex: how much time do you need before and after van starts? Do the machines need to be disabled (not running build jobs) for all of your work?
Flags: needinfo?(jwatkins)
Assignee | ||
Comment 10•10 years ago
|
||
(In reply to Jordan Lund (:jlund) from comment #9)
> (In reply to Jake Watkins [:dividehex] from comment #8)
> dividehex: how much time do you need before and after van starts?Do the
> machines need to be disabled (not running build jobs) for all of your work?
YES. Jobs need to be completed before nagios or inventory is update.
I have patches prepped for nagios and inventory. So I won't need much time once the builds jobs are completed. I'll need you to ping me when they are done.
So the process is something like this:
jobs complete ->
patch nagios to remove hosts ->
update inventory ->
delete old hosts from DS and enable default ds group ->
switch vlans/ports ->
reimage (no more than 5 at a time) ->
move default ds group back ->
patch nagios with new hostnames ->
enable new try slaves to take builds
Flags: needinfo?(jwatkins)
Reporter | ||
Comment 11•10 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #10)
> (In reply to Jordan Lund (:jlund) from comment #9)
> > (In reply to Jake Watkins [:dividehex] from comment #8)
>
> > dividehex: how much time do you need before and after van starts?Do the
> > machines need to be disabled (not running build jobs) for all of your work?
>
> YES. Jobs need to be completed before nagios or inventory is update.
>
> I have patches prepped for nagios and inventory. So I won't need much time
> once the builds jobs are completed. I'll need you to ping me when they are
> done.
>
> So the process is something like this:
> jobs complete ->
> patch nagios to remove hosts ->
> update inventory ->
> delete old hosts from DS and enable default ds group ->
> switch vlans/ports ->
> reimage (no more than 5 at a time) ->
> move default ds group back ->
> patch nagios with new hostnames ->
> enable new try slaves to take builds
sounds good. slaves have started disabling will ping once they are done
dividehex, van: re coop's request above, can we do this at the same time:
servo-lion-r5-001 -> bld-lion-r5-095 (build)
servo-lion-r5-002 -> bld-lion-r5-096 (try)
Reporter | ||
Updated•10 years ago
|
Summary: vlan request - move bld-lion-r5-[006-015] machines from prod build pool to try build pool → vlan request - move bld-lion-r5-[007-015] from build pool and servo-lion-r5-[001,002] from servo pool both to try pool
Reporter | ||
Comment 12•10 years ago
|
||
we are now planning to move 9 build pool machines and 2 servo machines to try pool:
bld-lion-r5-007
bld-lion-r5-008
bld-lion-r5-009
bld-lion-r5-010
bld-lion-r5-011
bld-lion-r5-012
bld-lion-r5-013
bld-lion-r5-014
bld-lion-r5-015
servo-lion-r5-001
servo-lion-r5-002
their new homes have been reflected in google spread sheet: https://docs.google.com/a/mozilla.com/spreadsheets/d/1FxyiAoZWzV3UICEy2S0aLgWlgXXEN7CQaaMxCNPD2yU/edit?usp=sharing
Assignee | ||
Comment 13•10 years ago
|
||
Nagios and inventory have been updated. It will take about 10~15 mins for dhcp and dns to propagate.
Comment 14•10 years ago
|
||
move + reimage completed. please let me know of any issues.
vans-MacBook-Pro:~ vle$ fping < tester
bld-lion-r5-007.try.releng.scl3.mozilla.com is alive
bld-lion-r5-008.try.releng.scl3.mozilla.com is alive
bld-lion-r5-009.try.releng.scl3.mozilla.com is alive
bld-lion-r5-010.try.releng.scl3.mozilla.com is alive
bld-lion-r5-011.try.releng.scl3.mozilla.com is alive
bld-lion-r5-012.try.releng.scl3.mozilla.com is alive
bld-lion-r5-013.try.releng.scl3.mozilla.com is alive
bld-lion-r5-014.try.releng.scl3.mozilla.com is alive
bld-lion-r5-015.try.releng.scl3.mozilla.com is alive
bld-lion-r5-095.try.releng.scl3.mozilla.com is alive
bld-lion-r5-096.try.releng.scl3.mozilla.com is alive
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(vle)
Resolution: --- → FIXED
Updated•10 years ago
|
colo-trip: --- → scl3
Comment 15•10 years ago
|
||
whoops, i thought this was a dcops bug. i'll reopen and let jake close when he confirms.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 16•10 years ago
|
||
(In reply to Van Le [:van] from comment #15)
> whoops, i thought this was a dcops bug. i'll reopen and let jake close when
> he confirms.
great! once we wait to get confirmation from Jake, we can enable these slaves again
coop: in case we require these in the morning before I start Pacific time, to enable we need to land and reconfig:
https://bugzilla.mozilla.org/attachment.cgi?id=8573011&action=edit
https://bugzilla.mozilla.org/attachment.cgi?id=8573015&action=edit
otherwise I will do this first thing assuming this bug is resolved when I come on line.
Flags: needinfo?(coop)
Assignee | ||
Comment 17•10 years ago
|
||
(In reply to Jordan Lund (:jlund) from comment #16)
> great! once we wait to get confirmation from Jake, we can enable these
> slaves again
Sorry :jlund, didn't realize you were waiting on me. There is no other validation I need to do once they are reimaged and puppetized. And I see puppet certs were generated for them successfully. Nagios checks have also been enabled for them.
You are clear to enable them to take builds.
Reporter | ||
Comment 18•10 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #17)
> (In reply to Jordan Lund (:jlund) from comment #16)
>
> > great! once we wait to get confirmation from Jake, we can enable these
> > slaves again
>
> Sorry :jlund, didn't realize you were waiting on me. There is no other
> validation I need to do once they are reimaged and puppetized. And I see
> puppet certs were generated for them successfully. Nagios checks have also
> been enabled for them.
>
> You are clear to enable them to take builds.
awesome. thanks!
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Flags: needinfo?(coop)
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•