Closed Bug 710233 Opened 9 years ago Closed 8 years ago

Get & deploy dongles for Windows 7 slaves

Categories

(Infrastructure & Operations :: RelOps: General, task, P1)

x86
Windows 7

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: dividehex)

References

Details

(Whiteboard: all w7 slaves have a dongle attached)

In bug 702504 we discovered that we need dongles for the Windows 7 slaves.
We currently have 80 Windows 7 slaves. 3 of them already have a dongle (ref, 036 & 053).

I am not expecting this to be deployed before the 2nd week of January as everyone's hands are full and I will be out for 2 weeks but I would like us to purchase the dongles ahead of time.

We will need a downtime (or a low capacity session) for this as I expect a performance hit and needs to be done all of them in one shot.

Meanwhile I am verifying which unit tests will fail with the dongle.

Thanks in advance.
We have 52 dongles from the old snow machines, so we'll need about 30 more.  I'll take care of the purchasing.  Jake, where's the best place to ship these for your soldering pleasure?
Assignee: server-ops-releng → dustin
I placed an order with digi-key for 40 solder cups, shipped to mtv1 via FedEx ground: 31759872
The adapter PID on monoprice is 4850.  I'm waiting to hear back about our net-30 terms with them before I order.
Monoprice order is placed.  It will be here Thursday.  Couriers rock :)
I believe all of the required equipment is in place now.  Jake, can you verify, and get the soldering taken care of?
Assignee: dustin → jwatkins
I have installed the 100ohm dongles on talos-r3-w7-001 thru 050.  Later today I'll pick up the solder cups ( I have the adapters already ) so I can start building them tonight at home.
So in light of comment 0, this shouldn't have happened -- my fault for not catching that before Jake knocked this out.

Per Bear, it seems not to have affected screen resolutions, so we'll leave them as they are until Monday, when Armen is back.

Erica has the 100Ω resistors from when she was working with Dao.  She is, AFAIK, in Phoenix right now, and will be back by Monday, so hopefully we can lay hands on those shortly.  If that seems unlikely, we'll just buy some more.
This is impacting tests; bug 662154 is permaorange in debug builds on Win7 slaves across all trees.
Blocks: 662154
(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> So in light of comment 0, this shouldn't have happened -- my fault for not
> catching that before Jake knocked this out.
> 
> Per Bear, it seems not to have affected screen resolutions, so we'll leave
> them as they are until Monday, when Armen is back.
> 
> Erica has the 100Ω resistors from when she was working with Dao.  She is,
> AFAIK, in Phoenix right now, and will be back by Monday, so hopefully we can
> lay hands on those shortly.  If that seems unlikely, we'll just buy some
> more.

(In reply to Matt Brubeck (:mbrubeck) from comment #8)
> This is impacting tests; bug 662154 is permaorange in debug builds on Win7
> slaves across all trees.

mbrubeck: The dongles were installed today by accident. I dont have exact time when today, but comment#^ makes me suspect sometime this morning. I'm trying to gauge how certain you are that this orange is caused by the new dongles. I know its hard to judge, but have you any idea if the permaorange test you see could in any way be related to any code landings that happened today coincidentally around the same time as the dongle work?

dustin: From comment#8, it looks like waiting until Monday may not be an option. Worst case, can we just revert back - pull the dongles and get those tests back to green again, until we are ready for it and know that staging machines with new dongles pass with green tests?
See, e.g., https://tbpl.mozilla.org/?tree=Mozilla-Beta&rev=01ef9195f79b for certainty - it landed on Tuesday, well before this started, but my retriggers that should have proved it was still green instead passed on 052, 058 and 062, and failed on 014, 018 and 043, the difference certainly fitting rather nicely with "above or below 50."
Because Jake rocks way way above and beyond the call of duty, the exact time when they were uninstalled was 9:47 PM.
According to the meeting I just had with coop et.al., this is definitely causing permaorange on all talos boxes below 50, so the dongles should be removed ASAP.
Severity: normal → critical
Obviously I've taken too many cold meds to be able to read clearly.  As philor states in comment 11, the dongles were removed last night.
Severity: critical → normal
Depends on: 712630
I'm very sorry this happened.
I made comments, before I left, on the blocking bug (rather than this one) and added a dependency to it for bug 710233 which asks the gfx team to fix such perma-oranges.
I've asked again for people from gfx to take on the bug.
12 new dongles have been soldered.  Plus the ones I got from Erica. This brings us to 80 dongles w/adapters that are in scl1 and ready to be deployed.  I will of course *wait* until this is unblocked and I get the green light to deploy. :-P

(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #14)
> I'm very sorry this happened.

But no one is sorrier than me.
gfx seems to have landed a fix.
I will be testing in the next day (as the release permits) that we're good to go.
Whiteboard: waiting on a test run on preproduction
I have put talos-r3-w7-036 and talos-r3-w7-053 on my development masters taking jobs.
We should have results in the morning and determine if there are no more perma-oranges.
After that we can schedule a downtime whenever it is possible.
I re-opened bug 712630 to track down 2 new found failures.

Re-triggering the jobs again to confirm that they are permanent failures.
Whiteboard: waiting on a test run on preproduction → waiting on 2 new perma oranges
Duplicate of this bug: 719190
Priority: -- → P1
Poked developers in bug 712630.
I am doing a last dry run.
We should be ready as soon as I verify the results.

What date and time could we schedule this?
We will need to ask for a downtime.
This might be done next week. To be discussed and finalized in today's relops meeting.
It will also need to be discussed with jhford who would be buildduty next week.
I can help from EDT time to gracefully shutdown the Windows test masters 45-60 mins ahead of the work.
dividehex says that it should not take longer than an hour to add all the dongles.
We should ask for a 2 hour window downtime and open earlier if we need to.
We just have to re-trigger jobs a lot of jobs as soon as the dongles are attached to an existing job.

On the day prior to the downtime, I will trigger jobs in staging to verify that no new perma-oranges got introduced.

Makes sense?
Whiteboard: waiting on 2 new perma oranges → probably: downtime to be scheduled next week with other IT work at that colo
I have to verify it once more as one of the two slaves that have the dongle did not really have a higher screen resolution.

Sorry for false news :(
Whiteboard: probably: downtime to be scheduled next week with other IT work at that colo → waiting on dependent bug
Armen: any update here? Before the MozCamp, I recall you saying you might have thought of a way to fix or even avoid this.
The idea was to enforce the screen resolution. This would mean that even after the dongles we would still be running on the same state. This means that we could then at our own choice decide to change the screen resolution and even choose which branches to run with which resolution.
I need more time to test this. I am very spread out right now. We can talk tomorrow on helping me focus on the right items.
I started working on this again on Friday (see dependent bug). It seems that my idea works (preliminary testing) and today I will be working on reviewing results and polishing patches.
At this moment no downtime seems necessary and we could soon be deploying the dongles.
I would prefer to deploy 10 dongles first to proof that my changes are really no-op and then deploy the remaining either that day or on another time.
dividehex and I spoke on IRC and I mentioned on the dependent bug.
We have planned to deploy 5 dongles on Tuesday.
Whiteboard: waiting on dependent bug → [5 dongles to be deployed on Tuesday 6/12/12]
Depends on: 763031
The code change was deployed successfully and we can now deploy 5 dongles in production.

I am going to change the dependencies:
1) deploy 5 dongles in production
2) check that there are no regressions
3) deploy remaining dongles in production
4) check that there are no regressions
5) (bug 712630) ask developers to take care of using the try server to fix the orange
Blocks: 712630, 763031
No longer depends on: 712630, 763031
dividehex I have disabled these slaves to get a dongle unto:
* talos-r3-w7-001 (staging)
* talos-r3-w7-002 (staging)
* talos-r3-w7-003 (staging)
* talos-r3-w7-004
* talos-r3-w7-005
* talos-r3-w7-006
* talos-r3-w7-007
* talos-r3-w7-008
* talos-r3-w7-010 (staging)

Would you also want to do #9 so we have a complete range? (I ask since we had only agreed yesterday to staging slaves and 5 production slaves).

Thanks!
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #29)
> dividehex I have disabled these slaves to get a dongle unto:
> * talos-r3-w7-001 (staging)
> * talos-r3-w7-002 (staging)
> * talos-r3-w7-003 (staging)
> * talos-r3-w7-004
> * talos-r3-w7-005
> * talos-r3-w7-006
> * talos-r3-w7-007
> * talos-r3-w7-008
> * talos-r3-w7-010 (staging)
> 
> Would you also want to do #9 so we have a complete range? (I ask since we
> had only agreed yesterday to staging slaves and 5 production slaves).
> 
> Thanks!

That will not be a problem.  I will be at SCL1 in the afternoon.
Dongles have been deployed to:
> * talos-r3-w7-001 (staging)
> * talos-r3-w7-002 (staging)

> * talos-r3-w7-004
> * talos-r3-w7-005
> * talos-r3-w7-006
> * talos-r3-w7-007
> * talos-r3-w7-008
> * talos-r3-w7-009
> * talos-r3-w7-010 (staging)

A dongle will be deployed to talos-r3-w7-003 (staging) when it is finished being imaged.
talos-r3-w7-003 has been reimaged and dongle is attached
talos-r3-w7-0[01-10] & 36 & 56 have dongles

I have not been able to spot any perma-oranges on the production slaves with dongles.

dividehex, who/when can deploy all the remaining dongles? Next week is fine (if possible). How long do you estimate it will take?

I believe we are talking about 67 Windows 7 32-bit production slaves.

We can schedule a small downtime if you think  that will help even though it is not necessary.
Whiteboard: [5 dongles to be deployed on Tuesday 6/12/12] → talos-r3-w7-0[01-10] & 36 & 56 have dongles
No longer blocks: 702504
No longer blocks: 763031
Depends on: 705854, 708361, 710214, 763031
Armen: I can do this today or on 06/26.  All of Relops and I will be in Berlin next week. If I do it today and something goes wrong, you will need to file a bug with DCops to have them remove the dongles.

Should only take about 5 mins to deploy to all talos-r3-w7's in SCL1.  We should NOT schedule downtime for this if its not needed.

Lets me know if you want me to move forward today or if you want to wait until relops it back from Berlin.
We decided to wait until Jake is back.
Can we deploy the dongles this week?
Which day/time can we do this?

I am on duty so I can prepare the slaves in advance.
Armenzg: does Wed 6/27 in the afternoon work for you?
Sounds good! (updating white board)
Whiteboard: talos-r3-w7-0[01-10] & 36 & 56 have dongles → deploying on Wed 6/27 PDT afternoon - talos-r3-w7-0[01-10] & 36 & 56 have dongles
I have set aside a first batch:
talos-r3-w7-0[11-50]

They should be ready in 50 minutes from this comment.

These minis will have a dongle:
talos-r3-w7-0{36,56}

These minis *might* have a dongle:
talos-r3-w7-024
talos-r3-w7-044

We will do the remaining batch after this first one.
dividehex did those machines.
I rebooted them and will check on them before we got for the 2nd set.

We're now disabling talos-r3-w7-051 to talos-r3-w7-079.
Those machines will be ready to get a dongle attached in 50 minutes from now.
dividehex: deployed except 55,67,70-79
talos-r3-w7-0{55,70,71,72,73,75,76,77,78} got done as well.

Just waiting on 67, 74 & 79.
All done.

I have been checking all day and there's not been anything obvious broken.

Tomorrow we should have more data points and be sure that nothing went wrong.

Thanks Jake.
Whiteboard: deploying on Wed 6/27 PDT afternoon - talos-r3-w7-0[01-10] & 36 & 56 have dongles → all w7 slaves have a dongle attached
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.