Closed Bug 384966 Opened 17 years ago Closed 17 years ago

Migrate from bl-bldxp01, bl-bldlnx01 to qm-pxp01-05, qm-plinux01-05

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: joduinn)

References

Details

Currently, the machines bl-bldxp01, bl-bldlnx01 are Tier1 supported machines (24x7). 

There are two things hurting IT here. 

1) These machines are in the office, not at MPT. When they die at night, it requires someone to physically drive to the office. This is unfair.

2) We believe the machines bl-bldxp01, bl-bldlnx01 have non-standard/non-defined images on them. It would help if these machines were configured like the other machines IT maintains.


Instead of reconfiguring and moving bl-bldxp01, bl-bldlnx01, Justin has setup new replacement machines qm-pxp01-05, qm-plinux01-05 in MPT specifically so we can do this transition.

Can we migrate:
bl-bldxp01 -> qm-pxp01-05
bl-bldlnx01 -> qm-plinux01-05
Because the new performance farm is set to replace these, and that will not have 24x7 support, I don't think there's any reason to do this. 

1) I'm not sure how much it will help to have these in the colo instead of the office. Has anyone ever driven to the office at night to restart them? If this is the requirement for Tier 1 machines I don't think these should be.

2) actually it's documented here:
http://wiki.mozilla.org/ReferencePlatforms/Test/WinXP

These machines are temporary, if these are causing significant problems I think that it'd be better to explicitly say that these have business hours support, instead of trying to support both these and the new performance infrastructure.
Let me put it this way, we can either:

1) wait for talos to be finished and deployed
2) re-allocate at least two of these machines to run the tinderbox test harness

There are a few good ways to do #2; either what's described in this bug (run standalone like on bl-bld*) or it can be run from buildbot, which is already installed on these machines I believe.

CCing robcee and alice, since if we do #2 then they should have some say into how we do it (they are currently using these machines).
from offline discussions, another option came up:

3) Setup "buildbot (already installed) runs test-only-tinderbox" on the new machines in MPT, and close down the old machines in the office. This would be a transitional step until we can move to Talos (which also uses buildbot) running on the new machines. While this is an extra step, it will allow us to quickly move from hardware in the office to hardware in MPT (good for IT) and also allow us to start using the recently completed buildbot-part-of-talos.
(In reply to comment #0)
> Currently, the machines bl-bldxp01, bl-bldlnx01 are Tier1 supported machines
> (24x7). 
> 
> There are two things hurting IT here. 
> 
> 1) These machines are in the office, not at MPT. When they die at night, it
> requires someone to physically drive to the office. This is unfair.

Is this happening? These machines are remotely accessible; they don't have hands-on support, but are the machines accessible. Is there a list of outage reports (or bugs) where the solution involved someone coming into the office?

I think we would've heard (quite loudly) about this, but it's never been mentioned.

> 2) We believe the machines bl-bldxp01, bl-bldlnx01 have
> non-standard/non-defined images on them. It would help if these machines were
> configured like the other machines IT maintains.

The problem with this is that these machines were specifically purchased and installed with hardware/software combinations that consumers are likely to have. So, they don't have dual power supplies and they're not running Win 2003 Server or RHEL; they're running Windows XP and Fedora Core, I think.

We made these decisions on purpose, and rhelmer pointed to the (painstakingly) documented configuration.

So, we have to be careful with what we replace them with; the replacement may have to be something more consumer oriented, which directly contraindicates what IT would normally put on a machine.

What I'd like to understand is:

1. When this became a pain point for IT? It seemingly hasn't been an issue for the last N (N >= 4, at least) months, so if these machines are crashing all the time, that's something we agreed would be escalated (it's something we should know about anyway).

2. If it's possible to move the hardware, as is, into the colo? This would solve the remote access problem. If not, why not?

3. We went to great pains to basically make these consumer class machines, with consumer class operating systems; if we're going to move to completely different hardware and software configurations, then this becomes a larger conversation that would involve the platform team, the testing team, and others. We'd need to inform them of the change plenty in advance, and get their input to ensure the systems these machines are being moved to meet the requirements.

As I understand it, it would likely involve a round of baseline tests that took... I think 1-2 weeks last time we did it. rhelmer would know for sure.

As I see it, none of the options presented are feasible or desirable, except option 1 (wait for talos), which is what I thought we were all doing/agreed to.
From an IT perspective, I won't have any machines in the office on 24x7 support.  We just don't have the infrastructure to support it there.  If you are OK with business hours support, then let's leave the machines in the office.

As for moving the hardware, we can move it to the colo if you'd prefer, but I thought (I could be wrong) that we bought these perf machines (from the same vendor and similar specs to the current office perf machines) to replace these boxes.
(In reply to comment #5)
> From an IT perspective, I won't have any machines in the office on 24x7
> support.  We just don't have the infrastructure to support it there.  If you
> are OK with business hours support, then let's leave the machines in the
> office.

I'm confused. These machines have been on 24/7 support for the last 8 months. They were specifically *the only* machines that we exempted from the office restriction (and you remember we threw a bunch of machines at the office into Tier 2 and 3 to make life easier, but not these two).

I understand that having these machines at the office is problematic, but we're actively working on talos, and that's at the colo, I believe. Moving these tests to another set of machines would likely require a new round of baselines, and the people that have done that work in the past are either working on talos or on other things.

Can you explain why this is suddenly a problem?

Please link to either outage reports or bugs.

> As for moving the hardware, we can move it to the colo if you'd prefer, but I
> thought (I could be wrong) that we bought these perf machines (from the same
> vendor and similar specs to the current office perf machines) to replace these
> boxes.

That I don't know; I'd defer to rhelmer and robcee/alice, but I think how brittle these tests are, moving them to new installations on new hardware without performance bases right now would make people trying to get 1.9 alphas shipped... not very happy.

I'm pretty sure this should be WONTFIXed, or at the very least, the summary should be changed to reflect that we're either moving the hardware or waiting for talos.
So this has always been an issue, and the fact that this machine going down closes the tree is the issue.  We haven't had a hardware issue yet, but when we do, I am not gonna call people into the office.  

The history as I remember is that rhelmer set them up in the office as test machines, and they all of the sudden became prod machines.

Anyways, this has always been an issue and clearly communicated that office machines won't be on 24x7 status.  I am pretty sure the plan has always been to move this to the colo, but I'll check with rhelmer.  Either way, you can WONTFIX if you want, but they'll be on a 8x5 support for hardware issues.

I am not clear why you object to just moving the current hardware.  If that's the solution, I can have them moved today.
(In reply to comment #7)
> I am not clear why you object to just moving the current hardware.  If that's
> the solution, I can have them moved today.


I think moving the hardware is the best option here.
1) For background, apparently reed was here Saturday night when a blocker ticket was filed on the machines... so he walked downstairs and fixed it. However, we were just lucky that he happened to be here in the office at that late night on a weekend. 

2) In last week's perf meeting, we agreed that the new replacement perf hardware would be Tier2 (9x5 office hours support only). If these should be Tier1 support, we need to revisit this in the next perf meeting.

3) Its therefore unclear to me why these existing perf machines are not also Tier2. Also, if these are changed to Tier2, should we close the tree if they fail?

Make sense?
Reed fixed ticket: https://bugzilla.mozilla.org/show_bug.cgi?id=384751


This bug is really about expectations. 

Are these machine critical Tier1 or not-as-critical Tier2? Once everyone agrees on that starting point, we can make changes accordingly. However, continuing to have machines configured as Tier2 machines, while expect Tier1 support is not good. 
(In reply to comment #9)
> 2) In last week's perf meeting, we agreed that the new replacement perf
> hardware would be Tier2 (9x5 office hours support only). If these should be
> Tier1 support, we need to revisit this in the next perf meeting.

Not the case - the idea is that we bought enough spares that we could keep these as 9x5.  Don't plan on changing that given the crap hardware it's on (by design).  Issue with the current office infra is that there is only 1 and if that 1 machine goes down, we close the tree (bad).
(In reply to comment #9)
> 1) For background, apparently reed was here Saturday night when a blocker
> ticket was filed on the machines... so he walked downstairs and fixed it.
> However, we were just lucky that he happened to be here in the office at that
> late night on a weekend. 

According to the ticket, he connected via RDP/VNC, so it was just a coincidence that he was at the office. Whoever was on call wouldn't have had to have been at the office. He just happened to be.

> 2) In last week's perf meeting, we agreed that the new replacement perf
> hardware would be Tier2 (9x5 office hours support only). If these should be
> Tier1 support, we need to revisit this in the next perf meeting.

I don't agree with that (and I'm willing to bet most engineers won't either), but I wasn't at the meeting, so I'll defer to the people that were there.

All I know is that traditionally, performance machines going down have closed the tree, because when we don't have perf coverage, we frustrate people like bz and dbaron who are typically tasked with chasing down perf regressions. Where they asked?

> 3) Its therefore unclear to me why these existing perf machines are not also
> Tier2. Also, if these are changed to Tier2, should we close the tree if they
> fail?

If these perf machines crash/are unavailable, then someone will close the tree (whether or not it's us).

So they are expected to be Tier 1, and we've communicated that they're Tier 1. If that's changing, then we need to discuss this at the project meeting. But we can't say "Oops, they're not Tier 1 now."

Justin: As for moving the hardware to the colo, I think that's mostly fine (they'll have to be an outage window), but I don't know that it solves the problem you have.

Just power/hands off issues?
> have to be an outage window

And a bigger one than at first glance, since moving them to the colo means switching pageload servers, which as we've now seen doesn't exactly mean just a cycle to see if it reports some number, any number.
We probably want a half-day outage window at least, more likely a full day: close the tree, get a good baseline pre-move, do the move, get a good baseline post-move, open the tree.
If it's switching to new machines, why not run both at once for a bit as the transition rather than having an outage window?
Priority: -- → P3
Alright, Justin and I talked about this and came to the following agreement:

These machines have 24/7 Oncall support for everything, EXCEPT hardware support. This means if the power goes out at 3 am, no one will reboot this machine. Similarly, it means if the hard drive goes down at 3 am, no one will start rebuilding this machine until business hours.

This is a somewhat annoying problem because moving these machines to the colo solves the reliable-power/hands-on issue (someone's there to reboot, etc.), but it doesn't solve the hardware reliability problem. Talos addresses this by having multiple machines available.

Anyway, the upshot is:

1. bl-bld* machines have 24/7 on-calls support for everything except hardware, which has business hours support.
2. These machines present a risk, because they are not RAIDified, don't have dual-power supplies, etc. This was done by design originally.
3. It's understood that these machines being down closes the tree.

I will be posting this to the newsgroups/my blog to remind people of this.

There's an agenda item at the next performance meeting to discuss either moving these machines into the colo (which only solves some subsets of problems), or moving the current infrastructure to the Talos machines (for now), to get redundancy, etc. (That was just a suggestion, and I'll let rhelmer, joduinn, robcee, and alice figure it out. ;-)

I'm going to re-assign to joduinn for now for update on resolution from that meeting; look for the post after the 4th.
Assignee: build → joduinn
Summary of discussion at today's performance meeting:

1) online support (meaning IT can connect and reset machines remotely) is on 24x7 support.

2) hardware support (meaning diskfailure, etc, because hardware is not redundant) requires a drive-to-office, so is on a 9-5 business hours basis only. 

3) downtime with these machines closes the tree. 

4) for catastrophic failure, we will make bacup disk images of bl-bldlnx02, bl-bldlnx01, bl-bldxp01 (bug#387702) and will restore the disk image within a few hours. It is the same spec machine, so no need to rebaseline numbers. Note: bl-bldlnx02 is currently being used for JProf, but was deemed ok to halt in this scenario.

5) once talos is online, these are no longer needed and can be shutdown.

6) assuming that talos comes online in the next month or so, we feel the risk of hardware failure is liveable. If talos significantly overruns, we should reopen and revisit.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.