Closed Bug 691856 Opened 12 years ago Closed 12 years ago

Select graphic card for HP DL120 infra for Linux/Windows testing (post rev3 era)

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P4)

x86
All

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

()

Details

IT has new hardware (HP ProLiant DL120 G7 Server) which we would like to use for testing purposes besides build jobs. This would be to replace the Linux and Windows rev3 minis.

To determine that we can use this hardware we need to be able to see if we can run unit tests and talos[1].

To do so we need graphic processing power which requires installing a graphic card unto these machines. Joe has volunteered to determine what the appropriate graphic card is.

As I mentioned to Joe we have to choose a graphic card that would work for the following OSes Linux (32/64-bit), Windows 7 (32/64-bit) and Windows XP. Having different graphic cards from OS to OS is a very expensive maintenance burden for IT/releng and should be avoided at all costs. The reason behind this is that not being able to simply reimage a Linux testing machine as a Window testing machine without having to access the collocation to replace the graphic card makes it very expensive for pool re-distribution/re-purposing. I hope this makes sense.

The DL120 G7s have two expansion slots:
1 PCIe x16 Gen2 (x16 speed) (full-length, full-height) 
1 PCIe x8 Gen2 (x4 speed) (half-length low-profile)

See below for the datasheet [2].

The operating systems that we want to support will need to have compatible drivers for the card.

Once we have a sign off by Joe, IT and someone from QA we can proceed to the installation on one/few of those machines and start the setup process (adjust if other people should be signing off).

Please correct me if I made any incorrect assumptions and/or misunderstood anything and add anyone that you think should be involved.

[1] If anyone knows that we should definitely run talos on slow machines speak up and we can drive that discussion offline.
[2] http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-3691ENW.pdf
Assignee: nobody → joe
Here are a list of the thing I'd prefer, in order:

1. Multiple GPU vendors per OS. That is, some of the (for example) Windows 7 machines run using AMD GPUs, some NVIDIA, some Intel. Repeat for all OSes.
2. Single GPU vendor per OS, different GPU vendor cross-OS. This still gets us multiple GPU vendor testing, but is obviously not optimal.
3. Single GPU vendor for all machines.

Now, I know that due to performance testing, #1 is a non-starter. However, if we could differentiate performance results per GPU, that'd be awesome; it would in fact make it so pool re-distribution would be possible too.

#2 might be best for our testing-on-different-GPU vs performance testing buck, but as Armen points out, it makes pool re-distribution impossible. Just how much of a hard stop is that, anyways?

#3 is easiest for everyone involved, but has the least testing efficacy.
Let's aim for #3 if there are no blocking issues since it is the one that would make things easier.

Any suggestions?
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #2)
> Let's aim for #3 if there are no blocking issues since it is the one that
> would make things easier.

In fact, IT/releng will likely insist on #3. Any other option and we can't move machines between pools for capacity as required.

Now, that said, if the graphics team is interested in getting some dedicated hardware setup to run different graphics hardware, I'm sure we could spin off a side project to do that. It just wouldn't be at the same scale as the rest of the build pool. I would also want to make sure we had a better rollover plan for any special hardware we set up, so we don't end up with another geriatric testing setup. 

If there is interest in this, please file a bug.
After talking with Joe (in holidays) and looking for a while I have chosen Nvidia's GeForce GT 430 for several reasons.

1) It is Nvidia (Joe said any Nvidia should do)
2) It seems to be mid to low range (between GTS and normal GeForce) ~$90
3) I believe it is good for the x16 slot
4) It is not extremely new
** if we need a new model we can choose 440 or 520 if they meet requirements
5) It has drivers for all 5 OSes we care about

I really don't know if I chose it with good criteria but this is what I have. Suggestions are welcome. Otherwise, let's give it a shot.

IT does this card work for you?
Anyone has any objections? Shall we give it a shot?

Would you like me to get this down to dev.platform or dev.planning? or Yammer?
I believe we have the right people in this bug.

[1] http://www.nvidia.com/object/product-geforce-gt-430-us.html
[2] http://store.steampowered.com/hwsurvey (Gamers' survey from Sept. 2011)
Assignee: joe → armenzg
Status: NEW → ASSIGNED
Shall we try getting few cards of the proposed one and try them out?

IMHO we won't know until we try.
Priority: -- → P2
Sure, but bear in mind we don't have a working tester image for the DL120G7's yet, either, so testing may be a bit difficult.

Will you be leading the new-linux-testers effort?  If so, let's schedule a time to talk.
From IRC, this is running ahead of the new-refimage project, and hoping to validate that the graphics card is adequate.

We can spare a few systems in the relabs cluster to test these.  There are currently three that I used for rabbitmq testing, and two allocated to jhford.  I'll order three cards.

We're confident that these hosts will work as builders, using more up-to-date operating systems, but HP does not list any of the testing OS's as officially supported, because they're all desktop/end-user OS's.  The latest Fedora and Windows 7 are likely to work, but XP may be a stretch.  For the former two we'll try 32- and 64-bit versions.  This will be a chance to get an early idea as to the suitability of these machines as testers.

Note that this hardware is remotely manageable, so the plan is that once the cards are installed, relops will remotely install the relevant operating system and hand over to armen for testing.
In looking more deeply, I see that 
  http://www.nvidia.com/object/product-geforce-gt-430-us.html
lists this is a dual-slot width, but
  http://h18004.www1.hp.com/products/quickspecs/13504_na/13504_na.html
shows that the PCI-e x16 slot is only single-width.

I think you'll need to find a card that is single-width.  So, back to the search :(
I have a single slot GT430
Can you post a link to it?
Newegg:
http://www.newegg.com/Product/Product.aspx?Item=N82E16814130579

Evga.com (same card):
http://www.evga.com/products/moreInfo.asp?pn=01G-P3-1335-KR&family=GeForce%20400%20Series%20Family&sw=


is what I have, i think.  I know for sure that its a gt 430 from evga and is single slot.  I am not 100% sure if it is passive.  Taking it out of my home theatre setup would be a pain, so i'd rather leave it in there unless this is an emergency.
armen, look good to you?

Definitely no need to take it out - we'll buy some :)
(In reply to Dustin J. Mitchell [:dustin] from comment #12)
> armen, look good to you?
> 
> Definitely no need to take it out - we'll buy some :)

This looks good. Feel free to use this bug to track it or file another one.

Thank you guys!
(In reply to John Ford [:jhford] from comment #11)
> Newegg:
> http://www.newegg.com/Product/Product.aspx?Item=N82E16814130579
> 
> Evga.com (same card):
> http://www.evga.com/products/moreInfo.asp?pn=01G-P3-1335-
> KR&family=GeForce%20400%20Series%20Family&sw=

These aren't quite the same card - the first is
  01G-P3-1335-KR - http://www.evga.com/products/pdf/01G-P3-1335.pdf
while the second is
  01G-P3-1430-LR - http://www.evga.com/products/pdf/01G-P3-1430.pdf

From the PDFs, the differences are (1335 vs. 1430):
 Memory Clock: 1200 vs. 1400 MHz
 Memory Bit Width: 64 vs. 128
 Memory Bandwidth: 9.6 vs. 22.4
 Dual-Link DHCP Capable: no vs. yes

These also aren't passive - the images on the sites you linked to all show a fan.  They are single-width, though.

The 1335 looks to be discontinued:
  http://www.newegg.com/Product/Product.aspx?Item=N82E16814130656
so I suppose that means we should go with the 1430!  I just need to check up on the suitability of a passive card before we get these ordered.
I meant suitability of an *active* card :)

It sounds like this should work fine, so I'll order up three 1430's for delivery to mtv1.
Fine folks of desktop: can you order three 01G-P3-1430-LR's from the vendor of your choice, for delivery to mtv1 and eventually 3.MDF?

The newegg link - http://www.newegg.com/Product/Product.aspx?Item=N82E16814130579 - is one possible vendor, but whatever source is best for you is fine.

If you get tracking info, please add it here and I'll keep an eye on it.
Assignee: armenzg → desktop-support
Component: Release Engineering → Server Operations: Desktop Issues
QA Contact: release → tfairfield
Assignee: desktop-support → aignacio
As a note to self, once these arrive we'll need to get lmsensors installed to monitor GPU temperature, or use something like

nvidia-settings -q [gpu:0]/GPUCoreTemp | grep "Attribute" | sed -e "s/.*: //g" -e "s/\.//g"
Whiteboard: On order
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: On order → Received
Sweet, thanks!  I'll get hands laid on these shortly, and get them installed.
Assignee: aignacio → server-ops-releng
Component: Server Operations: Desktop Issues → Server Operations: RelEng
QA Contact: tfairfield → zandr
Hi, Ann, can you please drop these at Matt Larrain's desk today (or let us know where you are, and we can coem get themfrom you if you're in MTV).  Thanks!
Assignee: server-ops-releng → aignacio
Status: RESOLVED → REOPENED
Component: Server Operations: RelEng → Server Operations: Desktop Issues
QA Contact: zandr → tfairfield
Resolution: FIXED → ---
Ann, Can you put these on Matt Larrain's desk (where I'm sitting) ASAP?  I need to get these installed today, as I'm only around for today :)
Assignee: aignacio → dustin
Component: Server Operations: Desktop Issues → Server Operations: RelEng
QA Contact: tfairfield → zandr
..or tell me where they are :)
Tabatha said she hasn't seen these.
Assignee: dustin → desktop-support
Component: Server Operations: RelEng → Server Operations: Desktop Issues
QA Contact: zandr → tfairfield
Assignee: desktop-support → tromero
I'll be in tomorrow (Friday) from 9-noon.  This isn't an emergency, but it will be much easier for me to install these myself than direct others to do so, and it seems silly to not do so due to a simple miscommunication over where the cards are..
Did it happen?
Nope.  Tabitha, we're almost at a week of "Received" but not actually received.  What can we do to find these?  Or can we just order new ones?
eVGA GeForce GT 430 - graphics card was reordered.  ETA: 11/18
Whiteboard: Received → ETA 11/18
Tabitha - thanks for tracking that down.

Ann - great, and where are they shipping to?  Do you have tracking information?  Tabitha mentions they are shipping, or did ship, to Armen in Toronto.  If that's the case for the previous batch, and the new batch is going to Mountain View, great.  Otherwise, can you get a head-start on shipping arrangements to send those from Toronto to Mountain View, since that's where they're needed (per comment 16)

Also, you used the singular in comment 26, but comment 16 specifies three cards.  Were three ordered?
3 graphics cards will be shipped to Mountain View.  ETA: 11/18
Armen has located the original three graphics cards in toronto, and will bring them to mountain view next time he's in town.  We don't need six (yet), so there's no rush on that.
I'll see if Matt or Jake can track these down tomorrow (11/18) afternoon.
Assignee: tromero → dustin
Component: Server Operations: Desktop Issues → Server Operations: RelEng
QA Contact: tfairfield → zandr
Sorry for the inconvenience.  The vendor is saying that the graphics cards are on backorder.  Ship Date 11/25
Status: REOPENED → ASSIGNED
Whiteboard: ETA 11/18 → Backordered-Ship Date 11/25
FTR I only got one card in Toronto. I should have been specific on my email.
The two GT430 video cards came in.  Who am I supposed to deploy them to?


-Vinh
Hm, there should be three 01G-P3-1430-LR cards.  Please give them to Matt Larrain.
Only two came in.  The third one is on order.  I will place these two on Matt's desk.
Whiteboard: Backordered-Ship Date 11/25 → Backordered-Ship Date 11/25 - 2/3 deployed. Waiting for the third video card to arrive.
Please install these cards in relabs01 and relabs02.  I will kickstart those hosts with CentOS 6.0, and hand them to Armen.
Assignee: dustin → server-ops-releng
colo-trip: --- → mtv1
Third graphics card deployed.
OK, please install them in relabs01, relabs02, and relabs06 :)
Assignee: server-ops-releng → jwatkins
Cards have been installed in relabs01, relabs02, and relabs06 :)
Great, thanks!

Hearkening back to comment 7 and yesterday's IRC conversation, it's still not clear what operating system should be installed here.

Armen, how would you like to proceed?  I can do a CentOS 6.0 kickstart install for you, or I can add a username/password to the iLO for you, and you can install whatever you'd like.
Assignee: jwatkins → armenzg
Whiteboard: Backordered-Ship Date 11/25 - 2/3 deployed. Waiting for the third video card to arrive.
I think iLO works best for me even if it would require extra work for me.

Thanks a lot for making this happen.

FTR: I won't be able to jump right away to work on this
OK, I've set this up.  You can login to
  https://relabs01-mgmt.build.mtv1.mozilla.com
  https://relabs02-mgmt.build.mtv1.mozilla.com
  https://relabs06-mgmt.build.mtv1.mozilla.com
with username 'armenzg' and the releng root pw.  You should have access to the virtual console (use the Java one, not .NET), virtual media, and control of the server power.  Things are fairly self-explanatory, but we can help out in #ops with any questions.

Please report your progress here - this will be a nice "first peek" at how well desktop operating systems run on this hardware.
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Priority: P2 → P3
There's been discussion lately of AMD-only graphics bugs on R-D. I realize that these are nVidia cards, but this work would let us create a pool with both nVidia and AMD cards.

On the Mac side, r5 minis have AMD GPUs.
I'm confused why this is a P3.  We finally have the cards, we finally have machines, let's finish this out and get it done.  The old Rev3 minis are getting more and more out of date and when we "need" to stand up new windows/linux testers you know that everyone will want it "last week".  I think this is our chance to get ahead of the curve before things get urgent.
It's mostly a resource allocation issue - we're looking at a datacenter move, wrapping up the new rev4 tester hardware, building rev5 mac builders, and creating a new linux builder image first.  We could potentially swap testers' priority (and hardware?) with the latter item, but the others already have firm (and close!) deadlines behind them.

And, not to nitpick, but we *don't* have the machines - Armen is using three machines from the relops lab to test the hardware, which he'll need to give back.  In fact, at the moment we don't have anywhere to install new machines even if we had them (that will have to be scl3).

So, bottom line, aside from experimenting with graphics hardware and linux images on this platform, there's not much we can do, and given the balls currently in the air I think it's premature to have much conversation about priority.
(In reply to Dustin J. Mitchell [:dustin] from comment #45)
> So, bottom line, aside from experimenting with graphics hardware and linux
> images on this platform, there's not much we can do, and given the balls
> currently in the air I think it's premature to have much conversation about
> priority.

Cool, thanks for the update, Dustin.  That's exactly the context I was missing.  I appreciate you filling me in.
Priority: P3 → P4
(In reply to Clint Talbert ( :ctalbert ) from https://bugzilla.mozilla.org/show_bug.cgi?id=737282#c3) 
> I would love for us to move away from mac mini's wholesale for windows and
> linux testers. But that's a larger issue that's been going on for over two
> years now. Current work on that appears stalled in bug 691856.

Nobody likes the minis. Both IT and releng are resource-constrained due to the in-progress colo move, and are currently focusing on replacing the aging 10.5 rev2 Mac mini builder platform with a new 10.7 rev5 Mac mini builder platform. Per IT, those old minis can't live in the new colo, so this *is* the critical path.

Once the colo move is done, how fast *would* we be able to pivot on this? I see a  bunch of unknowns:

* once stage 1 of scl3 is done (May), how much space will we have in our various colos to deploy HPs as testing boxes, assuming a 1-for-1 replacement of current test machines with HP machines? Keep in mind we will have space reqs for pandaboards also.
* I know we have more space coming in scl3 in Q3 as part of the releng BU build-out, and we probably don't *want* to put these machines elsewhere. Given that, do/can we wait until Q3 for this? Is there anyone that could even work on it (releng or IT) in the interim?
* because it's taken so long, is there different hardware (or a different HW rev) that we should be considering?
* how quickly can we order/deliver/rack more of these machines & graphics cards?
The only place we'll be putting new hardware is scl3.  Right now we have 9 racks we allocated to releng.

* 1.5 of them are currently being used by minis (a rack holds 64 minis).  Racks configured to hold minis will not hold other types of hardware based on the PDU setup. So we may want to allocate more racks to minis, based on how many more you think you might need for builders or capacity for 10.8.
* Part of 1 rack is being used for server equipment (3 HP DL 360s at the moment)
* We haven't been told how many pandaboards we're going to need since we can't even get them working yet.  I would guestimate somewhere between 10-15 pandaboards per 4U.  We'd likely want to allocate at least 2 racks to them, maybe 3?

We will have more space (19 racks) once the expansion is complete, but we'll have a total power budget of 100kW.  We must plan around that for *all* releng hardware since scl1 will be closing in a bit over a year and everything will be consolidated in scl3.  Knowing how many and what tyep of servers we're going to need for growth (for every platform) is critical.

Please remember to take into consideration any hardware you're going to want for w8, mountain lion, etc.  Also please keep in mind that we can not use modern hardware to run tests on ancient OSes (e.g. xp, centos 5.0) since they will not be supported.

Timing for ordering hardware and getting it racked will depend on the hardware and when we want to get this done.  There's probably more time constraint now before the sjc1 move than there will be afterwards for racking and cabling because folks are very busy trying to evac sjc1.

There are additional prerequisites for getting this working other than just buying and installing the hardware (buildbot masters, configuration management, how servers are imaged, etc).

We would need to sit down and come up with a comprehensive plan for each platform to give an accurate time estimate, but testing on centos6 is probably the lowest hanging fruit at this point.
(In reply to Dustin J. Mitchell [:dustin] from comment #42)
> OK, I've set this up.  You can login to
>   https://relabs01-mgmt.build.mtv1.mozilla.com
>   https://relabs02-mgmt.build.mtv1.mozilla.com
>   https://relabs06-mgmt.build.mtv1.mozilla.com

Armen, are these still being used?  I'd like to reclaim the systems for labs use otherwise.
(In reply to Dustin J. Mitchell [:dustin] from comment #49)
> (In reply to Dustin J. Mitchell [:dustin] from comment #42)
> > OK, I've set this up.  You can login to
> >   https://relabs01-mgmt.build.mtv1.mozilla.com
> >   https://relabs02-mgmt.build.mtv1.mozilla.com
> >   https://relabs06-mgmt.build.mtv1.mozilla.com
> 
> Armen, are these still being used?  I'd like to reclaim the systems for labs
> use otherwise.

All yours.
Which is to say, Armen's not working on this project, but it's quite a bit more critical now, so we should leave these cards in place, and hopefully another relenger will be assigned to evaluate them.
Assignee: armenzg → nobody
I think this belongs in Platform Support.
Component: Release Engineering → Release Engineering: Platform Support
QA Contact: release → coop
The graphics card selected (eVGA GeForce GT 430, comment 26) will work for the HP machines. We should open another bug to actually buy them in bulk once they're needed (soon). We're already installing the existing cards in bug 755772.

However, Amy tells me that if we're going to try to squeeze as much testing density out of new testing hardware as possible, we'll be going with different hardware, possibly dual iX systems (half-rack). These will require a low-profile graphics cards, so we should also file a bug to determine whether a comparable low-profile card exists.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Blocks: 760440
(In reply to Dustin J. Mitchell [:dustin] from comment #27)
> Ann - great, and where are they shipping to?  Do you have tracking
> information?  Tabitha mentions they are shipping, or did ship, to Armen in
> Toronto.

I cleared my desk and found on my drawers this card.
I have asked hilzy to ship it back to MaRu on MV office.
(In reply to Chris Cooper [:coop] from comment #53)
> We should open another bug to actually buy them in bulk
> once they're needed (soon). 

Once bugs are opened for that / the buying of the machines, please can they be marked as dependants of bug 764713, just to give something to point at :-)
(In reply to Ed Morley [:edmorley] from comment #55) 
> Once bugs are opened for that / the buying of the machines, please can they
> be marked as dependants of bug 764713, just to give something to point at :-)

We'll be using the same cards, but they'll be going into the 4-node iX machines now instead of these HP machines. Yes, we'll file bugs to get the graphics cards ordered when we place our first order for the iX nodes.
Product: mozilla.org → Release Engineering
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.