Closed Bug 834396 Opened 13 years ago Closed 11 years ago

pre-order sufficient graphics cards for releng testing infrastructure

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: arich)

References

Details

Once we have the new w8 systems in production and verify that everything works fine with the graphics cards, we need to pre-order more so that we have enough for planned capacity plus some cushion. Current forecast is that we'll need 500 more cards just for planned capacity. The specific card is: NVIDIA GPU GeForce GT 610
It's been almost a year since this card was released. I'm getting concerned that we still don't have an answer on how many to pre-order. We should make this decision in the next couple weeks.
Flags: needinfo?(hwine)
Redirecting to the decider. John - note that IT will be batching the "pre-order for future systems" with the "repair stock" order for existing systems.
Flags: needinfo?(hwine) → needinfo?(joduinn)
:arr - can we get a "last time buy date" for quantity 2K (ballpark) on these cards?
Flags: needinfo?(arich)
This is a low end consumer product. AFAIK, nvidia does not give EOL projections on their hardware. That said, I've asked Matt Finney if he knows anyone at Nvidia who might have some idea. As I recall, joduinn is the on with the in there, though, so he's probably in the best position to get this information.
Flags: needinfo?(arich)
As I expected, Nvidia doesn't give out official word on this. What Matt said was that the rumor was that they should probably be available till near the end of the year. He's going to ask his partners to notify them if it is going EOL, but I don't know how much warning that gives us.
(In reply to Hal Wine [:hwine] from comment #2) > Redirecting to the decider. > > John - note that IT will be batching the "pre-order for future systems" with > the "repair stock" order for existing systems. 1) unclear why these questions need to be co-tangled. Please order sufficient "repair stock" cards immediately, so that our current machines can be used even with the future predicted graphics card failures. 2) Estimates of how many other 4node ix machines we need to buy in order handle future load can really only happen as we have the current orders of machines delivered, brought online, and handling test load in a measurable way. At that point, we can evaluate possibly ordering more 4node ix machines, and more nvidia cards.
Flags: needinfo?(joduinn)
(In reply to Amy Rich [:arich] [:arr] from comment #4) > This is a low end consumer product. AFAIK, nvidia does not give EOL > projections on their hardware. That said, I've asked Matt Finney if he > knows anyone at Nvidia who might have some idea. Please do keep us posted here. > As I recall, joduinn is > the on with the in there, though, so he's probably in the best position to > get this information. My only contact at nvidia was around tegras, and that was a long time ago. I can share contact info with you/mfinney if you have no other sources, but I suspect these people are very different portions of nvidia, so doubtful they could help.
Amy, per comment 6, please order sufficient replacement stock to service the current set of orders.
Summary: pre-order sufficient graphics cards for releng testing infrastructure → order sufficient graphics cards as replacement stock for recent ix node orders
Please don't morph this bug. This bug is to pre-order cards for systems we don't yet own.
Summary: order sufficient graphics cards as replacement stock for recent ix node orders → pre-order sufficient graphics cards for releng testing infrastructure
Depends on: 857274
Hi, I would like to help drive this to completion. For maintaining current capacity (500 nodes): Is 25 graphic cards per pool a good number? (25% of each pool) Could we assume that for the next 2 years our current pools would have 125 failing graphic cards? For the Windows Firefox 64-bit project we might need to buy another 130 Win8 64-bit nodes plus its 155 graphic cards (+25 for maintenance). I'm still trying to determine what the plan will be. The Ubuntu HW that I'm asking to re-purpose as WinXP, Win7 and Win8 won't need to be refilled as we are running most jobs on Ubuntu VMs. There's few jobs to move from Fedora to Ubuntu but so far the plan is that all of them will move to the Ubuntu VMs. If we look at joduinn's load graphs [1], we can get this data: * From May 2012-2013 we are having between 5,250-6,500 pushes/month ** From May to March is a 25% load increase ** The growth in the last year has had a slower steady growth * From March 2011-2012 we were having between 1,750-4,500 pushes/month ** In a year the load increased by 250% I expect that next year we will have to increase each of the pool's capacity by 15-30% which should be like this (rounded numbers): * Win8 - 130 -> 150-170 - 20-40 * Win7 - 130 -> 150-170 - 20-40 * WinXP - 130 -> 150-170 - 20-40 * Lin32 - 55 -> 65- 75 - 10-20 * Lin64 - 55 -> 65- 75 - 10-20 This means that we should be getting 80 to 160 more nodes. This means that we should have 100 to 200 graphic cards for maintenance. I expect the growth in 2015 to be smaller than 15-30%. The numbers are: * current maintenance - 125 * possible win8 increase - 155 * 2014's growth - 80-60 * 2014's maintenance - 20-40 without thinking about 2015 we could be purchasing 380-420 graphic cards and be comfortable as I assume that 25% graphic cards breakage seems a little high. These numbers are not perfect but it gives a rough and perhaps conservative idea. Worst comes to worst, we could take at the end of 2014 and move one of the pool to a new different graphic card. That would make one of the OSes not be able to be considered for pool resizing but that would be OK. We would just have to re-base talos numbers. Does this help? Should we plug these numbers and assumptions in a spreadsheet to play with the values? Another note, using growth of pushes is not necessary a good value but it helps to give a rough idea. [1] http://oduinn.com/images/2013/blog_2013_03_trend.png
arr says that a study says that the aim for graphic cards failure is 3%. I will create a spreadsheet with 5% to be on the safe side since we never know if our card could come out on the bad side.
I've recalculated this with a 5% of overage for failure [1] I've considered growths for 2014 & 2015 of A) 15/10% and B) 30/20% Assuming no extra 130 nodes for Windows Firefox 64-bit: A) 165 graphic cards B) 295 graphic cards Assuming extra 130 nodes: A) 209 graphic cards B) 370 graphic cards Do the numbers make more sense this way? What would you need to determine what could be an approximate good number of cards? [1] https://docs.google.com/a/mozilla.com/spreadsheet/ccc?key=0AnE5wl4QTSYKdHkwa1U0WkNXblRWRlA2RHc3N3ZiV3c#gid=0
To be clear, the A) and B) choices in comment 12 are assuming different projected growth percentages (which is evident if you look at the spreadsheet). What this does not take into consideration is bringing up any new platforms other than (possibly) testing a 64 bit build of firefox on windows 8 64 (the second set of numbers above). Is the unstated assumption either: 1) we will not bring any other new windows or linux testing platforms online in the next three years, or 2) those new platforms will use a different graphics card? I find #1 unlikely, but if we're making a choice for #2, your analysis seems reasonable.
I'm fine with future OSes to have different graphic cards. I can run the plan and numbers through the rest of the release group. I don't think that we will have a Windows 9 coming out before 2016 unless Windows 8 is such a horrible OS that they will have to speed up Windows 9 (think of Vista). BTW, since we took the big budget hit this year; do you think we could try to increase our capacity an extra 10-15% to reduce how much coallescing we do? Is that something that we could be willing to do? We have initiatives to reduce the load but having extra capacity would help too.
We're going to assume that in the next 3 years we will need to support Windows 64-bit regardless if the decision comes in tomorrow or in 12 months from now. I have sent an email to Melissa and other major stakeholders to see if they have any complains with this plan. I would like to propose to order 100 (or more) graphic cards for 64-bit builds. If we run the jobs on Win8 64-bit, we would need to increase the pool by a 100. If we have to setup Win7 64-bit, we will need a separate pool which would need 130 nodes. I will re-calculate our numbers this week. Sounds reasonable?
We should never need graphics cards for builders, correct? I just want to make sure that we're looking at test machines only for these figures.
(In reply to Amy Rich [:arich] [:arr] from comment #16) > We should never need graphics cards for builders, correct? I just want to > make sure that we're looking at test machines only for these figures. Correct. I am talking about test machines. A) win32 build --> test on Win8 x64 machine B) win64 build --> test on Win8 x64 machine OR test on Win7 x64 machines Adding 64-bit builds cause us to have double of builds going on which should cause double load on win8 x64 machines.
I'm going to start working on a report that would allow us to look at the past and try to determine what type of growth we could expect in the next 2 years. I hope I can get this report before I'm off for 3 weeks after June 13th. Would you like me to file a separate bug to request the 130 nodes for the 64-bit project?
Depends on: 875832
Bug 875832 indicates that we need a minimum of 139 cards to support 64-bit firefox builds. If machines with cards do not get ordered before this order is placed, then we will order those 139 cards as part of this order. If an order gets placed for more machines before this order is placed, I will comment with that order bug number here.
per mtg w/mrz today: 1) mrz has ordered 140 graphics cards to be delivered to ix. Once the IX machine order makes it through the paperwork process, ix will install those graphics cards into the machines before delivering to Mozilla. This allows us to *know* that we will have the identical graphics cards in all the ix 4-node machines. This addresses a big concern RelEng has about the risk of video cards EOL-ing before the next order paperwork is processed. 2) RelEng will start the paperwork for the 4node ix machines in the coming days, once we finalize best-guessing how many machines we will need. If it turns out to be >140, then Armen will let mrz know so he can increase the order as/if needed. (cc-ing some others to cover, while I'm afk for next two weeks.)
OS: Mac OS X → All
Great, I'll close out this bug. Please open a new one if more cards are required.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
(In reply to John O'Duinn [:joduinn] from comment #21) > per mtg w/mrz today: > > 1) mrz has ordered 140 graphics cards to be delivered to ix. Once the IX > machine order makes it through the paperwork process, ix will install those > graphics cards into the machines before delivering to Mozilla. > > This allows us to *know* that we will have the identical graphics cards in > all the ix 4-node machines. This addresses a big concern RelEng has about > the risk of video cards EOL-ing before the next order paperwork is > processed. > > 2) RelEng will start the paperwork for the 4node ix machines in the coming > days, once we finalize best-guessing how many machines we will need. If it > turns out to be >140, then Armen will let mrz know so he can increase the > order as/if needed. > > > > (cc-ing some others to cover, while I'm afk for next two weeks.) (In reply to Amy Rich [:arich] [:arr] from comment #22) > Great, I'll close out this bug. Please open a new one if more cards are > required. Reopening. We need this bug open until: * we have confirmed delivery of the above batch of graphics cards, * and (in next few days) resolved ongoing business question about need for supporting 64-bit-builds-on-win7, which would require additional cards asap to fullfill Windows 64bit support. Additionally, based on comment#0, comment#6, comment#10-#13, I thought this bug was also to track: * buying replacements for failing cards (with numbers ranging from 3%-5%? if I read correctly?) * buying additional cards for increase load / growth of current OS ...so should be kept open for that reason also. If the replacement/future growth purchases and context about number-of-cards are being tracked elsewhere, please link me to that bug. Either way, we need to buy those replacements/future growth cards asap before they EOL.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I presume mrz is tracking the order of the cards through service now, not this bug. This bug is not to track ordering buying spares. You had hal open a separate bug for that, and it's already been done.
I thought we were keeping this one for: * how many nodes do we expect to buy over the next 2-3 years I've filed bug 877172 on our side to work on getting a report to answer that.
Cards have been ordered. The only be "delivered" to Mozilla. They'll stay with iX and eventually be integrated into the systems they build. This particular bug, based on the summary, would appear to be resolved. I have "pre-ordered sufficient graphics cards". Up to someone else to close or keep open.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
iX asked us about the cards they have in stock for us since it's been close to a year now with no movement. Should we resolve this bug and bring those cards in house ot use as spares for the existing infra?
I don't know. With all the various comments, I've lost track of what these cards were for and how many.
This was supposed to be the bug to pre-order graphics cards for any future test nodes we were going to purchase, because the graphics cards do not stay in circulation long before they are replaced with a new model. TBH, I'm not even sure we COULD get more of these if we wanted at this point. Comment 21 says we ordered 140 of them. If we're moving more and more testing to virtualized instances, this will probably be enough to last us a while?
Yes, I hope so. We're also going to look into smarter scheduling (run tests on every other checkin and backfill when a regression is found). The API for backfilling is semi-officially open for business (zeller's work).
Okay, I'm going to R/F this, then.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.