Closed Bug 950226 Opened 11 years ago Closed 10 years ago

Determine number of iX machines to request for 2014

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

Details

Attachments

(2 files)

At the beginning of 2013 we had 100 per Windows pool but I had to re-purpose 90 Ubuntu machines as Windows since we were below capacity.

At the end of 2013, we're again under-capacity and we will need to determine how many machines we will need to keep up with our load growth for 2014.

This bug is to track gathering the data to make an informed request.
The back of the envelope assumptions include:
 - success in moving windows builds to AWS which frees up additional testers
 - ability to get some tests of dubious value deleted
Please chat with catlee wrt to the first point.
Windows testing on AWS was not looking good the last time I asked about it. I don't know where the bugs are.
Flags: needinfo?(hwine)
Assignee: nobody → armenzg
(In reply to Hal Wine [:hwine] (use needinfo) from comment #1)
>  - success in moving windows builds to AWS which frees up additional testers

(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) (gone Thu. 12/19/2013 until 1/2/2014) from comment #2)
> Please chat with catlee wrt to the first point.
> Windows testing on AWS was not looking good the last time I asked about it.

Agreed - I meant moving windows builds to AWS, which frees up in-house hardware to be reused as windows testers. That is we grow the test pool size all, or in part, by freeing up builders.
Flags: needinfo?(hwine)
This bug was meant to grow our iX test pool and not our build pool.
The iX build pool cannot be used for the iX test pool since they're completely different hardware specs.

I'm specifically looking into how many more t-w864-ix, t-w732-ix and t-xp32-ix machines we will need for 2014.

I doubt we need more talos-linux32-ix or talos-linux64-ix machines unless any new platform is moved to be run on them (e.g. b2g reftests OR any new Android emulator flavour).

Let me know if we can meet to clarify any of this.
FYI, we needed these numbers at the end of last year to budget for the current year (the company is moving towards having all budgeting for the upcoming fiscal year finished before the year starts).  Have you come to any determination?
I will look into this today. I will hope to have an answer this week.
I have spent 2-3 hours thinking and loading reports to help with this and we don't have a good report to help us make this decision.

I am thinking that perhaps we should have an easy rule of thumb; if we hit 5 days in a month below 95% wait times we should start the order of the next batch and we would do a 20% increase order at a time. We should try it this year and see how it goes. What do you think?
I would not expect this happening more than 2 times a year.

Could we get a 50% increase budget approval for 2014 and be OK if we don't use it all?

Without thinking it twice, I would order another 30 machines per OS (~23% increase) right away to meet our current load (90 in total). We're currently not meeting SLA in a lot of days. Perhaps I should build a spreadsheet with my rule of thumb for the last few months.
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #7)
> I have spent 2-3 hours thinking and loading reports to help with this and we
> don't have a good report to help us make this decision.

Armen: I appreciate the effort here, but we're easily talking about spending $150K for just the 30 machines/OS you suggest. I think it behooves us to show our work here. If we need a hardware report, let's create a report so we don't have to forecast by hand next time. This is also something IT has asking for for a while.
(In reply to Chris Cooper [:coop] from comment #8)
> (In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4)
> from comment #7)
> > I have spent 2-3 hours thinking and loading reports to help with this and we
> > don't have a good report to help us make this decision.
> 
> Armen: I appreciate the effort here, but we're easily talking about spending
> $150K for just the 30 machines/OS you suggest. I think it behooves us to
> show our work here. If we need a hardware report, let's create a report so
> we don't have to forecast by hand next time. This is also something IT has
> asking for for a while.

I will try my best but I got too many things up in the air.

FTR, I did spend a decent amount of time reading about capacity planning and lots of the time was spent on trying to prove readers why the approach we're using is incorrect (this is from the people behind Flickr IIRC). Using previous data and trying to determine future load is wrong. I can only try to determine current needs to some sort of accuracy but trying to determine future needs is not going to work.

The proposed approach is to determine a rule of thumb that initiates the process to procure more capacity. Whole year budget approaches has not worked for us many many times. Facebook uses a similar approach AFAIK.
Attached image May to September
I added the jobs run on each OS for each week from May to September. I excluded Saturdays and Sundays.
Using Win7 as the base we could say that from May to September we saw a growth from 13,500 jobs to > 18,000. That is a 33% growth of jobs in 4 months.

We have to take into account that during that time we went from 100 machines to 130 machines. The more machines the less coalescing.

Is this a good direction to go with?

Unfortunately, I took buildapi down grabbing this data. I should probably have to set it up locally.
I managed to setup buildapi locally and load the reports.
I need a bigger dataset than what I was using.
I don't think I'm going on the right direction to determine this. I'm glad I left buildapi behind though so I can iterate faster.

I will be looking at determine what is the percentage of jobs that get delayed due to the lack of Win7 machines.

mysql> select year(builds.starttime) as Year, month(builds.starttime) as Month, round(sum(unix_timestamp(builds.endtime)-unix_timestamp(builds.starttime))) as "Total CPU sum", count(*) as Jobs, round(sum(unix_timestamp(builds.endtime)-unix_timestamp(builds.starttime))) DIV count(*) as Ratio from builds left join (slaves, builders) on (builds.slave_id=slaves.id and builds.builder_id=builders.id) where builds.starttime > '2013-01-01' and builds.endtime < '2014-01-31' and slaves.name like "t-w732-%" group by Year, Month order by Year, Month;
+------+-------+---------------+--------+-------+
| Year | Month | Total CPU sum | Jobs   | Ratio |
+------+-------+---------------+--------+-------+
| 2013 |     5 |      60280903 |  46433 |  1298 |
| 2013 |     6 |     117102569 |  87328 |  1340 |
| 2013 |     7 |     132674478 |  96855 |  1369 |
| 2013 |     8 |     137688731 | 101549 |  1355 |
| 2013 |     9 |     134586619 | 102307 |  1315 |
| 2013 |    10 |     137652233 | 108120 |  1273 |
| 2013 |    11 |     137673322 | 105512 |  1304 |
| 2013 |    12 |     121862974 |  95733 |  1272 |
| 2014 |     1 |     153610571 | 118155 |  1300 |
+------+-------+---------------+--------+-------+
9 rows in set (7.63 sec)
I don't have time for this.
Assignee: armenzg → nobody
We got bogged down here trying to come up with a perfect estimate.

Discussing with Amy earlier in the week, we decided that ordering 100 new iX machines would be better than nothing. Our target split is 20/20/20/40 for w7/w8/xp/linux64. 

Not perfect, but we can't solve this problem with more hardware forever.

Amy: should this bug morph into the purchasing bug, or is there one already on file?
Flags: needinfo?(arich)
There is not a purchasing bug on file already that I know. The project kickoff form will spawn those bugs, I think, I haven't done that before. Writing the justification was typically taken care of by John, and I'm not sure if it was him or Melissa that handled the actual purchasing process previously. Laura, are we still asking IT to handle that part of the process, or are we hoping to move that bit in house? I think at this point we're also waiting on an okay from Bob, right, Laura?
Flags: needinfo?(arich) → needinfo?(laura)
We'll try to get 20 systems (4 machines/system == 80 machines). 

Amy is putting together the justification and pricing details.
Flags: needinfo?(laura)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: