Closed Bug 975466 Opened 8 years ago Closed 3 months ago

[meta] Spidermonkey builder resource load

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INACTIVE

People

(Reporter: sfink, Assigned: sfink)

References

(Depends on 1 open bug)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2199] )

Attachments

(1 file)

I thought I'd open up a meta bug for spidermonkey builders, to track changes that influence resource utilization.

I'm also going to take a stab at measuring the current load, and will record the findings here.
Depends on: 965447
After massaging the data a bit (throwing out everything older than 90 days, categorizing builders, etc.), I found that there are a number of jobs that took a very long time (weeks, months). So I ignored anything that took longer than 2 hours. (That discards only 0.17% of the spidermonkey builds, but 15% of the other build jobs. If I leave them in, spidermonkey takes a much smaller percentage of the time, but I'm suspicious of that.)

select b.spidermonkey, sum(r.run_time) from requests r join buildrequests br on r.id = br.id join builders b on br.buildername=b.buildername where r.run_time < 7200 and r.type=1 group by b.spidermonkey;
+--------------+-----------------+
| spidermonkey | sum(r.run_time) |
+--------------+-----------------+
|            0 |       965328186 |
|            1 |        18495002 |
+--------------+-----------------+

r.type=1 restricts this to compare to only other build jobs. So spidermonkey is 1.88% of the time spent on build jobs that take less than 2 hours. If you don't like the artificial 2 hour threshold, removing that means spidermonkey is 1.27% of the build load, 0.33% of the overall load.
Unrelated, but here's a dump of the run time of the various job types:

 select r.type, sum(r.run_time) from requests r join buildrequests br on r.id = br.id join builders b on br.buildername=b.buildername where r.run_time < 7200 group by r.type;+------+-----------------+
| type   | sum(r.run_time) |
+--------+-----------------+
|build   |       983823188 |
|test    |      3579278383 |
|talos   |       360559291 |
|valgrind|        14565336 |
|misc    |         3884796 |
|fuzzer  |        86438800 |
+--------+-----------------+

or in machine-years (this is still for the last 90 days):

+------+------------------------------+
| type | sum(r.run_time)/60/60/24/365 |
+------+------------------------------+
|    1 |          31.2                |
|    2 |         113.5                |
|    3 |          11.4                |
|    4 |           0.5                |
|    5 |           0.1                |
|    6 |           2.7                |
+------+------------------------------+

(types are in the same order as the previous table).
I recorded most of what I did, if you'd like to spot where I screwed up.
Oh, right. Note that I am relying on buildsets.submitted_at rather than chasing through to the changes.when_timestamp. Hopefully that doesn't introduce too much inaccuracy? I had trouble getting that query to ever finish. I probably should have modified get_build_times_for_builders.py instead of doing everything in raw sql.
Steve, thank you for you doing this. The data you have here is probably enough - it's nice to know that Spidermonkey is less than 2% of our load.

(In reply to Steve Fink [:sfink] from comment #1)
> After massaging the data a bit (throwing out everything older than 90 days,
> categorizing builders, etc.), I found that there are a number of jobs that
> took a very long time (weeks, months). So I ignored anything that took
> longer than 2 hours.

You probably want to throw out things that take longer than 6 hours. We have plenty of jobs that take more than 2, and I know of some that take over 4. 6 hours should get you all normal jobs, and still throwaway the crazy ones (which are almost always not real).

(In reply to Steve Fink [:sfink] from comment #4)
> Oh, right. Note that I am relying on buildsets.submitted_at rather than
> chasing through to the changes.when_timestamp. Hopefully that doesn't
> introduce too much inaccuracy? I had trouble getting that query to ever
> finish. I probably should have modified get_build_times_for_builders.py
> instead of doing everything in raw sql.

I actually just realized that you probably don't want either of these. They're useful for measuring turnaround times, as they represent push time, but if you just care about job run times you should use builds.finish_time-builds.start_time. It's tricky though, because at some point you have to account for coalesced builds due to peak load. My script (https://hg.mozilla.org/build/braindump/file/c10104f5f52c/buildbot-related/get_build_times_for_builders.py) knows how to do that.
(In reply to Ben Hearsum [:bhearsum] from comment #5)
> (In reply to Steve Fink [:sfink] from comment #4)
> > Oh, right. Note that I am relying on buildsets.submitted_at rather than
> > chasing through to the changes.when_timestamp. Hopefully that doesn't
> > introduce too much inaccuracy? I had trouble getting that query to ever
> > finish. I probably should have modified get_build_times_for_builders.py
> > instead of doing everything in raw sql.
> 
> I actually just realized that you probably don't want either of these.
> They're useful for measuring turnaround times, as they represent push time,
> but if you just care about job run times you should use
> builds.finish_time-builds.start_time. It's tricky though, because at some
> point you have to account for coalesced builds due to peak load. My script
> (https://hg.mozilla.org/build/braindump/file/c10104f5f52c/buildbot-related/
> get_build_times_for_builders.py) knows how to do that.

Whoops, sorry. Although I do use buildsets.submitted_at, that's only for computing wait_time, and I'm not using that for the above data. So those results are already using what you suggest, finish_time-start_time. But without handling coalescing.

Here are the results for build-only jobs less than 6 hours:

+--------------+-----------------+
| spidermonkey | sum(r.run_time) |
+--------------+-----------------+
|            0 |      1425141881 |
|            1 |        18596579 |
+--------------+-----------------+

which comes to 1.29%.

(...and in related new, it looks like I've accidentally added several spidermonkey try jobs recently that I need to kill off...)
Depends on: 1009763
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2184]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2184] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2195]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2195] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2199]
Blocks: 1107807
Depends on: 1247630
Depends on: 1250709
Depends on: 1273639
Depends on: 1273917
Depends on: 1274060
Depends on: 1275775
Steve, is there anything left to do here?
Flags: needinfo?(sphink)
(In reply to Chris AtLee [:catlee] from comment #7)
> Steve, is there anything left to do here?

No. This is a meta bug. I've been using it to gather bugs that add more spidermonkey builds (or change their cost substantially), just from a vague sense of wanting to keep a handle on things and to notify you people when I'm running up your AWS bill. But the actual content of this bug comes from buildbot-only days, and hopefully there are now better ways to estimate the load imposed by a set of jobs (garndt has something for that, I believe?).

So if this is being tracked in some other way now, then there is indeed no reason for this bug to exist.
Flags: needinfo?(sphink)
Depends on: 1413721
Assignee: nobody → sphink
Component: General Automation → General
Depends on: 1509992
Depends on: 1508738

This bug isn't really being used anymore.

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.