Closed Bug 1431291 Opened 7 years ago Closed 6 years ago

Hyperchunking of reftests on instances is inefficient, wastes a lot of money on GPU instances

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Unassigned)

References

(Blocks 1 open bug)

Details

Bug 1373578 and bug 1396260 both increased the number of chunks being executed for various reftests configurations to 32 because it made some intermittent failures go away.

The price we paid for working around the problem instead of fixing the underlying problem is we are now spending a lot more money on GPU-enabled AWS instances to run these tasks. And that's because the efficiency of reftest tasks on workers is... pretty bad. Looking at logs, we spend most of the task in setup overhead! In https://public-artifacts.taskcluster.net/HBB4iBUCRviEJ19_bw5DZA/0/public/logs/live_backing.log, we start at 00:20:12, run Firefox at 00:24:24, and finish at 00:25:15. So of the ~303s the task ran, 252s was spent getting the task ready to run. That's ~83%! This in of itself is a major problem and someone should look into it.

But the issue I want to get on people's radar is the cost we're incurring for this overhead and specifically how much "hyperchunking" the reftests made it worse.

Here is the daily spend on g2.2xlarge instances in TaskCluster:

Day          Usage (hours)    Cost
----------  ---------------  ------
2017-09-01            12256    3300
2017-09-02            12406    3199
2017-09-03             6775    1495
2017-09-04             6623    1402
2017-09-05            10809    2622
2017-09-06            13401    3063
2017-09-07            13063    3137
2017-09-08            10266    2610
2017-09-09            14407    3595
2017-09-10            15657    3281
2017-09-11             4772    1077
2017-09-12             8987    2526
2017-09-13            11612    3212
2017-09-14            12016    3613
2017-09-15            12347    3118
2017-09-16             9071    2846
2017-09-17             6832    1408
2017-09-18             4930     906
2017-09-19            10415    2697
2017-09-20            12220    2747
2017-09-21            14803    3258
2017-09-22            14464    2790
2017-09-23             9097    1674
2017-09-24             6188     935
2017-09-25             5156     907
2017-09-26            10391    2034
2017-09-27            10614    2104
2017-09-28            12301    2815
2017-09-29            17054    4275
2017-09-30            16617    3370
2017-10-01             5132     759
2017-10-02             7832    1279
2017-10-03             7368    1320
2017-10-04             6272    1303
2017-10-05             8277    1818
2017-10-06            10323    2541
2017-10-07             8765    1765
2017-10-08             7378    1352
2017-10-09             6208    1024
2017-10-10             9177    1777
2017-10-11            10465    2249
2017-10-12            11516    2563
2017-10-13            14838    3210
2017-10-14            16784    3340
2017-10-15            12828    2275
2017-10-16            10961    1743
2017-10-17            12431    2254
2017-10-18            16432    4216
2017-10-19            16282    3472
2017-10-20            16787    3642
2017-10-21            14572    3641
2017-10-22            11023    2211
2017-10-23             5608     982
2017-10-24            14330    3575
2017-10-25            16589    5164
2017-10-26            17270    5036
2017-10-27            18361    4699
2017-10-28            16364    4434
2017-10-29            14899    3599
2017-10-30            10520    2594
2017-10-31            14609    4651
2017-11-01            13811    4970
2017-11-02            14584    4911
2017-11-03            16756    4214
2017-11-04            15333    4325
2017-11-05            13663    2812
2017-11-06             9664    1741
2017-11-07            14327    3836
2017-11-08            16126    3431
2017-11-09            16142    3852
2017-11-10            16360    3828
2017-11-11            16588    3874
2017-11-12            11578    2058
2017-11-13             8300    1743
2017-11-14            14164    3948
2017-11-15            14399    3732
2017-11-16            16288    4443
2017-11-17            14440    3768
2017-11-18            15586    4067
2017-11-19             9768    2642
2017-11-20             9701    2181
2017-11-21            13579    3126
2017-11-22            14647    4526
2017-11-23            16110    4127
2017-11-24            14830    3488
2017-11-25            14824    3258
2017-11-26             8791    1803
2017-11-27             6071    1189
2017-11-28            12990    3494
2017-11-29            15488    3655
2017-11-30            12644    2789
2017-12-01            17380    3881

If you plot this, you see an uptick in usage and cost in the middle of October. It's even clearer when you look at the 14 day moving average. That's when https://hg.mozilla.org/mozilla-central/rev/f68eef8bbd21 landed.

The 14 day moving average cost increased from ~$2,500/day to ~$3,300/day. Or ~$290,000/year. In other words, we could justify >1 FTE to work on this problem and only this problem for a full year.

You can see this in the monthly spend numbers (keep in mind all hands, holidays, and shutdown skewing numbers for December):

Month         Usage (hours)    Cost
----------  ---------------  ------
2017-02-01              629     408
2017-03-01             7648    1426
2017-04-01            19987    4124
2017-05-01           188373   59886
2017-06-01           157841   44803
2017-07-01           174782   49773
2017-08-01           253260   68712
2017-09-01           268130   63487
2017-10-01           318426   73488
2017-11-01           378893   88710
2017-12-01           411135  100758
2018-01-01           310800   93546
2018-01-17           150465   43313

What the big jump in July/August was for, I don't know. But someone should probably look into that too!

Normally test machines are cheap. But the GPU-enabled ones aren't. So we need to be more careful with their utilization or we can spend a lot of money very quickly.

The changes to increase the number of chunks for these GPU-enabled tests exacerbated some inefficiencies in CI. Given the amount of money involved, we should either undo the hyperchunking or improve the efficiency of these tasks so we're not spending 4+ minutes getting the task ready to run.

needinfo coop so he can triage this.
Flags: needinfo?(coop)
FWIW, about half the startup overhead is extracting the zip files with test files:

00:20:14     INFO - Downloading and extracting to Z:\task_1516234807\build\tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, reftest/*, jsreftest/*, mozinfo.json from https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.common.tests.zip
00:20:14     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.common.tests.zip'}, attempt #1
00:20:14     INFO - Fetch https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.common.tests.zip into memory
00:20:15     INFO - Content-Length response header: 38590587
00:20:15     INFO - Bytes received: 38590587
00:21:42     INFO - Downloading and extracting to Z:\task_1516234807\build\tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, reftest/*, jsreftest/*, mozinfo.json from https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.reftest.tests.zip
00:21:42     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.reftest.tests.zip'}, attempt #1
00:21:42     INFO - Fetch https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.reftest.tests.zip into memory
00:21:44     INFO - Content-Length response header: 60572713
00:21:44     INFO - Bytes received: 60572713
00:22:54     INFO - proxxy config: {}

Long term fix for that is "run tests from source checkouts." The closest bug we have is bug 1286900. The naive solution is to do a normal clone + checkout. But perf on Windows is abysmal. So we need infrastructure changes to the Mercurial server to support making the perf not suck. That's tracked in bug 1428470 and should land in Q2 or Q3.
Also, we knew the zip files were super slow on test instances. That's what initially led us down the "run tests from source checkouts" path a while back. But our assumption - and a reason that work fell off the priority list - was testers are cheap, so the slowdown - while annoying - didn't seem to have much of an impact. I recall discussion with lmandel and/or jgriffin where we thought we could overcome the inefficiencies by using more chunks. If tests only cost pennies an hour, throwing money at the problem as a stop-gap is viable. With expensive GPU-enabled test workers, the calculus changes and the strategy of throwing money at the problem backfires in a big way :/
I should also add that running tests from source checkouts is the key that unlocks a lot of other wins. For example, these tasks spend a few dozen seconds installing Python packages. I'm almost certain this involves some network activity. We're definitely invoking setup.py, which involves new process overhead (expensive on Windows). And we're likely copying a number of files around. If we run from a source checkout, we can leverage sys.path hacks to reference in-repo Python packages and most of the Python overhead will go away. We do this in `mach` and the Firefox build system to cut down on overhead, for example.
Undoing the increase in chunking without doing anything else would lead to even more savings, because if they go back to the way they were, sheriffs will demote them to tier-3 since they didn't even vaguely qualify to be visible in the default treeherder view the way they were. I'm told that doing so would then mean that we would need to close all trees, that we cannot have open trees without Windows reftests. Major savings!
I have considered going to 64 chunks instead of 32- we have so many intermittents on windows 7 reftests specifically for 2 main reasons:
* unable to keep up and draw at the speed we run reftests (no development resources to fix firefox in low memory environments)
* the OS theme is inconsistent (1.5% of jobs we fail to set the theme properly and the chance of intermittent failure is high)

This topic of money has never been a concern, if it is we should be focusing on a lot of other things in addition to this.
Depends on: 1431467
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #5)
> I have considered going to 64 chunks instead of 32-

Hyperchunking has been a long-term goal as a key pillar of significantly reducing end-to-end times (bug 1262834). So I support efforts to move in this direction.

> we have so many
> intermittents on windows 7 reftests specifically for 2 main reasons:
> * unable to keep up and draw at the speed we run reftests (no development
> resources to fix firefox in low memory environments)
> * the OS theme is inconsistent (1.5% of jobs we fail to set the theme
> properly and the chance of intermittent failure is high)

However, hyperchunking as a workaround (to avoid intermittent failures or reduce failure rate)... isn't great.

Correct me if I'm wrong, but the rough framework for our make end-to-end times as fast as possible project was to have each test chunk complete in <5 minutes. Our target was to get the builds down to 15-20 minutes (ideally less). If tests completed in  ~5 minutes, end-to-end times would be <30 minutes and we could focus energy on optimizing builds to further reduce end-to-end times.

I bring this up because the GPU tasks are already completing in the ~5 minute range. And ~4 minutes of those is task startup overhead. Using more chunks will further decrease efficiency and blow up costs. And it yields little to no end-user wins from an end-to-end time perspective (it does help with intermittents though).

> This topic of money has never been a concern, if it is we should be focusing
> on a lot of other things in addition to this.

It's not really been a concern for test machines because test machines historically didn't cost a lot. Many of our test instances cost <$0.02/hour. We can run >1,000 test workers for what it costs to employ 1 person. It's easy to justify that cost.

What changed in 2017 is that the g2.2xlarge instances entered the scene and changed the calculus for cost of test workers. They went from ~$0 to $50-60k/month practically overnight. Now we're flirting with $100k/month. In contrast, we run ~4x more m3.large and m3.xlarge instance-hours for ~50% the cost of the g2.2xlarge. The g2's are ~8x more expensive. We're now spending about as much on them as we are on build workers.

We have historically been concerned with cost of operating the build workers. Those are substantially more expensive instances. We've been careful to not be wasteful with over-provisioning those instances because they can easily contribute to runaway costs (like the g2's have). FWIW bug 1430878 tracks provisioning >30 vCPU count instances. Look for those to make an entrance soon...

I don't have the full context of the cause of the intermittent reftest failures. But it seems to me that identifying and fixing the root cause is a sound investment. Even if we keep splitting up test chunks, the issue will still be there. From my perspective, chunking the tests feels like a very expensive way of sweeping dirt under the rug.
As I mentioned in the Developer Workflow mtg today, I chatted about this with jmaher in the TC migration mtg yesterday. Developer resources are required to fix the underlying tests. I'm happy to drive that request up the management chain to try to make it happen.

jmaher: do you have a short-list of the dev teams we need to target based on the tests that are failing?
Flags: needinfo?(coop) → needinfo?(jmaher)
this is a windows7 reftest issue- I would start with :jet and :milan.
Flags: needinfo?(jmaher)
Reftest run-by-manifest is close to landing, which should hopefully let us reduce the chunks again.
Depends on: 1353461
Product: TaskCluster → Firefox Build System
we run similar chunks on all configs now as of bug 1449587
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.