Closed
Bug 1431291
Opened 7 years ago
Closed 7 years ago
Hyperchunking of reftests on instances is inefficient, wastes a lot of money on GPU instances
Categories
(Firefox Build System :: Task Configuration, task)
Firefox Build System
Task Configuration
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gps, Unassigned)
References
(Blocks 1 open bug)
Details
Bug 1373578 and bug 1396260 both increased the number of chunks being executed for various reftests configurations to 32 because it made some intermittent failures go away.
The price we paid for working around the problem instead of fixing the underlying problem is we are now spending a lot more money on GPU-enabled AWS instances to run these tasks. And that's because the efficiency of reftest tasks on workers is... pretty bad. Looking at logs, we spend most of the task in setup overhead! In https://public-artifacts.taskcluster.net/HBB4iBUCRviEJ19_bw5DZA/0/public/logs/live_backing.log, we start at 00:20:12, run Firefox at 00:24:24, and finish at 00:25:15. So of the ~303s the task ran, 252s was spent getting the task ready to run. That's ~83%! This in of itself is a major problem and someone should look into it.
But the issue I want to get on people's radar is the cost we're incurring for this overhead and specifically how much "hyperchunking" the reftests made it worse.
Here is the daily spend on g2.2xlarge instances in TaskCluster:
Day Usage (hours) Cost
---------- --------------- ------
2017-09-01 12256 3300
2017-09-02 12406 3199
2017-09-03 6775 1495
2017-09-04 6623 1402
2017-09-05 10809 2622
2017-09-06 13401 3063
2017-09-07 13063 3137
2017-09-08 10266 2610
2017-09-09 14407 3595
2017-09-10 15657 3281
2017-09-11 4772 1077
2017-09-12 8987 2526
2017-09-13 11612 3212
2017-09-14 12016 3613
2017-09-15 12347 3118
2017-09-16 9071 2846
2017-09-17 6832 1408
2017-09-18 4930 906
2017-09-19 10415 2697
2017-09-20 12220 2747
2017-09-21 14803 3258
2017-09-22 14464 2790
2017-09-23 9097 1674
2017-09-24 6188 935
2017-09-25 5156 907
2017-09-26 10391 2034
2017-09-27 10614 2104
2017-09-28 12301 2815
2017-09-29 17054 4275
2017-09-30 16617 3370
2017-10-01 5132 759
2017-10-02 7832 1279
2017-10-03 7368 1320
2017-10-04 6272 1303
2017-10-05 8277 1818
2017-10-06 10323 2541
2017-10-07 8765 1765
2017-10-08 7378 1352
2017-10-09 6208 1024
2017-10-10 9177 1777
2017-10-11 10465 2249
2017-10-12 11516 2563
2017-10-13 14838 3210
2017-10-14 16784 3340
2017-10-15 12828 2275
2017-10-16 10961 1743
2017-10-17 12431 2254
2017-10-18 16432 4216
2017-10-19 16282 3472
2017-10-20 16787 3642
2017-10-21 14572 3641
2017-10-22 11023 2211
2017-10-23 5608 982
2017-10-24 14330 3575
2017-10-25 16589 5164
2017-10-26 17270 5036
2017-10-27 18361 4699
2017-10-28 16364 4434
2017-10-29 14899 3599
2017-10-30 10520 2594
2017-10-31 14609 4651
2017-11-01 13811 4970
2017-11-02 14584 4911
2017-11-03 16756 4214
2017-11-04 15333 4325
2017-11-05 13663 2812
2017-11-06 9664 1741
2017-11-07 14327 3836
2017-11-08 16126 3431
2017-11-09 16142 3852
2017-11-10 16360 3828
2017-11-11 16588 3874
2017-11-12 11578 2058
2017-11-13 8300 1743
2017-11-14 14164 3948
2017-11-15 14399 3732
2017-11-16 16288 4443
2017-11-17 14440 3768
2017-11-18 15586 4067
2017-11-19 9768 2642
2017-11-20 9701 2181
2017-11-21 13579 3126
2017-11-22 14647 4526
2017-11-23 16110 4127
2017-11-24 14830 3488
2017-11-25 14824 3258
2017-11-26 8791 1803
2017-11-27 6071 1189
2017-11-28 12990 3494
2017-11-29 15488 3655
2017-11-30 12644 2789
2017-12-01 17380 3881
If you plot this, you see an uptick in usage and cost in the middle of October. It's even clearer when you look at the 14 day moving average. That's when https://hg.mozilla.org/mozilla-central/rev/f68eef8bbd21 landed.
The 14 day moving average cost increased from ~$2,500/day to ~$3,300/day. Or ~$290,000/year. In other words, we could justify >1 FTE to work on this problem and only this problem for a full year.
You can see this in the monthly spend numbers (keep in mind all hands, holidays, and shutdown skewing numbers for December):
Month Usage (hours) Cost
---------- --------------- ------
2017-02-01 629 408
2017-03-01 7648 1426
2017-04-01 19987 4124
2017-05-01 188373 59886
2017-06-01 157841 44803
2017-07-01 174782 49773
2017-08-01 253260 68712
2017-09-01 268130 63487
2017-10-01 318426 73488
2017-11-01 378893 88710
2017-12-01 411135 100758
2018-01-01 310800 93546
2018-01-17 150465 43313
What the big jump in July/August was for, I don't know. But someone should probably look into that too!
Normally test machines are cheap. But the GPU-enabled ones aren't. So we need to be more careful with their utilization or we can spend a lot of money very quickly.
The changes to increase the number of chunks for these GPU-enabled tests exacerbated some inefficiencies in CI. Given the amount of money involved, we should either undo the hyperchunking or improve the efficiency of these tasks so we're not spending 4+ minutes getting the task ready to run.
needinfo coop so he can triage this.
Flags: needinfo?(coop)
Reporter | ||
Comment 1•7 years ago
|
||
FWIW, about half the startup overhead is extracting the zip files with test files:
00:20:14 INFO - Downloading and extracting to Z:\task_1516234807\build\tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, reftest/*, jsreftest/*, mozinfo.json from https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.common.tests.zip
00:20:14 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.common.tests.zip'}, attempt #1
00:20:14 INFO - Fetch https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.common.tests.zip into memory
00:20:15 INFO - Content-Length response header: 38590587
00:20:15 INFO - Bytes received: 38590587
00:21:42 INFO - Downloading and extracting to Z:\task_1516234807\build\tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, reftest/*, jsreftest/*, mozinfo.json from https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.reftest.tests.zip
00:21:42 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.reftest.tests.zip'}, attempt #1
00:21:42 INFO - Fetch https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.reftest.tests.zip into memory
00:21:44 INFO - Content-Length response header: 60572713
00:21:44 INFO - Bytes received: 60572713
00:22:54 INFO - proxxy config: {}
Long term fix for that is "run tests from source checkouts." The closest bug we have is bug 1286900. The naive solution is to do a normal clone + checkout. But perf on Windows is abysmal. So we need infrastructure changes to the Mercurial server to support making the perf not suck. That's tracked in bug 1428470 and should land in Q2 or Q3.
Reporter | ||
Comment 2•7 years ago
|
||
Also, we knew the zip files were super slow on test instances. That's what initially led us down the "run tests from source checkouts" path a while back. But our assumption - and a reason that work fell off the priority list - was testers are cheap, so the slowdown - while annoying - didn't seem to have much of an impact. I recall discussion with lmandel and/or jgriffin where we thought we could overcome the inefficiencies by using more chunks. If tests only cost pennies an hour, throwing money at the problem as a stop-gap is viable. With expensive GPU-enabled test workers, the calculus changes and the strategy of throwing money at the problem backfires in a big way :/
Reporter | ||
Comment 3•7 years ago
|
||
I should also add that running tests from source checkouts is the key that unlocks a lot of other wins. For example, these tasks spend a few dozen seconds installing Python packages. I'm almost certain this involves some network activity. We're definitely invoking setup.py, which involves new process overhead (expensive on Windows). And we're likely copying a number of files around. If we run from a source checkout, we can leverage sys.path hacks to reference in-repo Python packages and most of the Python overhead will go away. We do this in `mach` and the Firefox build system to cut down on overhead, for example.
Comment 4•7 years ago
|
||
Undoing the increase in chunking without doing anything else would lead to even more savings, because if they go back to the way they were, sheriffs will demote them to tier-3 since they didn't even vaguely qualify to be visible in the default treeherder view the way they were. I'm told that doing so would then mean that we would need to close all trees, that we cannot have open trees without Windows reftests. Major savings!
Comment 5•7 years ago
|
||
I have considered going to 64 chunks instead of 32- we have so many intermittents on windows 7 reftests specifically for 2 main reasons:
* unable to keep up and draw at the speed we run reftests (no development resources to fix firefox in low memory environments)
* the OS theme is inconsistent (1.5% of jobs we fail to set the theme properly and the chance of intermittent failure is high)
This topic of money has never been a concern, if it is we should be focusing on a lot of other things in addition to this.
Reporter | ||
Comment 6•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #5)
> I have considered going to 64 chunks instead of 32-
Hyperchunking has been a long-term goal as a key pillar of significantly reducing end-to-end times (bug 1262834). So I support efforts to move in this direction.
> we have so many
> intermittents on windows 7 reftests specifically for 2 main reasons:
> * unable to keep up and draw at the speed we run reftests (no development
> resources to fix firefox in low memory environments)
> * the OS theme is inconsistent (1.5% of jobs we fail to set the theme
> properly and the chance of intermittent failure is high)
However, hyperchunking as a workaround (to avoid intermittent failures or reduce failure rate)... isn't great.
Correct me if I'm wrong, but the rough framework for our make end-to-end times as fast as possible project was to have each test chunk complete in <5 minutes. Our target was to get the builds down to 15-20 minutes (ideally less). If tests completed in ~5 minutes, end-to-end times would be <30 minutes and we could focus energy on optimizing builds to further reduce end-to-end times.
I bring this up because the GPU tasks are already completing in the ~5 minute range. And ~4 minutes of those is task startup overhead. Using more chunks will further decrease efficiency and blow up costs. And it yields little to no end-user wins from an end-to-end time perspective (it does help with intermittents though).
> This topic of money has never been a concern, if it is we should be focusing
> on a lot of other things in addition to this.
It's not really been a concern for test machines because test machines historically didn't cost a lot. Many of our test instances cost <$0.02/hour. We can run >1,000 test workers for what it costs to employ 1 person. It's easy to justify that cost.
What changed in 2017 is that the g2.2xlarge instances entered the scene and changed the calculus for cost of test workers. They went from ~$0 to $50-60k/month practically overnight. Now we're flirting with $100k/month. In contrast, we run ~4x more m3.large and m3.xlarge instance-hours for ~50% the cost of the g2.2xlarge. The g2's are ~8x more expensive. We're now spending about as much on them as we are on build workers.
We have historically been concerned with cost of operating the build workers. Those are substantially more expensive instances. We've been careful to not be wasteful with over-provisioning those instances because they can easily contribute to runaway costs (like the g2's have). FWIW bug 1430878 tracks provisioning >30 vCPU count instances. Look for those to make an entrance soon...
I don't have the full context of the cause of the intermittent reftest failures. But it seems to me that identifying and fixing the root cause is a sound investment. Even if we keep splitting up test chunks, the issue will still be there. From my perspective, chunking the tests feels like a very expensive way of sweeping dirt under the rug.
Comment 7•7 years ago
|
||
As I mentioned in the Developer Workflow mtg today, I chatted about this with jmaher in the TC migration mtg yesterday. Developer resources are required to fix the underlying tests. I'm happy to drive that request up the management chain to try to make it happen.
jmaher: do you have a short-list of the dev teams we need to target based on the tests that are failing?
Flags: needinfo?(coop) → needinfo?(jmaher)
Comment 8•7 years ago
|
||
this is a windows7 reftest issue- I would start with :jet and :milan.
Flags: needinfo?(jmaher)
Comment 9•7 years ago
|
||
Reftest run-by-manifest is close to landing, which should hopefully let us reduce the chunks again.
Depends on: 1353461
Updated•7 years ago
|
Product: TaskCluster → Firefox Build System
Comment 10•7 years ago
|
||
we run similar chunks on all configs now as of bug 1449587
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•