Some test jobs intermittently run very slowly

RESOLVED FIXED in mozilla54

Status

defect
P3
normal
RESOLVED FIXED
3 years ago
9 months ago

People

(Reporter: gbrown, Assigned: gbrown)

Tracking

(Blocks 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

I see a few recent examples of test jobs timing out where it looks like the tests just ran more than 2x more slowly than they normally would. I have Android reftest examples, but I have a feeling I have seen the same symptoms on other platforms.
https://public-artifacts.taskcluster.net/SAY7n5lsRJe3e5XGhswlXQ/0/public/logs/live_backing.log - bug 1321170

[task 2016-11-30T04:10:16.710093Z] 04:10:16     INFO -  REFTEST TEST-LOAD | http://10.0.2.2:8854/tests/layout/reftests/counters/counter-ua-limits-02.html | 253 / 281 (90%)
[task 2016-11-30T04:10:22.813428Z] 04:10:22     INFO -  org.mozilla.fennec still alive after SIGABRT: waiting...
[task 2016-11-30T04:10:28.294595Z] 04:10:28  WARNING -  TEST-UNEXPECTED-FAIL | http://10.0.2.2:8854/tests/layout/reftests/counters/counter-ua-limits-02.html == http://10.0.2.2:8854/tests/layout/reftests/counters/counter-ua-limits-02-ref.html | application ran for longer than allowed maximum time
[task 2016-11-30T04:10:28.296520Z] 04:10:28     INFO -  INFO | automation.py | Application ran for: 2:16:34.225782

This is reftest-22 (of 48, so Android Debug), which normally takes 45 minutes to complete.

Even harness startup was slow:

[task 2016-11-30T01:58:37.044757Z] 01:58:37     INFO -  REFTEST INFO | {"isDebugBuild":true,"xulRuntime":{"widgetToolkit":"android","OS":"Android","XPCOMABI":"arm-eabi-gcc3"},"smallScreen":false,"d2d":false,"dwrite":false,"gpuProcess":false,"azureCairo":false,"azureQuartz":false,"azureSkia":true,"skiaContent":true,"azureSkiaGL":1,"contentSameGfxBackendAsCanvas":true,"layersGPUAccelerated":true,"d3d11":false,"d3d9":false,"layersOpenGL":true,"layersOMTC":true,"B2G":false,"Android":true,"cocoaWidget":false,"gtkWidget":false,"qtWidget":false,"winWidget":false,"transparentScrollbars":false,"AndroidVersion":18,"AddressSanitizer":false,"webrtc":true,"http":{"userAgent":"Mozilla/5.0 (Android 4.3.1; Mobile; rv:53.0) Gecko/53.0 Firefox/53.0","appName":"Mozilla","appVersion":"5.0","platform":"Android 4.3.1","oscpu":"Linux armv7l","misc":"rv:53.0"},"haveTestPlugin":false,"windowsDefaultTheme":false,"nativeThemePref":false,"prefs":{},"browserIsRemote":false,"Mulet":false,"asyncPan":true}
[task 2016-11-30T02:08:11.695513Z] 02:08:11     INFO -  REFTEST SUITE-START | Running 13453 tests
[task 2016-11-30T02:08:11.695990Z] 02:08:11     INFO -  REFTEST INFO | Running chunk 22 out of 48 chunks.  tests 6388-6668/281

10 minutes vs normally around 4 minutes for Android Debug reftest-22.
Blocks: 1321170
https://public-artifacts.taskcluster.net/EPnUtCv6RY-hdNNRLM2kHw/0/public/logs/live_backing.log - bug 1321152, Android Debug reftest-16:

task 2016-11-30T02:26:43.503304Z] 02:26:43  WARNING -  TEST-UNEXPECTED-FAIL | http://10.0.2.2:8854/tests/layout/reftests/bugs/586400-1.html == http://10.0.2.2:8854/tests/layout/reftests/bugs/586400-1-ref.html | application ran for longer than allowed maximum time
[task 2016-11-30T02:26:43.503485Z] 02:26:43     INFO -  INFO | automation.py | Application ran for: 2:16:38.770905

...usually R16 runs in less than 1 hour.
Blocks: 1321152
https://archive.mozilla.org/pub/mobile/tinderbox-builds/autoland-android-api-15/1480371682/autoland_ubuntu64_vm_armv7_large_test-plain-reftest-16-bm123-tests1-linux64-build2.txt.gz - bug 1320852, Android opt reftest-16, buildbot:

17:01:39     INFO -  REFTEST INFO | SET PREFERENCE pref(layout.accessiblecaret.enabled_on_touch,false)
17:01:39     INFO -  REFTEST TEST-LOAD | http://10.0.2.2:8854/tests/editor/reftests/824080-2-ref.html | 795 / 945 (84%)
17:01:46     INFO -  org.mozilla.fennec still alive after SIGABRT: waiting...
17:01:51  WARNING -  TEST-UNEXPECTED-FAIL | http://10.0.2.2:8854/tests/editor/reftests/824080-2.html == http://10.0.2.2:8854/tests/editor/reftests/824080-2-ref.html | application ran for longer than allowed maximum time
17:01:51     INFO -  INFO | automation.py | Application ran for: 2:05:20.996975

...usually R16 runs in less than 1 hour for buildbot Android opt.
this is really odd- seems oddly familiar with the windows 7 issues on taskcluster, but those are permanent.

A few thoughts:
1) is there a pattern as to where these failures occur: us-west-1c ?
2) is there a possibility that the base OS image or the hardware is causing conflicts- we don't have as much insight into that and if it is different
3) these instances are very isolated according to the 3 linked bugs- 2 are similar times on the 29th, and one is the day before.
4) did the docker image or emulator image change recently?
Depends on: 1321926
Blocks: 1321926
No longer depends on: 1321926
The android emulator image has not changed since August.

4 of the 5 instances are taskcluster / 1 is buildbot; all 4 taskcluster jobs are us-west-1c and m3.xlarge types. That seems like an odd coincidence. 

It looks like gecko-t-linux-xlarge worker types can be either c3.xlarge or m3.xlarge. Comparing Android Debug reftest-1 running on c3.xlarge on one revision and m3.xlarge on another, I see no significant difference in run time.
More examples, from bug 1204281:

Android Debug jsreftest-5 (normally runs in less than 45 minutes): https://public-artifacts.taskcluster.net/S-8etlXSRFG3t60LJ-KAmg/0/public/logs/live_backing.log, us-west-1c, m3.xlarge, 70% complete after 2 hours.

Android Debug jsreftest-11 (normally runs in less than 50 minutes): https://public-artifacts.taskcluster.net/Q0UStQ7WQIq6bdEr1tGk0Q/0/public/logs/live_backing.log, us-west-1c, m3.xlarge, 83% complete after 2 hours.

Android Opt mochitest-13 (normally runs in less than 30 minutes): https://public-artifacts.taskcluster.net/SfYWLCrOSBiP_XxUP6a-GQ/0/public/logs/live_backing.log, us-west-1c, m3.xlarge, incomplete after 90 minutes.
:dustin - Do you have any suggestions for how to investigate this, or thoughts on possible causes? In brief, Android test jobs sometimes (infrequently, but at least a few times a week) run about 3x more slowly than usual. All but one known instance is taskcluster-initiated. All taskcluster instances are us-west-1c and m3.xlarge. The same jobs running on c3.xlarge or m3.xlarge on other-than-us-west-1c seem to run fine normally.
Flags: needinfo?(dustin)
It may be that, at the time, those were the cheapest instances per capacity unit.  For example, right now the vast majority of our instances are c3.xlarge in us-east-1{a,c,d} so it wouldn't be surprising if four of four randomly selected tasks occurred in those AZs.  At any rate, Amazon won't give us any answers about hardware differences between AZs or anything like that.  It's also well known that some small portion of EC2 instances are just slow (Google has the same issue, actually -- their MapReduce will reschedule jobs running on laggy nodes to faster nodes when possible).

Just picking something from the logs:

[task 2016-12-07T13:52:50.232125Z] 13:52:50     INFO -  REFTEST TEST-START | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=ecma_6/Class/superPropStatics.js
[task 2016-12-07T13:52:50.233366Z] 13:52:50     INFO -  REFTEST TEST-LOAD | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=ecma_6/Class/superPropStatics.js | 180 / 330 (54%)
[task 2016-12-07T13:53:18.252018Z] 13:53:18     INFO -  REFTEST TEST-PASS | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=ecma_6/Class/superPropStatics.js | OK  item 1
[task 2016-12-07T13:53:18.252568Z] 13:53:18     INFO -  REFTEST TEST-END | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=ecma_6/Class/superPropStatics.js

That test is just checking some behavior around ES6 classes:

https://dxr.mozilla.org/mozilla-central/source/js/src/tests/ecma_6/Class/superPropStatics.js?q=superPropStatics.js&redirect_type=direct 

and definitely shouldn't take 38 seconds, and it doesn't seem to use any OS features that might be slow.

That said, these tests are, AIUI, emulated.  Depending on how that emulation works, it may use some processor features that the EC2 hypervisor emulates in software, and fall off a performance cliff when there is a noisy neighbor on the same host hardware as our instance.

Advice-wise, maybe it's possible to enable additional logging in cases where the initial startup runs exceptionally slowly?  And the existing maxRunTime may be the right fix for this issue, where the tasks run on irredeemably laggy hosts just get automatically re-run.  We could accelerate that re-running by writing some mozharness code to detect lagginess and return the retry-me exit status early.
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] from comment #8)

Thanks much Dustin. I have suspected but never knew for sure that:

> It's also well known that some small portion of EC2 instances are
> just slow (Google has the same issue, actually -- their MapReduce will
> reschedule jobs running on laggy nodes to faster nodes when possible).

I guess that explains it.

Good point about the possibility of a performance cliff for emulation too.

> Advice-wise, maybe it's possible to enable additional logging in cases where
> the initial startup runs exceptionally slowly?  And the existing maxRunTime
> may be the right fix for this issue, where the tasks run on irredeemably
> laggy hosts just get automatically re-run.  We could accelerate that
> re-running by writing some mozharness code to detect lagginess and return
> the retry-me exit status early.

That's an interesting idea. I think I'll wait and see how much trouble this is causing for now, and consider implementing something like that in the future.
No longer blocks: 1320852
Duplicate of this bug: 1320852
No longer blocks: 1321152
Duplicate of this bug: 1321152
No longer blocks: 1321170
Duplicate of this bug: 1321170
No longer blocks: 1321926
Duplicate of this bug: 1321926
No longer blocks: 1322697
Duplicate of this bug: 1322697
Duplicate of this bug: 1323622
Duplicate of this bug: 1324588
Duplicate of this bug: 1324798
Duplicate of this bug: 1328520
Duplicate of this bug: 1328492
Duplicate of this bug: 1326364
Duplicate of this bug: 1331315
Duplicate of this bug: 1334698
It may be that these slow hosts have some obvious difference from "normal" ones. If that difference can be identified, the mozharness script could abandon the job quickly with a specific failure message, or perhaps retry.

Here's an attempt at collecting some performance-related info that might point the way to that difference. The mozharness android test script collects cpuinfo/meminfo/ps output for the host and the emulator and dumps it all to a new artifact "android-performance.log".

I might be able to find a slow host with lots of jobs on try, but I'd prefer to land this, wait for a few failures to come in, then examine the logs and adjust as needed.

Suggestions for additional info/commands welcome!

https://treeherder.mozilla.org/#/jobs?repo=try&revision=02026679d71fec460bdb231b8849b3309c1f58ac
Attachment #8832954 - Flags: review?(jmaher)
Comment on attachment 8832954 [details] [diff] [review]
collect performance information

Review of attachment 8832954 [details] [diff] [review]:
-----------------------------------------------------------------

this looks great, I like how this is host and emulator!
Attachment #8832954 - Flags: review?(jmaher) → review+
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/76f6cda8a080
Collect system performance data during Android tests; r=jmaher
https://hg.mozilla.org/mozilla-central/rev/76f6cda8a080
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Meant to leave-open...just diagnostics here so far.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Duplicate of this bug: 1335648
Duplicate of this bug: 1335264
Duplicate of this bug: 1334755
Duplicate of this bug: 1334737
Duplicate of this bug: 1334736
Duplicate of this bug: 1333969
Duplicate of this bug: 1333332
Duplicate of this bug: 1332849
Duplicate of this bug: 1332840
Duplicate of this bug: 1331996
Duplicate of this bug: 1336870
With the possible exception of bug 1336870, I haven't found any more examples of this since adding diagnostics (comment 26). There have been some job timeouts with other causes, but for the most part, I'm not seeing slow Android jobs these days. Will leave this open for another month or so, just in case...
Tired of waiting...I guess this was a temporary issue.
Status: REOPENED → RESOLVED
Closed: 3 years ago2 years ago
Resolution: --- → WORKSFORME
Duplicate of this bug: 1360425
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Duplicate of this bug: 1361506
Duplicate of this bug: 1361921
Duplicate of this bug: 1362253
Duplicate of this bug: 1397017
Duplicate of this bug: 1397143
Duplicate of this bug: 1397999
Duplicate of this bug: 1397997
Duplicate of this bug: 1401010
Duplicate of this bug: 1398014
Duplicate of this bug: 1397567
Duplicate of this bug: 1395055
Duplicate of this bug: 1395050
Duplicate of this bug: 1402639
Duplicate of this bug: 1413964
Duplicate of this bug: 1424058
Duplicate of this bug: 1424142
Duplicate of this bug: 1424140
Duplicate of this bug: 1424007
Duplicate of this bug: 1420095
Duplicate of this bug: 1420458
Duplicate of this bug: 1421872
Duplicate of this bug: 1421284
Duplicate of this bug: 1422878
Duplicate of this bug: 1416438
Duplicate of this bug: 1415711
Looking over recent dups, I often see low emulator bogomips; retry for bogomips < 250 might resolve 80% of the issues...but how many unnecessary retries would that add?
Duplicate of this bug: 1426601
Duplicate of this bug: 1426288
Duplicate of this bug: 1426287
Duplicate of this bug: 1426039
Duplicate of this bug: 1426030
Duplicate of this bug: 1425915
Duplicate of this bug: 1425408
Duplicate of this bug: 1424665
Many of these failures (duplicate bugs) have low emulator BogoMIPS (say, less than 250, while many successful test runs have BogoMIPS > 400). It's not consistent: Some test runs are okay on BogoMIPS < 250; some failed test runs have BogoMIPS > 400...but there is some correlation between low emulator BogoMIPS and test failure due to slowness. 

I thought I could exploit that correlation by quickly failing/retrying jobs with low BogoMIPS; here's a first attempt:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4cc63de62674998fb7a2200c47c2bc4de721b2c5

I am concerned about the number of failed jobs there -- that seems more frequent than warranted.

Maybe we should avoid the check when not running on .xlarge instances? (gpu jobs, maybe some others??) Or maybe we should run everything on xlarge.

Also, perhaps 250 is the wrong cut-off point.

More experiments coming in 2018!
Depends on: 1427750
I was pleased with the results in comment 78: Most of the retries are "normal" (emulator became unresponsive during tests) with just a few from the new bogomips check -- about what I would expect.

Comment 78 included a change to the instance size of a few tasks -- now handled in bug 1427750.
This is far from fool-proof, but I hope it will avoid at least 50% of the current problems, and might point the way to a better long-term solution.
Attachment #8939575 - Flags: review?(jmaher)
Comment on attachment 8939575 [details] [diff] [review]
retry test tasks with emulator bogomips < 250

Review of attachment 8939575 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/mozharness/scripts/android_emulator_unittest.py
@@ +664,5 @@
> +                bogomips = int(m.group(1))
> +                if bogomips < 250:
> +                    self.fatal('INFRA-ERROR: insufficient Android bogomips (%d < 250)' % bogomips,
> +                               EXIT_STATUS_DICT[TBPL_RETRY])
> +                break

if bogomips isn't there, should we fail?
Attachment #8939575 - Flags: review?(jmaher) → review+
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #81)
> if bogomips isn't there, should we fail?

I would prefer not: I think it might vary depending on Android implementation (like on newer versions) and it's no big deal if we carry on silently. I will add a comment addressing this.
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8db1ca4e7039
Retry android emulator test tasks when the emulator has insufficient bogomips; r=jmaher
I'm frustrated by ongoing Android task timeouts in bug 1411358, which give no indication of bogomips. Let's fix that: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cbd55f544754775eba0d0019d9fa0ac1a9a92965
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6c21b6d9183e
Follow-up to dump Android bogomips in main test log; r=me, a=test-only
Priority: -- → P3
See Also: → 1433163
Some tasks run slowly, but the bogomips check avoids the worst cases, and max-time/timeout management helps alleviate the rest. We're a much better position than we were and I have no plans for further work.
Status: REOPENED → RESOLVED
Closed: 2 years ago9 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.