Closed
Bug 1321605
Opened 8 years ago
Closed 6 years ago
Some test jobs intermittently run very slowly
Categories
(Testing :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla54
People
(Reporter: gbrown, Assigned: gbrown)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
3.59 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
1.87 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
I see a few recent examples of test jobs timing out where it looks like the tests just ran more than 2x more slowly than they normally would. I have Android reftest examples, but I have a feeling I have seen the same symptoms on other platforms.
Assignee | ||
Comment 1•8 years ago
|
||
https://public-artifacts.taskcluster.net/SAY7n5lsRJe3e5XGhswlXQ/0/public/logs/live_backing.log - bug 1321170 [task 2016-11-30T04:10:16.710093Z] 04:10:16 INFO - REFTEST TEST-LOAD | http://10.0.2.2:8854/tests/layout/reftests/counters/counter-ua-limits-02.html | 253 / 281 (90%) [task 2016-11-30T04:10:22.813428Z] 04:10:22 INFO - org.mozilla.fennec still alive after SIGABRT: waiting... [task 2016-11-30T04:10:28.294595Z] 04:10:28 WARNING - TEST-UNEXPECTED-FAIL | http://10.0.2.2:8854/tests/layout/reftests/counters/counter-ua-limits-02.html == http://10.0.2.2:8854/tests/layout/reftests/counters/counter-ua-limits-02-ref.html | application ran for longer than allowed maximum time [task 2016-11-30T04:10:28.296520Z] 04:10:28 INFO - INFO | automation.py | Application ran for: 2:16:34.225782 This is reftest-22 (of 48, so Android Debug), which normally takes 45 minutes to complete. Even harness startup was slow: [task 2016-11-30T01:58:37.044757Z] 01:58:37 INFO - REFTEST INFO | {"isDebugBuild":true,"xulRuntime":{"widgetToolkit":"android","OS":"Android","XPCOMABI":"arm-eabi-gcc3"},"smallScreen":false,"d2d":false,"dwrite":false,"gpuProcess":false,"azureCairo":false,"azureQuartz":false,"azureSkia":true,"skiaContent":true,"azureSkiaGL":1,"contentSameGfxBackendAsCanvas":true,"layersGPUAccelerated":true,"d3d11":false,"d3d9":false,"layersOpenGL":true,"layersOMTC":true,"B2G":false,"Android":true,"cocoaWidget":false,"gtkWidget":false,"qtWidget":false,"winWidget":false,"transparentScrollbars":false,"AndroidVersion":18,"AddressSanitizer":false,"webrtc":true,"http":{"userAgent":"Mozilla/5.0 (Android 4.3.1; Mobile; rv:53.0) Gecko/53.0 Firefox/53.0","appName":"Mozilla","appVersion":"5.0","platform":"Android 4.3.1","oscpu":"Linux armv7l","misc":"rv:53.0"},"haveTestPlugin":false,"windowsDefaultTheme":false,"nativeThemePref":false,"prefs":{},"browserIsRemote":false,"Mulet":false,"asyncPan":true} [task 2016-11-30T02:08:11.695513Z] 02:08:11 INFO - REFTEST SUITE-START | Running 13453 tests [task 2016-11-30T02:08:11.695990Z] 02:08:11 INFO - REFTEST INFO | Running chunk 22 out of 48 chunks. tests 6388-6668/281 10 minutes vs normally around 4 minutes for Android Debug reftest-22.
Blocks: 1321170
Assignee | ||
Comment 2•8 years ago
|
||
https://public-artifacts.taskcluster.net/EPnUtCv6RY-hdNNRLM2kHw/0/public/logs/live_backing.log - bug 1321152, Android Debug reftest-16: task 2016-11-30T02:26:43.503304Z] 02:26:43 WARNING - TEST-UNEXPECTED-FAIL | http://10.0.2.2:8854/tests/layout/reftests/bugs/586400-1.html == http://10.0.2.2:8854/tests/layout/reftests/bugs/586400-1-ref.html | application ran for longer than allowed maximum time [task 2016-11-30T02:26:43.503485Z] 02:26:43 INFO - INFO | automation.py | Application ran for: 2:16:38.770905 ...usually R16 runs in less than 1 hour.
Blocks: 1321152
Assignee | ||
Comment 3•8 years ago
|
||
https://archive.mozilla.org/pub/mobile/tinderbox-builds/autoland-android-api-15/1480371682/autoland_ubuntu64_vm_armv7_large_test-plain-reftest-16-bm123-tests1-linux64-build2.txt.gz - bug 1320852, Android opt reftest-16, buildbot: 17:01:39 INFO - REFTEST INFO | SET PREFERENCE pref(layout.accessiblecaret.enabled_on_touch,false) 17:01:39 INFO - REFTEST TEST-LOAD | http://10.0.2.2:8854/tests/editor/reftests/824080-2-ref.html | 795 / 945 (84%) 17:01:46 INFO - org.mozilla.fennec still alive after SIGABRT: waiting... 17:01:51 WARNING - TEST-UNEXPECTED-FAIL | http://10.0.2.2:8854/tests/editor/reftests/824080-2.html == http://10.0.2.2:8854/tests/editor/reftests/824080-2-ref.html | application ran for longer than allowed maximum time 17:01:51 INFO - INFO | automation.py | Application ran for: 2:05:20.996975 ...usually R16 runs in less than 1 hour for buildbot Android opt.
Comment 4•8 years ago
|
||
this is really odd- seems oddly familiar with the windows 7 issues on taskcluster, but those are permanent. A few thoughts: 1) is there a pattern as to where these failures occur: us-west-1c ? 2) is there a possibility that the base OS image or the hardware is causing conflicts- we don't have as much insight into that and if it is different 3) these instances are very isolated according to the 3 linked bugs- 2 are similar times on the 29th, and one is the day before. 4) did the docker image or emulator image change recently?
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 5•8 years ago
|
||
The android emulator image has not changed since August. 4 of the 5 instances are taskcluster / 1 is buildbot; all 4 taskcluster jobs are us-west-1c and m3.xlarge types. That seems like an odd coincidence. It looks like gecko-t-linux-xlarge worker types can be either c3.xlarge or m3.xlarge. Comparing Android Debug reftest-1 running on c3.xlarge on one revision and m3.xlarge on another, I see no significant difference in run time.
Assignee | ||
Comment 6•8 years ago
|
||
More examples, from bug 1204281: Android Debug jsreftest-5 (normally runs in less than 45 minutes): https://public-artifacts.taskcluster.net/S-8etlXSRFG3t60LJ-KAmg/0/public/logs/live_backing.log, us-west-1c, m3.xlarge, 70% complete after 2 hours. Android Debug jsreftest-11 (normally runs in less than 50 minutes): https://public-artifacts.taskcluster.net/Q0UStQ7WQIq6bdEr1tGk0Q/0/public/logs/live_backing.log, us-west-1c, m3.xlarge, 83% complete after 2 hours. Android Opt mochitest-13 (normally runs in less than 30 minutes): https://public-artifacts.taskcluster.net/SfYWLCrOSBiP_XxUP6a-GQ/0/public/logs/live_backing.log, us-west-1c, m3.xlarge, incomplete after 90 minutes.
Assignee | ||
Comment 7•8 years ago
|
||
:dustin - Do you have any suggestions for how to investigate this, or thoughts on possible causes? In brief, Android test jobs sometimes (infrequently, but at least a few times a week) run about 3x more slowly than usual. All but one known instance is taskcluster-initiated. All taskcluster instances are us-west-1c and m3.xlarge. The same jobs running on c3.xlarge or m3.xlarge on other-than-us-west-1c seem to run fine normally.
Flags: needinfo?(dustin)
Comment 8•8 years ago
|
||
It may be that, at the time, those were the cheapest instances per capacity unit. For example, right now the vast majority of our instances are c3.xlarge in us-east-1{a,c,d} so it wouldn't be surprising if four of four randomly selected tasks occurred in those AZs. At any rate, Amazon won't give us any answers about hardware differences between AZs or anything like that. It's also well known that some small portion of EC2 instances are just slow (Google has the same issue, actually -- their MapReduce will reschedule jobs running on laggy nodes to faster nodes when possible). Just picking something from the logs: [task 2016-12-07T13:52:50.232125Z] 13:52:50 INFO - REFTEST TEST-START | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=ecma_6/Class/superPropStatics.js [task 2016-12-07T13:52:50.233366Z] 13:52:50 INFO - REFTEST TEST-LOAD | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=ecma_6/Class/superPropStatics.js | 180 / 330 (54%) [task 2016-12-07T13:53:18.252018Z] 13:53:18 INFO - REFTEST TEST-PASS | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=ecma_6/Class/superPropStatics.js | OK item 1 [task 2016-12-07T13:53:18.252568Z] 13:53:18 INFO - REFTEST TEST-END | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=ecma_6/Class/superPropStatics.js That test is just checking some behavior around ES6 classes: https://dxr.mozilla.org/mozilla-central/source/js/src/tests/ecma_6/Class/superPropStatics.js?q=superPropStatics.js&redirect_type=direct and definitely shouldn't take 38 seconds, and it doesn't seem to use any OS features that might be slow. That said, these tests are, AIUI, emulated. Depending on how that emulation works, it may use some processor features that the EC2 hypervisor emulates in software, and fall off a performance cliff when there is a noisy neighbor on the same host hardware as our instance. Advice-wise, maybe it's possible to enable additional logging in cases where the initial startup runs exceptionally slowly? And the existing maxRunTime may be the right fix for this issue, where the tasks run on irredeemably laggy hosts just get automatically re-run. We could accelerate that re-running by writing some mozharness code to detect lagginess and return the retry-me exit status early.
Flags: needinfo?(dustin)
Assignee | ||
Comment 9•8 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #8) Thanks much Dustin. I have suspected but never knew for sure that: > It's also well known that some small portion of EC2 instances are > just slow (Google has the same issue, actually -- their MapReduce will > reschedule jobs running on laggy nodes to faster nodes when possible). I guess that explains it. Good point about the possibility of a performance cliff for emulation too. > Advice-wise, maybe it's possible to enable additional logging in cases where > the initial startup runs exceptionally slowly? And the existing maxRunTime > may be the right fix for this issue, where the tasks run on irredeemably > laggy hosts just get automatically re-run. We could accelerate that > re-running by writing some mozharness code to detect lagginess and return > the retry-me exit status early. That's an interesting idea. I think I'll wait and see how much trouble this is causing for now, and consider implementing something like that in the future.
Assignee | ||
Comment 23•8 years ago
|
||
It may be that these slow hosts have some obvious difference from "normal" ones. If that difference can be identified, the mozharness script could abandon the job quickly with a specific failure message, or perhaps retry. Here's an attempt at collecting some performance-related info that might point the way to that difference. The mozharness android test script collects cpuinfo/meminfo/ps output for the host and the emulator and dumps it all to a new artifact "android-performance.log". I might be able to find a slow host with lots of jobs on try, but I'd prefer to land this, wait for a few failures to come in, then examine the logs and adjust as needed. Suggestions for additional info/commands welcome! https://treeherder.mozilla.org/#/jobs?repo=try&revision=02026679d71fec460bdb231b8849b3309c1f58ac
Attachment #8832954 -
Flags: review?(jmaher)
Comment 24•8 years ago
|
||
Comment on attachment 8832954 [details] [diff] [review] collect performance information Review of attachment 8832954 [details] [diff] [review]: ----------------------------------------------------------------- this looks great, I like how this is host and emulator!
Attachment #8832954 -
Flags: review?(jmaher) → review+
Comment 25•8 years ago
|
||
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/76f6cda8a080 Collect system performance data during Android tests; r=jmaher
Comment 26•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/76f6cda8a080
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox54:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Assignee | ||
Comment 27•8 years ago
|
||
Meant to leave-open...just diagnostics here so far.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Updated•8 years ago
|
status-firefox54:
fixed → ---
Assignee | ||
Comment 39•8 years ago
|
||
With the possible exception of bug 1336870, I haven't found any more examples of this since adding diagnostics (comment 26). There have been some job timeouts with other causes, but for the most part, I'm not seeing slow Android jobs these days. Will leave this open for another month or so, just in case...
Assignee | ||
Comment 40•8 years ago
|
||
Tired of waiting...I guess this was a temporary issue.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → WORKSFORME
Assignee | ||
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Assignee | ||
Comment 67•7 years ago
|
||
Looking over recent dups, I often see low emulator bogomips; retry for bogomips < 250 might resolve 80% of the issues...but how many unnecessary retries would that add?
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 77•7 years ago
|
||
Many of these failures (duplicate bugs) have low emulator BogoMIPS (say, less than 250, while many successful test runs have BogoMIPS > 400). It's not consistent: Some test runs are okay on BogoMIPS < 250; some failed test runs have BogoMIPS > 400...but there is some correlation between low emulator BogoMIPS and test failure due to slowness. I thought I could exploit that correlation by quickly failing/retrying jobs with low BogoMIPS; here's a first attempt: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4cc63de62674998fb7a2200c47c2bc4de721b2c5 I am concerned about the number of failed jobs there -- that seems more frequent than warranted. Maybe we should avoid the check when not running on .xlarge instances? (gpu jobs, maybe some others??) Or maybe we should run everything on xlarge. Also, perhaps 250 is the wrong cut-off point. More experiments coming in 2018!
Assignee | ||
Comment 78•7 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5f153a94b68e2de662def68539365965b3ee932
Assignee | ||
Comment 79•7 years ago
|
||
I was pleased with the results in comment 78: Most of the retries are "normal" (emulator became unresponsive during tests) with just a few from the new bogomips check -- about what I would expect. Comment 78 included a change to the instance size of a few tasks -- now handled in bug 1427750.
Assignee | ||
Updated•7 years ago
|
Keywords: leave-open
Assignee | ||
Comment 80•7 years ago
|
||
This is far from fool-proof, but I hope it will avoid at least 50% of the current problems, and might point the way to a better long-term solution.
Attachment #8939575 -
Flags: review?(jmaher)
Comment 81•7 years ago
|
||
Comment on attachment 8939575 [details] [diff] [review] retry test tasks with emulator bogomips < 250 Review of attachment 8939575 [details] [diff] [review]: ----------------------------------------------------------------- ::: testing/mozharness/scripts/android_emulator_unittest.py @@ +664,5 @@ > + bogomips = int(m.group(1)) > + if bogomips < 250: > + self.fatal('INFRA-ERROR: insufficient Android bogomips (%d < 250)' % bogomips, > + EXIT_STATUS_DICT[TBPL_RETRY]) > + break if bogomips isn't there, should we fail?
Attachment #8939575 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 82•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #81) > if bogomips isn't there, should we fail? I would prefer not: I think it might vary depending on Android implementation (like on newer versions) and it's no big deal if we carry on silently. I will add a comment addressing this.
Comment 83•7 years ago
|
||
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/8db1ca4e7039 Retry android emulator test tasks when the emulator has insufficient bogomips; r=jmaher
Assignee | ||
Comment 85•7 years ago
|
||
I'm frustrated by ongoing Android task timeouts in bug 1411358, which give no indication of bogomips. Let's fix that: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cbd55f544754775eba0d0019d9fa0ac1a9a92965
Comment 86•7 years ago
|
||
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/6c21b6d9183e Follow-up to dump Android bogomips in main test log; r=me, a=test-only
Assignee | ||
Updated•7 years ago
|
Priority: -- → P3
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 105•6 years ago
|
||
Some tasks run slowly, but the bogomips check avoids the worst cases, and max-time/timeout management helps alleviate the rest. We're a much better position than we were and I have no plans for further work.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 6 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Keywords: leave-open
You need to log in
before you can comment on or make changes to this bug.
Description
•