web-platform tests can timeout in test-verify because of delays from MOZ_CHAOSMODE=0xfb
Categories
(Testing :: General, defect, P2)
Tracking
(firefox107 fixed)
Tracking | Status | |
---|---|---|
firefox107 | --- | fixed |
People
(Reporter: jdescottes, Assigned: jmaher)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
Attachments
(2 files)
Initially spotted in https://bugzilla.mozilla.org/show_bug.cgi?id=1733717. The test failing in test-verify here takes too long in chaosmode and times out. A normal run of the test takes between 10 and 20 seconds. But in chaosmode, it takes more than 45 seconds. I get a similar ratio comparing locally with or without CHAOSMODE.
In Bug 1390884, MOZ_CHAOSMODE was initially set to MOZ_CHAOSMODE=3
for test-verify
and 1 year ago it was updated to MOZ_CHAOSMODE=0xfb
. With 3
the test performance was very close to what we have in normal mode, but with 0xfb
, adding tabs for example is much slower.
For instance, a simple test adding a 100 tabs takes 25 seconds with MOZ_CHAOSMODE=0xfb
and 6 seconds in normal mode (or with MOZ_CHAOSMODE=3
).
Looking at the list of features of ChaosMode, a few are especially about delaying things:
enum ChaosFeature {
None = 0x0,
// Altering thread scheduling.
ThreadScheduling = 0x1,
// Altering network request scheduling.
NetworkScheduling = 0x2,
// Altering timer scheduling.
TimerScheduling = 0x4,
// Read and write less-than-requested amounts.
IOAmounts = 0x8,
// Iterate over hash tables in random order.
HashTableIteration = 0x10,
// Randomly refuse to use cached version of image (when allowed by spec).
ImageCache = 0x20,
// Delay dispatching threads to encourage dispatched tasks to run.
TaskDispatching = 0x40,
// Delay task running to encourage sending threads to run.
TaskRunning = 0x80,
Any = 0xffffffff,
};
So I imagine it's not too surprising that tests can timeout more easily in chaos mode. Maybe we should increase the default timeout for tests running in chaos mode then? Note, with MOZ_CHAOSMODE=0xfb
all chaos features except NetworkScheduling
are enabled.
Overall this leads to false positives from test-verify, would be great to avoid them so that we can trust the reports from test-verify.
Reporter | ||
Comment 1•3 years ago
|
||
Comment 2•3 years ago
|
||
Thanks Julian. I was aware of performance differences with chaos mode, but I hadn't seen such big differences / didn't know it was problematic. I agree with the idea of increasing the timeout, perhaps like
Updated•3 years ago
|
Comment 3•3 years ago
|
||
The severity field is not set for this bug.
:ahal, could you have a look please?
For more information, please visit auto_nag documentation.
Updated•3 years ago
|
Comment 4•3 years ago
|
||
Note: if we're only hitting this for a small handful of tests, we can work around this by opting those tests into getting a longer timeout with e.g. <meta name="timeout" content="long">
, as discussed at the end of bug 1765426 comment 3 (and the next comment, bug 1765426 comment 4).
Comment 5•3 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #4)
Note: if we're only hitting this for a small handful of tests, we can work around this by opting those tests into getting a longer timeout with e.g.
<meta name="timeout" content="long">
, as discussed at the end of bug 1765426 comment 3 (and the next comment, bug 1765426 comment 4).
Note that this might hide other issues when tests are run in normal mode. We have seen situations like these for webdriver tests in the past which were related to regressions in startup/shutdown times of the browser. So we use this meta tag very carefully and that only for these tests that really need it.
Comment 6•2 years ago
•
|
||
Yeah, after looking at one more instance of this in bug 1787569, I don't think <meta name="timeout" content="long">
spot-treatment would be worthwhile here. On top of what Henrik noted, I think would have to deploy it too broadly.
In bug 1787569, the "slow" test (outline-valid-mandatory.html
) literally just calls test_valid_value
20 times, and that's it. There are no special timeouts or looping/async fanciness that might be a spot where you could imagine chaos mode surfacing some real potential for edge-casey slowness.
Under the "regular-configuration" (non-chaos-mode) part of that bug's test-verify run, the test completes in 70-90ms in nearly all of the runs (the exception being a single 133ms run). Whereas with chaos mode, the slowest testrun was about 100x slower, with duration 9294ms
, which is close enough to the 10s timeout-threshold that the test-verify run was considered a failure (hence the bug having been filed).
Comment 7•2 years ago
•
|
||
Reading the notes in comment 0 here... I wonder if the "Altering thread scheduling", "Delay dispatching threads", and "Delay task running" items there might cause particular slowness for style-system tests, since our style system is multi-threaded.
That would explain why outline-valid-mandatory.html
's seemingly-benign calls to test_valid_value()
(which resolves some CSS) would end up being 50-100x slower with MOZ_CHAOSMODE=0xfb
.
Comment hidden (Intermittent Failures Robot) |
Comment 17•2 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #7)
Reading the notes in comment 0 here... I wonder if the "Altering thread scheduling", "Delay dispatching threads", and "Delay task running" items there might cause particular slowness for style-system tests, since our style system is multi-threaded.
That would explain why
outline-valid-mandatory.html
's seemingly-benign calls totest_valid_value()
(which resolves some CSS) would end up being 50-100x slower withMOZ_CHAOSMODE=0xfb
.
Emilio, would you have some feedback regarding this possible idea?
Comment 20•2 years ago
•
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #17)
(In reply to Daniel Holbert [:dholbert] from comment #7)
Reading the notes in comment 0 here... I wonder if the "Altering thread scheduling", "Delay dispatching threads", and "Delay task running" items there might cause particular slowness for style-system tests, since our style system is multi-threaded.
That would explain why
outline-valid-mandatory.html
's seemingly-benign calls totest_valid_value()
(which resolves some CSS) would end up being 50-100x slower withMOZ_CHAOSMODE=0xfb
.Emilio, would you have some feedback regarding this possible idea?
I spoke to him & he agrees. Essentially all this means is that this bug will be particularly-bad for anything that uses multiple threads. I see three options to address it:
(1) turn off Stylo's multi-threading just for this test-verify task (run with STYLO_THREADS=1
)
(I'm including this option for completeness, but really this would be silly; it would defeat the purpose of the task in the first place. Also, it wouldn't help other components like e.g. WebRTC where folks are hitting this a lot too, per e.g. karlt's note in bug 1786445 comment 2.)
(2) use less-aggressive chaos mode that doesn't cause as much of a perf impact. Comment 1 here suggests that MOZ_CHAOSMODE=3
had less overhead here. But of course it wouldn't exercise edge cases as well & wouldn't catch as many real bugs/failures.
(3) Use a longer timeout when we're running under chaos mode as part of this task (higher than 10 seconds). Anecdotally, I see tests take 50-100x as long when run under this configuration (see bug 1787569 comment 6), so even increasing the timeout by 10x here would be conservative.
Option 3 seems like the best outcome here (maintaining our strictness while avoiding false positive timeouts). I think it'd be a test harness change in our test-verify setup somewhere.
whimboo, do you know if this is possible & who we might point at that?
Updated•2 years ago
|
Comment 27•2 years ago
•
|
||
This bug was responsible for essentially all current test-verify
bugs that I could find, FWIW (I just did an audit of open bugs about test-verify failures, and I duped all the ones that I could confirm as being this issue, and pretty much all of them were).
Updated•2 years ago
|
Updated•2 years ago
|
Comment 28•2 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #20)
(3) Use a longer timeout when we're running under chaos mode as part of this task (higher than 10 seconds). Anecdotally, I see tests take 50-100x as long when run under this configuration (see bug 1787569 comment 6), so even increasing the timeout by 10x here would be conservative.
Option 3 seems like the best outcome here (maintaining our strictness while avoiding false positive timeouts). I think it'd be a test harness change in our test-verify setup somewhere.
whimboo, do you know if this is possible & who we might point at that?
Geoff or Andrew, could one of you please check this suggestion from Daniel? It's causing us a lot of trouble and maybe we could finally make some progress and get a better stability for verify jobs. Thanks.
Assignee | ||
Comment 29•2 years ago
|
||
I am happy to take this- Geoff and Andrew both work fulltime in Releng team, not in CI/Automation team- I guess we are all welcome to work on many things, just want to respect their time as this is work related to a long ago job :)
Comment 30•2 years ago
|
||
I agree option 3 would be best. I'm not sure about where/how to implement that.
I'm also not opposed to option 2, but of course would like the fewest number of chaos modes disabled to achieve stability (would require testing of the various combinations).
Assignee | ||
Comment 31•2 years ago
|
||
interesting- all the duplicates of this bug are in TVw (web-platform-tests).
Assignee | ||
Comment 32•2 years ago
|
||
Updated•2 years ago
|
Comment 33•2 years ago
•
|
||
(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #31)
interesting- all the duplicates of this bug are in TVw (web-platform-tests).
That's probably because our WPT harness has the strictest "test timed out" threshold (10 seconds), as far as I'm aware at least. IIRC, all our other testsuites' "test timed out" thresholds are on the order of minutes.
Comment hidden (Intermittent Failures Robot) |
Comment 35•2 years ago
|
||
The severity field for this bug is relatively low, S4. However, the bug has 16 duplicates.
:jmaher, could you consider increasing the bug severity?
For more information, please visit auto_nag documentation.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment 38•2 years ago
|
||
Comment 40•2 years ago
|
||
(In reply to Web Platform Test Sync Bot (Matrix: #interop:mozilla.org) from comment #39)
Failed to create upstream wpt PR due to merge conflicts. This requires fixup
from a wpt sync admin.
Note that this failed because my backout from bug 1792021 didn't successfully merge.
Comment 41•2 years ago
|
||
bugherder |
Comment 42•2 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #40)
(In reply to Web Platform Test Sync Bot (Matrix: #interop:mozilla.org) from comment #39)
Failed to create upstream wpt PR due to merge conflicts. This requires fixup
from a wpt sync admin.Note that this failed because my backout from bug 1792021 didn't successfully merge.
James, I think this needs an admin to proceed given that there is no upstream PR created?
Updated•2 years ago
|
Comment 45•2 years ago
|
||
Comment 46•2 years ago
|
||
bugherder |
Comment 48•2 years ago
|
||
The patch that landed here (in comment 46) was for the WPT test harness. We're still hitting versions of this for mochitests, though. Since this bug is closed, I'll adjust this bug summary to make it more clearly WPT-specific and I'll file a version of this bug for the mochitest harness.
Updated•2 years ago
|
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Description
•