Closed Bug 1734020 (tv-chaosmode-timeout-wpt) Opened 3 years ago Closed 2 years ago

web-platform tests can timeout in test-verify because of delays from MOZ_CHAOSMODE=0xfb

Categories

(Testing :: General, defect, P2)

Default
defect

Tracking

(firefox107 fixed)

RESOLVED FIXED
107 Branch
Tracking Status
firefox107 --- fixed

People

(Reporter: jdescottes, Assigned: jmaher)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(2 files)

Initially spotted in https://bugzilla.mozilla.org/show_bug.cgi?id=1733717. The test failing in test-verify here takes too long in chaosmode and times out. A normal run of the test takes between 10 and 20 seconds. But in chaosmode, it takes more than 45 seconds. I get a similar ratio comparing locally with or without CHAOSMODE.

In Bug 1390884, MOZ_CHAOSMODE was initially set to MOZ_CHAOSMODE=3 for test-verify and 1 year ago it was updated to MOZ_CHAOSMODE=0xfb. With 3 the test performance was very close to what we have in normal mode, but with 0xfb, adding tabs for example is much slower.

For instance, a simple test adding a 100 tabs takes 25 seconds with MOZ_CHAOSMODE=0xfb and 6 seconds in normal mode (or with MOZ_CHAOSMODE=3).

Looking at the list of features of ChaosMode, a few are especially about delaying things:

enum ChaosFeature {
  None = 0x0,
  // Altering thread scheduling.
  ThreadScheduling = 0x1,
  // Altering network request scheduling.
  NetworkScheduling = 0x2,
  // Altering timer scheduling.
  TimerScheduling = 0x4,
  // Read and write less-than-requested amounts.
  IOAmounts = 0x8,
  // Iterate over hash tables in random order.
  HashTableIteration = 0x10,
  // Randomly refuse to use cached version of image (when allowed by spec).
  ImageCache = 0x20,
  // Delay dispatching threads to encourage dispatched tasks to run.
  TaskDispatching = 0x40,
  // Delay task running to encourage sending threads to run.
  TaskRunning = 0x80,
  Any = 0xffffffff,
};

So I imagine it's not too surprising that tests can timeout more easily in chaos mode. Maybe we should increase the default timeout for tests running in chaos mode then? Note, with MOZ_CHAOSMODE=0xfb all chaos features except NetworkScheduling are enabled.

Overall this leads to false positives from test-verify, would be great to avoid them so that we can trust the reports from test-verify.

Thanks Julian. I was aware of performance differences with chaos mode, but I hadn't seen such big differences / didn't know it was problematic. I agree with the idea of increasing the timeout, perhaps like

https://searchfox.org/mozilla-central/rev/9bc5dcea99c59dc18eae0de7064131aa20cfbb66/testing/mochitest/runtests.py#2133

The severity field is not set for this bug.
:ahal, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(ahal)
Severity: -- → S4
Flags: needinfo?(ahal)

Note: if we're only hitting this for a small handful of tests, we can work around this by opting those tests into getting a longer timeout with e.g. <meta name="timeout" content="long">, as discussed at the end of bug 1765426 comment 3 (and the next comment, bug 1765426 comment 4).

(In reply to Daniel Holbert [:dholbert] from comment #4)

Note: if we're only hitting this for a small handful of tests, we can work around this by opting those tests into getting a longer timeout with e.g. <meta name="timeout" content="long">, as discussed at the end of bug 1765426 comment 3 (and the next comment, bug 1765426 comment 4).

Note that this might hide other issues when tests are run in normal mode. We have seen situations like these for webdriver tests in the past which were related to regressions in startup/shutdown times of the browser. So we use this meta tag very carefully and that only for these tests that really need it.

Yeah, after looking at one more instance of this in bug 1787569, I don't think <meta name="timeout" content="long"> spot-treatment would be worthwhile here. On top of what Henrik noted, I think would have to deploy it too broadly.

In bug 1787569, the "slow" test (outline-valid-mandatory.html) literally just calls test_valid_value 20 times, and that's it. There are no special timeouts or looping/async fanciness that might be a spot where you could imagine chaos mode surfacing some real potential for edge-casey slowness.

Under the "regular-configuration" (non-chaos-mode) part of that bug's test-verify run, the test completes in 70-90ms in nearly all of the runs (the exception being a single 133ms run). Whereas with chaos mode, the slowest testrun was about 100x slower, with duration 9294ms, which is close enough to the 10s timeout-threshold that the test-verify run was considered a failure (hence the bug having been filed).

Reading the notes in comment 0 here... I wonder if the "Altering thread scheduling", "Delay dispatching threads", and "Delay task running" items there might cause particular slowness for style-system tests, since our style system is multi-threaded.

That would explain why outline-valid-mandatory.html's seemingly-benign calls to test_valid_value() (which resolves some CSS) would end up being 50-100x slower with MOZ_CHAOSMODE=0xfb.

No longer blocks: 1787569
No longer blocks: 1765426

(In reply to Daniel Holbert [:dholbert] from comment #7)

Reading the notes in comment 0 here... I wonder if the "Altering thread scheduling", "Delay dispatching threads", and "Delay task running" items there might cause particular slowness for style-system tests, since our style system is multi-threaded.

That would explain why outline-valid-mandatory.html's seemingly-benign calls to test_valid_value() (which resolves some CSS) would end up being 50-100x slower with MOZ_CHAOSMODE=0xfb.

Emilio, would you have some feedback regarding this possible idea?

Flags: needinfo?(emilio)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #17)

(In reply to Daniel Holbert [:dholbert] from comment #7)

Reading the notes in comment 0 here... I wonder if the "Altering thread scheduling", "Delay dispatching threads", and "Delay task running" items there might cause particular slowness for style-system tests, since our style system is multi-threaded.

That would explain why outline-valid-mandatory.html's seemingly-benign calls to test_valid_value() (which resolves some CSS) would end up being 50-100x slower with MOZ_CHAOSMODE=0xfb.

Emilio, would you have some feedback regarding this possible idea?

I spoke to him & he agrees. Essentially all this means is that this bug will be particularly-bad for anything that uses multiple threads. I see three options to address it:

(1) turn off Stylo's multi-threading just for this test-verify task (run with STYLO_THREADS=1)
(I'm including this option for completeness, but really this would be silly; it would defeat the purpose of the task in the first place. Also, it wouldn't help other components like e.g. WebRTC where folks are hitting this a lot too, per e.g. karlt's note in bug 1786445 comment 2.)

(2) use less-aggressive chaos mode that doesn't cause as much of a perf impact. Comment 1 here suggests that MOZ_CHAOSMODE=3 had less overhead here. But of course it wouldn't exercise edge cases as well & wouldn't catch as many real bugs/failures.

(3) Use a longer timeout when we're running under chaos mode as part of this task (higher than 10 seconds). Anecdotally, I see tests take 50-100x as long when run under this configuration (see bug 1787569 comment 6), so even increasing the timeout by 10x here would be conservative.

Option 3 seems like the best outcome here (maintaining our strictness while avoiding false positive timeouts). I think it'd be a test harness change in our test-verify setup somewhere.

whimboo, do you know if this is possible & who we might point at that?

Flags: needinfo?(emilio) → needinfo?(hskupin)
Attachment #9244267 - Attachment description: Chaos mode performance difference → Chaos mode performance difference (not a fix -- just a patch to add a dummy test that demonstrates the issue)
No longer blocks: 1790178

This bug was responsible for essentially all current test-verify bugs that I could find, FWIW (I just did an audit of open bugs about test-verify failures, and I duped all the ones that I could confirm as being this issue, and pretty much all of them were).

Alias: test-verify-chaosmode-timeout
Alias: test-verify-chaosmode-timeout → tv-chaosmode-timeout

(In reply to Daniel Holbert [:dholbert] from comment #20)

(3) Use a longer timeout when we're running under chaos mode as part of this task (higher than 10 seconds). Anecdotally, I see tests take 50-100x as long when run under this configuration (see bug 1787569 comment 6), so even increasing the timeout by 10x here would be conservative.

Option 3 seems like the best outcome here (maintaining our strictness while avoiding false positive timeouts). I think it'd be a test harness change in our test-verify setup somewhere.

whimboo, do you know if this is possible & who we might point at that?

Geoff or Andrew, could one of you please check this suggestion from Daniel? It's causing us a lot of trouble and maybe we could finally make some progress and get a better stability for verify jobs. Thanks.

Flags: needinfo?(gbrown)
Flags: needinfo?(ahal)

I am happy to take this- Geoff and Andrew both work fulltime in Releng team, not in CI/Automation team- I guess we are all welcome to work on many things, just want to respect their time as this is work related to a long ago job :)

Flags: needinfo?(hskupin)
Flags: needinfo?(gbrown)
Flags: needinfo?(ahal)

I agree option 3 would be best. I'm not sure about where/how to implement that.

I'm also not opposed to option 2, but of course would like the fewest number of chaos modes disabled to achieve stability (would require testing of the various combinations).

Blocks: 1791871

interesting- all the duplicates of this bug are in TVw (web-platform-tests).

Assignee: nobody → jmaher
Status: NEW → ASSIGNED

(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #31)

interesting- all the duplicates of this bug are in TVw (web-platform-tests).

That's probably because our WPT harness has the strictest "test timed out" threshold (10 seconds), as far as I'm aware at least. IIRC, all our other testsuites' "test timed out" thresholds are on the order of minutes.

The severity field for this bug is relatively low, S4. However, the bug has 16 duplicates.
:jmaher, could you consider increasing the bug severity?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmaher)
Flags: needinfo?(jmaher)
Priority: -- → P2
Attachment #9244267 - Attachment description: Chaos mode performance difference (not a fix -- just a patch to add a dummy test that demonstrates the issue) → Chaos mode performance difference
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/68560b2b3448 extend timeout for test-verify chaosmode mode. r=gbrown,jgraham
Failed to create upstream wpt PR due to merge conflicts. This requires fixup from a wpt sync admin.

(In reply to Web Platform Test Sync Bot (Matrix: #interop:mozilla.org) from comment #39)

Failed to create upstream wpt PR due to merge conflicts. This requires fixup
from a wpt sync admin.

Note that this failed because my backout from bug 1792021 didn't successfully merge.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 107 Branch

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #40)

(In reply to Web Platform Test Sync Bot (Matrix: #interop:mozilla.org) from comment #39)

Failed to create upstream wpt PR due to merge conflicts. This requires fixup
from a wpt sync admin.

Note that this failed because my backout from bug 1792021 didn't successfully merge.

James, I think this needs an admin to proceed given that there is no upstream PR created?

Flags: needinfo?(james)
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/36527 for changes under testing/web-platform/tests
Flags: needinfo?(james)
Upstream PR merged by moz-wptsync-bot
Pushed by wptsync@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/64d14b98edbb [wpt PR 36527] - [Gecko Bug 1734020] extend timeout for test-verify chaosmode mode., a=testonly
Duplicate of this bug: 1812220
No longer duplicate of this bug: 1812220

The patch that landed here (in comment 46) was for the WPT test harness. We're still hitting versions of this for mochitests, though. Since this bug is closed, I'll adjust this bug summary to make it more clearly WPT-specific and I'll file a version of this bug for the mochitest harness.

Alias: tv-chaosmode-timeout → tv-chaosmode-timeout-wpt
Summary: Tests can timeout in test-verify because of delays from MOZ_CHAOSMODE=0xfb → web-platform tests can timeout in test-verify because of delays from MOZ_CHAOSMODE=0xfb
See Also: → 1899777
See Also: → 1899781
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: