Closed Bug 1878346 Opened 9 months ago Closed 9 months ago

Attempt to prevent intermittent failure in test_bug514732-2.xhtml

Categories

(Core :: Layout, task)

task

Tracking

()

RESOLVED FIXED
124 Branch
Tracking Status
firefox-esr115 --- fixed
firefox123 --- fixed
firefox124 --- fixed

People

(Reporter: dholbert, Assigned: dholbert)

References

Details

Attachments

(3 files)

I was able to reproduce the intermittent failures in bug 1765787 locally a few times yesterday (though it stopped reproducing for me; it's very fiddly).

In the instances of the failure that I caught, it seems like the issue was this:

  1. the test has a <browser> that starts out seeked to about:blank
  2. the test immediately (in its load handler) calls doPageNavigation to navigate that <browser>, and the test listens for a MozScrolledAreaChanged event as a signal that it can move on to the next part of the test.
  3. BUT occasionally, in that first doPageNavigation call, the <browser>'s initial about:blank document still has a pending reflow which posts a MozScrolledAreaChanged when it completes (because its viewport size does change size -- maybe the outer test element resized or something as part of its load).
  4. So the test's first doPageNavigation call occasionally receives that MozScrolledAreaChanged and misinterprets it as a sign that the first navigation completed, even though it's barely started.
  5. ...and the test moves onto the next part of the test, despite that initial page not having been loaded in the <browser> and hence not being in the bfcache.
  6. This results in us failing to navigate back (because there's nowhere to navigate back to) later on in the test, per bug 1765787 comment 38, and that explains this failure-mode:
TEST-UNEXPECTED-FAIL | layout/generic/test/test_bug514732-2.xhtml | Test timed out. 

...because the event that we're waiting for (from our back-navigation) never arrives, because we didn't actually navigate back.

Also: I think in some cases, our first doPageNavigation call does get a chance to complete and so we end up getting two MozScrolledAreaChanged events (the unexpected one from about:blank and the expected one from the actual page that we land on). This explains the other failure-mode, which is:

layout/generic/test/test_bug514732-2.xhtml | Unexpected event (MozScrolledAreaChanged) occurred 

I've got a patch that I think will fix things by doing an initial load to a dummy page, where we simply watch for the load event instead of MozScrolledAreaChanged. This should establish a more reliable initial state since I think the about:blank load event should not still be pending at that point and won't misfire in the same way that its MozScrolledAreaChanged event was.

Actually, I think a simpler & more-robust fix is just to explicitly flush layout in the outer and inner doc before we do the navigation. That ensures that any pending reflow-induced MozScrolledAreaChanged events from about:blank will have already been fired before we make our first doPageNavigation call.

This patch doesn't change behavior; it's just making this test more robust.

Without this patch, we can occasionall get an unexpected MozScrolledAreaChanged
from the initial about:blank test-document, which throws off the test logic
and causes test failures and timeouts. See this bug comment for more details
on how that happens:
https://bugzilla.mozilla.org/show_bug.cgi?id=1878346#c0

Without my patch, in an --enable-optimize --disable-debug build, I can pretty reliably reproduce the Unexpected event (MozScrolledAreaChanged) occurred failure-mode, with the following command:

./mach mochitest --repeat 200 --headless layout/generic/test/test_bug514732-2.xhtml

I ran that twice in a row, and I got 8 failures both times, like so:

===============

mochitest-chrome
~~~~~~~~~~~~~~~~
Ran 1013 checks (812 subtests, 201 tests)
Expected results: 1005
Unexpected results: 8
  subtest: 8 (8 fail)

Unexpected Results
------------------
layout/generic/test/test_bug514732-2.xhtml
  FAIL Unexpected event (MozScrolledAreaChanged) occurred
SimpleTest.ok@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:426:16
pageEventListener@chrome://mochikit/content/tests/SimpleTest/docshell_helpers.js:490:7
EventListener.handleEvent*doPageNavigation@chrome://mochikit/content/tests/SimpleTest/docshell_helpers.js:245:31
testIterator@chrome://mochitests/content/chrome/layout/generic/test/file_bug514732_window.xhtml:43:25
nextTest@chrome://mochitests/content/chrome/layout/generic/test/file_bug514732_window.xhtml:28:13
setTimeout handler*onload@chrome://mochitests/content/chrome/layout/generic/test/file_bug514732_window.xhtml:1:11
[...7 more copies of the above test-failure]

With this patch, that same command completes successfully with no failures. I ran it again with 4x as many repeat counts (--repeat 800) and still got no failures. So that's a good sign that this does indeed prevent/mitigate the situation that was leading to the test failures.

(In reply to Daniel Holbert [:dholbert] from comment #3)

Without my patch [...] I can pretty reliably reproduce the Unexpected event (MozScrolledAreaChanged) occurred failure-mode
[...]
With this patch, that same command completes successfully with no failures.

(To be clear, I think the patch also addresses the other (test-timeout) failure-mode from bug 1765787, too, as described in comment 0 here. Both failure-modes are symptoms of the same underlying thing having gone wrong.)

Pushed by dholbert@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0238b7b5d676 Avoid intermittent test failures in test_bug514732-2.xhtml by flushing layout before setting up MozScrolledAreaChanged event-handler. r=layout-reviewers,emilio

Good news: macos opt test-verify builds seem to confirm that this fixed the issue. They reproduce the failures quite reliably in a build without this patch, and they're solid-green with the patch.

Without the patch: 5/5 opt TV runs are orange:
https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=O5kVqM5eSu6XEGDM3-jR0g.0&resultStatus=testfailed%2Cbusted%2Cexception%2Csuccess%2Crunning%2Cpending%2Crunnable&revision=8214788adbf84b869520d84eb3aa4f804017494f&searchStr=test-macosx1015-64-qr%2Fopt-test-verify

With the patch: 14/14 opt TV runs are green:
https://treeherder.mozilla.org/jobs?repo=try&resultStatus=testfailed%2Cbusted%2Cexception%2Csuccess%2Crunning%2Cpending%2Crunnable&revision=cfebe4a742ae39f7be3ff14a25ead13369567162&searchStr=test-macosx1015-64-qr%2Fopt-test-verify

Let's get this uplifted to beta/esr115 to stop the intermittent failures there as well.

This patch doesn't change behavior; it's just making this test more robust.

Without this patch, we can occasionally get an unexpected MozScrolledAreaChanged
from the initial about:blank test-document, which throws off the test logic
and causes test failures and timeouts. See this bug comment for more details
on how that happens:
https://bugzilla.mozilla.org/show_bug.cgi?id=1878346#c0

Original Revision: https://phabricator.services.mozilla.com/D200492

Attachment #9378089 - Flags: approval-mozilla-beta?

Uplift Approval Request

  • User impact if declined: No user impact either way; this is a test-only change
  • Code covered by automated testing: yes
  • Risk associated with taking this patch: Zero
  • Fix verified in Nightly: yes
  • Steps to reproduce for manual QE testing: N/A
  • Is Android affected?: no
  • String changes made/needed: None
  • Explanation of risk level: This is a test-only change
  • Needs manual QE test: no

This patch doesn't change behavior; it's just making this test more robust.

Without this patch, we can occasionally get an unexpected MozScrolledAreaChanged
from the initial about:blank test-document, which throws off the test logic
and causes test failures and timeouts. See this bug comment for more details
on how that happens:
https://bugzilla.mozilla.org/show_bug.cgi?id=1878346#c0

Original Revision: https://phabricator.services.mozilla.com/D200492

Attachment #9378090 - Flags: approval-mozilla-esr115?

Uplift Approval Request

  • Code covered by automated testing: yes
  • User impact if declined: No user impact either way; this is a test-only change
  • Fix verified in Nightly: yes
  • Risk associated with taking this patch: Zero
  • String changes made/needed: None
  • Is Android affected?: no
  • Steps to reproduce for manual QE testing: N/A
  • Needs manual QE test: no
  • Explanation of risk level: This is a test-only change
Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → 124 Branch
Attachment #9378089 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9378090 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+

This test-only-change was uplifted to esr115 (comment 13), but that patch appears to have been inadvertently reverted by an unrelated patch that landed in bug 1874627, specifically this part:
https://hg.mozilla.org/releases/mozilla-esr115/rev/025824aab59e3ee1778a8041eff04b9d653b8468#l4.2

I'm guessing that patch in bug 1874627 was generated by-hand using some sort of diff/revert command, and it inadvertently picked up the changes here as something-to-be-reverted.

We need to get this re-landed on ESR115, or else we'll keep getting intermittent failure reports on bug 1765787 for esr115 indefinitely. What's the best course of action at this point? Should I generate a new ESR115 uplift request? Or can we just automatically re-land the patch from comment 13?

Flags: needinfo?(pascalc)

(Aryx, maybe you know the best course-of-action here. I'm guessing that the patch in question (on bug 1874627) was one that you generated & directly-pushed to work around some bitrot that lando may have been hitting; maybe you can just revert your changes to layout/generic/test/file_bug514732_window.xhtml as a followup commit to that bug or this bug?)

Flags: needinfo?(aryx.bugmail)

Sorry for the unintended reversal. Comment 16 fixed this.

Flags: needinfo?(pascalc)
Flags: needinfo?(aryx.bugmail)

Thanks!

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: