Closed Bug 1915025 Opened 11 months ago Closed 10 months ago

Perma Android async-scrolling/position-fixed-async-zoom-1.html == async-scrolling/position-fixed-async-zoom-1-ref.html | image comparison, max difference: 255, number of differing pixels: 15000

Categories

(Core :: Panning and Zooming, defect)

defect

Tracking

()

RESOLVED FIXED
131 Branch
Tracking Status
firefox131 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: dholbert)

References

Details

(Keywords: intermittent-failure, intermittent-testcase)

Attachments

(2 files)

This started to permafail once this push landed on Autoland. From backfills and retriggers it seems that this goes way back, falling even where it was before green.

Severity: S4 → --
Priority: P5 → --
Component: Layout → Panning and Zooming

Maybe something changed or got deployed/enabled (outside of the autoland/mozilla-central repos) related to Android + nofission test configurations in the past day?

(I recall owlish and whimboo greening up tests with Android+fission recently; not sure if either of them are aware of any such change? CC'ing them both in case they have ideas.)

Though also: this seems like it turned out not to be a perma-fail, right? There are some pushes where it looks perma-faily, but other pushes where we've got some green and some orange, and others (older and newer than the fail-prone ones) that look fully green.

(I guess specifically for the Android "R-nofis-bk" configuration (note the "-bk" suffix), this does currently seem like a perma-fail. On other platforms, it looks more intermittent.)

[EDIT: since I didn't know & now have found out: -bk here means "backfill" tests, i.e. a task from a failing run, backfilled to run on previous builds, or something along those lines And the intermittent-ness is potentially not really intermittent; see comment 8.]

Nothing I worked on should have affected that. However, when I compare screenshots of the failures across different tests, I notice that the test image has a zoom value set to twice that of the reference image. This suggests there might be an issue with async zooming?

This is similar to bug 1913376 where Android 7.0 reftest failures also had bee observed for a combination of reftest manifests. This dynamic scheduling got turned on in bug 1893181 recently.

Is there anything shared between the failing tests in the two bugs?

most likely this fails when run as a standalone manifest vs in a big bunch of manifests. usually this is the result of a previous test changing the state of the browser. In other cases it could be that the browser naturally does something (i.e. garbage collection) before this test runs and now it is testing without a full browser where it depends on a service being initialized fully.

(In reply to Daniel Holbert [:dholbert] from comment #3)

Though also: this seems like it turned out not to be a perma-fail, right? There are some pushes where it looks perma-faily, but other pushes where we've got some green and some orange, and others (older and newer than the fail-prone ones) that look fully green.

Ah, it looks like maybe the green logs are all just logs that ran hardly any tests at all (and not these tests in particular).

e.g. compare these two runs (a33f8b42c8f35579dd1e9f74036641b094bd3f4d where this is green vs. its parent c0f325a58cd9f31aafbe33c36e8d5963cfc50511 where this looks perma-orange, on "Android 7.0 x86-64 Lite WebRender opt R-nofis and R-nofis-ship R6):
https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&searchStr=android%2C7.0%2Copt%2Creftests%2Ctest-android-em-7.0-x86_64-lite-qr%2Fopt-geckoview-reftest-nofis%2Cr6&tochange=a33f8b42c8f35579dd1e9f74036641b094bd3f4d&fromchange=c0f325a58cd9f31aafbe33c36e8d5963cfc50511

The green logs only run 47 tests, and they're 4 minutes & 6 minutes long (respectively) -- here's the link and a quote from the first one:
https://firefoxci.taskcluster-artifacts.net/IRK2uhq-SqSpEgm275e4FQ/0/public/logs/live_backing.log

[task 2024-08-26T17:50:46.701Z] 17:50:46     INFO -  REFTEST SUITE-START | Running 47 tests
[task 2024-08-26T17:50:46.701Z] 17:50:46     INFO -  REFTEST TEST-START | gfx/tests/reftest/468496-1.html == gfx/tests/reftest/468496-1-ref.html
[...]
[task 2024-08-26T17:51:16.966Z] 17:51:16     INFO -  REFTEST TEST-END | gfx/tests/reftest/1912431-emoji-globalAlpha.html == gfx/tests/reftest/1912431-emoji-globalAlpha-ref.html
[task 2024-08-26T17:51:16.966Z] 17:51:16     INFO -  REFTEST INFO | Result summary:
[task 2024-08-26T17:51:16.966Z] 17:51:16     INFO -  REFTEST INFO | Successful: 40 (40 pass, 0 load only)
[task 2024-08-26T17:51:16.966Z] 17:51:16     INFO -  REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 failed load, 0 exception)
[task 2024-08-26T17:51:16.966Z] 17:51:16     INFO -  REFTEST INFO | Known problems: 7 (0 known fail, 0 known asserts, 1 random, 6 skipped, 0 slow)
[task 2024-08-26T17:51:16.966Z] 17:51:16     INFO -  REFTEST SUITE-END | Shutdown

vs. the orange tests look like this and run 3163 tests (including the ones in question here), taking 25-40min:
https://firefoxci.taskcluster-artifacts.net/BDpzYP1bS8KsLAexwIIxOQ/0/public/logs/live_backing.log

[task 2024-08-26T17:10:03.297Z] 17:10:03     INFO -  REFTEST SUITE-START | Running 3163 tests
[task 2024-08-26T17:10:03.298Z] 17:10:03     INFO -  REFTEST TEST-START | layout/reftests/bugs/105-1.html == layout/reftests/bugs/105-1-ref.html
[...]
[task 2024-08-26T17:31:22.723Z] 17:31:22     INFO -  REFTEST TEST-END | layout/reftests/position-dynamic-changes/horizontal/fromauto-leftA-widthA-rightN.html?border_parent == layout/reftests/position-dynamic-changes/horizontal/leftA-widthA-rightN-ref.html?border_parent
[task 2024-08-26T17:31:22.723Z] 17:31:22     INFO -  REFTEST INFO | Result summary:
[task 2024-08-26T17:31:22.723Z] 17:31:22     INFO -  REFTEST INFO | Successful: 2993 (2993 pass, 0 load only)
[task 2024-08-26T17:31:22.723Z] 17:31:22  WARNING -  REFTEST INFO | Unexpected: 6 (6 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 failed load, 0 exception)
[task 2024-08-26T17:31:22.723Z] 17:31:22  WARNING -  One or more unittests failed.
[task 2024-08-26T17:31:22.723Z] 17:31:22     INFO -  REFTEST INFO | Known problems: 164 (47 known fail, 0 known asserts, 23 random, 94 skipped, 0 slow)
[task 2024-08-26T17:31:22.723Z] 17:31:22     INFO -  REFTEST SUITE-END | Shutdown

jmaher, is there any reason why we'd be picking such a vastly different set of tests for the same configuration from one run to the next? Note that the only commit that differs between these two runs is https://hg.mozilla.org/integration/autoland/rev/a33f8b42c8f35579dd1e9f74036641b094bd3f4d which didn't add or remove any tests & hence doesn't seem like it should have changed rebucketing?

Flags: needinfo?(jmaher)

Are we doing something tricky like e.g. analyzing the commit itself to figure out which tests should get included in the R-nofis(R6) bucket?)

It seems pretty suspicious that both of Robert Longson's pushes yesterday (for bug 1914345) happened to be the ones that got the broader bucket of R-nofis(R6) and R-nofis-ship(R6) tests, whereas their direct child commits (which touched zero reftests) mysteriously got a much smaller (and hence greener) bucket of R-nofis(R6) tests:
https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&searchStr=android%2C7.0%2Copt%2Creftests%2Ctest-android-em-7.0-x86_64-lite-qr%2Fopt-geckoview-reftest-nofis%2Cr6&tochange=abeda35559be4439d219295722d3d364be36850e&fromchange=92acbb3b404d41b2fc9a3f5307603d42b70f31fa

Whatever's responsible for the variability here, we might want to rethink it, because it leads to confusion about whether a test is a perma-failure or not, and whether a patch introduced a test-failure or not (e.g. longsonr's bug 1914345 got repeatedly backed out for supposedly causing this test failure, when really it seems to have been a preexisting test-failure, in a test that only gets run as under some semi-mysterious conditions)

See Also: → 1914345

we determine the tests based on the code, finding the tests most relevant. Every 20th push (or every 2 hours) we run ALL the tests to ensure we are not missing coverage for too long of a gap.

Mochitest + xpcshell + wpt have been running like this for 4.5 years, now reftests can run like this, reducing load on autoland and try (if using ./mach try auto) a bit.

Ideally all tests should be able to run independent of any other test and in any order, we know that isn't always the truth. I am on PTO this week, otherwise I would do a series of try pushes to narrow down which test is running before in order to make this pass.

Flags: needinfo?(jmaher)

Thanks for the reply when on PTO! That info helps.

otherwise I would do a series of try pushes to narrow down which test is running before in order to make this pass.

I might poke at doing that myself a bit. It's not yet clear to me that this test is passing at all in any recent test run. In the green test runs that I've looked at so far, the test is simply not-running-at-all. (Maybe it is passing in other test runs on other pushes, but I'm not sure how/where to look to find those test runs where it might hypothetically be getting run + passing.)

Update: the test is indeed usually passing, and it's only failing when it gets run in coordination with just-the-right set of prior reftest directories.

In this case the relevant prior-reftest directory is layout/reftests/meta-viewport. So this issue is reproducible if you run e.g.

./mach try fuzzy layout/reftests/meta-viewport/ layout/reftests/async-scrolling -q "android lite reftest"

I just tested a patch which reduced the reftest.list files drastically, and it still reproduced the issue:
https://hg.mozilla.org/try/rev/c4205fe6d189c75565a90af87944043a999775eb

Here's the try run showing the issue reproducing with that patch:
https://treeherder.mozilla.org/jobs?repo=try&revision=8efc2e09141d04a12494b1a2152458431f3d581a

I reduced the meta-viewport reftest.list file to just this defaults line plus a single test:

 defaults pref(dom.meta-viewport.enabled,true) pref(apz.allow_zooming,true)

So I suspect those default prefs are involved here -- they're perhaps not getting cleared properly when we reach the end of that directory, or something along those lines?

I'm pretty sure this is the culprit:
https://searchfox.org/mozilla-central/rev/45d6f8bf028e049f812aa26dced565d50068af5d/layout/reftests/meta-viewport/desktop-mode.html#24-26

let win = SpecialPowers.wrap(window);
SpecialPowers.spawnChrome([win.browsingContext.id], id => {
  BrowsingContext.get(id).forceDesktopViewport = true;

This forceDesktopViewport setting persists and potentially impacts the tests that follow this one.

I'm working on adding a new dummy reftest to run right after this one, which simply restores that setting to its default value. I'm pretty sure that'll fix this.

Assignee: nobody → dholbert
Status: NEW → ASSIGNED

This patch doesn't change the test's behavior at all; it's just syntactic sugar.

This makes things cleaner in the next patch in the series (allowing us to
ergonomically return a value from the chrome task to the test logic).

The reftest desktop-mode.html modifies forceDesktopViewport, and it needs
that modification to still be in-effect when its reftest snapshot is taken.
However, we also need to undo that modification, to avoid having an impact on
subsequent reftests. This patch adds a "dummy" reftest to do that cleanup
work, which will now be run immediately after desktop-mode.html.

This patch also makes both reftests validate their expectations about what
value the forceDesktopViewport setting should have when the test gets run.
If forceDesktopViewport doesn't have the expected value, then the tests
change their background to red, to trigger a test-failure.

Try run without patch stack (expected to be orange):
https://treeherder.mozilla.org/jobs?repo=try&revision=17d7be9ade018c5cfdfb035368b66632ff081ef2

Try run with patch stack (expected to be green):
https://treeherder.mozilla.org/jobs?repo=try&revision=cb2a909fae877033ed0b0812f5288b59962a27da

(Both try runs are running reftests in layout/reftests/meta-viewport and layout/reftests/async-scrolling.)

Here's a try run including test-verify tasks:
[EDIT: updated link to include a try run that actually has my patches, so that test-verify hopefully actually runs the tests]
https://treeherder.mozilla.org/jobs?repo=try&revision=ee0d5d9e28d98e17443b39fa2d8c69c80a97ca21

I anticipate that test-verify will fail for desktop-mode-cleanup.html since that test has built-in strong expectation that desktop-mode.html will have run prior to it and made the modification that the -cleanup.html test is there to undo. (And on test-verify, that expectation doesn't hold up, because the test gets run in isolation.)

I don't have strong feelings whether to keep that expectation in the test; it's not mandatory, but it's kinda nice for strictness (in case the tests get reshuffled apart somehow in the future), but it's unfortunate that it makes the test fail in test-verify mode. Though I'm also not super-concerned about test-verify in this case, since this test will hopefully never need to be modified (and hence won't get TV tasks ever/often once it's added); and I've added a comment in the test itself mentioning that it's known and sort-of-intentional that the test will fail when run on its own. So hopefully that avoids confusion/trouble in a future-world where this test gets occasional TV tasks scheduled which end up failing.

I guess the preexisting test desktop-mode.html itself will also now fail verification, specifically for the "run test 10 times in a row" etc. The new expectation that I added to that test (that forceDesktopViewport is initially off) won't hold up if the test is run in a loop (without its -cleanup.html buddy). As above, we could relax this expectation in the test, but it's kind of nice as validation that the test pair as-a-whole are sound (i.e. that the value that we're setting in the cleanup test is in fact the original/default value).

In a perfect world, we wouldn't need the separate -cleanup.html test file, and we'd be able to make this reftest standalone and atomic by scheduling the "cleanup" work (restoring forceDesktopViewport) to happen after the reftest snapshot. But I'm not sure we can arrange for that to happen; once the reftest snapshot has been taken, the test harness navigates away from the test and I don't think there's a way for a test to reliably schedule JS to run at that point.

Observations from try run in comment 17:

  • desktop-mode-cleanup.html does indeed fail in test-verify mode for reasons described in comment 17.
  • desktop-mode.html fails in test-verify mode in the "Run each test 10 times in one browser" section (and its "in chaos mode" variant), for reasons described in comment 18.

Seeing the actual test failures in the Try run, I'm feeling like that's probably not-great (particularly for the "real" reftest desktop-mode.html), so I'm inclined to remove the precondition so that the tests pass verification.

One other observation: both of these tests apparently get skipped in most test-verify tasks -- they're only run in test-verify-gpu verification (which is linux-only). On other test-verify tasks, the log shows them being skipped/discarded like so:

INFO - Per-test run (non-gpu) discarded gpu test layout\reftests\meta-viewport\desktop-mode-cleanup.html (gpu)
INFO - Per-test run (non-gpu) discarded gpu test layout\reftests\meta-viewport\desktop-mode.html (gpu)
WARNING - No tests to verify.
Pushed by dholbert@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a8224d80f9d1 part 1: Rewrite reftest desktop-mode.html to use async/await. r=hiro https://hg.mozilla.org/integration/autoland/rev/ad62a16baf7a part 2: Add reftest 'desktop-mode-cleanup.html' to restore a setting that gets modified in reftest 'desktop-mode.html'. r=hiro
Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Target Milestone: --- → 131 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: