Open Bug 1857246 Opened 1 year ago Updated 1 year ago

Intermittent layout/generic/test/test_scroll_position_restore.html failures on new Ubuntu 22.04 Wayland debug pool

Categories

(Core :: Layout, defect)

defect

Tracking

()

People

(Reporter: ahal, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(1 file)

This is happening on a new Ubuntu 22.04 Wayland enabled pool we're in the process of standing up.

Example failure log:
https://treeherder.mozilla.org/logviewer?job_id=431356063&repo=try&lineNumber=6870

[task 2023-10-04T20:59:03.544Z] 20:59:03     INFO - TEST-START | layout/generic/test/test_scroll_position_restore.html
[task 2023-10-04T20:59:58.467Z] 20:59:58     INFO - GECKO(6317) | [Child 6500, Main Thread] WARNING: flushing shaped-word cache: file /builds/worker/checkouts/gecko/gfx/thebes/gfxFont.cpp:3194
[task 2023-10-04T21:02:03.140Z] 21:02:03     INFO - GECKO(6317) | [Child 6500, Main Thread] WARNING: flushing shaped-word cache: file /builds/worker/checkouts/gecko/gfx/thebes/gfxFont.cpp:3194
[task 2023-10-04T21:03:41.515Z] 21:03:41     INFO - GECKO(6317) | [Child 6500, Main Thread] WARNING: flushing shaped-word cache: file /builds/worker/checkouts/gecko/gfx/thebes/gfxFont.cpp:3194
[task 2023-10-04T21:04:14.727Z] 21:04:14     INFO - TEST-INFO | started process screentopng
[task 2023-10-04T21:04:18.830Z] 21:04:18     INFO - TEST-INFO | screentopng: exit 0
[task 2023-10-04T21:04:18.830Z] 21:04:18     INFO - TEST-UNEXPECTED-FAIL | layout/generic/test/test_scroll_position_restore.html | Test timed out. - 
[task 2023-10-04T21:04:18.830Z] 21:04:18     INFO - TEST-PASS | layout/generic/test/test_scroll_position_restore.html | undefined assertion name 
[task 2023-10-04T21:04:18.830Z] 21:04:18     INFO - GECKO(6317) | MEMORY STAT | vsize 2682MB | residentFast 373MB | heapAllocated 202MB
[task 2023-10-04T21:04:18.832Z] 21:04:18    ERROR - TEST-UNEXPECTED-FAIL | SimpleTest | this test already called finish!
[task 2023-10-04T21:04:18.833Z] 21:04:18     INFO - TEST-UNEXPECTED-ERROR | layout/generic/test/test_scroll_position_restore.html | called finish() multiple times
[task 2023-10-04T21:04:18.834Z] 21:04:18     INFO - TEST-INFO took 312969ms

I believe this is just platform that pushes this test to the edge of the timeout window.

This test does take a little while (in part because it loads an extremely-tall
2.6MB HTML helper-file). We're thinking that this is making it overshoot the
default test-timeout value on the new Ubuntu 22.04 Wayland test pool, so let's
just use a longer timeout threshold and see if that's enough to make it
reliably pass.

Assignee: nobody → dholbert
Status: NEW → ASSIGNED

Hmm, it looks like the task itself is close to overstepping the max run time for the task-as-a-whole, and reenabling this test can make that happen as in this orange log from Try:
https://firefoxci.taskcluster-artifacts.net/SAoPT-M1TMe_R18YYfR2jg/0/public/logs/live_backing.log

So we need to further subdivide into additional "buckets" of tests before we can reenable this one, I think.

ahal, do you know how to do that?

Severity: -- → S3
Flags: needinfo?(ahal)

Alternately, maybe we could increase the timeout threshold for the task as a whole?

One thing I'm noticing when looking at that log from comment 3: the tests don't actually get a chance to start running for 13min, so that time is occupied by various testrunner-environment-setup tasks (from timestamp 7:35 - 7:48):

[taskcluster 2023-10-14T07:35:37.397Z] Worker Type (gecko-t/t-linux-vm-2204-wayland) settings:
[...]
[task 2023-10-14T07:48:24.941Z] 07:48:24     INFO - SUITE-START | Running 664 tests
[task 2023-10-14T07:48:24.942Z] 07:48:24     INFO - TEST-START | dom/base/test/test_bug444546.html
[...]

And then the task hits its max run time at timestamp 8:34, about an hour after starting:

[taskcluster:error] Aborting task...
[taskcluster:warn 2023-10-14T08:34:09.055Z] no such process
[taskcluster 2023-10-14T08:34:09.116Z] === Task Finished ===
[taskcluster 2023-10-14T08:34:09.116Z] Task Duration: 58m30.094373522s
[...]
[taskcluster:error] Task aborted - max run time exceeded

i.e the first quarter of the max run time is just environment-setup-overhead.

Maybe we could/should just increase the task "max run time" for this task? (instead of splitting it up)

Hmm, that 13min env-setup-time might have been an anomaly; in the log from comment 0, SUITE-START is only 4min after the start of the log, which is much nicer. Might be something to look into there.)

(In reply to Daniel Holbert [:dholbert] from comment #3)

Hmm, it looks like the task itself is close to overstepping the max run time for the task-as-a-whole,

Another issue: it looks like the requestLongerTimeout(2) in the attached patch isn't actually sufficient, too. The test itself still times out occasionally (after ~10 minutes), as i noticed after ~30 retriggers of this task on Try with the attached patch:
https://treeherder.mozilla.org/logviewer?job_id=432915850&repo=try&lineNumber=12358
https://treeherder.mozilla.org/logviewer?job_id=433268911&repo=try&lineNumber=12538

At this point it's probably best to leave the test disabled on this platform/configuration, if it really takes that long. Though it's also a bit interesting/concerning that it's this timeout-prone on this particular configuration but not on others.

(I guess we do have bug 1786425 filed on this test intermittently timing out on debug builds very occasionally, but this new platform seems to be timing out with a higher frequency (even with 2x the timeout threshold) as compared to the frequency of intermittent timeouts in bug 1786425.)

Assignee: dholbert → nobody
Status: ASSIGNED → NEW
See Also: → 1786425

Sorry for the late reply.

The 16 chunks is coming from here:
https://searchfox.org/mozilla-central/rev/e4afef5d3ff67781dc1377912344694f3cf3a226/taskcluster/ci/test/mochitest.yml#141

We could add another line below that like:

linux.*-wayland/debug: 20

But yes, I agree it's strange that doubling the timeout still isn't always enough. Definitely feels like something else is going on.

Flags: needinfo?(ahal)
Depends on: 1860872

I think my patch in bug 1860872 will fix this more properly, by making the test reliably complete faster.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: