Open Bug 1979297 Opened 2 months ago Updated 1 month ago

Several Windows opt failures after the latest wpt-sync

Categories

(Testing :: web-platform-tests, defect)

defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: SerbanS, Assigned: Sasha)

References

Details

Attachments

(1 file)

Hi Alexandra! Could you please take a look at these wpt failures? It seems that they're a fallout of the latest wpt-sync and they're failling in every instance but with several failure lines, as it can be seen here. This is happening only on Autoland. A separate instance it's happening on main too but on Windows 11 24H2 Shippable, as it can be seen here.

Thank you!

Flags: needinfo?(aborovova)

I can see that /event-timing/click.html failures come from wptsync landing, but I'm not sure about everything else. I've created the patch for /event-timing/click.html.

Flags: needinfo?(aborovova)
See Also: → 1957864
Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 143 Branch
Assignee: nobody → aborovova

The failures in /event-timing/click.html indeed seem to have been resolved but the other failures are still present. After some more digging, Bug 1937025 looks to be the one who started this, as it can be seen in these ranges of retriggers and backfills: r1, r2, r3.

Hi Bob! Could you please take a look at this?

Thank you!

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 143 Branch → ---
Flags: needinfo?(bobowencode)

I've run those jobs 5 times each on my final try push and didn't get any failures, so I'm a bit confused as to why these changes could have caused those failures.
https://treeherder.mozilla.org/jobs?repo=try&duplicate_jobs=visible&revision=a7b3fcfa81b63b10747c27fdb7e77b8a08371c66

Flags: needinfo?(bobowencode)

Pushed central and with the main patch from bug 1937025 backed out to try:
https://treeherder.mozilla.org/jobs?repo=try&revision=bbe636c27daa704736b76524dd0c809d4e86e89f
https://treeherder.mozilla.org/jobs?repo=try&revision=c554116ff2854fd88e3c7b1f7c4b2e5b2b20d213

There do appear to be fewer failures with the back-out.
One of the issues is that we don't seem to get crash dumps a lot of the time.
When we do, the failures seem to be bug 1979494 and bug 1977225.
Crashing here for 1979494 because oldCurrentEntry is null.
Crashing here for 1977225.

Looking at the dumps, the root cause in both cases seems to be GetCurrentEntry returning null because mCurrentEntryIndex is not set.

Hard to see why the chromium sandbox update would actually cause these failures, so perhaps it is some sort of timing issue.
smaug - what do you think?
Can these be fixed by null checks or is there a deeper issue?

Flags: needinfo?(smaug)

Given that there didn't appear to be an issue on my final try push, I've bisected on try to find what else it is interacting with to cause the failures.
It ends up on Bug 1977364 - Remove support for ducktyped errors.
Try push with my changes on top of first patch for that bug:
https://treeherder.mozilla.org/jobs?repo=try&revision=1523186cd8154e066d337bcca2adc919c8a756f4
Try push on top of second patch:
https://treeherder.mozilla.org/jobs?repo=try&revision=463208c564e47e1102ac8d629a3c98c08d1bd873

So, I've pushed two more try runs with current central and the pref set to true and both patches backed out:
https://treeherder.mozilla.org/jobs?repo=try&revision=a101b1c8f4720fd2b015ec670ea01753ac39904b
https://treeherder.mozilla.org/jobs?repo=try&revision=1774847716012e2757588acbede05b26670348f9

Looks like both of these resolve the crashes.
tschuster: any idea how these might interact?
In theory there shouldn't be any sandboxing changes here, but there could of course be timing changes.

The chromium update also produced some memory improvements: bug 1937025 comment 42.

Flags: needinfo?(tschuster)

FWIW, this seems to be resolved now. The last push where these failures appeared was this one. What came after that were clear from these failures.

(In reply to Serban Stanca [:SerbanS] from comment #12)

FWIW, this seems to be resolved now. The last push where these failures appeared was this one. What came after that were clear from these failures.

Seems it's still failing on some builds:
https://treeherder.mozilla.org/jobs?repo=autoland&revision=ba74f4e0d9bfcdbb4c2e3ec2267632a65dd8a2d5
https://treeherder.mozilla.org/jobs?repo=autoland&revision=d970c2612f53e692c05a34890ea5896f30ed5593

There are clean ones either side of these.

Sorry, I have no idea why this would crash. It's of course problem that we changed the reported error message somewhere and the code can't deal? Is this is even a "real" crash?

INFO - NoSuchWindowException on command, setting status to CRASH
INFO - TEST-UNEXPECTED-CRASH | /html/document-isolation-policy/credentialless-shared-worker.https.tentative.window.html?request_origin=same_origin&worker_dip=none&window_dip=none | expected OK
INFO - TEST-INFO took 805ms
INFO - PID 8384 | JavaScript error: chrome://remote/content/shared/webdriver/Certificates.sys.mjs, line 79: NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsICertOverrideService.setDisableAllSecurityChecksAndLetAttackersInterceptMyData]
INFO - Browser exited with return code 572
Flags: needinfo?(tschuster)
Flags: needinfo?(tschuster)

So it does still happen. I don't really know how to proceed here.

farre, see comment 8.
Need to get some Navigation API crashes fixed ;)

Flags: needinfo?(smaug) → needinfo?(afarre)

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #17)

farre, see comment 8.
Need to get some Navigation API crashes fixed ;)

Yeah, looking at it now. It's not a null check that's needed, it's a deeper issue.

Flags: needinfo?(afarre)

I had another look at this. The following failure in Bob's Try push seems interesting:

TEST-UNEXPECTED-FAIL | /workers/interfaces/WorkerUtils/importScripts/report-error-cross-origin.sub.any.worker.html | WorkerGlobalScope error event: lineno - assert_equals: expected 8 but got 0
TEST-UNEXPECTED-FAIL | /workers/interfaces/WorkerUtils/importScripts/report-error-cross-origin.sub.any.sharedworker.html | WorkerGlobalScope error event: filename - assert_equals: expected "http://web-platform.test:8000/workers/interfaces/WorkerUtils/importScripts/report-error-helper.js" but got "" 

These exact symptoms happen when JS::ErrorReportBuilder::maybeCreateReportFromDOMException returns nullptr. Of course it's not clear why this would happen non-deterministically. The ini file for this test does that show that even before my changes it seems to be highly unreliable.

Flags: needinfo?(tschuster)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: