Closed Bug 1274741 Opened 9 years ago Closed 8 years ago

75% permafail on WinXP test_outerHTML.xhtml,test_picture_mutations.html,test_picture_pref.html,test_pointerPreserves3D.html,test_pointerPreserves3DClip.html,test_resource_timing.html, | application timed out after 330 seconds with no output

Categories

(Core :: DOM: Core & HTML, defect)

49 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox48 --- unaffected
firefox49 --- fixed
firefox50 --- fixed

People

(Reporter: aryx, Assigned: RyanVM)

References

Details

(Keywords: intermittent-failure)

Summary: Intermittent test_picture_pref.html | application timed out after 330 seconds with no output → Intermittent test_picture_pref.html or test_pointerPreserves3D.html or test_resource_timing.html | application timed out after 330 seconds with no output, nearly-permaorange on Windows XP pgo
Seems like there's a decent chance this is somehow related to bug 1274450.
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #2) > Seems like there's a decent chance this is somehow related to bug 1274450. Although when both were present, it seems like it was unrelated which happened on which push, although they did both happen on the same push once.
I realize this might just be a new form of bug 1273758.
Nope, the bisection (see link in comment 5) indicates that the regression was in this window: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=9b44a0d216be66c4ac1a05d344a838a5056692e9&tochange=00218374a90cfbb6b66a9a1bf8e5483efcb18661 which I'm pretty sure means one of the changesets from bug 1273070.
Blocks: 1273070
Flags: needinfo?(bkelly)
Summary: Intermittent test_picture_pref.html or test_pointerPreserves3D.html or test_resource_timing.html | application timed out after 330 seconds with no output, nearly-permaorange on Windows XP pgo → 75% permafail test_picture_pref.html or test_pointerPreserves3D.html or test_resource_timing.html | application timed out after 330 seconds with no output, nearly-permaorange on Windows XP pgo
Summary: 75% permafail test_picture_pref.html or test_pointerPreserves3D.html or test_resource_timing.html | application timed out after 330 seconds with no output, nearly-permaorange on Windows XP pgo → 75% permafail on WinXP PGO test_picture_pref.html or test_pointerPreserves3D.html or test_resource_timing.html | application timed out after 330 seconds with no output
Summary: 75% permafail on WinXP PGO test_picture_pref.html or test_pointerPreserves3D.html or test_resource_timing.html | application timed out after 330 seconds with no output → 75% permafail on WinXP PGO test_picture_pref.html,test_pointerPreserves3D.html,test_pointerPreserves3DClip.html,test_resource_timing.html | application timed out after 330 seconds with no output
Feel free to back out bug 1273070 to be safe, but I don't think that these can be related: 1) All the test code I added in dom/tests/mochitest/fetch runs in a separate browser instance from dom/tests/mochitest/general. 2) None of the tests in dom/tests/mochitest/general execute any fetch code. I added asserts and ran the tests locally to verify this.
Flags: needinfo?(bkelly)
Did mozilla-build change around this time? Looking at the build for the previous commit: http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-win32-pgo/1463866923/mozilla-inbound-win32-pgo-bm91-build1-build177.txt.gz I see this output in the log: 14:42:59 INFO - Executing: ['c:\\mozilla-build\\python27\\python.exe', 'C:/mozilla-build/tooltool.py', '--authentication-file', 'c:\\builds\\relengapi.tok', '-c', 'c:/builds/tooltool_cache', '--url', 'https://api.pub.build.mozilla.org/tooltool/', '--overwrite', '-m', 'c:\\builds\\moz2_slave\\m-in-w32-pgo-00000000000000000\\build\\src\\browser/config/tooltool-manifests/win32/releng.manifest', 'fetch'] 14:43:03 INFO - INFO - rm tree: rustc 14:43:04 INFO - INFO - untarring "rustc.tar.bz2" 14:43:12 INFO - INFO - rm tree: sccache 14:43:13 INFO - INFO - untarring "sccache.tar.bz2" 14:43:13 INFO - INFO - rm tree: vs2015u2 14:43:20 INFO - INFO - unzipping "vs2015u2.zip" 14:43:35 INFO - Return code: 0 On my commit where the failures started I see: 11:40:10 INFO - Executing: ['c:\\mozilla-build\\python27\\python.exe', 'C:/mozilla-build/tooltool.py', '--authentication-file', 'c:\\builds\\relengapi.tok', '-c', 'c:/builds/tooltool_cache', '--url', 'https://api.pub.build.mozilla.org/tooltool/', '--overwrite', '-m', 'c:\\builds\\moz2_slave\\m-in-w32-pgo-00000000000000000\\build\\src\\browser/config/tooltool-manifests/win32/releng.manifest', 'fetch'] 11:40:10 INFO - INFO - File mozmake.exe retrieved from local cache c:/builds/tooltool_cache 11:40:10 INFO - INFO - File rustc.tar.bz2 not present in local cache folder c:/builds/tooltool_cache 11:40:10 INFO - INFO - Attempting to fetch from 'https://api.pub.build.mozilla.org/tooltool/'... 11:40:15 INFO - INFO - File rustc.tar.bz2 fetched from https://api.pub.build.mozilla.org/tooltool/ as c:\builds\moz2_slave\m-in-w32-pgo-00000000000000000\build\src\tmpxs8cbr 11:40:19 INFO - INFO - File sccache.tar.bz2 retrieved from local cache c:/builds/tooltool_cache 11:40:36 INFO - INFO - File vs2015u2.zip retrieved from local cache c:/builds/tooltool_cache 11:40:41 INFO - INFO - File integrity verified, renaming tmpxs8cbr to rustc.tar.bz2 11:40:41 INFO - INFO - Updating local cache c:/builds/tooltool_cache... 11:40:41 INFO - INFO - Local cache c:/builds/tooltool_cache updated with rustc.tar.bz2 11:40:41 INFO - INFO - untarring "sccache.tar.bz2" 11:40:55 INFO - INFO - unzipping "vs2015u2.zip" 11:43:22 INFO - INFO - untarring "rustc.tar.bz2" 11:43:53 INFO - Return code: 0 I'm not saying this output is exactly the cause, but might suggest something else out-of-band changed here. Ryan, do you know what is going on with mozilla-build here?
Flags: needinfo?(ryanvm)
The MozillaBuild package I maintain has very little to do with what we do in CI at the moment. Not sure what might have changed in the RelEng world last week, maybe catlee has an idea.
Flags: needinfo?(ryanvm) → needinfo?(catlee)
Pretty sure nothing has changed on XP in ages.
Flags: needinfo?(catlee)
Is this hidden on treeherder or something? Brasstacks shows it dropping back down close to zero.
It's back on both inbound and fx-team. Windows XP opt and pgo M(3) often fail in one of these tests which are scheduled to run after each other: test_picture_mutations.html https://treeherder.mozilla.org/logviewer.html#?job_id=28915923&repo=mozilla-inbound test_performance_timeline.html https://treeherder.mozilla.org/logviewer.html#?job_id=28965078&repo=mozilla-inbound test_performance_now.html https://treeherder.mozilla.org/logviewer.html#?job_id=28965079&repo=mozilla-inbound test_outerHTML.xhtml https://treeherder.mozilla.org/logviewer.html#?job_id=28964759&repo=mozilla-inbound (In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #6) > I realize this might just be a new form of bug 1273758. test_paste_selection.html runs befor most of these tests (but after outerHTML.xhtml) and uses the clipboard.
Summary: 75% permafail on WinXP PGO test_picture_pref.html,test_pointerPreserves3D.html,test_pointerPreserves3DClip.html,test_resource_timing.html | application timed out after 330 seconds with no output → 75% permafail on WinXP test_outerHTML.xhtml,test_picture_mutations.html,test_picture_pref.html,test_pointerPreserves3D.html,test_pointerPreserves3DClip.html,test_resource_timing.html, | application timed out after 330 seconds with no output
I recently bisected bug 1273070 as the cause for extremely frequent WinXP e10s DOM mochitest timeouts on Ash as well. I won't file a new bug for it since it looks like the same basic problem as this bug. Hits at least half of the time. https://treeherder.mozilla.org/logviewer.html#?job_id=22371957&repo=try#L9098 Ben, the hits on Ash are with regular Windows opt builds. You can also run XP mochitest-e10s-3 on Try now without having to do anything special (try: -b o -p win32 -u mochitest-e10s-3[Windows XP]), in case it helps in debugging without having to run PGO.
Flags: needinfo?(bkelly)
Ryan, can you try just backing out the test changes in P2 from bug 1273070 in a try push? I'd like to try to isolate if this is a problem from adding the tests vs the DOM code changes. I would NI, but you have those turned off. :-)
Flags: needinfo?(bkelly)
Also doing some try runs with full timestamps would be great. The buffer-and-dump logs hides the timing here which unfortunately seems relevant.
(In reply to Ben Kelly [:bkelly] from comment #25) > Also doing some try runs with full timestamps would be great. The > buffer-and-dump logs hides the timing here which unfortunately seems > relevant. I'm don't think there's a way to do that, unfortunately. I'll run some Try pushes to at least isolate which of the two patches from bug 1273070 were at fault, though.
Flags: needinfo?(ryanvm)
BTW, if it ends up being Part 1 that's at fault, it looks like that's not going to backout cleanly at this point. Looks like there's been some significant-looking work that's landed on Fetch.cpp since then. https://hg.mozilla.org/mozilla-central/log/default/dom/fetch/Fetch.cpp
Flags: needinfo?(ryanvm)
Looks like it was indeed the test changes that are causing this (reverting to rev 5733b66fdedf results in no timeouts) for at least WinXP mochitest-e10s-3. I'll try disabling the test on m-c tip next to hopefully verify. This of course still begs the question for why a test from one directory is affecting tests in another one given that we're supposed to have a clean Firefox instance between each one. I guess service workers leave things running in the background or something? Are we properly shutting everything down at the end of the fetch tests?
Well, I don't want to back it out completely. Can we do something like this to only disable on windows instead? .then(function() { // XXX This makes other, unrelated test suites fail. Follow up bug 123. let isWin = navigator.platform.indexOf("Win") == 0; return isWin ? undefined : nestedWorkerTest(); }) Because otherwise we have zero test coverage for this particular code.
Flags: needinfo?(bkelly) → needinfo?(ryanvm)
Blocks: 1281212
Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/0edc88aff987 Skip the Fetch nestedWorkerTest on Windows for causing frequent WinXP timeouts in other DOM mochitests. r=bkelly
Please keep an eye on WinXP PGO M(3) over the next few days and confirm that this is indeed resolved by the push above. Try says it works for M-e10s(3) anyway, but I haven't tried PGO. Landed with r=bkelly per IRL discussion in London last week.
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
Flags: needinfo?(aryx.bugmail)
Keywords: leave-open
will do, thanks for the head-up
Flags: needinfo?(cbook)
Thanks, this looks like fixed on inbound and central.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(aryx.bugmail)
Resolution: --- → FIXED
Thanks for the confirmation. I'll get this uplifted to Aurora soonish.
Assignee: nobody → ryanvm
Keywords: leave-open
Target Milestone: --- → mozilla50
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.