Open Bug 1544147 Opened 1 year ago Updated 6 days ago

Intermittent dom/ipc/tests/test_process_error.xhtml,test_process_error.xul | Test timed out.

Categories

(Core :: DOM: Content Processes, defect, P5)

defect

Tracking

()

ASSIGNED

People

(Reporter: intermittent-bug-filer, Assigned: gsvelto)

References

(Depends on 1 open bug, Regression)

Details

(Keywords: intermittent-failure, leave-open, regression, Whiteboard: [stockwell disabled])

Attachments

(2 files)

#[markdown(off)]
Filed by: ncsoregi [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=240046060&repo=autoland

https://queue.taskcluster.net/v1/task/bgIPX1ohTwi50endNqMYiQ/runs/0/artifacts/public/logs/live_backing.log

[task 2019-04-12T22:19:07.164Z] 22:19:07 INFO - Buffered messages logged at 22:13:39
[task 2019-04-12T22:19:07.165Z] 22:19:07 INFO - TEST-PASS | dom/ipc/tests/test_process_error.xul | Expected the right browsing context id on the oop-browser-crashed event.
[task 2019-04-12T22:19:07.165Z] 22:19:07 INFO - TEST-PASS | dom/ipc/tests/test_process_error.xul | Received correct observer topic.
[task 2019-04-12T22:19:07.166Z] 22:19:07 INFO - TEST-PASS | dom/ipc/tests/test_process_error.xul | Subject implements nsIPropertyBag2.
[task 2019-04-12T22:19:07.167Z] 22:19:07 INFO - Buffered messages finished
[task 2019-04-12T22:19:07.168Z] 22:19:07 INFO - TEST-UNEXPECTED-FAIL | dom/ipc/tests/test_process_error.xul | Test timed out.
[task 2019-04-12T22:19:07.168Z] 22:19:07 INFO - SimpleTest.ok@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:275:18
[task 2019-04-12T22:19:07.169Z] 22:19:07 INFO - reportError@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:121:22
[task 2019-04-12T22:19:07.170Z] 22:19:07 INFO - TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:142:7
[task 2019-04-12T22:19:07.171Z] 22:19:07 INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:163:5
[task 2019-04-12T22:19:07.171Z] 22:19:07 INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:163:5
[task 2019-04-12T22:19:07.172Z] 22:19:07 INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:163:5
[task 2019-04-12T22:19:07.172Z] 22:19:07 INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:163:5
[task 2019-04-12T22:19:07.172Z] 22:19:07 INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:163:5
[task 2019-04-12T22:19:07.173Z] 22:19:07 INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:163:5
[task 2019-04-12T22:19:07.173Z] 22:19:07 INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:163:5
[task 2019-04-12T22:19:07.173Z] 22:19:07 INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:163:5
[task 2019-04-12T22:19:07.173Z] 22:19:07 INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:163:5
[task 2019-04-12T22:19:07.173Z] 22:19:07 INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:163:5
[task 2019-04-12T22:19:07.173Z] 22:19:07 INFO - setTimeout handler*TestRunner._checkForHangs@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:163:5
[task 2019-04-12T22:19:07.173Z] 22:19:07 INFO - TestRunner.runTests/<@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:381:20
[task 2019-04-12T22:19:07.175Z] 22:19:07 INFO - promise callback*TestRunner.runTests@chrome://mochikit/content/tests/SimpleTest/TestRunner.js:368:50
[task 2019-04-12T22:19:07.176Z] 22:19:07 INFO - RunSet.runtests@chrome://mochikit/content/tests/SimpleTest/setup.js:201:14
[task 2019-04-12T22:19:07.176Z] 22:19:07 INFO - RunSet.runall@chrome://mochikit/content/tests/SimpleTest/setup.js:180:12
[task 2019-04-12T22:19:07.177Z] 22:19:07 INFO - hookupTests@chrome://mochikit/content/tests/SimpleTest/setup.js:273:12
[task 2019-04-12T22:19:07.177Z] 22:19:07 INFO - parseTestManifest@chrome://mochikit/content/manifestLibrary.js:38:5
[task 2019-04-12T22:19:07.178Z] 22:19:07 INFO - getTestManifest/req.onload@chrome://mochikit/content/manifestLibrary.js:49:11
[task 2019-04-12T22:19:07.179Z] 22:19:07 INFO - EventHandlerNonNull*getTestManifest@chrome://mochikit/content/manifestLibrary.js:45:3
[task 2019-04-12T22:19:07.180Z] 22:19:07 INFO - hookup@chrome://mochikit/content/tests/SimpleTest/setup.js:253:5
[task 2019-04-12T22:19:07.180Z] 22:19:07 INFO - linkAndHookup@chrome://mochikit/content/harness.xul:54:3
[task 2019-04-12T22:19:07.180Z] 22:19:07 INFO - parseTestManifest@chrome://mochikit/content/manifestLibrary.js:38:5
[task 2019-04-12T22:19:07.181Z] 22:19:07 INFO - getTestManifest/req.onload@chrome://mochikit/content/manifestLibrary.js:49:11
[task 2019-04-12T22:19:07.181Z] 22:19:07 INFO - EventHandlerNonNull*getTestManifest@chrome://mochikit/content/manifestLibrary.js:45:3
[task 2019-04-12T22:19:07.182Z] 22:19:07 INFO - getTestList@chrome://mochikit/content/chrome-harness.js:251:3
[task 2019-04-12T22:19:07.182Z] 22:19:07 INFO - loadTests@chrome://mochikit/content/harness.xul:33:3
[task 2019-04-12T22:19:07.183Z] 22:19:07 INFO - EventListener.handleEvent*@chrome://mochikit/content/harness.xul:57:12
[task 2019-04-12T22:19:07.872Z] 22:19:07 INFO - GECKO(2485) | MEMORY STAT | vsize 2919MB | residentFast 341MB | heapAllocated 88MB
[task 2019-04-12T22:19:07.890Z] 22:19:07 INFO - TEST-UNEXPECTED-ERROR | dom/ipc/tests/test_process_error.xul | expected-crash-dump-missing - This test did not leave any crash dumps behind, but we were expecting some!
[task 2019-04-12T22:19:07.892Z] 22:19:07 INFO - TEST-UNEXPECTED-ERROR | dom/ipc/tests/test_process_error.xul | unexpected-crash-dump-found - This test left crash dumps behind, but we weren't expecting it to!
[task 2019-04-12T22:19:07.893Z] 22:19:07 INFO - Found unexpected crash dump file /tmp/tmpRa3SxC.mozrunner/minidumps/52c54ecf-93c1-45b2-c0d4-674efdf6416c.dmp.
[task 2019-04-12T22:19:07.895Z] 22:19:07 INFO - Found unexpected crash dump file /tmp/tmpRa3SxC.mozrunner/minidumps/52c54ecf-93c1-45b2-c0d4-674efdf6416c.extra.
[task 2019-04-12T22:19:07.900Z] 22:19:07 INFO - TEST-UNEXPECTED-CRASH | dom/ipc/tests/test_process_error.xul | Finished in 329739ms

Hi Geoff, this is a tier1 failure with 56 total failures in the last 7 days and 110 total failures in the last 30.

Recent spikes are from the above mentioned bug, do we backout or disable?

Flags: needinfo?(gbrown)
Attachment #9094196 - Attachment description: Bug 1544147 - disable est_process_error.xul for frequent failures on linux64-qr. r=gbrown → Bug 1544147 - disable test_process_error.xul for frequent failures on linux64-qr. r=gbrown

(In reply to Geoff Brown [:gbrown] from comment #25)

I don't think 1538042 is responsible for the regression. For instance, there were significant failures in

I agree that Bug 1538042 is very unlikely to cause a regression, since that bug landed only a unit test and mochitest. There shouldn't be any behavior change.

Flags: needinfo?(bwerth)
Keywords: leave-open
Whiteboard: [stockwell needswork:owner] → [stockwell disabled]
Pushed by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/07eb2cc7e1c3
disable test_process_error.xul for frequent failures on linux64-qr. r=gbrown
Summary: Intermittent dom/ipc/tests/test_process_error.xul | Test timed out. → Intermittent dom/ipc/tests/test_process_error.xhtml,test_process_error.xul | Test timed out.

I did a quick check of this test with the patch for bug 1420363 applied and it looks good:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee2bdcf58b952db3de5147a921b3f05c115d1ca0

Maybe we'll be able to re-enable it after that lands.

It seems like bug 1420363 did not fix the issue, nor did the fix for bug 1280561 and the one for bug 1598258. So there's yet another race-condition lurking in this code. Ugh.

Looking at the log it seems that we're sending the ipc:content-shutdown notification at a time when mCrashReporter is not yet set so the dumpID field in the property bag ends up being null. Maybe there's a simple way to fix that...

I've got a WIP patch to fix this (and other similar intermittent failures). It's going through try right now.

Yay, it's green. Time for review.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED

I was too optimistic, apparently the affected test was moved to a different chunk so I'll have to re-run it.

When a content or plug-in process crashes too early we haven't initialized the
CrashReporterHost for that process. This will cause the crash to be orphaned,
i.e. to miss most of its crash annotations. We added code to finalize those
crashes in bug 1282776 so that we wouldn't miss them entirely. This ensured
that crash reports would have both their .dmp and .extra files but the patch
failed to modify the code that notified various listeners about the crash
report's presence.

This changes always send the crash ID alongside the crash notifications, even
for orphaned crashes, so that listeners such as the content crash handler or
the test harnesses can always find the minidump and .extra file.

This also re-enables dom/ipc/tests/process_error.xul which failed frequently
because of this bug.

After some more fiddling with the code I managed to make the orphaned minidump go through the test assertions correctly but then the test fails because it times out.

This is caused by the harness itself. When we have an orphaned minidump not only we don't have an .extra file but we never end up informing the CrashManager of the crash. The test harness will always wait for crashes to have been recorded in the CrashManager before analyzing them and in this case this will never happen. Now I need a way to fix this in a way that's robust enough to work with all the test harnesses.

Alright, I've got a working patch that makes sure that orphaned minidumps are recorded by the CrashManager and counted in our telemetry accumulators. That's cool because it means that this patch will fix a lot more than just a few intermittent failures. What's not cool is that while running the test over and over I ran into other races that cause the test harness to fail (but not the test).

See this log for example. The test is executing fine for a hundred iterations then something odd happens:

[task 2019-12-19T13:01:34.988Z] Failed to retrieve MOZ_UPLOAD_DIR env var

That's an environment variable that the test harness sets up to run the tests; the fact that it's undefined at that point is very odd. Either there's a race in the harness so we try to read it before it's set or something somewhere is unsetting it when it's still needed.

But that's not all! Looking at the same log you'll find multiple entries of this line:

[task 2019-12-19T12:54:50.919Z] GECKO(1089) | Crash cleaned up

The problem is that there's not enough of them! There should be one per iteration but there's only a handful instead. Since that's coming from the test cleanup we can only assume that we're not cleaning up correctly most of the time because of another race, ouch.

And then there's this:

[task 2019-12-19T12:56:30.676Z] GECKO(1089) | No .extra file for dumpID: 7437404b-3812-f130-c189-bb1ed029917a

Double-ouch, something is deleting the .extra file before we're done with it. This might be related to the previous problem.

Anyhow, I'll send the fix for this up for review and then file bugs for these issues too.

Attachment #9116715 - Attachment description: Bug 1544147 - Always include a minidump id in crash notifications → Bug 1544147 - Ensure orphaned crashes are properly notified to the rest of the system
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/155706ccc7e4
Ensure orphaned crashes are properly notified to the rest of the system r=froydnj
Regressions: 1608932

Once bug 1605100 is fixed this should go away.

Depends on: 1605100
You need to log in before you can comment on or make changes to this bug.