Closed Bug 1322138 Opened 8 years ago Closed 7 years ago

Intermittent test_crash.py TestCrash.test_crash_chrome_process | AssertionError: "Process crashed" does not match "Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 10s)"

Categories

(Testing :: Marionette Client and Harness, defect)

Version 3
All
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1376773

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

The test itself sets the socket timeout to 10s because the crashing code should crash Firefox immediately. As it looks like for this pgo build it has been taken longer, so Marionette killed the process due to socket connection loss.

It really reminds me to the remaining problem on bug 1299216 for Windows 8/10 64bit machines. Lets see and wait for more reports like this one.
OS: Unspecified → Windows 8
Hardware: Unspecified → x86_64
Closing as intermittent has not been seen in last 45 days
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
This happened again today:
https://treeherder.mozilla.org/logviewer.html#?job_id=95212819&repo=autoland
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
So lately this is only happening on OS X 10.10 for opt (e10s) builds:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1322138&startday=2017-05-15&endday=2017-06-11&tree=all

Here the common output from the gecko log:

http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/e34db8e3ab27d214703bdba9dc5f5c2db3817890d36a453bdcd8f43054ea3a171d7a8331e3183a20abd2bb65b23db8ab7e5f949f17da36ddbb9f7201e61dfa7c

1497023126005	Marionette	TRACE	173 -> [0,11,"executeScript",{"scriptTimeout":null,"newSandbox":true,"args":[],"filename":"test_crash.py","script":"\n              // Copied from crash me simple\n              Components.utils.import(\"resource://gre/modules/ctypes.jsm\");\n\n              // ctypes checks for NULL pointer derefs, so just go near-NULL.\n              var zero = new ctypes.intptr_t(8);\n              var badptr = ctypes.cast(zero, ctypes.PointerType(ctypes.int32_t));\n              var crash = badptr.contents;\n            ","sandbox":null,"line":84}]
[GFX1-]: Receive IPC close with reason=AbnormalShutdown
[Child 1722] WARNING: pipe error: Broken pipe: file /builds/slave/m-in-m64-000000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 709
** Unknown exception behavior: -2147483647
2017-06-09 08:46:35.020 plugin-container[1725:13582] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x973f, name = 'com.apple.tsm.portname'
See /usr/include/servers/bootstrap_defs.h for the error codes.
2017-06-09 08:46:35.026 plugin-container[1725:13582] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x4923, name = 'com.apple.CFPasteboardClient'
See /usr/include/servers/bootstrap_defs.h for the error codes.
2017-06-09 08:46:35.026 plugin-container[1725:13582] Failed to allocate communication port for com.apple.CFPasteboardClient; this is likely due to sandbox restrictions

I wonder if the GFX process related behavior here is causing the crash during shutdown instead of normally exiting.

Milan, do you know someone who could help with that?
Flags: needinfo?(milan)
OS: Windows 8 → Mac OS X
Hardware: x86_64 → All
(In reply to Henrik Skupin (:whimboo) from comment #11)
> So lately this is only happening on OS X 10.10 for opt (e10s) builds
> ...
> I wonder if the GFX process related behavior here is causing the crash
> during shutdown instead of normally exiting.

Shouldn't be related to GFX process - we don't actually create those on OS X.

Nical, the CompositorBridgeChild::ActorDestroy() getting called with a AbnormalShutdown for the ActorDestroyReason - this is a side effect of something bad happening earlier?
Flags: needinfo?(milan) → needinfo?(nical.bugzilla)
AbnormalShutdown usually means the other process crashed (or the connection was lost for whatever other unexpected reason).
Flags: needinfo?(nical.bugzilla)
I wonder if the issue seen here lately could be related to bug 1371207 which is about a crash of the main thread, and which started to happen recently.
(In reply to Nicolas Silva [:nical] from comment #14)
> AbnormalShutdown usually means the other process crashed (or the connection
> was lost for whatever other unexpected reason).

So I assume this assertion is not something critical? I'm asking because I can see this always in our content crash unit test for Marionette.
Flags: needinfo?(nical.bugzilla)
It means something went wrong on the other process (which most likely crashed) but it doesn't tell how critical that is. You can expect to see this whenever a process crashes that has som gfx related ipc. Gfx stuff is tricky to properly shutdown when ipc goes nuts so it is a good indicator for us when something else fails catastrophically in gfx-land right after, but it doesn't mean the root cause is actually gfx-related.
Flags: needinfo?(nical.bugzilla)
I see. Thank you for this explanation. So I doubt that it is important for us here given that this is a forced crash by the harness for testing purposes.

It actually should no longer occur with my upcoming changes for the unit test on bug 1223277.
Depends on: 1223277
glad this is understood and there are patches in the works for bug 1223277!  this has a lot of failures, but it looks like the real fix will get in sooner than later, no need to consider backing out.
Whiteboard: [stockwell needswork]
It should basically be blocked on bug 1376795.
Depends on: 1376795
Whiteboard: [stockwell needswork] → [stockwell unknown]
We missed to uplift the patch on bug 1381403 to beta. So this is only fixed for 56. The last failures as reported by OF are expected.

I will leave the bug open until we are clear about the real underlying issue.
Only happens on esr-52 and release.  Nothing we can do about that.
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
No, this is disabled and should actually be a dupe of bug 1376773.
Resolution: FIXED → DUPLICATE
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Product: Testing → Remote Protocol
Moving bug to Testing::Marionette Client and Harness component per bug 1815831.
Component: Marionette → Marionette Client and Harness
Product: Remote Protocol → Testing
You need to log in before you can comment on or make changes to this bug.