Closed Bug 1294540 Opened 3 years ago Closed 3 years ago

[e10s] Marionette randomly hangs after "Error: Channel error: cannot send/recv"

Categories

(Testing :: Marionette, defect)

defect
Not set

Tracking

(e10s+, firefox51 fixed)

VERIFIED FIXED
mozilla51
Tracking Status
e10s + ---
firefox51 --- fixed

People

(Reporter: maja_zf, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: regression, Whiteboard: [fixed by bug 1051567])

Attachments

(1 file)

When running ./mach marionette-test locally (in my case, on OS X), the test runner hangs right after opening Firefox and before any tests get run. After the socket timeout is reached, the Firefox instance is stopped (and the runner reports this as a crash, which looks like incorrect behaviour but that's another issue).

See log pasted at [1].

This failure occurs almost every time I call ./mach marionette-test; say, 2 out of 3 times I see this hang, and I can reproduce it reliably.

[1] Log output for ./mach marionette-test testing/marionette/harness/marionette/tests/unit/test_click.py --gecko-log=- -vv

 0:00.00 LOG: MainThread INFO Using workspace for temporary data: "/Users/mozilla/dev/firefox"
1470929473394 Marionette  DEBUG Marionette enabled via build flag and pref
1470929473574 Marionette  INFO  Listening on port 2828
1470929473736 Marionette  DEBUG Marionette enabled via command-line flag
1470929473797 Marionette  DEBUG Accepted connection conn0 from 127.0.0.1:50559
1470929473833 Marionette  DEBUG Closed connection conn0
1470929473834 Marionette  DEBUG Accepted connection conn1 from 127.0.0.1:50560
1470929473838 Marionette  DEBUG Closed connection conn1
1470929473840 Marionette  DEBUG Accepted connection conn2 from 127.0.0.1:50561
1470929473845 Marionette  TRACE conn2 -> [0,1,"newSession",{"sessionId":null,"capabilities":null}]
1470929473847 Marionette  CONFIG  Changing capabilities: {"browserName":"firefox","browserVersion":"51.0a1","platformName":"darwin","platformVersion":"15.6.0","specificationLevel":0,"raisesAccessibilityExceptions":false,"rotatable":false,"acceptSslCerts":false,"takesElementScreenshot":true,"takesScreenshot":true,"proxy":{},"platform":"DARWIN","XULappId":"{ec8030f7-c20a-464f-9b0e-13a3a9e97384}","appBuildId":"20160809071821","processId":31329,"version":"51.0a1"}
[Child 31330] WARNING: Message needs unreceived descriptors channel:1174fa000 message-type:4849673 header()->num_fds:1 num_fds:0 fds_i:0: file /builds/slave/ash-m64-0000000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 482
[Child 31330] ###!!! ABORT: Aborting on channel error.: file /builds/slave/ash-m64-0000000000000000000000/build/src/ipc/glue/MessageChannel.cpp, line 2057
[Child 31330] ###!!! ABORT: Aborting on channel error.: file /builds/slave/ash-m64-0000000000000000000000/build/src/ipc/glue/MessageChannel.cpp, line 2057

###!!! [Parent][MessageChannel] Error: (msgtype=0x4A0001,name=PContent::Msg_PBrowserConstructor) Channel error: cannot send/recv


###!!! [Parent][MessageChannel] Error: (msgtype=0x4A0001,name=PContent::Msg_PBrowserConstructor) Channel error: cannot send/recv


###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv

[Parent 31329] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /builds/slave/ash-m64-0000000000000000000000/build/src/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 22

<<<<< hangs here for duration of socket_timeout, e.g. 360s >>>>

1470929833857 Marionette  DEBUG Closed connection conn2
 8:02.31 CRASH: MainThread pid:31262. Test:runner.py. Minidump anaylsed:False. Signature:[None]
Crash dump filename: /var/folders/5k/xmn_fndx0qs2jcpcwhzl86wm0000gn/T/tmpzWN2bD.mozrunner/minidumps/EC5ABC38-EDD1-48B2-B6A5-EFD481645451.dmp
MINIDUMP_STACKWALK not set, can't process dump.

Error running mach:

    ['marionette-test', 'testing/marionette/harness/marionette/tests/unit/test_click.py', '--gecko-log=-', '-vv']

The error occurred in code that was called by the mach command. This is either
a bug in the called code itself or in the way that mach is calling it.

You should consider filing a bug for this issue.

If filing a bug, please include the full output of mach, including this error
message.

The details of the failure are as follows:

IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s)

  File "/Users/mozilla/dev/firefox/testing/marionette/mach_commands.py", line 158, in run_marionette_test
    return run_marionette(tests, topsrcdir=self.topsrcdir, **kwargs)
  File "/Users/mozilla/dev/firefox/testing/marionette/mach_commands.py", line 58, in run_marionette
    failed = MarionetteHarness(MarionetteTestRunner, args=vars(args)).run()
  File "/Users/mozilla/dev/firefox/testing/marionette/harness/marionette/runtests.py", line 68, in run
    runner.run_tests(tests)
  File "/Users/mozilla/dev/firefox/testing/marionette/harness/marionette/runner/base.py", line 857, in run_tests
    need_external_ip = self._start_marionette()
  File "/Users/mozilla/dev/firefox/testing/marionette/harness/marionette/runner/base.py", line 811, in _start_marionette
    if self.appName != 'Fennec':
  File "/Users/mozilla/dev/firefox/testing/marionette/harness/marionette/runner/base.py", line 656, in appName
    self._appName = self.capabilities.get('browserName')
  File "/Users/mozilla/dev/firefox/testing/marionette/harness/marionette/runner/base.py", line 629, in capabilities
    self.marionette.start_session()
  File "/Users/mozilla/dev/firefox/testing/marionette/client/marionette_driver/decorators.py", line 42, in _
    return func(*args, **kwargs)
  File "/Users/mozilla/dev/firefox/testing/marionette/client/marionette_driver/marionette.py", line 1130, in start_session
    resp = self._send_message("newSession", body)
  File "/Users/mozilla/dev/firefox/testing/marionette/client/marionette_driver/decorators.py", line 48, in _
    m.force_shutdown()
  File "/Users/mozilla/dev/firefox/testing/marionette/client/marionette_driver/decorators.py", line 42, in _
    return func(*args, **kwargs)
  File "/Users/mozilla/dev/firefox/testing/marionette/client/marionette_driver/marionette.py", line 682, in _send_message
    msg = self.client.request(name, params)
  File "/Users/mozilla/dev/firefox/testing/marionette/client/marionette_driver/transport.py", line 274, in request
    return self.receive()
  File "/Users/mozilla/dev/firefox/testing/marionette/client/marionette_driver/transport.py", line 204, in receive
    raise socket.timeout("Connection timed out after %ds" % self.socket_timeout)
Bisection yields the following so I'm setting this to block Bug 1287827 + n-i for :Standard8

```
changeset:   307229:1eca699bca28
user:        Mark Banner <standard8@mozilla.com>
date:        Tue Jul 26 20:32:59 2016 +0100
summary:     Bug 1287827 - Part 1. Remove Loop code from Firefox. rs=dmose

changeset:   307230:1a8114aa64c1
user:        Mark Banner <standard8@mozilla.com>
date:        Wed Jul 27 15:22:27 2016 +0100
summary:     Bug 1287827 - Part 2. Build changes for removing Loop. r=glandium
```
Flags: needinfo?(standard8)
Maja, it looks like you even hit a crash. Would you mind setting up the minidump-stackwalk binary to get more details? Maybe it is one of the crashes which we also see on Treeherder. Mind telling us which kind of build you have? I build Firefox myself yesterday and I do not see this problem on OS X.

http://hg.mozilla.org/mozilla-central/rev/1eca699bca28
http://hg.mozilla.org/mozilla-central/rev/1a8114aa64c1

The first commit is all about removing the loop extension from the code base. I cannot see how this affects us. The second only removes the reference to the loop tests. 

I would assume that something else causes this bustage for you. Did you do a full rebuilt or an artifact build from latest code on mozilla-central?
I agree that it's puzzling why these commits would cause any issue. 

This is with artifact builds. 

I will try with full build, etc. when I get the chance.
Maybe do a clobber and try an artifact again in case you do not have autoclobber set.
This could look like an e10s related issue.  Usually when artifact builds are the problem we see a YSOD.
The issue seems restricted to the combination of e10s on an artifact build:

On latest revision from m-c (233ab21b64b5):
* Full build
** e10s off: ok
** e10s on: ok
* Artifact build
** e10s off: ok
** e10s on: hang as described in Comment 0, with crash. See attached log.
tracking-e10s: --- → ?
The only vague thing I can think of, is that Loop wasn't e10s compatible, so it would load the multiprocess compatibility shims. A big speculation is that there could have been a side-effect to marionette somehow, or just that removing the loop tests changes the order of something.

However, the fact it doesn't happen with a full build, but only an artifact one, implies there's something wrong with the artifact builds. Maybe worth asking some e10s & build people to take a look.
Flags: needinfo?(standard8)
CC'ing Chris given that he is/was actively working on the artifact builds and may know more.
Summary: Marionette session does not start: Channel error: cannot send/recv → Marionette session does not start when using artifact builds, failure: "Channel error: cannot send/recv"
I can't reproduce this with the rev mentioned in comment 6... Maja, is this still happening with your build?

There are cases where an artifact build will fetch the most recent binaries, but those will be old enough to be incompatible with the working revision's chrome and cause crashes (I'm working on bug 1240134 to improve this situation).
(In reply to Chris Manchester (:chmanchester) from comment #10)
> I can't reproduce this with the rev mentioned in comment 6... Maja, is this
> still happening with your build?
> 

Yes, still happening at that revision. I also reproduced it on a revision from today: fe895421dfbe. It's always been intermittent (happens 1 of 4 runs, say) so if you ran the test only once maybe you happened not to encounter the failure.
This is actually a dupe of bug 1294456. The hang can happen at any time, and you see it mostly during startup. We have dozen of tests failing because of it right now. Bug 1294456 has a testcase which let you see this hang immediately.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1294456
(In reply to Mike Conley (:mconley) - (Needinfo me!) from bug 1294456 comment 15)
> (Referring to the log of
> https://bugzilla.mozilla.org/show_bug.cgi?id=1294540#c0)
> 
> Looks like the child has a hard time interpreting something from the parent,
> and shuts itself down.
> 
> This error looks suspicious:
> 
> [Child 31330] WARNING: Message needs unreceived descriptors
> channel:1174fa000 message-type:4849673 header()->num_fds:1 num_fds:0
> fds_i:0: file
> /builds/slave/ash-m64-0000000000000000000000/build/src/ipc/chromium/src/
> chrome/common/ipc_channel_posix.cc, line 482
> 
> That refers to this:
> https://dxr.mozilla.org/mozilla-central/rev/
> fe895421dfbe1f1f8f1fc6a39bb20774423a6d74/ipc/chromium/src/chrome/common/
> ipc_channel_posix.cc#472
> 
> Hey billm, any idea why we might not have gotten the right number of fd's in
> a message?
Status: RESOLVED → REOPENED
Flags: needinfo?(wmccloskey)
Resolution: DUPLICATE → ---
Summary: Marionette session does not start when using artifact builds, failure: "Channel error: cannot send/recv" → [e10s] Marionette randomly hangs after "Error: Channel error: cannot send/recv"
This bug might actually have the same underlying issue as bug 1051567 but it doesn't crash Firefox.
The underlying issue definitely is bug 1051567. But it looks like that we crash more often with artifact builds.
Depends on: 1051567
Flags: needinfo?(wmccloskey)
This crash should be fixed now with the patch on bug 1051567 landed. 

Maja, can you please verify? Thanks.
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Flags: needinfo?(mjzffr)
Resolution: --- → FIXED
Whiteboard: [fixed by bug 1051567]
Target Milestone: --- → mozilla51
No longer blocks: 1291047
Status: RESOLVED → VERIFIED
Flags: needinfo?(mjzffr)
You need to log in before you can comment on or make changes to this bug.