Closed Bug 1294456 Opened 8 years ago Closed 8 years ago

[e10s] After opening a tab/window Marionette hangs if any command is send which executes in the listener.js frame script

Categories

(Remote Protocol :: Marionette, defect)

Version 3
defect
Not set
major

Tracking

(e10s+, firefox50 fixed, firefox51 fixed)

RESOLVED FIXED
mozilla51
Tracking Status
e10s + ---
firefox50 --- fixed
firefox51 --- fixed

People

(Reporter: whimboo, Assigned: mconley)

References

(Blocks 1 open bug)

Details

(Keywords: hang, regression)

Attachments

(3 files, 5 obsolete files)

I can get Marionette reliably to hang on the server side with the attached testcase. This started to happen not that long ago. I will check for a regression range.
Henrik, if this turns out to be non-test code breaking the test, please re-nom for tracking-e10s, thanks.
Ok, so this is not only happening in case of syntax failures but also if in this case the condition times out. Interestingly I only see this with my self-made opt and debug build (latest code from mozilla-central) but not with the Nightly build from August 10th. Even when I checkout 6cf0089510fa and rebuilt I'm not able to reproduce. Maybe it is related to artifact builds? My current build flags are:


mk_add_options AUTOCLOBBER=1
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/obj/opt/
mk_add_options MOZ_MAKE_FLAGS="-j2"

ac_add_options --enable-artifact-builds
ac_add_options --disable-debug
# ac_add_options --enable-dmd

# Set environment variables
export BUILD_SYSTEM_TELEMETRY=1
export ENABLE_MARIONETTE=1
export MOZ_PACKAGE_JSSHELL=1
Summary: [e10s] Marionette hangs in case of a syntax failure within a test after opening a new tab → [e10s] Marionette hangs in case of a test failure after opening a new tab
(In reply to Henrik Skupin (:whimboo) from comment #0)
> I can get Marionette reliably to hang on the server side with the attached
> testcase. This started to happen not that long ago. I will check for a
> regression range.

I don't see an attached testcase -- did you forget to add it?
Flags: needinfo?(hskupin)
This may be related to Bug 1294540, which is also seen on artifact builds only, with e10s. When I use a Nightly build instead, or a full self-made build (instaed of artifact), all is fine.
Attached file Testcase (obsolete) —
Interesting. I'm sure that I attached it when I filed the bug. Anyway, here is the testcase. Maja, it would be great if you could try to repro with an artifact and Nightly build.
Flags: needinfo?(hskupin) → needinfo?(mjzffr)
I can reproduce the hang on latest m-c with artifact build. Attaching some logs.
Flags: needinfo?(mjzffr)
Same result as above for full build with e10s.

With e10s off on full build, it fails faster and I get a different error:
0:33.16 TEST_END: MainThread ERROR, expected PASS
Traceback (most recent call last):
  File "/Users/mozilla/dev/firefox/testing/marionette/harness/marionette/marionette_test.py", line 344, in run
    testMethod()
  File "/Users/mozilla/dev/firefox/test_temp_henrik.py", line 13, in test_hang
    lambda mn: len(mn.window_handles) == 3)
  File "/Users/mozilla/dev/firefox/testing/marionette/harness/marionette/marionette_test.py", line 685, in wait_for_condition
    raise TimeoutException("wait_for_condition timed out")
TimeoutException: wait_for_condition timed out
(In reply to Maja Frydrychowicz (:maja_zf) from comment #9)
> With e10s off on full build, it fails faster and I get a different error:

Is that the intended outcome of the test, or can we actually expect 3 window handles at the point? 

Anyway, there you have it.
Thanks Maja for testing this. The failure you see is expected and actually needed to reproduce this hang. I think that we all are seeing the same and some e10s related changes broke our IPC communitcation.

I will investigate this further and maybe make this bug more general.
Updating the summary to reflect the underlying issue. This hang can happen at any time.
Summary: [e10s] Marionette hangs in case of a test failure after opening a new tab → [e10s] Marionette randomly hangs after "Error: Channel error: cannot send/recv"
Attachment #8782078 - Attachment mime type: text/x-log → text/plain
Attachment #8782031 - Attachment mime type: text/x-log → text/plain
(Referring to the log of https://bugzilla.mozilla.org/show_bug.cgi?id=1294540#c0)

Looks like the child has a hard time interpreting something from the parent, and shuts itself down.

This error looks suspicious:

[Child 31330] WARNING: Message needs unreceived descriptors channel:1174fa000 message-type:4849673 header()->num_fds:1 num_fds:0 fds_i:0: file /builds/slave/ash-m64-0000000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 482

That refers to this: https://dxr.mozilla.org/mozilla-central/rev/fe895421dfbe1f1f8f1fc6a39bb20774423a6d74/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#472

Hey billm, any idea why we might not have gotten the right number of fd's in a message?
Flags: needinfo?(wmccloskey)
Blocks: 1259055
Sorry, I think that I wrongly duped the other bug. I tried with my testcase again and I do not see the IPC error but just a hang without a reply. I will correct that and restore the former states.
Flags: needinfo?(wmccloskey)
Summary: [e10s] Marionette randomly hangs after "Error: Channel error: cannot send/recv" → [e10s] Marionette hangs in case of a test failure after opening a new tab
Blocks: 1295889
Blocks: 1289091
Blocks: 1295709
No longer blocks: 1295709
I'm currently working on a regression range via my own debug builds. This might take a while. :(
Ok, interestingly i can now reproduce the hang in a Nightly build. This will make the search faster.
Affected branches are clearly mozilla-central, and mozilla-aurora.
Sadly I cannot use mozregression because marionette exists with 0 when we have a socket.timeout during teardown. I will investigate that first because it means under those conditions we do not have the expected result.
Summary: [e10s] Marionette hangs in case of a test failure after opening a new tab → [e10s] Marionette hangs in teardown if any opened tab or window has not been closed at the end of the test
The hang in teardown is just the result of my testcase. But this could happen at any time.

The regression test with mozregression has shown:

14:04.57 INFO: Last good revision: 84b5a1027550faf65534d67dc02372ba09508550
14:04.57 INFO: First bad revision: 80ab1089a326e4a35d32cd6f52ed0194eba89ecf
14:04.57 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=84b5a1027550faf65534d67dc02372ba09508550&tochange=80ab1089a326e4a35d32cd6f52ed0194eba89ecf

14:05.63 INFO: Looks like the following bug has the changes which introduced the regression:
https://bugzilla.mozilla.org/show_bug.cgi?id=1261842

Mike, can you please have a look at this bug given that it is a regression from one of your patches?
Blocks: 1261842
Flags: needinfo?(mconley)
Summary: [e10s] Marionette hangs in teardown if any opened tab or window has not been closed at the end of the test → [e10s] Marionette hangs if any opened tab or window has not been closed at the end of the test
I would assume the problem lies in the following changeset?

https://hg.mozilla.org/integration/autoland/rev/88b8a2529323
Attached file testcase (hang during teardown) (obsolete) —
Attachment #8781025 - Attachment is obsolete: true
Attachment #8782029 - Attachment is obsolete: true
Attachment #8782031 - Attachment is obsolete: true
Attachment #8782078 - Attachment is obsolete: true
The above assertion is also visible in non-e10s mode. So it's not relevant. Looks like I cannot do more here, and I would ask Mike to continue the investigation.
Bumping severity to major given that this bug causes a lot of intermittent oranges for us in Marionette and Firefox ui tests.
Severity: normal → major
Blocks: 1293090
Andreas, maybe you can help too? Looks like this is really happening when we try work with the listener.js frame script for a newly opened tab. Maybe we don't install it correctly? I will update the minimized testcase shortly.
Flags: needinfo?(ato)
Summary: [e10s] Marionette hangs if any opened tab or window has not been closed at the end of the test → [e10s] After opening a tab/window Marionette hangs if any command is send which executes in the listener.js frame script
Attached file testcase
Attachment #8782493 - Attachment is obsolete: true
Keep in mind that the testcase does not switch away from the original window handle! Maybe we somehow kill the listener frame script after opening a new tab?
I just noticed that whenever we call switch_to_window() after a tab has been opened, the test will pass even for the original tab. So I assume our references to listener.js are getting mixed-up somehow.
Here is some more debugging output:

When using switch_to_window() before accessing the page_source property:
1471960691095	Marionette	TRACE	conn5 -> [0,16,"getPageSource",null]
1471960691096	Marionette	INFO	******** Enter call to get page source *******
1471960691097	Marionette	INFO	* executeWhenReady - remoteness change: false
1471960691102	Marionette	INFO	******** Exit call to get page source *******

When the switch_to_window() call is left-out because we already have the correct window selected:
1471960823850	Marionette	INFO	******** Enter call to get page source *******
1471960823851	Marionette	INFO	* executeWhenReady - remoteness change: true

The difference here is about the remoteness change. It means that executeWhenReady() will push the new command to the pendingCommands list, but those commands are never popped out of the list again. It feels to me that we miss a call to flushPendingCommands() somewhere. I will have to check if that is the right thing to do, and if yes where it has to be placed.
Interestingly it looks like we load the listener twice when opening a new tab:

1471961553666	Marionette	DEBUG	loaded listener.js
1471961553833	Marionette	TRACE	conn5 <- [1,8,null,{}]
++DOCSHELL 0x7f5f37079800 == 7 [pid = 18098] [id = 7]
++DOMWINDOW == 14 (0x7f5f370c7800) [pid = 18098] [serial = 14] [outer = (nil)]
++DOMWINDOW == 15 (0x7f5f370c9800) [pid = 18098] [serial = 15] [outer = 0x7f5f370c7800]
1471961553865	Marionette	DEBUG	loaded listener.js
1471961553866	Marionette	INFO	***** listeners attached
++DOMWINDOW == 16 (0x7f5f37c0e800) [pid = 18098] [serial = 16] [outer = 0x7f5f3745d400]
1471961553956	Marionette	INFO	***** listeners attached

So this is somewhat similar to bug 1239552 which got a follow-up bug 1296628. I would speculate it might fix this problem. I will mark as a blocker for now and re-evaluate this bug once bug 1296628 got fixed.
Depends on: 1296628
Actually listeners are attached twice also in the switch_to_window() case. So it should not completely be related to bug 1296628.
Giving this a look today.
Flags: needinfo?(mconley)
Blocks: 1298005
Some interesting findings.

1) Inside Marionette's browser.js, the reference to this.tab does not refer to the currently selected browser tab, but the browser tab that Marionette last switched to.
2) Bug 1227252 landed about 9 months ago, and accidentally broke that assumption. Some state that browser.js holds on to is supposed to refer to this.tab, but it uses this.browser.selectedTab instead. With your test case, that causes Marionette to think that the browser is switching remoteness, and therefore starts queueing commands for a remoteness change that'll never happen.

I think we need to correct the work that was done in bug 1227252 to fix this. If I, for example, back that patch out, the testcase in this bug passes just fine.

[1]: Since the initial browser is non-remote (this.tab), but the opened one (this.browser.selectedTab) is not.
Comment on attachment 8784944 [details]
Bug 1294456 - Stop Marionette from being confused about which tab it currently cares about in each window.

https://reviewboard.mozilla.org/r/74266/#review72124

Just to add, it would be great if my testcase could be added to the unit tests, eg. in https://dxr.mozilla.org/mozilla-central/source/testing/marionette/harness/marionette/tests/unit/test_window_handles.py.
Attachment #8784944 - Flags: review?(cmanchester) → review?(dburns)
Comment on attachment 8784984 [details]
Bug 1294456 - Regression test.

https://reviewboard.mozilla.org/r/74308/#review72164

::: testing/marionette/harness/marionette/tests/unit/test_window_handles.py:50
(Diff revision 1)
> +        self.wait_for_condition(lambda mn: len(mn.window_handles) == 2)
> +
> +        self.assertEqual(self.marionette.get_url(), "about:blank")
> +
> +        self.marionette.switch_to_window(self.marionette.window_handles[1])
> +        self.marionette.close()

I think if you don't switch back to the original window handle it can cause trouble.
Comment on attachment 8784944 [details]
Bug 1294456 - Stop Marionette from being confused about which tab it currently cares about in each window.

https://reviewboard.mozilla.org/r/74266/#review72500
Attachment #8784944 - Flags: review?(dburns) → review+
Comment on attachment 8784984 [details]
Bug 1294456 - Regression test.

https://reviewboard.mozilla.org/r/74308/#review72502

Please can you change the commit message to something more meaningful and related to what is currently happening.
Attachment #8784984 - Flags: review?(dburns) → review+
Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5d34ee345e18
Stop Marionette from being confused about which tab it currently cares about in each window. r=automatedtester
https://hg.mozilla.org/integration/autoland/rev/e17911d5430d
Regression test. r=automatedtester
https://hg.mozilla.org/mozilla-central/rev/5d34ee345e18
https://hg.mozilla.org/mozilla-central/rev/e17911d5430d
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Finally all that works wonderful again. Thank you a lot!

Lets get this also landed on aurora to get the regression fixed.
Assignee: nobody → mconley
Flags: needinfo?(ato)
Whiteboard: [checkin-needed-aurora]
Whiteboard: [checkin-needed-aurora]
Blocks: 1285031
No longer depends on: 1296628
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: