Closed Bug 1504807 Opened 6 years ago Closed 6 years ago

flushPendingCommands doesn't support multiple process changes for the current command

Categories

(Remote Protocol :: Marionette, defect, P1)

defect

Tracking

(firefox64 fixed, firefox65 fixed)

RESOLVED FIXED
mozilla65
Tracking Status
firefox64 --- fixed
firefox65 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Keywords: hang)

Attachments

(2 files)

While investigating the high frequent intermittent failure on bug 1501130 I noticed that we have hangs in commands like `WebDriver:Navigate`, which end with a disconnect of the socket connection after 360s. Here an example:

https://treeherder.mozilla.org/logviewer.html#?job_id=209849362&repo=autoland&lineNumber=26770-26783

> 1541434342790    Marionette  TRACE   89 -> [0,20,"WebDriver:Navigate",{"url":"http://127.0.0.1:45870/windowHandles.html"}]
> 1541434342810    Marionette  DEBUG   [8589934599] Frame script loaded
> 1541434342822    Marionette  DEBUG   [8589934599] Frame script registered
> 1541434342987    Marionette  DEBUG   [8589934602] Frame script loaded
> 1541434342988    Marionette  DEBUG   [8589934602] Frame script registered
> 1541434343012    Marionette  DEBUG   [8589934599] Received DOM event beforeunload for about:newtab
> 1541434343020    Marionette  DEBUG   [8589934599] Check readyState interactive for about:newtab
> 1541434343135    Marionette  DEBUG   [12884901895] Frame script loaded
> 1541434343139    Marionette  DEBUG   [12884901895] Frame script registered
> 1541434343141    Marionette  DEBUG   [8589934599] Received DOM event pageshow for about:newtab
> 1541434343261    Marionette  DEBUG   [8589934599] Received DOM event pagehide for about:newtab
> 1541434343272    Marionette  DEBUG   [8589934599] Received DOM event unload for about:newtab
> 1541434343297    Marionette  DEBUG   [8589934599] Received observer notification outer-window-destroyed
> 1541434702858    Marionette  DEBUG   Closed connection 89

When you closely observe this output you will notice that the current framescript is moved between different processes multiple times for the duration of the navigation command: `8589934599` -> `8589934602` -> `8589934599` -> `12884901895` -> `8589934599`. 

While originally this was working fine for a single content process, this started to fail for multiple content processes. When we moved to 4 processes we rarely have seen it. But now with 8 processes by default the intermittent test failures have been increased, and it is way more likely to hit such a situation.

When I tried to reproduce this situation locally it was easy to do with `test_window_close_content.py` (bug 1501130). There is a hang nearly all the time when running the Marionette tests with `--run-until-failure`.

Observing the code what happens when we log `Frame script registered`, I noticed that in driver.js the method `flushPendingCommands()` in browser.js is getting called:

https://searchfox.org/mozilla-central/rev/7c848ac7630df5baf1314b0c03e015683599efb9/testing/marionette/browser.js#435-442 

And this method is buggy! The reason is that it is only able to handle a single process change. Each time when a framescript gets moved to a different process it will flush all the pending commands, and immediately reset all pending commands. That means that after the second process change no more commands are in the list, and eg. a formerly started navigate command isn't registering its page load listeners anymore. As such the command never receives the `DOMContentLoaded`, and `pageshow` events, and also never returns. The connection finally times out.

To fix that we should not immediately clear the full list of pending commands, to still have all of them available for a possible unexpected process change. Once a command has been finished its entry should be cleared automatically.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Priority: -- → P1
When I don't reset the list immediately, I can see this behavior which I would expect:

> 1541438931544	Marionette	TRACE	243 -> [0,20,"WebDriver:Navigate",{"url":"http://127.0.0.1:57002/windowHandles.html"}]
> 1541438931631	Marionette	DEBUG	[8589934892] Frame script loaded
> 1541438931633	Marionette	DEBUG	[8589934892] Frame script registered
> 1541438931646	Marionette	DEBUG	[8589934892] Received DOM event beforeunload for about:newtab
> 1541438931722	Marionette	DEBUG	[8589934896] Frame script loaded
> 1541438931722	Marionette	DEBUG	[8589934896] Frame script registered
> 1541438931861	Marionette	DEBUG	[8589934892] Check readyState interactive for about:newtab
> 1541438931895	Marionette	DEBUG	[68719476743] Frame script loaded
> 1541438931921	Marionette	DEBUG	[68719476743] Frame script registered
> 1541438931957	Marionette	DEBUG	[8589934892] Received DOM event pageshow for about:newtab
> 1541438932007	Marionette	DEBUG	[68719476743] Check readyState uninitialized for about:blank
> 1541438932070	Marionette	DEBUG	[8589934892] Received DOM event pagehide for about:newtab
> 1541438932086	Marionette	DEBUG	[8589934892] Received DOM event unload for about:newtab
> 1541438932148	Marionette	DEBUG	[8589934892] Received observer notification outer-window-destroyed
> 1541438932182	Marionette	DEBUG	[68719476743] Received DOM event DOMContentLoaded for http://127.0.0.1:57002/windowHandles.html
> 1541438932233	Marionette	DEBUG	[68719476743] Received DOM event pageshow for http://127.0.0.1:57002/windowHandles.html
> 1541438932253	Marionette	TRACE	243 <- [1,20,null,{"value":null}]
Mike, any idea why we are seeing so many process changes for a single page navigation?
Flags: needinfo?(mconley)
I pushed a try build yesterday which looks pretty fine:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc2196140b8402c4fe3c6c9ef2008b39e21e7e3d

But now I still have a problem with the `responseCompleted()` method in driver.js:

https://searchfox.org/mozilla-central/rev/7c848ac7630df5baf1314b0c03e015683599efb9/testing/marionette/driver.js#3428-3432

It basically means that whichever command is currently executed, it will delete all the pending commands once completed, and not only it's own entry. Note that for now we already did that in `flushPendingCommands`, so this action is only moved to a different place while allowing Marionette to still complete the currently active command.

What makes me wonder is why we have to keep a list of pending commands when we only allow the very first one to be executed again, but all others to be removed once the first command completed execution? Why do we need a queue at all?

Andreas, can you remember why this has been implemented that way? It was clearly before I started to work on Marionette. So maybe I miss something here. 

I for myself would like to leave this logic as is, and we could revisit on a different bug.
Flags: needinfo?(ato)
Blocks: 1473743
With the use of multiple content processes in Firefox a navigation
command can cause the active framescript to be moved to a different
process. This interrupts the currently executed command, and as such
needs to be executed again after the framescript has been finished
initializing.

Currently flushing the pending commands doesn't take into account
that the framescript can even be moved multiple times to a different
process during a single page navigation. As such all pending commands
are getting removed after the first process move. For navigation
commands this means that no page load listeners are attached for
subsequent process changes, and navigation commands could never
return, and cause a hang of Marionette.

To solve the problem the pending commands need to be flushed each
time the process changes. They will remove themselves from the list
once they have finished processing.

Depends on D10998
(In reply to Henrik Skupin (:whimboo) from comment #3)

> I pushed a try build yesterday which looks pretty fine:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=dc2196140b8402c4fe3c6c9ef2008b39e21e7e3d
> 
> But now I still have a problem with the `responseCompleted()`
> method in driver.js:
> 
> https://searchfox.org/mozilla-central/rev/7c848ac7630df5baf1314b0c03e015683599efb9/testing/marionette/driver.js#3428- 3432
> 
> It basically means that whichever command is currently executed,
> it will delete all the pending commands once completed, and not
> only it's own entry.  Note that for now we already did that
> in `flushPendingCommands`, so this action is only moved to a
> different place while allowing Marionette to still complete the
> currently active command.

I see, that puts your review in a different light.  I wish this
information had been part of your commit message.  I’m going to go
back and accept the patch.

> What makes me wonder is why we have to keep a list of pending
> commands when we only allow the very first one to be executed
> again, but all others to be removed once the first command
> completed execution? Why do we need a queue at all?
> 
> Andreas, can you remember why this has been implemented that
> way? It was clearly before I started to work on Marionette. So
> maybe I miss something here.

It’s currently assumed that all pending commands belong to the same
request to the frame script, and an array seemed like a better
data structure than a single variable that could get overridden if
driver.js had a need for multiple intermediary requests to the frame
script.

Also in theory we want to support concurrent calls to the frame
script from different origins, but this is obviously not possible at
the moment because there simply are no origin check…

The data flow here is pretty bad and I think the solution might be
to move to child actors (implemented by kmag) or to use one of the
fancy message manager wrappers we have for Firefox.  I don’t know
if they specifically cater for _multiple process changes_, but it
doesn’t seem impossible that they can be made to work that way.
Sharing more code with toolkit and Firefox would be of huge advantage
for Marionette.
Flags: needinfo?(ato)
(In reply to Andreas Tolfsen ❲:ato❳ from comment #6)
> I see, that puts your review in a different light.  I wish this
> information had been part of your commit message.  I’m going to go
> back and accept the patch.

Sorry, I thought that you would read the needinfo first. But it's one example of disconnect between Bugzilla and Phabricator, which we have now. Maybe I should really start putting everything into the commit message which makes the patch self-explanatory. 

> It’s currently assumed that all pending commands belong to the same
> request to the frame script, and an array seemed like a better
> data structure than a single variable that could get overridden if
> driver.js had a need for multiple intermediary requests to the frame
> script.

I see. Also note that we had a problem before when `Marionette:newSession` was still used. While a navigation request was still queued up, a process move caused a re-registration of the framescript, and as such a call to `Marionette:newSession` was emitted. But that was never executed because the navigation command had precedence. So I'm glad that this listener is gone and we don't have to rely on it anymore.

> Also in theory we want to support concurrent calls to the frame
> script from different origins, but this is obviously not possible at
> the moment because there simply are no origin check…

Yes, it's good that we don't need that at the moment!

> The data flow here is pretty bad and I think the solution might be
> to move to child actors (implemented by kmag) or to use one of the

Lets see once we have to convert, or someone is doing it for us.
Attachment #9022840 - Attachment description: Bug 1504807 - [marionette] Remove call to not existent "Marionette:newSession" message listener. r?ato → Bug 1504807 - [marionette] Remove call to non-existent "Marionette:newSession" message listener. r?ato
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8e36cdc6bbc5
[marionette] Remove call to non-existent "Marionette:newSession" message listener. r=ato
https://hg.mozilla.org/integration/autoland/rev/dac9d7c5dce8
[marionette] Don't remove pending command unless it has been completed. r=ato
https://hg.mozilla.org/mozilla-central/rev/8e36cdc6bbc5
https://hg.mozilla.org/mozilla-central/rev/dac9d7c5dce8
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Blocks: 1503274
(In reply to Henrik Skupin (:whimboo) from comment #2)
> Mike, any idea why we are seeing so many process changes for a single page
> navigation?

I think I might have identified the problem on bug 1503274 comment 1. If you have the time and could cross-check that would be great!
To prevent hangs in Marionette after navigation it would be great to get this test harness only patch uplifted to beta.
Keywords: hang
Whiteboard: [checkin-needed-beta]
(In reply to Henrik Skupin (:whimboo) from comment #2)
> Mike, any idea why we are seeing so many process changes for a single page
> navigation?

Quite likely this is because of the new Activity Stream content process (bug 1472212). Any new tabs pointed at about:newtab will load in that content process, but as soon as it browses away, it'll do a process switch to a normal content process.
Flags: needinfo?(mconley)
That would indeed explain it in combination with bug 1503274. Thanks a lot for this reference!
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: