Closed
Bug 1504807
Opened 6 years ago
Closed 6 years ago
flushPendingCommands doesn't support multiple process changes for the current command
Categories
(Remote Protocol :: Marionette, defect, P1)
Remote Protocol
Marionette
Tracking
(firefox64 fixed, firefox65 fixed)
RESOLVED
FIXED
mozilla65
People
(Reporter: whimboo, Assigned: whimboo)
References
Details
(Keywords: hang)
Attachments
(2 files)
While investigating the high frequent intermittent failure on bug 1501130 I noticed that we have hangs in commands like `WebDriver:Navigate`, which end with a disconnect of the socket connection after 360s. Here an example: https://treeherder.mozilla.org/logviewer.html#?job_id=209849362&repo=autoland&lineNumber=26770-26783 > 1541434342790 Marionette TRACE 89 -> [0,20,"WebDriver:Navigate",{"url":"http://127.0.0.1:45870/windowHandles.html"}] > 1541434342810 Marionette DEBUG [8589934599] Frame script loaded > 1541434342822 Marionette DEBUG [8589934599] Frame script registered > 1541434342987 Marionette DEBUG [8589934602] Frame script loaded > 1541434342988 Marionette DEBUG [8589934602] Frame script registered > 1541434343012 Marionette DEBUG [8589934599] Received DOM event beforeunload for about:newtab > 1541434343020 Marionette DEBUG [8589934599] Check readyState interactive for about:newtab > 1541434343135 Marionette DEBUG [12884901895] Frame script loaded > 1541434343139 Marionette DEBUG [12884901895] Frame script registered > 1541434343141 Marionette DEBUG [8589934599] Received DOM event pageshow for about:newtab > 1541434343261 Marionette DEBUG [8589934599] Received DOM event pagehide for about:newtab > 1541434343272 Marionette DEBUG [8589934599] Received DOM event unload for about:newtab > 1541434343297 Marionette DEBUG [8589934599] Received observer notification outer-window-destroyed > 1541434702858 Marionette DEBUG Closed connection 89 When you closely observe this output you will notice that the current framescript is moved between different processes multiple times for the duration of the navigation command: `8589934599` -> `8589934602` -> `8589934599` -> `12884901895` -> `8589934599`. While originally this was working fine for a single content process, this started to fail for multiple content processes. When we moved to 4 processes we rarely have seen it. But now with 8 processes by default the intermittent test failures have been increased, and it is way more likely to hit such a situation. When I tried to reproduce this situation locally it was easy to do with `test_window_close_content.py` (bug 1501130). There is a hang nearly all the time when running the Marionette tests with `--run-until-failure`. Observing the code what happens when we log `Frame script registered`, I noticed that in driver.js the method `flushPendingCommands()` in browser.js is getting called: https://searchfox.org/mozilla-central/rev/7c848ac7630df5baf1314b0c03e015683599efb9/testing/marionette/browser.js#435-442 And this method is buggy! The reason is that it is only able to handle a single process change. Each time when a framescript gets moved to a different process it will flush all the pending commands, and immediately reset all pending commands. That means that after the second process change no more commands are in the list, and eg. a formerly started navigate command isn't registering its page load listeners anymore. As such the command never receives the `DOMContentLoaded`, and `pageshow` events, and also never returns. The connection finally times out. To fix that we should not immediately clear the full list of pending commands, to still have all of them available for a possible unexpected process change. Once a command has been finished its entry should be cleared automatically.
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Priority: -- → P1
Assignee | ||
Comment 1•6 years ago
|
||
When I don't reset the list immediately, I can see this behavior which I would expect:
> 1541438931544 Marionette TRACE 243 -> [0,20,"WebDriver:Navigate",{"url":"http://127.0.0.1:57002/windowHandles.html"}]
> 1541438931631 Marionette DEBUG [8589934892] Frame script loaded
> 1541438931633 Marionette DEBUG [8589934892] Frame script registered
> 1541438931646 Marionette DEBUG [8589934892] Received DOM event beforeunload for about:newtab
> 1541438931722 Marionette DEBUG [8589934896] Frame script loaded
> 1541438931722 Marionette DEBUG [8589934896] Frame script registered
> 1541438931861 Marionette DEBUG [8589934892] Check readyState interactive for about:newtab
> 1541438931895 Marionette DEBUG [68719476743] Frame script loaded
> 1541438931921 Marionette DEBUG [68719476743] Frame script registered
> 1541438931957 Marionette DEBUG [8589934892] Received DOM event pageshow for about:newtab
> 1541438932007 Marionette DEBUG [68719476743] Check readyState uninitialized for about:blank
> 1541438932070 Marionette DEBUG [8589934892] Received DOM event pagehide for about:newtab
> 1541438932086 Marionette DEBUG [8589934892] Received DOM event unload for about:newtab
> 1541438932148 Marionette DEBUG [8589934892] Received observer notification outer-window-destroyed
> 1541438932182 Marionette DEBUG [68719476743] Received DOM event DOMContentLoaded for http://127.0.0.1:57002/windowHandles.html
> 1541438932233 Marionette DEBUG [68719476743] Received DOM event pageshow for http://127.0.0.1:57002/windowHandles.html
> 1541438932253 Marionette TRACE 243 <- [1,20,null,{"value":null}]
Assignee | ||
Comment 2•6 years ago
|
||
Mike, any idea why we are seeing so many process changes for a single page navigation?
Flags: needinfo?(mconley)
Assignee | ||
Comment 3•6 years ago
|
||
I pushed a try build yesterday which looks pretty fine: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc2196140b8402c4fe3c6c9ef2008b39e21e7e3d But now I still have a problem with the `responseCompleted()` method in driver.js: https://searchfox.org/mozilla-central/rev/7c848ac7630df5baf1314b0c03e015683599efb9/testing/marionette/driver.js#3428-3432 It basically means that whichever command is currently executed, it will delete all the pending commands once completed, and not only it's own entry. Note that for now we already did that in `flushPendingCommands`, so this action is only moved to a different place while allowing Marionette to still complete the currently active command. What makes me wonder is why we have to keep a list of pending commands when we only allow the very first one to be executed again, but all others to be removed once the first command completed execution? Why do we need a queue at all? Andreas, can you remember why this has been implemented that way? It was clearly before I started to work on Marionette. So maybe I miss something here. I for myself would like to leave this logic as is, and we could revisit on a different bug.
Flags: needinfo?(ato)
Assignee | ||
Comment 4•6 years ago
|
||
Assignee | ||
Comment 5•6 years ago
|
||
With the use of multiple content processes in Firefox a navigation command can cause the active framescript to be moved to a different process. This interrupts the currently executed command, and as such needs to be executed again after the framescript has been finished initializing. Currently flushing the pending commands doesn't take into account that the framescript can even be moved multiple times to a different process during a single page navigation. As such all pending commands are getting removed after the first process move. For navigation commands this means that no page load listeners are attached for subsequent process changes, and navigation commands could never return, and cause a hang of Marionette. To solve the problem the pending commands need to be flushed each time the process changes. They will remove themselves from the list once they have finished processing. Depends on D10998
Comment 6•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #3) > I pushed a try build yesterday which looks pretty fine: > > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=dc2196140b8402c4fe3c6c9ef2008b39e21e7e3d > > But now I still have a problem with the `responseCompleted()` > method in driver.js: > > https://searchfox.org/mozilla-central/rev/7c848ac7630df5baf1314b0c03e015683599efb9/testing/marionette/driver.js#3428- 3432 > > It basically means that whichever command is currently executed, > it will delete all the pending commands once completed, and not > only it's own entry. Note that for now we already did that > in `flushPendingCommands`, so this action is only moved to a > different place while allowing Marionette to still complete the > currently active command. I see, that puts your review in a different light. I wish this information had been part of your commit message. I’m going to go back and accept the patch. > What makes me wonder is why we have to keep a list of pending > commands when we only allow the very first one to be executed > again, but all others to be removed once the first command > completed execution? Why do we need a queue at all? > > Andreas, can you remember why this has been implemented that > way? It was clearly before I started to work on Marionette. So > maybe I miss something here. It’s currently assumed that all pending commands belong to the same request to the frame script, and an array seemed like a better data structure than a single variable that could get overridden if driver.js had a need for multiple intermediary requests to the frame script. Also in theory we want to support concurrent calls to the frame script from different origins, but this is obviously not possible at the moment because there simply are no origin check… The data flow here is pretty bad and I think the solution might be to move to child actors (implemented by kmag) or to use one of the fancy message manager wrappers we have for Firefox. I don’t know if they specifically cater for _multiple process changes_, but it doesn’t seem impossible that they can be made to work that way. Sharing more code with toolkit and Firefox would be of huge advantage for Marionette.
Flags: needinfo?(ato)
Assignee | ||
Comment 7•6 years ago
|
||
(In reply to Andreas Tolfsen ❲:ato❳ from comment #6) > I see, that puts your review in a different light. I wish this > information had been part of your commit message. I’m going to go > back and accept the patch. Sorry, I thought that you would read the needinfo first. But it's one example of disconnect between Bugzilla and Phabricator, which we have now. Maybe I should really start putting everything into the commit message which makes the patch self-explanatory. > It’s currently assumed that all pending commands belong to the same > request to the frame script, and an array seemed like a better > data structure than a single variable that could get overridden if > driver.js had a need for multiple intermediary requests to the frame > script. I see. Also note that we had a problem before when `Marionette:newSession` was still used. While a navigation request was still queued up, a process move caused a re-registration of the framescript, and as such a call to `Marionette:newSession` was emitted. But that was never executed because the navigation command had precedence. So I'm glad that this listener is gone and we don't have to rely on it anymore. > Also in theory we want to support concurrent calls to the frame > script from different origins, but this is obviously not possible at > the moment because there simply are no origin check… Yes, it's good that we don't need that at the moment! > The data flow here is pretty bad and I think the solution might be > to move to child actors (implemented by kmag) or to use one of the Lets see once we have to convert, or someone is doing it for us.
Updated•6 years ago
|
Attachment #9022840 -
Attachment description: Bug 1504807 - [marionette] Remove call to not existent "Marionette:newSession" message listener. r?ato → Bug 1504807 - [marionette] Remove call to non-existent "Marionette:newSession" message listener. r?ato
Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8e36cdc6bbc5 [marionette] Remove call to non-existent "Marionette:newSession" message listener. r=ato https://hg.mozilla.org/integration/autoland/rev/dac9d7c5dce8 [marionette] Don't remove pending command unless it has been completed. r=ato
Comment 9•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8e36cdc6bbc5 https://hg.mozilla.org/mozilla-central/rev/dac9d7c5dce8
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
status-firefox65:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Assignee | ||
Comment 10•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #2) > Mike, any idea why we are seeing so many process changes for a single page > navigation? I think I might have identified the problem on bug 1503274 comment 1. If you have the time and could cross-check that would be great!
Assignee | ||
Comment 11•6 years ago
|
||
To prevent hangs in Marionette after navigation it would be great to get this test harness only patch uplifted to beta.
Comment 12•6 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/2dba15f946e3 https://hg.mozilla.org/releases/mozilla-beta/rev/6900e33d7308
Whiteboard: [checkin-needed-beta]
Comment 13•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #2) > Mike, any idea why we are seeing so many process changes for a single page > navigation? Quite likely this is because of the new Activity Stream content process (bug 1472212). Any new tabs pointed at about:newtab will load in that content process, but as soon as it browses away, it'll do a process switch to a normal content process.
Flags: needinfo?(mconley)
Assignee | ||
Comment 14•6 years ago
|
||
That would indeed explain it in combination with bug 1503274. Thanks a lot for this reference!
Updated•1 year ago
|
Product: Testing → Remote Protocol
You need to log in
before you can comment on or make changes to this bug.
Description
•