Closed Bug 1335778 Opened 3 years ago Closed 3 years ago

Make element#click command check for page load and wait

Categories

(Testing :: Marionette, defect, P1, major)

Version 3
defect

Tracking

(firefox54 fixed, firefox55 fixed)

RESOLVED FIXED
mozilla55
Tracking Status
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: automatedtester, Assigned: whimboo)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Keywords: pi-marionette-server, pi-marionette-spec, Whiteboard: [spec][17/04])

Attachments

(6 files)

Severity: normal → major
Priority: -- → P1
I feel this should wait for a solution on bug 1288336, so we know better what could be done here. Andreas, David, what do you think?
No longer blocks: 1332998
Well, the bug number I wanted to use should have been bug 1333458.
(In reply to Henrik Skupin (:whimboo) from comment #1)
> I feel this should wait for a solution on bug 1333458, so we know better
> what could be done here. Andreas, David, what do you think?

I really don't mind what order we do things in
(In reply to Henrik Skupin (:whimboo) from comment #2)
> Well, the bug number I wanted to use should have been bug 1333458.

I don’t necessarily think we need to write patches sequentially in dependency order.  Either one can be rebased onto the other.

The crucial part of this bug is to disconnect the wait-for-page-load algorithm from the navigation command so it can also be used for interaction commands such as Element Click.
(In reply to Andreas Tolfsen ‹:ato› from comment #4)
> The crucial part of this bug is to disconnect the wait-for-page-load
> algorithm from the navigation command so it can also be used for interaction
> commands such as Element Click.

I think that after our discussion yesterday it should be more clear what will come up, and that starting on this bug now would be waste of time.
We should actually also wait for bug 1333014 being landed.
Depends on: 1333014
Before I get started with this bug some notes from our meeting today:

1. It's tricky to detect if a page load is triggered by a click. On one hand we want to make sure to wait for it, but on the other hand we also do not want to delay the reply that long. So a sane timeout has to be found. Maybe this can be a value between 100ms and 200ms for opt builds. Debug builds and Valgrind builds might need a factor so they are waiting longer due to their slowness.

2. How do we detect if a page load is triggered? We will have to wait for an event. This could be `pagehide`, but this might be triggered too late. Another option could be `beforeunload` which should be available way earlier. To get a sense of which events are getting fired overall the following code could be used which I got from David:

var log=function(a){console.log(a)},events=[];for(var i in document.body)i.startsWith('on')&&events.push(i.substr(2));events.forEach(function(a){document.body.addEventListener(a,log)});
can we get an update for this bug?
Flags: needinfo?(hskupin)
Sure. We depend on the upcoming changes for bug 1291320, which should hopefully land soon. Then we can figure out what to do here. Maybe we don't need the patch for bug 1333014. But lets see.
Depends on: 1291320
Flags: needinfo?(hskupin)
[mass] Setting priority
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Ok, so I had a look into this and I got it working locally as a POC. Generally we can use the `beforeunload` event to detect if a page load event is going to happen. Using a 200ms timeout with a local optimized debug build works fine. But we certainly need a multiplier here in the case of a non-optimized build, or we should not run those unit tests for debug/valgrind builds.

Anyway, the following items are left to do:

* Check if we can still use the new dispatching technique for elementClick() or if we have to go back to the old one, similar to `get()`, `back()`, and `forward()`. I suspect it has to be the case for now, given that we have no solution for remoteness changes yet.

* Make the unload check mandatory so it will apply to all commands. Commands which really need an unload (get, back, forward, refresh) should be able to opt-out.

* Figuring out a good timeout for the beforeunload event. Maybe by using a multiplier.
(In reply to Henrik Skupin (:whimboo) from comment #11)

> 
> * Make the unload check mandatory so it will apply to all commands. Commands
> which really need an unload (get, back, forward, refresh) should be able to
> opt-out.


I would rather this be opt in for commands. 

If someone were to do something with #sendKeys it could add an unwarranted delay and if people are using #executeScript they should know if the page is going to transition or not.
(In reply to David Burns :automatedtester from comment #12)
> I would rather this be opt in for commands. 
> 
> If someone were to do something with #sendKeys it could add an unwarranted
> delay and if people are using #executeScript they should know if the page is
> going to transition or not.

Well, using `sendKeys()` can also trigger a page load, so we would have to guard for that - unless the spec would only restrict that behavior to click(). Is that the case?

Also I don't think that all that adds too much of an overhead. Here two examples with timings from a local optimized debug build:

A click which is not causing a page load (timeout of 200ms):

*** Caught event: mousedown
*** Caught event: mouseup
*** Caught event: click
*** set unload timer: 1491596165563
*** unload timeout: 1491596165770

A click which is causing a page load (timeout of 200ms):

*** Caught event: mousedown
*** Caught event: mouseup
*** Caught event: click
*** set unload timer: 1491596500131
*** event seen: 1491596500166 - pagehide
*** stopped unload timer: 1491596500166
*** event seen: 1491596500196 - DOMContentLoaded
*** event seen: 1491596500225 - pageshow


In the second example you can see that the pagehide event kicked in 35ms after the click was issued. I don't have general numbers yet in how different builds and different pages behave. It's something I would have to profile. But if it's really that fast, we might be even able to use a 100ms timeout.
Flags: needinfo?(dburns)
(In reply to Henrik Skupin (:whimboo) from comment #13)
> (In reply to David Burns :automatedtester from comment #12)
> > I would rather this be opt in for commands. 
> > 
> > If someone were to do something with #sendKeys it could add an unwarranted
> > delay and if people are using #executeScript they should know if the page is
> > going to transition or not.
> 
> Well, using `sendKeys()` can also trigger a page load, so we would have to
> guard for that - unless the spec would only restrict that behavior to
> click(). Is that the case?

We are only going to care about click in this case. I do appreciate that sendKeys can do that but people have never expected it on that command so let's not change that.

> 
> Also I don't think that all that adds too much of an overhead. Here two
> examples with timings from a local optimized debug build:
> 
> A click which is not causing a page load (timeout of 200ms):
> 
> *** Caught event: mousedown
> *** Caught event: mouseup
> *** Caught event: click
> *** set unload timer: 1491596165563
> *** unload timeout: 1491596165770
> 
> A click which is causing a page load (timeout of 200ms):
> 
> *** Caught event: mousedown
> *** Caught event: mouseup
> *** Caught event: click
> *** set unload timer: 1491596500131
> *** event seen: 1491596500166 - pagehide
> *** stopped unload timer: 1491596500166
> *** event seen: 1491596500196 - DOMContentLoaded
> *** event seen: 1491596500225 - pageshow
> 
> 
> In the second example you can see that the pagehide event kicked in 35ms
> after the click was issued. I don't have general numbers yet in how
> different builds and different pages behave. It's something I would have to
> profile. But if it's really that fast, we might be even able to use a 100ms
> timeout.

Let's start conservatively at 200ms and leave it for a while (even if its just in nightly) and then I will go speak to some WebDriver users and see how this affects their tests.
Flags: needinfo?(dburns)
Depends on: 1356000
It's not really blocked on bug 1356000, which can be fixed beside with the same technique. So removing the dependency.

As of today I have made a lot of changes to my local patch and triggered a couple of try builds. Each time when I fixed something, another failure appeared. So it was a bit of iterating and adding/fixing some more tests. Finally as it shows my last try build seem to be greenish:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=624a4f998e216116204921f3c4e229105738d9f0

With that result I will keep the debugging commit still in the commit series, and will upload the patch for review, and to trigger another complete try build.

What I'm surprised with is that the unload timeout of 200ms works perfectly fine even for debug builds. So compared to the slowness in loading a page, the unload gets triggered with the same speed as in opt builds. That's actually great to see.
No longer depends on: 1356000
While I already setup r? I might have to check a failing wpt8 job:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a7627d75229b&selectedJob=91363747

Also I will have to check some of our Firefox UI tests where we can remove the Wait().until() calls now.
Comment on attachment 8857984 [details]
Bug 1335778 - Synchronize navigate() for trigger methods using generators.

https://reviewboard.mozilla.org/r/130014/#review132648

::: testing/marionette/listener.js:289
(Diff revision 1)
> +      // TODO check instance or error and convert to WebDriverError if necessary
> +      sendError(err, command_id);

I think this is done for you in http://searchfox.org/mozilla-central/source/testing/marionette/proxy.js#174, which gets called from http://searchfox.org/mozilla-central/source/testing/marionette/listener.js#616.
Attachment #8857984 - Flags: review?(ato) → review+
Comment on attachment 8857985 [details]
Bug 1335778 - Harden test_page_timeout_fail against intermittent failures.

https://reviewboard.mozilla.org/r/130016/#review132650

::: commit-message-00aba:3
(Diff revision 1)
> +A unit test which tests a page timeout has to use a page which gets slowly
> +loaded. Otherwise there is a risk that the timeout error is not thrown
> +because the page was loaded that fast.

Please fix the language here.
Attachment #8857985 - Flags: review?(ato) → review+
Comment on attachment 8857986 [details]
Bug 1335778 - Make element click command check for page load and wait.

https://reviewboard.mozilla.org/r/130018/#review132652

::: commit-message-593c1:4
(Diff revision 2)
> +Bug 1335778 - Make element click command check for page load and wait.
> +
> +If the click command is triggering a page load, it should not return before
> +the page hasn't been fully loaded. With this patch we allow an opt-in for

Double negation, fix by s/hasn't//.

::: testing/marionette/driver.js:1904
(Diff revision 2)
> +      // If a remoteness update interrupts our page load, this will never return
> +      // We need to re-issue this request to correctly poll for readyState and
> +      // send errors.

Suggestion for rephrasing:

> When a remoteness update occurs during navigation caused
> by the click, the listener will never respond.  Re-issue
> the request to correctly poll for readyState and potential
> errors.

::: testing/marionette/harness/marionette_harness/tests/unit/test_click.py:257
(Diff revision 2)
>          with self.assertRaises(errors.ElementClickInterceptedException):
>              obscured.click()
>          self.assertFalse(self.marionette.execute_script("return window.clicked", sandbox=None))
> +
> +
> +class TestClickAndNavigate(MarionetteTestCase):

I would call this TestClickNavigation.  The conjunction makes it sound like you first click and then navigate, but it is a click-induced navigation.

::: testing/marionette/listener.js:123
(Diff revision 2)
>  var loadListener = {
>    command_id: null,
> +  seenUnload: null,
>    timeout: null,
> -  timer: null,
> +  timerPageLoad: null,
> +  timerPageUnLoad: null,

s/timerPageUnLoad/timerPageUnload/g

::: testing/marionette/listener.js:245
(Diff revision 2)
>     */
>    notify: function (timer) {
> +    switch (timer) {
> +      // If the page unload timer is raised, ensure to properly stop the load
> +      // listener, and return from the currently active command.
> +      case this.timerPageUnLoad:

I don’t think I’ve ever seen instances used in cases like this before.  Rather clever.

::: testing/marionette/listener.js:254
(Diff revision 2)
> +        }
> +        break;
> +
> +    case this.timerPageLoad:
> -    this.stop();
> +      this.stop();
> -    sendError(new TimeoutError("Timeout loading page after " + this.timeout + "ms"),
> +      sendError(new TimeoutError("Timeout loading page after " + this.timeout + "ms"),

Use an ES6 string template here.

::: testing/marionette/listener.js:255
(Diff revision 2)
> +        break;
> +
> +    case this.timerPageLoad:
> -    this.stop();
> +      this.stop();
> -    sendError(new TimeoutError("Timeout loading page after " + this.timeout + "ms"),
> +      sendError(new TimeoutError("Timeout loading page after " + this.timeout + "ms"),
> -              this.command_id);
> +                this.command_id);

Indent following line by four spaces.
Attachment #8857986 - Flags: review?(ato) → review+
Comment on attachment 8857984 [details]
Bug 1335778 - Synchronize navigate() for trigger methods using generators.

https://reviewboard.mozilla.org/r/130014/#review132648

> I think this is done for you in http://searchfox.org/mozilla-central/source/testing/marionette/proxy.js#174, which gets called from http://searchfox.org/mozilla-central/source/testing/marionette/listener.js#616.

Yeah. That was a left-over. Thanks for spotting it.
Comment on attachment 8857985 [details]
Bug 1335778 - Harden test_page_timeout_fail against intermittent failures.

https://reviewboard.mozilla.org/r/130016/#review132650

> Please fix the language here.

I suggest:

> Tests for page timeout durations use an HTTPD handler that delays or slows down document load.  Otherwise there a risk that the timeout error is not returned before the document finishes loading.
Comment on attachment 8857986 [details]
Bug 1335778 - Make element click command check for page load and wait.

https://reviewboard.mozilla.org/r/130018/#review132652

> Suggestion for rephrasing:
> 
> > When a remoteness update occurs during navigation caused
> > by the click, the listener will never respond.  Re-issue
> > the request to correctly poll for readyState and potential
> > errors.

This would need a change to all existing commands which use the old dispatching technique, because they all share the same comment. Given that it wasn't a problem before I will leave out this change, and we should check how to get rid of this technique at all.
Comment on attachment 8857987 [details]
Bug 1335778 - Make switchToFrame synchronous for in-process frames.

https://reviewboard.mozilla.org/r/130020/#review133764
Attachment #8857987 - Flags: review?(ato) → review+
We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again.

hg error in cmd: hg rebase -s 77d46ee2bced -d 00ab2938b91c: rebasing 390418:77d46ee2bced "Bug 1335778 - Synchronize navigate() for trigger methods using generators. r=ato"
merging testing/marionette/listener.js
rebasing 390419:dc51e67d50dd "Bug 1335778 - Harden test_page_timeout_fail against intermittent failures. r=ato"
rebasing 390420:e7f6c0d857b3 "Bug 1335778 - Make element click command check for page load and wait. r=ato"
merging testing/marionette/listener.js
warning: conflicts while merging testing/marionette/listener.js! (edit, then use 'hg resolve --mark')
unresolved conflicts (see hg resolve, then hg rebase --continue)
As it looks like my patch on bug 1356000 caused this merge conflict. Given that I cannot rebase against autoland with a unified repository I have to wait once the other patch got merged to mozilla-central.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b5def807ceda
Synchronize navigate() for trigger methods using generators. r=ato
https://hg.mozilla.org/integration/autoland/rev/2cfb401403e3
Harden test_page_timeout_fail against intermittent failures. r=ato
https://hg.mozilla.org/integration/autoland/rev/938d57f025ea
Make element click command check for page load and wait. r=ato
https://hg.mozilla.org/integration/autoland/rev/7d8f800a7229
Add debug logging output for page load events. r=ato
Depends on: 1357634
Whiteboard: [spec][17/04]
Depends on: 1353220
Blocks: 937659
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #46)
> Backed out for triggering bug 1357650 as permafailure on Windows 8:

This is really annoying to see failing on autoland / central for Windows 8. Especially that I run try builds with the all parameter for platforms, but no jobs were actually run for Windows 8.

I know that we have limited resources but landing a patch, and getting it backed out will actually cause more machine hours as simply pushing a try build. Joel, are there plans to improve Windows 8 coverage on try?

To investigate this problem I will do a separate try push for Windows 8 in a bit which will have more debug logging enabled. Hopefully this helps in figuring out what's wrong.
Flags: needinfo?(hskupin) → needinfo?(jmaher)
Btw what I assume here is that we somehow have a problem in removing the minidump folder. Maybe it is still in use by another process, and as such we cannot remove it. Then when Marionette checks for possible crashes it would still find this report and bail out with a real crash, and not a content crash.

Still not sure, why this is caused by the patches on this bug.
There is no short term plan to make this better- Our ideal state is to run as much as we can in windows 10 in AWS- that has a couple months of hacking, and will probably be realistically online closer to July.  That doesn't mean some tests will not start moving to windows 10 sooner.  We also have new hardware ordered (new refresh of hardware, not more ofthe same old stuff), but that depends on a new data center being available which is a September or later.

The only shortcut here is to turn off tests on windows 8 (for example non-e10s)
Flags: needinfo?(jmaher)
(In reply to Henrik Skupin (:whimboo) from comment #48)
> Btw what I assume here is that we somehow have a problem in removing the
> minidump folder. Maybe it is still in use by another process, and as such we
> cannot remove it. Then when Marionette checks for possible crashes it would
> still find this report and bail out with a real crash, and not a content
> crash.
> 
> Still not sure, why this is caused by the patches on this bug.

A kinda unfortunate situation. :/ But it's an interesting point that those machines seem to be slow, which might then cause those problems. A question is how slow exactly, and if it is worth spending dozen of hours to get failures fixed which only occur on this platform. Also please note that submitting job requests for Windows 8 to try take forever to be run.

Not sure if we should turn off Mn jobs for Windows 8 given that all other harnesses make use of Marionette anyway. 

Andreas, which platforms were affected for you for crashes lately? Was it also only Windows 8?
Flags: needinfo?(ato)
(In reply to Henrik Skupin (:whimboo) from comment #50)
> Andreas, which platforms were affected for you for crashes lately? Was it
> also only Windows 8?

The crashes I’ve been seeing were on Windows 7 VMs.

FWIW I don’t think disabling Mn on Windows 8 would severely impact our test coverage.  This is for the same reason that I think it’s also a waste running Mn on Android debug.
Flags: needinfo?(ato)
(In reply to Andreas Tolfsen ‹:ato› from comment #51)
> The crashes I’ve been seeing were on Windows 7 VMs.

I should modify this statement a bit: My patches haven’t landed on inbound yet and I haven’t explicitly triggered Windows 8 jobs on try for them.
Depends on: 1357696
Ok, thanks. So looks like we are not only Windows 8 affected here, and I will continue to investigate this issue on that platform. It might still take a while. :/

Beside that we got two other regressions in our unit tests which are bug 1357696 and bug 1357634. After lots of re-runs I can locally see that they are failing because the unload timer is hit after the 200ms. I would assume that the fast execution of our tests with the additional page unloads and loads, maybe cause Firefox to trigger the GC which delays a page unload.

There are two options from what I can see now:

1) Increase the timeout value, but that would mean an additional delay for clicks which do not trigger a page load.

2) Adding/removing our own onbeforeunload handler for click commands, which will fire before the page gets actually unloaded.

I will first check options 2)
Registering the beforeunload handler can indeed drastically improve our situation here as it can be seen by this log:

1492700342857: setup beforeunload listener
[..]
1492700342875: Handled DOM event "beforeunload" for "http://127.0.0.1:56906/clicks.html"
1492700343280: Handled DOM event "pagehide" for "http://127.0.0.1:56906/clicks.html"

So between registering it and when it fires are only ~20ms, while for page load there are ~420ms.

Interestingly I have already added the beforeunload case in handleEvent, but it looks like we have to explicitly add it. If we don't do it, it doesn't fire.

Also I seem to have to add it via `curContainer.frame.addEventListener()` and not via `addEventListener()` only as what I did for the other listeners. I check now why there is such a difference, and if I should better do it this way for all listeners.
And in the case of failure the dump output is kinda suspicious:

1492700350840	Marionette	TRACE	90 -> [0,4,"clickElement",{"id":"31ecc0f2-1304-4844-af35-a28ed6ff9b89"}]
[Child 18307] WARNING: 'NS_FAILED(rv)', file /builds/slave/m-in-m64-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
1492700351408Handled DOM event "pagehide" for "http://127.0.0.1:56906/clicks.html"
++DOMWINDOW == 105 (0x1423ac000) [pid = 18307] [serial = 105] [outer = 0x108738800]
[Child 18307] WARNING: 'aRv.Failed()', file /builds/slave/m-in-m64-d-0000000000000000000/build/src/dom/ipc/StructuredCloneData.cpp, line 123
1492700351435	Marionette	TRACE	90 <- [1,4,null,{}]

The unload listener is not registered at all, which means the .then() block after the trigger is not executed.
The problem as mentioned in the last comment happens because we run into the escape logic of interaction.flushEventLoop():

https://dxr.mozilla.org/mozilla-central/rev/20dff607fb88ee69135a280bbb7f32df75a86237/testing/marionette/interaction.js#296

Right now it uses an "unload" listener to escape the Promise. The problem is that if the animation frame is not happening before the unload, our click + page load logic won't work anymore.

I will have to check, if we can remove it given that we wait anyway for it in listener.js.
I’ve had to fix a similar issue in my ever-growing changeset to use the WebDriver Element Click algorithm in Marionette.  In my case the window disappears as the result of a click, which means I need to check if it has been discarded/closed.

See my patch here:

    https://reviewboard.mozilla.org/r/131234/diff/3#index_header
And hereby the problem is that `flushEventLoop` is rejecting the promise by assigning `reject()` as callback for the registered listener. As result we run into an issue with sendError() by passing in an event and not an error instance. This actually does not raise any failure on the driver side, and is silently ignored.
(In reply to Andreas Tolfsen ‹:ato› from comment #57)
> I’ve had to fix a similar issue in my ever-growing changeset to use the
> WebDriver Element Click algorithm in Marionette.  In my case the window
> disappears as the result of a click, which means I need to check if it has
> been discarded/closed.

That's not what affects my patch here, as you can also see in the last comment. I still think that I have to refactor or maybe even remove this method. It's something I will continue tomorrow.

At least commenting out the call to flushEventLoop() and running all click tests 100 times, I have seen no single failure. So I get the impression this method is not necessary.
When you say that by requesting an animation frame, the “click + page load logic won’t work anymore”, what do you mean by that exactly?

As I hinted in comment #57, I suspect the issue is that when a click causes navigation, the window is discarded and that calling requestAnimationFrame never returns.  interaction.flushEventLoop uses the unload event to escape if the document unloads, but this can be changed to beforeunload if that helps.

> That's not what affects my patch here, as you can also see in the last comment.

Actually I can’t see that (-:  In fact, I don’t understand comment #58 at all: it sounds like your code needs to take into account that flushEventLoop might reject instead of passing on the rejection.

> At least commenting out the call to flushEventLoop() and running all click tests 100 times, I have seen no single failure. So I get the impression this method is not necessary.

We probably don’t have a good test for ensuring the event loop has spun enough times to process DOM events generated by the click before returning.
(In reply to Andreas Tolfsen ‹:ato› from comment #60)
> When you say that by requesting an animation frame, the “click + page load
> logic won’t work anymore”, what do you mean by that exactly?

Please see comment 58. A reject is silently ignored in `sendError()` if it's of type `Event`. Maybe a refactoring of `flushEventLoop` might help.

> As I hinted in comment #57, I suspect the issue is that when a click causes
> navigation, the window is discarded and that calling requestAnimationFrame
> never returns.  interaction.flushEventLoop uses the unload event to escape
> if the document unloads, but this can be changed to beforeunload if that
> helps.

Right. So if a content window is closed, the message manager is dead and with that no code in listener.js or any other loaded file is executed anymore. So escaping `flushEventLoop` will not help. A fix is needed in proxy.js, as stated in bug 1223277. On the other side a chrome window can be closed when `clickElement` is called in driver.js. In such a case Marionette should not hang, but the next issued command except `switchTo*()` has to raise a NoSuchWindowError.

> Actually I can’t see that (-:  In fact, I don’t understand comment #58 at
> all: it sounds like your code needs to take into account that flushEventLoop
> might reject instead of passing on the rejection.

At least it rejects wrongly. 'reject()' should get an error as parameter but not `Event`. Or something is broken in `sendError()` as mentioned above. 

Anyway, for the examples in `test_click.py` we should never see a call of reject()!

> We probably don’t have a good test for ensuring the event loop has spun
> enough times to process DOM events generated by the click before returning.

Which exact event are you waiting for? Maybe a `click` event? Why not registering this instead? It would make way more sense given that a `requestAnimationFrame` event could still sneak into between the click, and the actually received click event.
(In reply to Henrik Skupin (:whimboo) from comment #61)
> (In reply to Andreas Tolfsen ‹:ato› from comment #60)
> > When you say that by requesting an animation frame, the “click + page load
> > logic won’t work anymore”, what do you mean by that exactly?
> 
> Please see comment 58. A reject is silently ignored in `sendError()` if it's
> of type `Event`. Maybe a refactoring of `flushEventLoop` might help.

But what does this _mean_?  flushEventLoop is rejecting because it doesn’t end up flushing the event loop.  The caller should handle that.

> > As I hinted in comment #57, I suspect the issue is that when a click causes
> > navigation, the window is discarded and that calling requestAnimationFrame
> > never returns.  interaction.flushEventLoop uses the unload event to escape
> > if the document unloads, but this can be changed to beforeunload if that
> > helps.
> 
> Right. So if a content window is closed, the message manager is dead and
> with that no code in listener.js or any other loaded file is executed
> anymore. So escaping `flushEventLoop` will not help. A fix is needed in
> proxy.js, as stated in bug 1223277. On the other side a chrome window can be
> closed when `clickElement` is called in driver.js. In such a case Marionette
> should not hang, but the next issued command except `switchTo*()` has to
> raise a NoSuchWindowError.

I meant when the window has been discarded, in the HTML meaning of the term; not that the physical OS window was closed.  When calling window.requestAnimationFrame immediately following a click-induced navigation, that request might never be fulfilled because the navigation interrupts the request, meaning the callback given to it will never be called.  If it is the callback that is responsible for resolving the promise, then that will leave flushEventLoop hanging.

The intention is to wait until the event loop has spun enough times to process the DOM events generated by the previous steps, e.g. DOM events fired as a direct consequence of interaction or as indirect consequences due to the state the document is in.  When a click causes navigation, there is no need to do that because at that point we don’t care about the event loop from the point of view of WebDriver anymore because navigation implies the event loop will be cleared for us.

> > Actually I can’t see that (-:  In fact, I don’t understand comment #58 at
> > all: it sounds like your code needs to take into account that flushEventLoop
> > might reject instead of passing on the rejection.
> 
> At least it rejects wrongly. 'reject()' should get an error as parameter but
> not `Event`. Or something is broken in `sendError()` as mentioned above. 

Again, I don’t understand what sendError has to do with this.  Why is the promise returned by flushEventLoop being passed to sendError?

> > We probably don’t have a good test for ensuring the event loop has spun
> > enough times to process DOM events generated by the click before returning.
> 
> Which exact event are you waiting for? Maybe a `click` event? Why not
> registering this instead? It would make way more sense given that a
> `requestAnimationFrame` event could still sneak into between the click, and
> the actually received click event.

Interaction can trigger arbitrary events to occur.  We’re not primarily concerned with the click event, but with the causation that may incur from it.

Indeed, there is no guarantee that synthesising a click at a point will generate a click at all.
(In reply to Andreas Tolfsen ‹:ato› from comment #62)
> But what does this _mean_?  flushEventLoop is rejecting because it doesn’t
> end up flushing the event loop.  The caller should handle that.

The caller here is clickElement(), so my question is now why do I have to expect clickElement failing if the unload event is coming through first? That's what I do not understand. 

> I meant when the window has been discarded, in the HTML meaning of the term;
> not that the physical OS window was closed.  When calling
> window.requestAnimationFrame immediately following a click-induced
> navigation, that request might never be fulfilled because the navigation

This is not what you are doing here. You click first, and then request the animation frame! That's what I already mentioned when this code was under review.

> interrupts the request, meaning the callback given to it will never be
> called.  If it is the callback that is responsible for resolving the
> promise, then that will leave flushEventLoop hanging.

I think we are mixing up different issues here. On this bug I actually do not care about the closing window case. Maybe we should better move this discussion to bug 1223277.

> Again, I don’t understand what sendError has to do with this.  Why is the
> promise returned by flushEventLoop being passed to sendError?

It's not the promise which gets passed but the argument reject is called with is the problem. Given that you directly assign the listener of the DOM event to the reject() callback. At least we should have used a wrapper here.
Andreas and I discussed this and I'm going to do the two things:

1) Get rid of the race condition when the unload event is triggered before the requested animation frame
2) Check if we can move the logic from flushEventLoop into listener.js because we have duplicated code regarding unload
So while further investigating this issue I pushed another try with latest changes:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=40d4d632f35e50b722328384c95dc6e7498d03b8&selectedJob=93793589

As it can be seen there is a problem in test_should_be_able_to_carry_on_working_if_the_frame_is_deleted_from_under_us now. First investigations of this test have shown that there is a race condition in switchToFrame(), which I will get fixed with an upcoming commit. 

But then we fail now in detecting that the currently selected frame is getting closed. As result we hang and do not return. Once I have a solution for that I will update the bug.
Checking if the currently selected frame is getting closed, cannot be done via an event listener. Instead an observer notification has to be used. Given that our content script runs under privileged scope, we can listen for `outer-window-destroyed` notifications. If the outer window ids are the same, the page load listener can early escape the load.

I submitted another try build which is hopefully green now:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c4773609463232a1a72a563b9c96462a32818dd9
Comment on attachment 8857987 [details]
Bug 1335778 - Make switchToFrame synchronous for in-process frames.

Strangely Mozreview didn't detect this commit as new. So it hasn't been reviewed yet.
Attachment #8857987 - Flags: review+ → review?(ato)
Comment on attachment 8857986 [details]
Bug 1335778 - Make element click command check for page load and wait.

Please review this patch again for the changes which have been made until you reviewed it the last time. Thanks.
Attachment #8857986 - Flags: review+ → review?(ato)
Comment on attachment 8857987 [details]
Bug 1335778 - Make switchToFrame synchronous for in-process frames.

https://reviewboard.mozilla.org/r/130020/#review136512

::: commit-message-e91ed:3
(Diff revision 6)
> +In the case of switching to a non-oop frame, the checkLoad timer is
> +setup which waits for the frame's content being finished loading.

So by “non-oop frame” you actually mean “in-process frame” right?  I think avoiding the double negation would be helpful.

::: commit-message-e91ed:4
(Diff revision 6)
> +Bug 1335778 - Make switchToFrame synchronous for non-oop frames.
> +
> +In the case of switching to a non-oop frame, the checkLoad timer is
> +setup which waits for the frame's content being finished loading.

Suggested improvement:

> … which waits for the frame’s content to finish loading.
Attachment #8857987 - Flags: review?(ato) → review+
Comment on attachment 8857984 [details]
Bug 1335778 - Synchronize navigate() for trigger methods using generators.

https://reviewboard.mozilla.org/r/130014/#review136514
Comment on attachment 8857985 [details]
Bug 1335778 - Harden test_page_timeout_fail against intermittent failures.

https://reviewboard.mozilla.org/r/130016/#review136516
Comment on attachment 8857986 [details]
Bug 1335778 - Make element click command check for page load and wait.

https://reviewboard.mozilla.org/r/130018/#review136520

::: testing/marionette/listener.js:266
(Diff revision 6)
> +    switch (timer) {
> +      // If the page unload timer is raised, ensure to properly stop the load
> +      // listener, and return from the currently active command.
> +      case this.timerPageUnload:
> +        if (!this.seenUnload) {
> +          logger.debug("Canceled page load listener because no navigation has been detected");

This line is a bit long, you should consider breaking it at ~72 characters.

::: testing/marionette/listener.js:324
(Diff revision 6)
>     * @param {number} command_id
>     *     ID of the currently handled message between the driver and listener.
>     * @param {number} pageTimeout
>     *     Timeout in milliseconds the method has to wait for the page being finished loading.
> +   * @param {boolean=} loadEventExpected
> +   *     TODO

Finish this before landing.

::: testing/marionette/listener.js:346
(Diff revision 6)
> +        return;
> +      }
> +
> +      // If requested setup a timer to detect a possible page load
> +      if (useUnloadTimer) {
> +        this.timerPageUnload.initWithCallback(this, "200", Ci.nsITimer.TYPE_ONE_SHOT);

"200" as a string?

::: testing/marionette/listener.js:1174
(Diff revision 6)
> +      } catch (e) {
> +        sendError(new InvalidArgumentError("Malformed URL: " + e.message), command_id);
> +        return;

I guess the inner try…catch takes precedence here?
Attachment #8857986 - Flags: review?(ato) → review+
Comment on attachment 8857987 [details]
Bug 1335778 - Make switchToFrame synchronous for in-process frames.

https://reviewboard.mozilla.org/r/130020/#review136512

> So by “non-oop frame” you actually mean “in-process frame” right?  I think avoiding the double negation would be helpful.

It was hard to read through the code because this `switchToFrame` method is so complex. But yes, as what I got this is for in-process frames. I can definitely switch it.
Comment on attachment 8857986 [details]
Bug 1335778 - Make element click command check for page load and wait.

https://reviewboard.mozilla.org/r/130018/#review136520

> Finish this before landing.

Ups, good catch.

> "200" as a string?

Seemed to work. :D I have updated it.

> I guess the inner try…catch takes precedence here?

Sure, given that it is a catch all.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4a9cc95b8a9f
Make switchToFrame synchronous for in-process frames. r=ato
https://hg.mozilla.org/integration/autoland/rev/5589cde521f8
Synchronize navigate() for trigger methods using generators. r=ato
https://hg.mozilla.org/integration/autoland/rev/0e6dee0ccecf
Harden test_page_timeout_fail against intermittent failures. r=ato
https://hg.mozilla.org/integration/autoland/rev/e084deb550c2
Make element click command check for page load and wait. r=ato
The problem here is that we assume a page load when there is none:

> 1493211403645	Marionette	TRACE	6 -> [0,63,"clickElement",{"id":"5a4d2441-608d-3f44-8acb-f5198ed99757"}]
> Received DOM event "beforeunload" for "http://127.0.0.1:56948/addons/extensions/"
> 1493211403992	addons.xpi	WARN	Add-on restartless-eula@mozqa.com is not correctly signed.
> 1493211403995	addons.xpi	WARN	Download of http://127.0.0.1:56948/addons/extensions/restartless_addon_unsigned.xpi failed: signature is required but missing

So we detect a "beforeunload" event but the page actually never gets unloaded. The click on the XPI (similar to a link to a file:// URL) is not raising `unload` nor `DOMContentLoaded`/`pageshow` events.

Also I think using `beforeunload` was a bit too zealous because if a page has setup a `beforeunload` handler too, it could stop the navigation from happening. As such we would also hang.

So to circumvent all this we should use the `unload` event to actually mark a page load to be started.
Flags: needinfo?(hskupin)
Comment on attachment 8857986 [details]
Bug 1335778 - Make element click command check for page load and wait.

Andreas, please review again. Now the remaining issue with firefox-ui-functional tests is fixed. Thanks.
Attachment #8857986 - Flags: review?(ato)
Comment on attachment 8857986 [details]
Bug 1335778 - Make element click command check for page load and wait.

https://reviewboard.mozilla.org/r/130018/#review136906

::: testing/marionette/harness/marionette_harness/tests/unit/test_click.py:281
(Diff revisions 8 - 9)
> +    def close_notification(self):
> +        try:
> +            with self.marionette.using_context("chrome"):
> +                popup = self.marionette.find_element(
> +                    By.CSS_SELECTOR, '#notification-popup popupnotification')
> +                popup.find_element(By.ANON_ATTRIBUTE,
> +                                   {'anonid': 'closebutton'}).click()
> +        except errors.NoSuchElementException:
> +            pass

Please use double quotes consistently.
(In reply to Andreas Tolfsen ‹:ato› from comment #92)
> Please use double quotes consistently.

So this stops you from giving a r+? It's a minor thing which can be fixed right before pushing the patch.
(In reply to Henrik Skupin (:whimboo) from comment #93)
> So this stops you from giving a r+? It's a minor thing which can be fixed
> right before pushing the patch.

Andreas actually send me his r+ over IRC and might have missed the additional r? flag directly set in Bugzilla. So we are good for a re-push now.
(In reply to Henrik Skupin (:whimboo) from comment #93)
> (In reply to Andreas Tolfsen ‹:ato› from comment #92)
> > Please use double quotes consistently.
> 
> So this stops you from giving a r+? It's a minor thing which can be fixed
> right before pushing the patch.

Oh no, sorry.  I was in a rush and meant to give an r+, but didn’t check that it just got reset on Bugzilla.  It read like an r+ on mozreview.
Autoland is still closed, so I hope this patch will land sometimes during the night.
hg error in cmd: hg push -r tip ssh://hg.mozilla.org/integration/autoland: pushing to ssh://hg.mozilla.org/integration/autoland
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 4 changesets with 10 changes to 8 files
remote: 
remote: 
remote: ************************** ERROR ****************************
remote: Pushing to an APPROVAL REQUIRED tree requires your top changeset comment to include: a=... (or, more accurately, a\S*=...)
remote: *************************************************************
remote: 
remote: 
remote: transaction abort!
remote: rollback completed
remote: pretxnchangegroup.a_treeclosure hook failed
abort: push failed on remote
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/86011af8aa9d
Make switchToFrame synchronous for in-process frames. r=ato
https://hg.mozilla.org/integration/autoland/rev/9d9edfb9481b
Synchronize navigate() for trigger methods using generators. r=ato
https://hg.mozilla.org/integration/autoland/rev/0a723f1cd5db
Harden test_page_timeout_fail against intermittent failures. r=ato
https://hg.mozilla.org/integration/autoland/rev/fb69df36fb08
Make element click command check for page load and wait. r=ato
Depends on: 1360446
Depends on: 1360466
Dave, we are considering to uplift this patch to current beta for Firefox 54.0. If you find the time could you please check all the web tests you have, which would rely on this feature? If there is no issue, it would make it easier for us to make the decision. Thanks.
Flags: needinfo?(dave.hunt)
it appears this set of changes has a large improvement on AWSY memory:
== Change summary for alert #6237 (as of April 27 2017 05:53 UTC) ==

Improvements:

 15%  Images summary windows10-64-vm opt      7207385.2 -> 6105040.88
 14%  Images summary linux32 opt              5725535.78 -> 4905760
 14%  Images summary linux64 opt              6510398.51 -> 5617985.5
 13%  Images summary windows7-32-vm opt       5684485.21 -> 4939489.66
 11%  Explicit Memory summary windows10-64-vm opt 166018069.38 -> 147756229.91
 10%  JS summary windows10-64-vm opt          193323432.41 -> 173147333.32
  9%  Explicit Memory summary linux64 opt     167417805.67 -> 152958584.62
  8%  Explicit Memory summary linux32 opt     125706775.39 -> 115362313.84
  8%  Explicit Memory summary windows7-32-vm opt 123943118.53 -> 113997647.21
  8%  Resident Memory summary windows10-64-vm opt 660490656.18 -> 607881555.33
  8%  JS summary linux64 opt                  183537032.14 -> 169440154.72
  7%  JS summary windows7-32-vm opt           133651662.89 -> 123869764.86
  6%  JS summary linux32 opt                  127120270.31 -> 118879361.5
  5%  Heap Unclassified summary linux32 opt   41847843.77 -> 39573058.08
  5%  Heap Unclassified summary linux64 opt   54397207.73 -> 51542439.02
  4%  Heap Unclassified summary windows10-64-vm opt 42885385.8 -> 40988705.83
  4%  Resident Memory summary windows7-32-vm opt 424590392.22 -> 406794395.51

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=6237

I am a bit concerned how this is the case give that AWSY changed so much for test tooling- this is no big deal, but possibly something to focus on with our test tools as they impact the product we are testing more than we realize.
(In reply to Joel Maher ( :jmaher) from comment #102)
> I am a bit concerned how this is the case give that AWSY changed so much for
> test tooling- this is no big deal, but possibly something to focus on with
> our test tools as they impact the product we are testing more than we
> realize.

I'm inclined to believe this broke AWSY somehow (possibly loading fewer pages?), I'll see if I can repro.
Depends on: 1361302
As mentioned, the first suite I ran this with showed no issues, but it turns out this wasn't an ideal place to start as the same suite does not suffer from this issue due to fairly robust explicit waits. I've since tried running this against the tests for crash stats and have seen that the wait is not happening when selecting an item from a dropdown. I'll attach a reduced script that replicates the issue (requires Python client for Selenium), and the resulting trace level geckodriver log file.
Flags: needinfo?(dave.hunt)
Dave, thank you for testing. As it looks like we cancel the pageunload timer too early. This is because we currently do not wait for the `beforeunload` event, and use this event for detecting a possible page load. With my patch on bug 1357634 we should hopefully have this landed tomorrow.
Depends on: 1361983
Dave, all known dependencies have been fixed now. I would be great if you could give it another try with the upcoming Nightly from today. Thanks.
Flags: needinfo?(dave.hunt)
Sorry for the delay. Other than bug 1363418 I saw no issues with this.
Flags: needinfo?(dave.hunt)
Fantastic to hear! Thank you for the testing.

We will decide later today if we want to uplift it for Firefox 54.0.
We decided to uplift this feature addition to Firefox 54.0. I will have a look now if the patches we need seamlessly apply. If not, and major backport changes are necessary, we will drop the uplift idea.
Here the depenency list:

* The patch for bug 1359053 should be uplifted first
* All patches on this bug except fb69df36fb08 apply (simple backport fix needed due to missing changes from 0280c5b6d617 on mozilla-beta)
* Both patches from bug 1360466 apply cleanly
* Patch from bug 1361302 applies cleanly
* Both patches (incl. skip patch) from bug 1357634 apply cleanly
* Patch on bug 1361983 applies cleanly

Here a try push with all the patches applied:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0094bed058983ab6e59bd414f4b3f0d1f38a1288
The try run ended-up in bustage because the patch on bug 1355216 doesn't live on mozilla-beta. The fix is easy and a one-line change, so I patched the affected commit fb69df36fb08 and triggered another try build:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=880139443bad030272e046cf9401b65902e11b4d
Latest try build looks fine, I had to push another fix due to a bad merge caused by conflicts:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=99397cd27b4100bd5c446c58d978399a9a4c1f2a

Ryan, is it enough for you to have those commits for an uplift, or would you need backport patches attached to the bug? I would have to create those then.
Flags: needinfo?(ryanvm)
If that Try push is complete, that'll work fine.
Flags: needinfo?(ryanvm)
Great! In this case lets do the uplift of the test-only patch this way. Please note to uplift the patch on bug 1359053 first. Thanks.
Whiteboard: [spec][17/04] → [spec][17/04][checkin-needed-beta][uplift bug 1359053 first]
No longer blocks: 1223277
Depends on: 1223277
Duplicate of this bug: 1371939
Depends on: 1416216
Depends on: 1450876
You need to log in before you can comment on or make changes to this bug.