Closed Bug 1357634 Opened 2 years ago Closed 2 years ago

[e10s] Intermittent test_click.py TestClick.test_click_number_link | NoSuchElementException: Unable to locate element: username (because beforeunload is not taken into account)

Categories

(Testing :: Marionette, defect)

Version 3
defect
Not set

Tracking

(firefox-esr52 unaffected, firefox53 unaffected, firefox54 fixed, firefox55 fixed)

RESOLVED FIXED
mozilla55
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: whimboo)

References

Details

(Keywords: intermittent-failure, Whiteboard: [spec][17/04][stockwell fixed])

Attachments

(1 file, 1 obsolete file)

The steps look fine:

1492570419436	Marionette	TRACE	374 -> [0,3,"findElement",{"using":"link text","value":"333333"}]
1492570419450	Marionette	TRACE	374 <- [1,3,null,{"value":{"element-6066-11e4-a52e-4f735466cecf":"fc8cf1dc-6f3e-5d40-a642-95b16574241d","ELEMENT":"fc8cf1dc-6f3e-5d40-a642-95b16574241d"}}]
1492570419452	Marionette	TRACE	374 -> [0,4,"clickElement",{"id":"fc8cf1dc-6f3e-5d40-a642-95b16574241d"}]
1492570420145	Marionette	TRACE	374 <- [1,4,null,{}]
++DOMWINDOW == 20 (0x1106cb000) [pid = 2675] [serial = 32] [outer = 0x112891000]
1492570420160	Marionette	TRACE	374 -> [0,5,"findElement",{"using":"id","value":"username"}]

But sadly due to bug 1357407 we don't see the page load events. Usually `clickElement` should have waited for the page being loaded.

It's a regression in a unit test caused by my changes from bug 1335778. The volume is kinda low for now. So I would like to wait for some more examples of this failure before continuing to investigate.
Blocks: 1335778
Keywords: regression
:whimboo, this occurred 20 times yesterday, I think the volume is high enough to work on this :)
Whiteboard: [stockwell needswork]
The patch on bug 1335778 was backed out yesterday. So this should have stopped this failure from happening.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Whiteboard: [stockwell needswork] → [stockwell fixed]
Looks like that it is still around but with a low frequency now:

https://treeherder.mozilla.org/#/jobs?repo=autoland&bugfiler&selectedJob=94750326
Status: RESOLVED → REOPENED
Keywords: regression
Resolution: FIXED → ---
Whiteboard: [stockwell fixed] → [stockwell needswork]
It's very strange that the click doesn't cause a navigation:

1493299007208	Marionette	TRACE	374 -> [0,4,"clickElement",{"id":"8f744058-5aa7-4c32-86cd-43d0f09893f1"}]
1493299008420	Marionette	DEBUG	Canceled page load listener because no navigation has been detected
1493299008638	Marionette	TRACE	374 <- [1,4,null,{}]

Which means after 1.4s there is still no sight of a navigation visible. Looks like this is only happening for debug builds.
Assignee: nobody → hskupin
Status: REOPENED → ASSIGNED
Summary: Intermittent test_click.py TestClick.test_click_number_link | NoSuchElementException: Unable to locate element: username → [e10s] Intermittent test_click.py TestClick.test_click_number_link | NoSuchElementException: Unable to locate element: username
See Also: → 1357696
See Also: → 1360206
Comment on attachment 8862735 [details]
Bug 1357634 - Temporarily skip test_click_number_link and test_clicking_on_a_multiline_link for intermittent failures.

https://reviewboard.mozilla.org/r/134604/#review137590
Attachment #8862735 - Flags: review+
We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again.

hg error in cmd: hg rebase -s 66cbe191d7e6 -d 42e64e0c808f: rebasing 392321:66cbe191d7e6 "Bug 1357634 - Temporarily skip test_click_number_link and test_clicking_on_a_multiline_link for intermittent failures. r=Tomcat" (tip)
merging testing/marionette/harness/marionette_harness/tests/unit/test_click.py
warning: conflicts while merging testing/marionette/harness/marionette_harness/tests/unit/test_click.py! (edit, then use 'hg resolve --mark')
unresolved conflicts (see hg resolve, then hg rebase --continue)
We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again.

hg error in cmd: hg rebase -s 66cbe191d7e6 -d 42e64e0c808f: rebasing 392321:66cbe191d7e6 "Bug 1357634 - Temporarily skip test_click_number_link and test_clicking_on_a_multiline_link for intermittent failures. r=Tomcat" (tip)
merging testing/marionette/harness/marionette_harness/tests/unit/test_click.py
warning: conflicts while merging testing/marionette/harness/marionette_harness/tests/unit/test_click.py! (edit, then use 'hg resolve --mark')
unresolved conflicts (see hg resolve, then hg rebase --continue)
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f9e4d7e4253e
Temporarily skip test_click_number_link and test_clicking_on_a_multiline_link for intermittent failures. r=Tomcat
I got this failure triggered locally now. To investigate the amount of time it needs for a pagehide/unload to happen, I commented out the call to loadListener.stop() and sendOk(). The result is the following:

1493368900377	Marionette	TRACE	40 -> [0,4,"clickElement",{"id":"3cc7fe3d-2b72-394e-bcb6-fe607dd2346b"}]
1493368900573	Marionette	DEBUG	Flushed event loop.
1493368900594	Marionette	DEBUG	Flushed event loop.
1493368900777	Marionette	DEBUG	Canceled page load listener because no navigation has been detected
1493368901026	Marionette	DEBUG	Received DOM event "pagehide" for "http://127.0.0.1:50065/clicks.html"
1493368901027	Marionette	DEBUG	Received DOM event "unload" for "http://127.0.0.1:50065/clicks.html"

It means that after returning from the `flushEventLoop` call, it takes up to 450ms until the first event is fired.
I think we have to bring back the `beforeunload` listener:

1493369406323	Marionette	TRACE	110 -> [0,4,"clickElement",{"id":"93df523e-749d-0d4d-a2da-9695029c07fc"}]
1493369406532	Marionette	DEBUG	Flushed event loop.
1493369407348	Marionette	DEBUG	Received DOM event "beforeunload" for "http://127.0.0.1:50349/clicks.html"
1493369407348	Marionette	DEBUG	Flushed event loop.
1493369407356	Marionette	DEBUG	Canceled page load listener because no navigation has been detected
1493369407362	Marionette	DEBUG	Received DOM event "pagehide" for "http://127.0.0.1:50349/clicks.html"
1493369407362	Marionette	DEBUG	Received DOM event "unload" for "http://127.0.0.1:50349/clicks.html"

But we should not take is as measure that the page load is actually happening. Instead we might want to increase the timeout for the unload timer, so it can wait a bit longer for the `unload` event.

With such a change we wouldn't cause any delay in test execution for normal clicks which do not cause a page load, but only delay those cases when a `beforeunload` handler is setup, and which cancels the page load.
And that attempt works:

1493370050924	Marionette	TRACE	110 -> [0,4,"clickElement",{"id":"52559715-f64e-fe4c-8d00-b5162edc2741"}]
1493370051152	Marionette	DEBUG	Flushed event loop.
1493370051194	Marionette	DEBUG	Received DOM event "beforeunload" for "http://127.0.0.1:50650/clicks.html"
1493370051194	Marionette	DEBUG	Flushed event loop.
1493370051507	Marionette	DEBUG	Received DOM event "pagehide" for "http://127.0.0.1:50650/clicks.html"
1493370051508	Marionette	DEBUG	Received DOM event "unload" for "http://127.0.0.1:50650/clicks.html"

`beforeunload` got fired 42ms after `flushEventLoop` returned. It extended the timeout for PageUnload by another 5000ms, which I have chosen because its the default value for the implicit wait duration. Then `pagehide` got fired 355ms after `flushEventLoop` returned. 

I think that we should leave the timeout value of 200ms for waiting of the `beforeunload` event. Lower values like 100ms could be a risk. Also by having changed `flushEventLoop` from `unload` to `beforeunload`, we basically return early from this call in every case. So overall we should even have reduced the time to send the final response to the driver.
Comment on attachment 8862735 [details]
Bug 1357634 - Temporarily skip test_click_number_link and test_clicking_on_a_multiline_link for intermittent failures.

https://reviewboard.mozilla.org/r/134604/#review137632

::: testing/marionette/harness/marionette_harness/tests/unit/test_click.py:10
(Diff revision 3)
>  import urllib
>  
>  from marionette_driver.by import By
>  from marionette_driver import errors
>  
> -from marionette_harness import MarionetteTestCase, run_if_e10s
> +from marionette_harness import MarionetteTestCase, run_if_e10s, skip

This commit has to be removed from this series before landing.
Attachment #8862792 - Flags: review?(ato)
Whiteboard: [stockwell needswork] → [spec][17/04][stockwell needswork]
Comment on attachment 8862792 [details]
Bug 1357634 - Use 'beforeunload' to detect a possible page load.

https://reviewboard.mozilla.org/r/134684/#review137654

::: commit-message-9a2c8:3
(Diff revision 1)
> +Using only the 'unload' event to detect a page load is unreliable because
> +of a possible slowness of the application. By also using 'beforeunload' a

Why is it unreliable?

::: commit-message-9a2c8:8
(Diff revision 1)
> +There is an expected additional delay for those cases when the unload gets
> +canceled due to a registered 'beforeunload' listener on the page itself.

What are you implying?  That the document navigation gets cancelled in beforeunload, or that beforeunload takes a long time?

In the first case, we would expect never to see an unload at all because navigation would not happen.

In the latter, it’s not obvious to me what we should do.

::: commit-message-9a2c8:10
(Diff revision 1)
> +Given that this will happen very rarely, reliability should be considered
> +more important, and so a delay of 5s should make that happen.

Why will it only happen rarely?  Generally, I’m lacking “the story” in this commit message.  By reading it, I don’t really understand what the issue is.

::: testing/marionette/listener.js:214
(Diff revision 1)
> +      case "beforeunload":
> +        // An upcoming page load has been detected. Give the 'unload' event a bit
> +        // more time due to a possible slowness of the application. In the case
> +        // when the page load gets aborted due to a registered 'beforeunload'
> +        // handler, the load listener will wait until the unload timer times out.
> +        this.timerPageUnload.cancel();
> +        this.timerPageUnload.initWithCallback(this, 5000, Ci.nsITimer.TYPE_ONE_SHOT)
> +        break;

The way I read this, the notify function gets called again after five seconds.

What if the timerPageUnload that has already been registered fires before this?  Does that cancel this?

Also, what if the document’s beforeunload cancels the navigation?  Does your implementation handle that?
Attachment #8862792 - Flags: review?(ato) → review+
Comment on attachment 8862792 [details]
Bug 1357634 - Use 'beforeunload' to detect a possible page load.

https://reviewboard.mozilla.org/r/134684/#review137654

> Why is it unreliable?

Because we run into timeouts for the page unload timer. The unload events are getting fired way too late. beforeunload is fired right away.

> What are you implying?  That the document navigation gets cancelled in beforeunload, or that beforeunload takes a long time?
> 
> In the first case, we would expect never to see an unload at all because navigation would not happen.
> 
> In the latter, it’s not obvious to me what we should do.

This paragraph is for the first case, means a registered handler for beforeunload is called and rejects the unload. In such a case the unload timer will fire after 5s, and we assume there is no page load.

If beforeunload takes too long we simply have no chance in detecting a page load. But that should really not be possible as all the test results so far have been shown. Even in the worst case with the unload event taking more than 1s, beforeunload is fired immediately after we return from flushEventLoop.

Does this answer your question?

> Why will it only happen rarely?  Generally, I’m lacking “the story” in this commit message.  By reading it, I don’t really understand what the issue is.

So what I mean here is that tests which specifically test the path to abort a page load with a registered beforeunload handler should be rare. 

If it's confusing I can remove this paragraph.

> The way I read this, the notify function gets called again after five seconds.
> 
> What if the timerPageUnload that has already been registered fires before this?  Does that cancel this?
> 
> Also, what if the document’s beforeunload cancels the navigation?  Does your implementation handle that?

> What if the timerPageUnload that has already been registered fires before this?  Does that cancel this?

Please see line 219 which cancels the currently running page unload timer.

> Also, what if the document’s beforeunload cancels the navigation?  Does your implementation handle that?

As said earlier (which was unclear to you) the page unload timer will fire again after 5s, and if no unload event has been seen by that time, we assume there is no page load, and we call sendOk().
Comment on attachment 8862792 [details]
Bug 1357634 - Use 'beforeunload' to detect a possible page load.

https://reviewboard.mozilla.org/r/134684/#review137654

> > What if the timerPageUnload that has already been registered fires before this?  Does that cancel this?
> 
> Please see line 219 which cancels the currently running page unload timer.
> 
> > Also, what if the document’s beforeunload cancels the navigation?  Does your implementation handle that?
> 
> As said earlier (which was unclear to you) the page unload timer will fire again after 5s, and if no unload event has been seen by that time, we assume there is no page load, and we call sendOk().

Something which I just noticed is that we do not check if the unload timer should be used or not. It's still optional for navigation requests. So it means that with the above change we would also fire a unload timer with 5s for get/backward/forward/refresh. It may not cause issues, but I should fix that.
Blocks: 1357696, 1360206
See Also: 1357696, 1360206
Comment on attachment 8862792 [details]
Bug 1357634 - Use 'beforeunload' to detect a possible page load.

https://reviewboard.mozilla.org/r/134684/#review137654

> Something which I just noticed is that we do not check if the unload timer should be used or not. It's still optional for navigation requests. So it means that with the above change we would also fire a unload timer with 5s for get/backward/forward/refresh. It may not cause issues, but I should fix that.

Thinking more about my last comment showed me that the current behavior is correct in ALL the cases, and that it also fixes a hang in the page load code for the other navigation commands like `get` itself. It will happen in those cases when `beforeunload` is raised, but the actual page navigation gets canceled. I think that I will add additional testcases for this.
Comment on attachment 8862792 [details]
Bug 1357634 - Use 'beforeunload' to detect a possible page load.

https://reviewboard.mozilla.org/r/134684/#review137654

> Thinking more about my last comment showed me that the current behavior is correct in ALL the cases, and that it also fixes a hang in the page load code for the other navigation commands like `get` itself. It will happen in those cases when `beforeunload` is raised, but the actual page navigation gets canceled. I think that I will add additional testcases for this.

Let me revoke my last comment. It's not part of the spec to return early for eg. a `get` request which cancels the page load via a registered `beforeunload` handler. I will restrict this code to only get run when the command opted-in.
Comment on attachment 8862792 [details]
Bug 1357634 - Use 'beforeunload' to detect a possible page load.

https://reviewboard.mozilla.org/r/134684/#review137654

> Let me revoke my last comment. It's not part of the spec to return early for eg. a `get` request which cancels the page load via a registered `beforeunload` handler. I will restrict this code to only get run when the command opted-in.

So what I just noticed while creating a test for `onbeforeunload` any command is returning immediately, so that the opened alert can be handled. As such the comment in the code needs an update, and I will still add two more tests for `get` and `click`.
Attachment #8862735 - Attachment is obsolete: true
Andreas, please check the latest changes, and remaining issues. I think that we can close them all. Thanks.
Flags: needinfo?(ato)
Comment on attachment 8862792 [details]
Bug 1357634 - Use 'beforeunload' to detect a possible page load.

https://reviewboard.mozilla.org/r/134684/#review138836
Comment on attachment 8862792 [details]
Bug 1357634 - Use 'beforeunload' to detect a possible page load.

https://reviewboard.mozilla.org/r/134684/#review138838

Please clarify commit message.  Otherwise, everything is good.
Comment on attachment 8862792 [details]
Bug 1357634 - Use 'beforeunload' to detect a possible page load.

https://reviewboard.mozilla.org/r/134684/#review137654

> This paragraph is for the first case, means a registered handler for beforeunload is called and rejects the unload. In such a case the unload timer will fire after 5s, and we assume there is no page load.
> 
> If beforeunload takes too long we simply have no chance in detecting a page load. But that should really not be possible as all the test results so far have been shown. Even in the worst case with the unload event taking more than 1s, beforeunload is fired immediately after we return from flushEventLoop.
> 
> Does this answer your question?

Thanks for clarifying.  I would love for the commit message to be clearer on this.  Especially “for those cases when the unload gets canceled due to a registered 'beforeunload' listener on the page itself” is unclear.  An document does not get cancelled _because_ of the beforeunload listener, but because the beforeunload listener _aborts the navigation_, as you said in your clarification.

> So what I mean here is that tests which specifically test the path to abort a page load with a registered beforeunload handler should be rare. 
> 
> If it's confusing I can remove this paragraph.

I get what you’re saying now.  Maybe it would make sense to rephrase this:

> In the case when a navigation attempt is aborted in the document’s
> beforeunload, the timeout waiting for the document to start navigating
> is increased by 5000 ms to ensure a possible load event is not missed.
> In the common case such an event should occur pretty soon after
> interaction, and we optimise for this.

> So what I just noticed while creating a test for `onbeforeunload` any command is returning immediately, so that the opened alert can be handled. As such the comment in the code needs an update, and I will still add two more tests for `get` and `click`.

This all sounds good.
Flags: needinfo?(ato)
Comment on attachment 8862792 [details]
Bug 1357634 - Use 'beforeunload' to detect a possible page load.

https://reviewboard.mozilla.org/r/134684/#review137654

> Thanks for clarifying.  I would love for the commit message to be clearer on this.  Especially “for those cases when the unload gets canceled due to a registered 'beforeunload' listener on the page itself” is unclear.  An document does not get cancelled _because_ of the beforeunload listener, but because the beforeunload listener _aborts the navigation_, as you said in your clarification.

Yeah, I missed to update this part of the commit message. I will rephrase it, so it's better understandable.

> I get what you’re saying now.  Maybe it would make sense to rephrase this:
> 
> > In the case when a navigation attempt is aborted in the document’s
> > beforeunload, the timeout waiting for the document to start navigating
> > is increased by 5000 ms to ensure a possible load event is not missed.
> > In the common case such an event should occur pretty soon after
> > interaction, and we optimise for this.

Your message is not completely correct. We will only extend the unload timer when the page load is NOT aborted. Otherwise it looks fine, and I will update. My proposal is:

          // In the case when a document has a beforeunload handler registered,
          // the currently active command will return immediately due to the
          // modal dialog observer in proxy.js.
          // Otherwise the timeout waiting for the document to start navigating
          // is increased by 5000 ms to ensure a possible load event is not missed.
          // In the common case such an event should occur pretty soon after
          // beforeunload, and we optimise for this.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/29d676d23989
Use 'beforeunload' to detect a possible page load. r=ato
Status: ASSIGNED → RESOLVED
Closed: 2 years ago2 years ago
Keywords: leave-open
Resolution: --- → FIXED
As it looks like the issue is still not completely fixed:

1493878218361	Marionette	TRACE	374 -> [0,4,"clickElement",{"id":"e31a38ae-c0ef-4939-bf22-3ca7b52c247b"}]
1493878219108	Marionette	DEBUG	Received DOM event "beforeunload" for "http://127.0.0.1:38163/clicks.html"
1493878219318	Marionette	DEBUG	Canceled page load listener because no navigation has been detected
1493878219396	Marionette	TRACE	374 <- [1,4,null,{}]

Somehow we still cancel the page unload timer. I will file a follow-up bug for it.
Summary: [e10s] Intermittent test_click.py TestClick.test_click_number_link | NoSuchElementException: Unable to locate element: username → [e10s] Intermittent test_click.py TestClick.test_click_number_link | NoSuchElementException: Unable to locate element: username (because beforeunload is not taken into account)
No longer blocks: 1362737
Whiteboard: [spec][17/04][stockwell needswork] → [spec][17/04][stockwell fixed]
Depends on: 1411393
You need to log in before you can comment on or make changes to this bug.