Closed Bug 1141519 Opened 5 years ago Closed 4 years ago

Marionette hangs if exceptions are happening in content scope

Categories

(Testing :: Marionette, defect, critical)

defect
Not set
critical

Tracking

(firefox40 wontfix, firefox41 wontfix, firefox42 wontfix, firefox43 wontfix, firefox44 fixed, firefox-esr38 wontfix)

RESOLVED FIXED
mozilla44
Tracking Status
firefox40 --- wontfix
firefox41 --- wontfix
firefox42 --- wontfix
firefox43 --- wontfix
firefox44 --- fixed
firefox-esr38 --- wontfix

People

(Reporter: whimboo, Assigned: automatedtester)

References

Details

(Keywords: hang, pi-marionette-runner, pi-marionette-server, Whiteboard: [qa-automation-blocked])

Attachments

(5 files, 1 obsolete file)

If you have a test which is working with multiple chrome windows, Marionette will completely hang until the defined global timeout if an exception occurs in another chrome window than the default.

Here a simple example with Firefox:

    def test_hang_until_timeout(self):
        with self.marionette.using_context('chrome'):
            menu = self.marionette.find_element(By.ID, 'aboutName')
            menu.click()

            self.marionette.switch_to_window(13)
            self.marionette.find_element(By.ID, 'dek')

Running this code will put Marionette in a hang state.

This bug is blocking us from running majorly any kind of Firefox Desktop test which is using a different window. This applies especially to the update tests.
Whiteboard: [qa-automation-blocked]
To be clear here the exception is raised by looking for the element with the ID 'dek' which not exist in the new window. So an AssertionError is thrown.
The following test case works and doesnt require manual intervention

def test_hang_until_timeout(self):
        with self.marionette.using_context('chrome'):
            current_handle = self.marionette.current_chrome_window_handle
            menu = self.marionette.find_element(By.ID, 'aboutName')
            menu.click()
            handles = self.marionette.chrome_window_handles
            handles.remove(current_handle)
            self.marionette.switch_to_window(handles[0])
            self.marionette.find_element(By.ID, 'dek')
Looking at this, it appears to be something we are doing during the tear down in the base runner. Removing the screenshot code and the page source code makes the ElementNotFoundException visible
The issue appears to be https://dxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/marionette_test.py#640 when we call that it errors because, I think,  about: doesnt have a content context available.
This is not related to an about: page, but its a separate chrome window, which does not have any associated tabs. And because of that no window handle.

I tested it by opening another browser window and causing an exception. That works. So this is indeed only happening for non browser chrome windows.
Summary: Marionette hangs until global timeout if exceptions are happening in other chrome windows than the default one → Marionette hangs until global timeout if exceptions are happening in chrome windows other than browser
I have tracked the offending code down to https://dxr.mozilla.org/mozilla-central/source/testing/marionette/marionette-server.js#291... now I need to figure out why we always try cleanup in the content scope and figuring what is the best way to reply to those calls if we can't do that
When I look at the fix, I wonder if we should do the same on startup. There can be situations when addons open their own windows on startup of the application, so we might also fail in those cases similar to that one on shutdown.
/r/5333 - Bug 1141519: Don't switch to content scope for logging the test end; r=jgriffin
/r/5335 - Bug 1141519: added test that puts marionette into a position that can cause hangs when in content scope
/r/5337 - Bug 1141519: error if we are going to try send a message to a frame that we know doesnt exist; r=jgriffin
/r/5339 - Bug 1141519: if we try call marionette calls in the teardown and they don't work we should just carry on; r=jgriffin

Pull down these commits:

hg pull review -r edc485e853c257b9d5fc2d023e34a3b61eab6276
Attachment #8577483 - Flags: review?(jgriffin)
Comment on attachment 8577483 [details]
MozReview Request: bz://1141519/AutomatedTester

https://reviewboard.mozilla.org/r/5331/#review4521

::: testing/marionette/client/marionette/marionette_test.py
(Diff revision 1)
> +            except:

We should avoid naked except clauses, as a general best practice.

::: testing/marionette/marionette-server.js
(Diff revision 1)
> +          this.sendError("Unfortunately can not send call to listener as it does not exist", 500, null, commandId);

Probably we can omit the "Unfortunately"; errors are always unfortunate! :)
Attachment #8577483 - Flags: review?(jgriffin)
Assignee: nobody → dburns
David, would it be possible for you to finish off this patch today?
Comment on attachment 8577483 [details]
MozReview Request: bz://1141519/AutomatedTester

/r/5333 - Bug 1141519: Don't switch to content scope for logging the test end; r=jgriffin
/r/5335 - Bug 1141519: added test that puts marionette into a position that can cause hangs when in content scope
/r/5337 - Bug 1141519: error if we are going to try send a message to a frame that we know doesnt exist; r=jgriffin
/r/5339 - Bug 1141519: if we try call marionette calls in the teardown and they don't work we should just carry on; r=jgriffin

Pull down these commits:

hg pull review -r 9a1e704152d1c4ad64d25e26c363935085747d11
Attachment #8577483 - Flags: review?(jgriffin)
I fixed this yesterday but forgot to press the publish button on mozreview :/
Comment on attachment 8577483 [details]
MozReview Request: bz://1141519/AutomatedTester

https://reviewboard.mozilla.org/r/5331/#review4689

Thanks, lgtm!
Attachment #8577483 - Flags: review?(jgriffin) → review+
David, we should get this released ASAP. Otherwise I won't be able to release new versions of marionette client and driver, which we need to get the Marionette tests running in our CI.
Flags: needinfo?(dburns)
I thought I landed this last week but it appears it didnt. Landed now and will uplift later after the next set of merges
Flags: needinfo?(dburns)
Flags: needinfo?(dburns)
Any updates on this?
It's blocking me from running Marionette update tests for releases (bug 1152460 and bug 1148546).
Blocks: 1152460
For the record, I'm not currently hanging with the following code.
The exception is raised immediately and the script ends.

This is what I'm currently getting:
MacAir driver hg:[default!] $ python ~/moz/temp/test_marionette_bug.py
Traceback (most recent call last):
  File "/Users/armenzg/moz/temp/test_marionette_bug.py", line 16, in <module>
    test_hang_until_timeout()
  File "/Users/armenzg/moz/temp/test_marionette_bug.py", line 14, in test_hang_until_timeout
    client.find_element(By.ID, 'dek')
  File "/Users/armenzg/repos/branches/mozilla-inbound/testing/marionette/driver/marionette_driver/marionette.py", line 1467, in find_element
    response = self._send_message('findElement', 'value', **kwargs)
  File "/Users/armenzg/repos/branches/mozilla-inbound/testing/marionette/driver/marionette_driver/decorators.py", line 36, in _
    return func(*args, **kwargs)
  File "/Users/armenzg/repos/branches/mozilla-inbound/testing/marionette/driver/marionette_driver/marionette.py", line 715, in _send_message
    self._handle_error(response)
  File "/Users/armenzg/repos/branches/mozilla-inbound/testing/marionette/driver/marionette_driver/marionette.py", line 751, in _handle_error
    raise errors.lookup(status)(message, stacktrace=stacktrace)
marionette_driver.errors.NoSuchElementException: NoSuchElementException: Unable to locate element: dek

code:
from marionette import Marionette
from marionette_driver import By
client = Marionette('localhost', port=2828)
client.start_session()

def test_hang_until_timeout():
        with client.using_context('chrome'):
            current_handle = client.current_chrome_window_handle
            menu = client.find_element(By.ID, 'aboutName')
            menu.click()
            handles = client.chrome_window_handles
            handles.remove(current_handle)
            client.switch_to_window(handles[0])
            client.find_element(By.ID, 'dek')

test_hang_until_timeout()
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #23)
> For the record, I'm not currently hanging with the following code.
> The exception is raised immediately and the script ends.
> 
> This is what I'm currently getting:
> MacAir driver hg:[default!] $ python ~/moz/temp/test_marionette_bug.py
> Traceback (most recent call last):
>   File "/Users/armenzg/moz/temp/test_marionette_bug.py", line 16, in <module>
>     test_hang_until_timeout()
>   File "/Users/armenzg/moz/temp/test_marionette_bug.py", line 14, in
> test_hang_until_timeout
>     client.find_element(By.ID, 'dek')
>   File
> "/Users/armenzg/repos/branches/mozilla-inbound/testing/marionette/driver/
> marionette_driver/marionette.py", line 1467, in find_element
>     response = self._send_message('findElement', 'value', **kwargs)
>   File
> "/Users/armenzg/repos/branches/mozilla-inbound/testing/marionette/driver/
> marionette_driver/decorators.py", line 36, in _
>     return func(*args, **kwargs)
>   File
> "/Users/armenzg/repos/branches/mozilla-inbound/testing/marionette/driver/
> marionette_driver/marionette.py", line 715, in _send_message
>     self._handle_error(response)
>   File
> "/Users/armenzg/repos/branches/mozilla-inbound/testing/marionette/driver/
> marionette_driver/marionette.py", line 751, in _handle_error
>     raise errors.lookup(status)(message, stacktrace=stacktrace)
> marionette_driver.errors.NoSuchElementException: NoSuchElementException:
> Unable to locate element: dek
> 
> code:
> from marionette import Marionette
> from marionette_driver import By
> client = Marionette('localhost', port=2828)
> client.start_session()
> 
> def test_hang_until_timeout():
>         with client.using_context('chrome'):
>             current_handle = client.current_chrome_window_handle
>             menu = client.find_element(By.ID, 'aboutName')
>             menu.click()
>             handles = client.chrome_window_handles
>             handles.remove(current_handle)
>             client.switch_to_window(handles[0])
>             client.find_element(By.ID, 'dek')
> 
> test_hang_until_timeout()

The code that's troublesome for windows without a content context is called by the marionette test harness, which isn't invoked here.
What do I need to do to reproduce it locally? Run the marionette job from TH?
Add the code above into a test?
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #25)
> What do I need to do to reproduce it locally? Run the marionette job from TH?
> Add the code above into a test?

It should be reproducible if the code is in a marionette test.
(In reply to Chris Manchester [:chmanchester] from comment #26)
> (In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #25)
> > What do I need to do to reproduce it locally? Run the marionette job from TH?
> > Add the code above into a test?
> 
> It should be reproducible if the code is in a marionette test.

Could I add it as a firefox-ui-updates test? Thanks for the pointers.
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #27)
> (In reply to Chris Manchester [:chmanchester] from comment #26)
> > (In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #25)
> > > What do I need to do to reproduce it locally? Run the marionette job from TH?
> > > Add the code above into a test?
> > 
> > It should be reproducible if the code is in a marionette test.
> 
> Could I add it as a firefox-ui-updates test? Thanks for the pointers.

That should work, yes.
The update in bug 1107706 might have fixed this issue since the "server" code will always returns. The issue is that there was an unhandled exception because we tried to call something that it could never hit. The runner was trying to get info that it couldnt access because the context didnt exist
Flags: needinfo?(dburns)
Blocks: 1148546
Attached file run.txt
At the end of this attachment you can see the code changes that I made.
I hit the timeout and hang issue with this.
This is tested against firefox 38 beta 3.

The run did not complete until I killed both windows.

What changes sdo we need to prevent this from happening on gecko 38 builds? (since they will be esr38 and stay around for long enough).

For context, I'm trying to run update verification on releng's release automation and I want to reduce paper cuts before this runs in production.
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #31)
> Created attachment 8592470 [details]
> run.txt
> 
> At the end of this attachment you can see the code changes that I made.
> I hit the timeout and hang issue with this.
> This is tested against firefox 38 beta 3.
> 
> The run did not complete until I killed both windows.
> 
> What changes sdo we need to prevent this from happening on gecko 38 builds?
> (since they will be esr38 and stay around for long enough).
> 
> For context, I'm trying to run update verification on releng's release
> automation and I want to reduce paper cuts before this runs in production.



My patch that "fixes" the issues in the runner is causing a test to run a little quicker and that needs fixing and then we need to see what needs backporting
We will try to work around this in bug 1156475 while this gets figured out.
No longer blocks: 1148546
David, do you see any chance to get this patch finished up and landed? We are waiting for a while now. Sadly we missed the Firefox 38 train here. :(
Status: NEW → ASSIGNED
Flags: needinfo?(dburns)
This has not been a priority for me as there are more pressing things that need doing. Since the landing of bug 1107706 we don't hang but we don't get a decent message back. I need to fix the tests because we now appear to be hitting a little bit of flakiness.
Flags: needinfo?(dburns)
Attachment #8577483 - Attachment is obsolete: true
Attachment #8619714 - Flags: review+
Attachment #8619715 - Flags: review+
Attachment #8619716 - Flags: review+
Attachment #8619717 - Flags: review+
It turns out this is a nasty hang for our firefox-ui-tests especially in Jenkins. It causes Marionette to hang and not exit so that it gets killed after 60min by Jenkins! 

David, I would kinda appreciate if we can get this fix landed in Marionette so we do no longer have the trouble anymore with content vs. chrome.
Flags: needinfo?(dburns)
Summary: Marionette hangs until global timeout if exceptions are happening in chrome windows other than browser → Marionette hangs if exceptions are happening in content scope
(In reply to Henrik Skupin (:whimboo) from comment #43)
> It turns out this is a nasty hang for our firefox-ui-tests especially in
> Jenkins. It causes Marionette to hang and not exit so that it gets killed
> after 60min by Jenkins!

It might be that this patch only addresses some of those issues.  It’s certainly true that the way testing/marionette/listener.js is designed JS errors, in general, are not caught.  Currently the situation is that each command must have a try…catch block defined.

The longer term plan is to use the new dispatch technique we got from bug 1107706 also for the listener.  I’m slowing adapting this as part of bug 1197130, which is a precursor to generalising error catching in content space.

If an uncaught error is thrown in content space it might be better to mark bug 1197130 as a blocker.
I wonder if we could at least fix takeScreenshot easily which might be a central piece of this malfunction, right?
(In reply to Henrik Skupin (:whimboo) from comment #45)
> I wonder if we could at least fix takeScreenshot easily which might be a
> central piece of this malfunction, right?

I just filed bug 1202663 about this.
Attachment #8619716 - Attachment description: MozReview Request: Bug 1141519: Don't switch to content scope for logging the test end; r=jgriffin → MozReview Request: Bug 1141519: Don't switch to content scope for logging the test end; r?jgriffin
Attachment #8619716 - Flags: review+ → review?(jgriffin)
Comment on attachment 8619716 [details]
MozReview Request: Bug 1141519: Don't switch to content scope for logging the test end; r?jgriffin

Bug 1141519: Don't switch to content scope for logging the test end; r?jgriffin
Comment on attachment 8619717 [details]
MozReview Request: Bug 1141519: added test that puts marionette into a position that can cause hangs when in content scope; r?jgriffin

Bug 1141519: added test that puts marionette into a position that can cause hangs when in content scope; r?jgriffin
Attachment #8619717 - Attachment description: MozReview Request: Bug 1141519: added test that puts marionette into a position that can cause hangs when in content scope → MozReview Request: Bug 1141519: added test that puts marionette into a position that can cause hangs when in content scope; r?jgriffin
Attachment #8619717 - Flags: review+ → review?(jgriffin)
Comment on attachment 8619714 [details]
MozReview Request: Bug 1141519: error if we are going to try send a message to a frame that we know doesnt exist; r?jgriffin

Bug 1141519: error if we are going to try send a message to a frame that we know doesnt exist; r?jgriffin
Attachment #8619714 - Attachment description: MozReview Request: Bug 1141519: error if we are going to try send a message to a frame that we know doesnt exist; r=jgriffin → MozReview Request: Bug 1141519: error if we are going to try send a message to a frame that we know doesnt exist; r?jgriffin
Attachment #8619714 - Flags: review+ → review?(jgriffin)
Attachment #8619715 - Attachment description: MozReview Request: Bug 1141519: if we try call marionette calls in the teardown and they don't work we should just carry on; r=jgriffin → MozReview Request: Bug 1141519: if we try call marionette calls in the teardown and they don't work we should just carry on; r?jgriffin
Attachment #8619715 - Flags: review+ → review?(jgriffin)
Comment on attachment 8619715 [details]
MozReview Request: Bug 1141519: if we try call marionette calls in the teardown and they don't work we should just carry on; r?jgriffin

Bug 1141519: if we try call marionette calls in the teardown and they don't work we should just carry on; r?jgriffin
I have done a try push that is visible on MozReview. Let's see what breaks but locally everything is good.
Flags: needinfo?(dburns)
Comment on attachment 8619714 [details]
MozReview Request: Bug 1141519: error if we are going to try send a message to a frame that we know doesnt exist; r?jgriffin

https://reviewboard.mozilla.org/r/5337/#review17463
Attachment #8619714 - Flags: review?(jgriffin) → review+
https://reviewboard.mozilla.org/r/5331/#review17467

::: testing/marionette/client/marionette/tests/unit/test_chrome.py:24
(Diff revision 3)
> +        with self.marionette.using_context('chrome'):

Does this have the same effect as:

self.marionette.set_context('chrome')
# do something that causes an exception

I'm wondering if the using_context context manager is switching the context back to content implicitly before tearDown is caused, and thus not really testing the changes to marionette_test.py completely.
(In reply to Jonathan Griffin (:jgriffin) from comment #53)
> https://reviewboard.mozilla.org/r/5331/#review17467
> 
> ::: testing/marionette/client/marionette/tests/unit/test_chrome.py:24
> (Diff revision 3)
> > +        with self.marionette.using_context('chrome'):
> 
> Does this have the same effect as:
> 
> self.marionette.set_context('chrome')
> # do something that causes an exception

It’s not the same thing.  It defines a chrome-space scoped block that switches back to content afterwards.  It would be the equivalent of this:

    original_context = marionette.get_context()
    marionette.set_context("chrome")
    # do something
    marionette.set_context(original_context)
Right, and I'm assuming that if an exception is hit during the 'something', that we still switch the context back, which means the test here may not be testing what we want it to test.
(In reply to Jonathan Griffin (:jgriffin) from comment #53)
> https://reviewboard.mozilla.org/r/5331/#review17467
> 
> ::: testing/marionette/client/marionette/tests/unit/test_chrome.py:24
> (Diff revision 3)
> > +        with self.marionette.using_context('chrome'):
> 
> Does this have the same effect as:
> 
> self.marionette.set_context('chrome')
> # do something that causes an exception
> 
> I'm wondering if the using_context context manager is switching the context
> back to content implicitly before tearDown is caused, and thus not really
> testing the changes to marionette_test.py completely.

For this test case if we implicitly switched because of an exception any other calls would then fail. In this test we can't have because that would make the hang appear still.

The context will not be switched if there is an exception part way through.
https://reviewboard.mozilla.org/r/5331/#review17467

> Does this have the same effect as:
> 
> self.marionette.set_context('chrome')
> # do something that causes an exception
> 
> I'm wondering if the using_context context manager is switching the context back to content implicitly before tearDown is caused, and thus not really testing the changes to marionette_test.py completely.

Commented in the bug
Comment on attachment 8619716 [details]
MozReview Request: Bug 1141519: Don't switch to content scope for logging the test end; r?jgriffin

https://reviewboard.mozilla.org/r/5333/#review17711
Attachment #8619716 - Flags: review?(jgriffin) → review+
Comment on attachment 8619717 [details]
MozReview Request: Bug 1141519: added test that puts marionette into a position that can cause hangs when in content scope; r?jgriffin

https://reviewboard.mozilla.org/r/5335/#review17713
Attachment #8619717 - Flags: review?(jgriffin) → review+
Comment on attachment 8619715 [details]
MozReview Request: Bug 1141519: if we try call marionette calls in the teardown and they don't work we should just carry on; r?jgriffin

https://reviewboard.mozilla.org/r/5339/#review17715
Attachment #8619715 - Flags: review?(jgriffin) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/5e3cbd7c84fa4eae1343c46e44e8f7649cb55440
Bug 1141519: Don't switch to content scope for logging the test end; r=jgriffin

https://hg.mozilla.org/integration/mozilla-inbound/rev/2101909374c09e8b06af50cedfc59ccccf078d0e
Bug 1141519: added test that puts marionette into a position that can cause hangs when in content scope; r=jgriffin

https://hg.mozilla.org/integration/mozilla-inbound/rev/2c319ee6609b45f254194005991218a8e6f1b1ce
Bug 1141519: error if we are going to try send a message to a frame that we know doesnt exist; r=jgriffin

https://hg.mozilla.org/integration/mozilla-inbound/rev/4f0cd3e94c0a34500c60951e3e80bcd312167fe5
Bug 1141519: if we try call marionette calls in the teardown and they don't work we should just carry on; r=jgriffin
looks like I was leaving a window behind and I think that was causing the failures
Flags: needinfo?(dburns)
https://hg.mozilla.org/integration/mozilla-inbound/rev/98b43c681c1d77c3aaee352b721349895f550db5
Bug 1141519: Don't switch to content scope for logging the test end; r=jgriffin

https://hg.mozilla.org/integration/mozilla-inbound/rev/b2c829ea4540ee6d6b7474325f1fb487995ef3fb
Bug 1141519: added test that puts marionette into a position that can cause hangs when in content scope; r=jgriffin

https://hg.mozilla.org/integration/mozilla-inbound/rev/6d92fc2898e9b9232d1642526a4f7eb25718fbae
Bug 1141519: error if we are going to try send a message to a frame that we know doesnt exist; r=jgriffin

https://hg.mozilla.org/integration/mozilla-inbound/rev/88386ed1886c6e811b58690cf0adf3f4429d7857
Bug 1141519: if we try call marionette calls in the teardown and they don't work we should just carry on; r=jgriffin
No longer blocks: 1143565
It looks like that even with this fix landed Marionette still hangs with an element not been found in content scope. The testcase is via bug 1202576. I will further debug it and file an appropriate Marionette bug.
Looks like I was trapped by the dependency tree. Bug 1202576 is actually not dependent on this bug but bug 1202663 as stated earlier on this bug by Andreas. So we will have to wait for that too to get the other half fixed.
No longer blocks: 1202576
Our oldest supported Firefox release will be 45.0 soon. So I don't see why we would need this fix in older releases.
You need to log in before you can comment on or make changes to this bug.