Closed Bug 1207125 Opened 9 years ago Closed 6 years ago

Marionette._send_message() mixes up response values of different calls

Categories

(Remote Protocol :: Marionette, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: whimboo, Unassigned)

References

Details

(Keywords: pi-marionette-client)

File "/mozilla/code/gecko/testing/marionette/client/marionette/runner/base.py", line 530, in gather_debug
    rv['screenshot'] = marionette.screenshot()

  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()

  File "/mozilla/code/gecko/testing/marionette/driver/marionette_driver/marionette.py", line 1241, in using_context
    self.set_context(scope)

  File "/mozilla/code/gecko/testing/marionette/driver/marionette_driver/marionette.py", line 1216, in set_context
    assert(context == self.CONTEXT_CHROME or context == self.CONTEXT_CONTENT)

AssertionError

[..]

Traceback (most recent call last):
  File "/mozilla/code/gecko/testing/marionette/client/marionette/marionette_test.py", line 296, in run
    testMethod()
  File "/mozilla/code/firefox-ui-tests/firefox_ui_tests/update/direct/test_direct_update.py", line 26, in test_update
    self._test_update()
  File "/mozilla/code/firefox-ui-tests/firefox_ui_tests/update/direct/test_direct_update.py", line 20, in _test_update
    self.download_and_apply_available_update(force_fallback=False)
  File "/mozilla/code/firefox-ui-tests/firefox_ui_harness/testcases/update.py", line 158, in download_and_apply_available_update
    self.updates[self.current_update_index]['patch'] = about_window.patch_info
  File "/mozilla/code/firefox-ui-tests/firefox_puppeteer/ui/about_window/window.py", line 51, in patch_info
    patch = self._software_update.patch_info
  File "/mozilla/code/firefox-ui-tests/firefox_puppeteer/api/software_update.py", line 295, in patch_info
    info = {'channel': self.update_channel.channel}
  File "/mozilla/code/firefox-ui-tests/firefox_puppeteer/api/software_update.py", line 401, in channel
    return self.prefs.get_pref('app.update.channel', True)
  File "/mozilla/code/firefox-ui-tests/firefox_puppeteer/api/prefs.py", line 70, in get_pref
    """, script_args=[pref_name, default_branch, interface])
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/mozilla/code/gecko/testing/marionette/driver/marionette_driver/marionette.py", line 1241, in using_context
    self.set_context(scope)
  File "/mozilla/code/gecko/testing/marionette/driver/marionette_driver/marionette.py", line 1216, in set_context
    assert(context == self.CONTEXT_CHROME or context == self.CONTEXT_CONTENT)
AssertionError


The extra assert line in the code doesn't give that much information about the actual scope to be set. If you remove it and let the next line throw a ValueError you will see that the scope to be set is "nightly"!
Adding some debug code to using_context shows that very strange output:

current scope: /tmp/tmpAkdPyG.firefox-ui-update/binary.backup/defaults/pref/channel-prefs.js

or 

current scope: nightly

None of those scopes get set somewhere. So I wonder if the following method can return bogus values:

self._send_message("getContext", key="value")
Here some Marionette debug output:

************************** using context *******
** wanted: chrome
1442926688558	Marionette	DEBUG	conn1 -> {"name":"executeScript","parameters":{"scriptTimeout":null,"newSandbox":true,"args":["app.update.channel",true,null],"filename":"prefs.py","script":"\n              Components.utils.import(\"resource://gre/modules/Services.jsm\");\n\n              let pref_name = arguments[0];\n              let default_branch = arguments[1];\n              let interface = arguments[2];\n\n              let prefBranch;\n              if (default_branch) {\n                prefBranch = Services.prefs.getDefaultBranch(\"\");\n              }\n              else {\n                prefBranch = Services.prefs;\n              }\n\n              // If an interface has been set, handle it differently\n              if (interface !== null) {\n                return prefBranch.getComplexValue(pref_name,\n                                                  Components.interfaces[interface]).data;\n              }\n\n              let type = prefBranch.getPrefType(pref_name);\n\n              switch (type) {\n                case prefBranch.PREF_STRING:\n                  return prefBranch.getCharPref(pref_name);\n                case prefBranch.PREF_BOOL:\n                  return prefBranch.getBoolPref(pref_name);\n                case prefBranch.PREF_INT:\n                  return prefBranch.getIntPref(pref_name);\n                case prefBranch.PREF_INVALID:\n                  return null;\n              }\n            ","sandbox":"default","line":70}}
1442926688560	Marionette	DEBUG	conn1 client <- {"value":"nightly"}
**current: nightly

The code shows that we call using_context('chrome') to get a preference value from Firefox via execute_script. In this case the return value is "nightly". Somehow the using_context() method gets this value via the call to getContext and stores it internally for the current scope. Then once execute_script has been finished and the original state has to be restored it will accidentally restore the scope to "nightly" but not "content" or "chrome".
Just to add that all the above can only be seen when you hit Ctrl+C while the tests are running.

Another failure which falls into this bucket is:

Traceback (most recent call last):

  File "/mozilla/code/gecko/testing/marionette/client/marionette/runner/base.py", line 529, in gather_debug
    with marionette.using_context(marionette.CONTEXT_CHROME):

  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()

  File "/mozilla/code/gecko/testing/marionette/driver/marionette_driver/marionette.py", line 1234, in using_context
    scope = self._send_message("getContext", key="value")

  File "/mozilla/code/gecko/testing/marionette/driver/marionette_driver/decorators.py", line 36, in _
    return func(*args, **kwargs)

  File "/mozilla/code/gecko/testing/marionette/driver/marionette_driver/marionette.py", line 722, in _send_message
    return self._unwrap_response(resp.get(key))

AttributeError: 'list' object has no attribute 'get'

Adding debugging code to _send_message gives me the following output:

> **** response for getContext
> [u'3', u'14']

That means for getContext we return the list of open chrome window handles.

So I feel that the _send_message method is busted in those scenarios and returns a value which was set by a different invocation? 

Here the steps to reproduce:

1. Clone the firefox-ui-tests repo (https://github.com/mozilla/firefox-ui-tests)
2. Run python setup.py develop
3. Execute firefox-ui-update --binary %path% --update-direct-only
4. After Firefox restarted the first time and the about window opens, quickly switch to the console and hit Ctrl+C

This reliably fails for me.
Summary: Firefox Update tests can cause Marionette to fail with: "Unknown context: nightly" → Marionette._send_message() mixes up response values of different calls
Here some debugging output for responses we get:

**** response for executeScript
**** key: value
**** reponse: {u'value': True}
^C**** response for getContext
**** key: value
**** reponse: {u'value': u'14'}
**** response for setContext
**** key: None
**** reponse: {u'value': u'chrome'}
**** response for executeScript
**** key: value
**** reponse: {}
**** response for getChromeWindowHandles
**** key: None
**** reponse: {u'value': u'nightly'}
**** response for getContext
**** key: value
**** reponse: [u'3', u'14']

As you can see all is fine until the last call to executeScript. Here an empty dict is returned first but the real return value of it ('nightly') comes back for getChromeWindowHandles. And the handles actually for getContext.
Here some debug output with more context:

**** command: {"name": "getChromeWindowHandles"}
**** key for getChromeWindowHandles: None
**** reponse for getChromeWindowHandles: [u'3', u'14']
**** command: {"name": "executeScript", "parameters": {"scriptTimeout": null, "newSandbox": true, "args": [], "filename": "software_update.py", "script": "\n          Components.utils.import('resource://gre/modules/Services.jsm');\n\n          let file = Services.dirsvc.get('PrfDef', Components.interfaces.nsIFile);\n          file.append('channel-prefs.js');\n\n          return file.path;\n        ", "sandbox": "default", "line": 389}}
**** key for executeScript: value
**** reponse for executeScript: {u'value': u'/tmp/tmpoFtz1S.firefox-ui-update/binary.backup/defaults/pref/channel-prefs.js'}
**** command: {"name": "executeScript", "parameters": {"scriptTimeout": null, "newSandbox": true, "args": [], "filename": "software_update.py", "script": "\n          Components.utils.import('resource://gre/modules/Services.jsm');\n\n          let file = Services.dirsvc.get('GreD', Components.interfaces.nsIFile);\n          file.append('update-settings.ini');\n\n          return file.path;\n        ", "sandbox": "default", "line": 95}}
^C**** command: {"name": "getContext"}
**** key for getContext: value
**** reponse for getContext: {u'value': u'/tmp/tmpoFtz1S.firefox-ui-update/binary.backup/update-settings.ini'}
**** command: {"name": "setContext", "parameters": {"value": "chrome"}}
**** key for setContext: None
**** reponse for setContext: {u'value': u'chrome'}


See ^C at the beginning of the line which immediately sends a getContext command while the old executeScript command has not been returned yet. As result the response we get for getContext is actually the one for executeScript. :(
Thanks for that last debug round with more information, that’s incredibly useful!

I’m guessing there’s an interrupt handler in the Marionette runner that on SIGTERM aborts the current operation and sends a getContext message and blocks on receiving a response, but when the response comes back it’s from the previous executeScript command.

This stems from a complete lack of synchronous enforcement in Marionette: The server gladly executes any command you request it to, even if there an existing command is in progress.  There is an expectation that the local end blocks until it receives a response back before issuing a new command.
Summary: Marionette._send_message() mixes up response values of different calls → Implement a synchronous, blocking command queue for Marionette (Marionette._send_message() mixes up response values of different calls)
(In reply to Andreas Tolfsen (:ato) from comment #6)
> I’m guessing there’s an interrupt handler in the Marionette runner that on
> SIGTERM aborts the current operation and sends a getContext message and
> blocks on receiving a response, but when the response comes back it’s from
> the previous executeScript command.

So far the runner only listens for SIGINT (bug 1130220) but not SIGTERM (bug 1201871). The getContext might come from a using_context inside our specialized Firefox UI tests harness. But not sure out of the box.

> This stems from a complete lack of synchronous enforcement in Marionette:
> The server gladly executes any command you request it to, even if there an
> existing command is in progress.  There is an expectation that the local end
> blocks until it receives a response back before issuing a new command.

Or the _send_message should handle all the exceptions + interrupts correctly and cancel the processing of the command?
Blocks: 1207042
I looked closer at this and the issue seems to be that when a unittest is interrupted, it forcefully shuts down the thread that the test runs on, sometimes causing the socket recv() call from a request to not empty the pending data in the socket buffer.

When the next command is called, i.e. getContext from :whimboo’s log above that gets called in gather_debug by moztest, that actually reads the response from the previous request and thereby causes a race condition where response packets are delayed by a delta of +1.  Theoretically this will eventually lead to a buffer overflow if it happens repeatedly.
Summary: Implement a synchronous, blocking command queue for Marionette (Marionette._send_message() mixes up response values of different calls) → Marionette._send_message() mixes up response values of different calls
One solution is to poll with a timeout (select()) on the socket for read, and iterate byte by byte until the there is no more data.

The issue can be reproduced, if transport’s send is modified to no implicitly receive data, with this test case:

    import json
    from marionette import MarionetteTestCase

    get_current_url = json.dumps({"name": "getCurrentUrl"})

    class TestBlocking(MarionetteTestCase):
        def test_flush_buffer(self):
            self.marionette.client.send(get_current_url)
            resp = self.marionette.execute_script("return 'foo'")
            self.assertEqual("foo", resp)
Assignee: nobody → ato
Status: NEW → ASSIGNED
I assume you accidentally reverted the summary. Fixing it.
Summary: Marionette._send_message() mixes up response values of different calls → Implement a synchronous, blocking command queue for Marionette (Marionette._send_message() mixes up response values of different calls)
Experimental patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e7d06753663

(In reply to Henrik Skupin (:whimboo) from comment #10)
> I assume you accidentally reverted the summary. Fixing it.

I think we can address this by emptying the buffer before sending, which will fix this particular issue.  

But you’re right that there’s an overall bigger issue here that we need to make the server blocking.  I’ve filed bug 1207564 about fixing this.
Summary: Implement a synchronous, blocking command queue for Marionette (Marionette._send_message() mixes up response values of different calls) → Marionette._send_message() mixes up response values of different calls
So I tested this patch and it doesn't work for me on Linux with my steps. I see a huge perf decrease and when hitting ctrl+c it hangs forever in:

 File "/mozilla/code/gecko/testing/marionette/driver/marionette_driver/marionette.py", line 1535, in execute_script
    rv = self._send_message("executeScript", body, key="value")
  File "/mozilla/code/gecko/testing/marionette/driver/marionette_driver/decorators.py", line 36, in _
    return func(*args, **kwargs)
  File "/mozilla/code/gecko/testing/marionette/driver/marionette_driver/marionette.py", line 686, in _send_message
    resp = self.client.request(packet)
  File "/mozilla/code/gecko/testing/marionette/transport/marionette_transport/transport.py", line 126, in request
    self.send(data)
  File "/mozilla/code/gecko/testing/marionette/transport/marionette_transport/transport.py", line 109, in send
    self._empty()
  File "/mozilla/code/gecko/testing/marionette/transport/marionette_transport/transport.py", line 99, in _empty
    read, _, _ = select.select([self.sock], [], [], 0.1)
KeyboardInterrupt
Depends on: 1211489
Sorry, the last comment was not meant for this bug.
Flags: needinfo?(ato)
Depends on: 1230151
Unassigning myself as this hasn’t been so much of an issue lately.  (Possibly fixed?)
Status: ASSIGNED → NEW
Assignee: ato → nobody
Have you seen this recently?
Flags: needinfo?(hskupin)
Priority: -- → P3
I tried several methods in reproducing it but wasn't able to. I'm sure I have seen something like that still maybe in wpt tests, but unless I have an example lets close this bug as incomplete.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(hskupin)
Resolution: --- → INCOMPLETE
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.