Closed Bug 779011 Opened 7 years ago Closed 7 years ago

Marionette can get out of sync

Categories

(Testing :: Marionette, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla17

People

(Reporter: jgriffin, Assigned: jgriffin)

References

Details

Attachments

(3 files, 2 obsolete files)

Sometimes when a test times out, Marionette can get requests and responses out-of-sync, which eventually leads to this error:

Traceback (most recent call last):
  File "runtests.py", line 607, in <module>
    runner.run_tests(tests, testtype=options.type)
  File "runtests.py", line 294, in run_tests
    self.run_test(test, testtype)
  File "runtests.py", line 375, in run_test
    self.run_test(i["path"], testtype)
  File "runtests.py", line 395, in run_test
    results = MarionetteTextTestRunner(verbosity=3).run(suite)
  File "runtests.py", line 106, in run
    result.printLogs(test)
  File "runtests.py", line 69, in printLogs
    for line in testcase.loglines:
TypeError: 'bool' object is not iterable

We have a 'command_id' attached to execute requests which is supposed to prevent this synchronization problem, but it apparently is not working in all cases.

To reproduce:

1. update test_execute_async_script:test_no_timeout so that the test will actually timeout
2. run the tests

Results:

The test times out, as expected, but all tests after that fail:

======================================================================
ERROR: test_no_timeout (test_execute_async_script.TestExecuteAsyncContent)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/tests/unit/test_execute_async_script.py", line 27, in test_no_timeout
    """))
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 370, in execute_async_script
    specialPowers=special_powers)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 154, in _send_message
    self._handle_error(response)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 203, in _handle_error
    raise ScriptTimeoutException(message=message, status=status, stacktrace=stacktrace)
ScriptTimeoutException: timed out

======================================================================
ERROR: test_no_timeout (test_execute_async_script.TestExecuteAsyncContent)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette_test.py", line 99, in tearDown
    self.marionette.delete_session()
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 227, in delete_session
    response = self._send_message('deleteSession', 'ok')
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 154, in _send_message
    self._handle_error(response)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 212, in _handle_error
    raise MarionetteException(message=response, status=500)
MarionetteException: {u'from': u'conn53.marionette1', u'value': {}}

======================================================================
ERROR: test_same_context (test_execute_async_script.TestExecuteAsyncContent)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/tests/unit/test_execute_async_script.py", line 45, in test_same_context
    """ % var1), var1)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 358, in execute_script
    specialPowers=special_powers)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 154, in _send_message
    self._handle_error(response)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 212, in _handle_error
    raise MarionetteException(message=response, status=500)
MarionetteException: {u'message': u"error occurred while processing 'setScriptTimeout' request: TypeError: this.curBrowser is null", u'from': u'conn53.marionette1', u'error': u'unknownError'}

======================================================================
ERROR: test_same_context (test_execute_async_script.TestExecuteAsyncContent)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette_test.py", line 97, in tearDown
    self.loglines = self.marionette.get_logs()
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 396, in get_logs
    return self._send_message('getLogs', 'value')
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 154, in _send_message
    self._handle_error(response)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 212, in _handle_error
    raise MarionetteException(message=response, status=500)
MarionetteException: {u'message': u"error occurred while processing 'executeScript' request: TypeError: this.curBrowser is null", u'from': u'conn53.marionette1', u'error': u'unknownError'}

======================================================================
ERROR: test_sandbox_reuse (test_execute_async_script.TestExecuteAsyncContent)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/tests/unit/test_execute_async_script.py", line 11, in setUp
    self.marionette.set_script_timeout(1000)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 238, in set_script_timeout
    response = self._send_message('setScriptTimeout', 'ok', value=timeout)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 154, in _send_message
    self._handle_error(response)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 212, in _handle_error
    raise MarionetteException(message=response, status=500)
MarionetteException: {u'from': u'conn53.marionette1', u'value': []}

======================================================================
ERROR: test_script_finished (test_execute_async_script.TestExecuteAsyncContent)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/tests/unit/test_execute_async_script.py", line 11, in setUp
    self.marionette.set_script_timeout(1000)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 238, in set_script_timeout
    response = self._send_message('setScriptTimeout', 'ok', value=timeout)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 154, in _send_message
    self._handle_error(response)
  File "/data/jenkins/workspace/webapi-marionette-test/testing/marionette/client/marionette/marionette.py", line 212, in _handle_error
    raise MarionetteException(message=response, status=500)
MarionetteException: {u'message': u"error occurred while processing 'setScriptTimeout' request: TypeError: this.curBrowser is null", u'from': u'conn53.marionette1', u'error': u'unknownError'}

Traceback (most recent call last):
  File "runtests.py", line 607, in <module>
    runner.run_tests(tests, testtype=options.type)
  File "runtests.py", line 294, in run_tests
    self.run_test(test, testtype)
  File "runtests.py", line 375, in run_test
    self.run_test(i["path"], testtype)
  File "runtests.py", line 395, in run_test
    results = MarionetteTextTestRunner(verbosity=3).run(suite)
  File "runtests.py", line 106, in run
    result.printLogs(test)
  File "runtests.py", line 69, in printLogs
    for line in testcase.loglines:
TypeError: 'bool' object is not iterable
I thought the cause of this problem was an issue where a test times out, and the finish() callback for the timed out test gets triggered while a subsequent test is running, and Marionette somehow gets out of sync.

That may not be the case.  I saw this problem again on the CI, after I had increased the script timeout for test_execute_async_script:test_no_timeout to 10s.  The test timed out after much less than 10s.

The previous test in the CI seems to be benign (test_execute_permission (test_execute_async_script.TestExecuteAsyncContent).  Also note that this error has only been observed with the content version of this test, not chrome.

My guess as to what's happening:  setScriptTimeout sends a message to the content process and then immediately returns; it doesn't wait for the content process to process the message.  Perhaps it is possible that the next message sent to content (in this case an execute_async message) is getting processed before the timeout message has fully been processed, so it's operating with the earlier timeout.

The likely fix:  for setScriptTimeout in the content context, don't send the response from the chrome context, but from the content context after the message is processed.
Attached patch patch v0.1Splinter Review
This moves the acknowledgement for setScriptTimeout to the content script.
Attachment #647632 - Flags: review?(mdas)
Comment on attachment 647632 [details] [diff] [review]
patch v0.1

Review of attachment 647632 [details] [diff] [review]:
-----------------------------------------------------------------

Oh wow, that was silly! Good catch!
Attachment #647632 - Flags: review?(mdas) → review+
http://hg.mozilla.org/mozilla-central/rev/4198ceecf0ee
Assignee: nobody → jgriffin
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla17
Sadly, this error cropped up again on the CI today.  This makes me think a setTimeout from a previous test is triggering the timeout for test_no_timeout, since the test is definitely not taking 10s to run.  I'll have to try to add some instrumentation to figure out what's going wrong.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It turns out that bug 786463 made this bug much easier to reproduce, so I was finally able to figure out what was going on.

What's happening is that Marionette is not cleaning up after itself properly after JS exceptions occur during execute_async_script calls.  In particular, this means we are returning without canceling the timeout timer, which might end up getting triggered later, unless some other test was executed successfully (and thus triggered the cleanup) before the timeout interval elapsed.

When an orphaned timeout timer gets called for a test which is already finished, much mayhem can ensue.  This generally causes the wrong response to be sent to whatever the current command is, which causes a cascade of failures.

The proximate fix is to patch handling of JS errors so that Marionette cleans up after itself. I have a patch for this which I'll land sans review to see if it clears up the errors on the CI (it fixes them for me locally).

A better and more fundamental fix is to expand the concept of command_id's, which we've applied only to execute commands, to cover all commands.  This should prevent this kind of thing from happening whether we clean up after ourselves or not.
Landed without review as https://hg.mozilla.org/mozilla-central/rev/e3e7f8f7796d, so we can see the effects on the CI over this weekend.  I'm leaving this bug open since it won't be 100% fixed until we apply command_id logic to all Marionette messages, which will be a more involved patch.
Blocks: 801898
Summary: TypeError: 'bool' object is not iterable when test failures cause Marionette to get out of sync → Marionette can get out of sync
Try run for a0a6e8f74f87 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=a0a6e8f74f87
Results (out of 11 total builds):
    success: 11
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jgriffin@mozilla.com-a0a6e8f74f87
Try seems to like this, so I cleaned it up a bit and am posting for review.
Attachment #685435 - Flags: review?(mdas)
Attachment #684214 - Attachment is obsolete: true
Comment on attachment 685435 [details] [diff] [review]
Attach id's to all requests to prevent Marionette from getting out-of-sync

Review of attachment 685435 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm, but I'd keep the logRequest stuff out if we can.

::: testing/marionette/marionette-actors.js
@@ +241,5 @@
>  
> +  logRequest: function MDA_logRequest(type, data) {
> +    logger.info("Got request: " + type + ", data: " + JSON.stringify(data) + ", id: " + this.command_id);
> +  },
> +

Do we want to logRequests in the production code? It seems really verbose

@@ +463,4 @@
>     *
>     */
>    newSession: function MDA_newSession() {
> +    let command_id = this.command_id = this.getCommandId();

Why do we use command_id when we pass it to sendAsync, instead of using this.command_id? A new one can't be assigned during this function execution, can it?

::: testing/marionette/marionette-listener.js
@@ +734,3 @@
>    }
>    catch (e) {
> +    sendError(e.message, e.code, e.stack, comand_id);

typo here ('comand_id')
Attachment #685435 - Flags: review?(mdas) → review+
(In reply to Malini Das [:mdas] from comment #12)
> Comment on attachment 685435 [details] [diff] [review]
> Attach id's to all requests to prevent Marionette from getting out-of-sync
> 
> Review of attachment 685435 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> lgtm, but I'd keep the logRequest stuff out if we can.
> 
> ::: testing/marionette/marionette-actors.js
> @@ +241,5 @@
> >  
> > +  logRequest: function MDA_logRequest(type, data) {
> > +    logger.info("Got request: " + type + ", data: " + JSON.stringify(data) + ", id: " + this.command_id);
> > +  },
> > +
> 
> Do we want to logRequests in the production code? It seems really verbose

I think I can use logger.debug(), and set the default log level to info, so that these won't show up in the log unless you change the log level yourself.

> 
> @@ +463,4 @@
> >     *
> >     */
> >    newSession: function MDA_newSession() {
> > +    let command_id = this.command_id = this.getCommandId();
> 
> Why do we use command_id when we pass it to sendAsync, instead of using
> this.command_id? A new one can't be assigned during this function execution,
> can it?

No it can't; you're right it isn't really necessary in this case.

> 
> ::: testing/marionette/marionette-listener.js
> @@ +734,3 @@
> >    }
> >    catch (e) {
> > +    sendError(e.message, e.code, e.stack, comand_id);
> 
> typo here ('comand_id')

Thanks.
Try run for a98c4e1e5c38 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=a98c4e1e5c38
Results (out of 4 total builds):
    success: 4
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jgriffin@mozilla.com-a98c4e1e5c38
The WebAPI run on try failed due to bug 802877, I believe.  I retriggered to confirm.
Try run for db35f04ce4e8 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=db35f04ce4e8
Results (out of 12 total builds):
    success: 11
    warnings: 1
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jgriffin@mozilla.com-db35f04ce4e8
https://hg.mozilla.org/integration/mozilla-inbound/rev/721a4d74f09e
Whiteboard: [automation-needed-in-aurora][automation-needed-in-beta]
http://hg.mozilla.org/mozilla-central/rev/721a4d74f09e
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
https://hg.mozilla.org/releases/mozilla-beta/rev/c4a61d1bc82b
Whiteboard: [automation-needed-in-aurora][automation-needed-in-beta] → [automation-needed-in-aurora]
https://hg.mozilla.org/releases/mozilla-aurora/rev/92fdcff7bde4
Whiteboard: [automation-needed-in-aurora]
Try run for a0a6e8f74f87 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=a0a6e8f74f87
Results (out of 12 total builds):
    success: 11
    failure: 1
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jgriffin@mozilla.com-a0a6e8f74f87
You need to log in before you can comment on or make changes to this bug.