Random tests failing with JavascriptException: TypeError: this.element is null in app://system.gaiamobile.org/gaia_build_defer_index.js line: 2337 or .../app_window.js line:451

RESOLVED FIXED

Status

defect
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: gmealer, Assigned: gmealer)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(6 attachments)

Posted file 7530-logcat.txt
This is bringing down our Gaia UI Tests at around ~15% failing for any given run. 

We're getting it during what seems like random calls to execute_async_script() in Marionette. It's usually our settings reset in cleanup_gaia() but I've seen it during frame switches, app launches, etc. 

The way Marionette works, any JS exception that happens during an execute_async_script(), no matter which call stack it comes from (i.e. can be concurrent code) will look like an exception in that async call.

About the only valid conclusion is that the exception started happening at some frequency. It may or may not have anything to do with the actual execute_async_script() call other than those are the time periods we're open to getting hit by the error.

Bracketing test runs:

These can be navigated to from within MoCo network or VPN.

http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.tinderbox.ui.functional.sanity/7529/

tinderbox-builds/b2g-inbound-flame-kk-eng/20150319180839

Gaia date 	20 Mar 2015 00:47:44
Gaia revision 	6e24aff11fbb
Gecko build 	20150319180839
Gecko revision 	069d3d6722ad
Gecko version 	39.0a1


http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.tinderbox.ui.functional.sanity/7530/

tinderbox-builds/b2g-inbound-flame-kk-eng/20150319193241

Gaia date 	20 Mar 2015 00:47:44
Gaia revision 	6e24aff11fbb
Gecko build 	20150319193241
Gecko revision 	ed6fdb6e238e
Gecko version 	39.0a1

Sample error:

Traceback (most recent call last):
File "/var/jenkins/2/workspace/flame-kk-319.b2g-inbound.tinderbox.ui.functional.sanity/.env/local/lib/python2.7/site-packages/marionette_client-0.9-py2.7.egg/marionette/marionette_test.py", line 290, in run
testMethod()
File "/var/jenkins/2/workspace/flame-kk-319.b2g-inbound.tinderbox.ui.functional.sanity/tests/python/gaia-ui-tests/gaiatest/tests/functional/dialer/test_dialer.py", line 20, in test_dialer_make_call
phone.launch()
File "/var/jenkins/2/workspace/flame-kk-319.b2g-inbound.tinderbox.ui.functional.sanity/tests/python/gaia-ui-tests/gaiatest/apps/phone/app.py", line 27, in launch
Base.launch(self)
File "/var/jenkins/2/workspace/flame-kk-319.b2g-inbound.tinderbox.ui.functional.sanity/tests/python/gaia-ui-tests/gaiatest/apps/base.py", line 26, in launch
self.app = self.apps.launch(self.name, self.manifest_url, self.entry_point, launch_timeout=launch_timeout)
File "/var/jenkins/2/workspace/flame-kk-319.b2g-inbound.tinderbox.ui.functional.sanity/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 64, in launch
result = self.marionette.execute_async_script("GaiaApps.launchWithName('%s')" % name, script_timeout=launch_timeout)
File "/var/jenkins/2/workspace/flame-kk-319.b2g-inbound.tinderbox.ui.functional.sanity/.env/local/lib/python2.7/site-packages/marionette_driver-0.2-py2.7.egg/marionette_driver/marionette.py", line 1420, in execute_async_script
debug_script=debug_script)
File "/var/jenkins/2/workspace/flame-kk-319.b2g-inbound.tinderbox.ui.functional.sanity/.env/local/lib/python2.7/site-packages/marionette_driver-0.2-py2.7.egg/marionette_driver/decorators.py", line 36, in _
return func(*args, **kwargs)
File "/var/jenkins/2/workspace/flame-kk-319.b2g-inbound.tinderbox.ui.functional.sanity/.env/local/lib/python2.7/site-packages/marionette_driver-0.2-py2.7.egg/marionette_driver/marionette.py", line 670, in _send_message
self._handle_error(response)
File "/var/jenkins/2/workspace/flame-kk-319.b2g-inbound.tinderbox.ui.functional.sanity/.env/local/lib/python2.7/site-packages/marionette_driver-0.2-py2.7.egg/marionette_driver/marionette.py", line 725, in _handle_error
raise errors.JavascriptException(message=message, status=status, stacktrace=stacktrace)
JavascriptException: JavascriptException: TypeError: this.element is null
stacktrace:
@app://system.gaiamobile.org/gaia_build_defer_index.js, line 2337

Associated logcat block/datestamps (see attached):

03-19 21:46:05.161  3042  3042 W GeckoConsole: [JavaScript Error: "TypeError: this.element is null" {file: "app://system.gaiamobile.org/gaia_build_defer_index.js" line: 2337}]
03-19 21:46:05.411  3042  3042 I Gecko   : 1426826765413	Marionette	INFO	sendToClient: {"from":"0","error":{"message":"TypeError: this.element is null","status":17,"stacktrace":"@app://system.gaiamobile.org/gaia_build_defer_index.js, line 2337"}}, {4fde0b10-caaa-4ba5-853b-361b5967b65e}, {4fde0b10-caaa-4ba5-853b-361b5967b65e}

Please see attached logcat.txt for more context.
QA Whiteboard: [fxosqa-auto-backlog+]
No longer depends on: 1107139
See Also: → 1107139
:geo, would it be possible to attach the gaia_build_defer_index.js, that line number doesnt look right when I look at my own build, but it does seem to be somewhere in apps/system/js/app_window.js. That and the getScreenshot log line shortly near the exception in the logcat make it plausible that my patch is the cause. Hard to troubleshoot until I can reproduce it, and I also don't see anything going on like this on https://treeherder.mozilla.org/#/jobs?repo=gaia-try
Flags: needinfo?(gmealer)
Comment on attachment 8580993 [details] [review]
[gaia] sfoster:app-window-guard-this-element-bug-1145875 > mozilla-b2g:master

This is a speculative patch, tightening up a couple places where theoretically we might get called when this.element was falsey, and which were modified in bug 1107139
(In reply to Sam Foster [:sfoster] from comment #1)
> :geo, would it be possible to attach the gaia_build_defer_index.js, that
> line number doesnt look right when I look at my own build, but it does seem
> to be somewhere in apps/system/js/app_window.js. That and the getScreenshot
> log line shortly near the exception in the logcat make it plausible that my
> patch is the cause. Hard to troubleshoot until I can reproduce it, and I
> also don't see anything going on like this on
> https://treeherder.mozilla.org/#/jobs?repo=gaia-try

Yeah, I'll dig one out of the phones today (and update the line number if need be, since it's a generated file). It was line 2336 in some builds, but that file moves quick since it's an assemblage.

Re: gaia-try, it doesn't seem to reproduce on b2g desktop. I hadn't had a chance to specifically verify that, but the fact that it's not failing on try server or the mainstream treeherder jobs suggests that.
Flags: needinfo?(gmealer)
(In reply to Geo Mealer [:geo] from comment #4)
> Re: gaia-try, it doesn't seem to reproduce on b2g desktop. I hadn't had a
> chance to specifically verify that, but the fact that it's not failing on
> try server or the mainstream treeherder jobs suggests that.

Asynchronous scripts generally complete a lot quicker on desktop, so it's less likely to collide with a JavaScript exception than when running on device. We used to hit this a lot when we had an async script for the entire B2G boot.
QA Whiteboard: [fxosqa-auto-backlog+] → [fxosqa-auto-s12+]
QA Whiteboard: [fxosqa-auto-s12+] → [fxosqa-auto-s12+, s13]
Assignee: nobody → gmealer
TL;DR: It's a race condition. There's a recap at the end. I've included enough info that logs can be reviewed and the theory can be validated.

OK, I used a local build and after a number of tries, got an error out of it without the generated file.

This happened 3 or so seconds into a 1 minute hard sleep from the test, which strongly suggests it's independent of exactly what the test is doing at the time:

03-23 17:26:44.054 14889 14889 I GeckoConsole: Content JS LOG: ====> STARTING SLEEP ====<
03-23 17:26:47.484 14889 14889 W GeckoConsole: [JavaScript Error: "TypeError: this.element is null" {file: "app://system.gaiamobile.org/js/app_window.js" line: 451}]

That corresponds to:

AppWindow.prototype.destroyBrowser = function() {
  ...
  this.element.classList.add('suspended');

Etienne added that, so adding him to NI.

I did notice that right before we get that error, we typically get something like

03-23 17:45:34.594 14889 14914 I Gecko   : [Parent 14889] WARNING: pipe error (145): Connection reset by peer: file ../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 456

Looked further, and that happens when we execute kill_all() from GaiaApps. This corresponds with a JS atom I can't easily link to because we don't store it in Gaia repo anywhere obvious, but ultimately comes down to:

      console.log('terminating app with origin \'' + aOrigin + '\'');
      let manager = window.wrappedJSObject.appWindowManager ||
                    new window.wrappedJSObject.AppWindowManager();
      manager.kill(aOrigin);

We do this on all open apps though to my knowledge that's only FTU right after startup. We get the list from:

window.wrappedJSObject.StackManager.snapshot();

So, my theory was that we were killing FTU as it was initializing sometimes, causing its destroyBrowser() method to fire before it had assigned an element.

To test this, I set up a while 1: a.kill_all() function in Python REPL, and spammed taps on the launcher at the same time, trying to create a race between me launching and it killing. I was eventually able to reproduce using Messages with a similar error:

03-23 17:45:34.614 14889 14889 W GeckoConsole: [JavaScript Error: "TypeError: this.element is null" {file: "app://system.gaiamobile.org/js/app_window.js" line: 451}]

I'm attaching local-logcat.txt for an actual test run, and killall-loop.txt for the semiautomatic experiment. Hopefully you'll forgive me for not trimming them--that's why I included timestamps on c/p.

So, to recap, my theory:

* This is a race condition between the system launching FTU and us killing it.

* It started taking down tests because either a) it did just start happening or b) the timing changed enough to move it into the window we're doing async scripts and are vulnerable to exceptions. I suspect b).

* This is unlikely to hit us much in real life, unless it'd be something like a low-memory kill right as something launched.

* However, I'd rather work around this on product side anyway. It would be awkward to filter this exception and only this exception, as it could happen on any async script we fire. Also, most this.element accesses on AppWindow are guarded, and this one probably should be too.

NI'ing Sam in case his recent change might actually have exposed this somehow (and to close the loop), and Etienne because it was his line of code. Please feel free to reassign the bug, cc, move it to another component, whatever.
Flags: needinfo?(sfoster)
Flags: needinfo?(etienne)
Summary: Random tests failing with JavascriptException: TypeError: this.element is null in app://system.gaiamobile.org/gaia_build_defer_index.js line: 2337 → Random tests failing with JavascriptException: TypeError: this.element is null in app://system.gaiamobile.org/gaia_build_defer_index.js line: 2337 or .../app_window.js line:451
Posted file local-logcat.txt
Local test run with the error in the original file.
Posted file killall-loop.txt
Logcat from spamming launches while looping a killall. See comment above for timestamps to search for.
Posted file gaia_apps.js
JS atom with the implementation of killAll() in it.
Added a speculative PR with the this.element guard in it.
> Looked further, and that happens when we execute kill_all() from GaiaApps.
> This corresponds with a JS atom I can't easily link to because we don't
> store it in Gaia repo anywhere obvious, but ultimately comes down to:
> 
>       console.log('terminating app with origin \'' + aOrigin + '\'');
>       let manager = window.wrappedJSObject.appWindowManager ||
>                     new window.wrappedJSObject.AppWindowManager();
>       manager.kill(aOrigin);
> 
> We do this on all open apps though to my knowledge that's only FTU right
> after startup. We get the list from:
> 
> window.wrappedJSObject.StackManager.snapshot();
> 
> So, my theory was that we were killing FTU as it was initializing sometimes,
> causing its destroyBrowser() method to fire before it had assigned an
> element.

That sounds perfectly plausible. Thanks for your extensive investigation here. Like you say, its unlikely to be a problem with a human-directed interaction, but for the sake of a guard around this.element - which is typical best practice anyhow - I think this is a legitimate bug that needs fixing. 

The patch looks like its fixing the right things in the right way, Etienne is a system peer is would be a good reviewer.
Flags: needinfo?(sfoster)
I did go back through older logs and don't see the error. That makes it more likely that something changed and did just start causing this. 

That said, I checked all the assumptions on our side, including that the list of applications we're killing doesn't contain anything we don't want it to, and it all looks good. 

My guess is some fundamental bit of window setup/teardown logic changed on the Gecko side, and the rest still comes down to a race and killing something in-progress. The guard should still be appropriate here, and I haven't seen any other misbehavior suggesting it's a bug so much as just an internal change.
Comment on attachment 8582013 [details] [review]
[gaia] geoelectric:fix-1145875 > mozilla-b2g:master

Clearing the NI on Etienne in favor of a review request. 

If there are no objections, I really need to get this landed so we can start getting clean UI test results back again.
Flags: needinfo?(etienne)
Attachment #8582013 - Flags: review?(etienne)
Comment on attachment 8582013 [details] [review]
[gaia] geoelectric:fix-1145875 > mozilla-b2g:master

sure!
Attachment #8582013 - Flags: review?(etienne) → review+
Let's see if autolander lands this, given the Gip task problems right now. Failing that, I'll figure out the best manual way.
Status: NEW → RESOLVED
Closed: 4 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.