Closed Bug 921194 Opened 7 years ago Closed 7 years ago

TypeError: stack is undefined from trusted_ui.js, line 303

Categories

(Firefox OS Graveyard :: Gaia, defect)

All
Gonk (Firefox OS)
defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdas, Assigned: alive)

References

Details

(Whiteboard: [qa-automation-blocked])

Attachments

(1 file)

In our automation, we call WindowManager's kill command on all applications (except homescreen), and this has worked up until recently, when we started to see:

3:53:14 TEST-UNEXPECTED-FAIL | test_settings_device_info.py TestSettingsDeviceInfo.test_that_device_info_is_populated | JavascriptException: TypeError: stack is undefined
13:53:14 	stacktrace:
13:53:14 	@app://system.gaiamobile.org/js/trusted_ui.js, line 303
13:53:14 ------------------------------------------------------------------ 

This is happening frequently, as you can see in some of the gaia-ui-test automation: http://qa-selenium.mv.mozilla.com:8080/view/B2G Hamachi/job/b2g.hamachi.mozilla-central.ui/83/console

There is no reduced test case we can think of right now, we only see this with central hamachi, running the unit test suite (where you'd need to set up your environment and testvars file according to https://github.com/mozilla-b2g/gaia/tree/master/tests/python/gaia-ui-tests):

gaiatest --testvars=/home/webqa/webqa-credentials/b2g/b2g-2.json --address=localhost:2828  --type=b2g-xfail-unit --restart gaiatest/tests/manifest.ini

My guess is that it's from this change: https://github.com/mozilla-b2g/gaia/commit/2e417d84777e3cfc974da1fa4745a2fbf9ad32aa
Or you know, maybe not that change, that was from 11 months ago.
Alive has done a bunch of changes on this recently, so he probably has some input as to why this is happening.
This does happen before my change at random rate, and that's why I filed https://bugzilla.mozilla.org/show_bug.cgi?id=911880 because trustedUI has some problem by maintain the app stack.
Could you tell me how to run your automation?

(In reply to Malini Das [:mdas] from comment #0)
> In our automation, we call WindowManager's kill command on all applications
> (except homescreen), and this has worked up until recently, when we started
> to see:
Assignee: nobody → alive
Note:
See https://bugzilla.mozilla.org/show_bug.cgi?id=863117#c0

(In reply to Alive Kuo [:alive] from comment #4)
> Could you tell me how to run your automation?
> 
> (In reply to Malini Das [:mdas] from comment #0)
> > In our automation, we call WindowManager's kill command on all applications
> > (except homescreen), and this has worked up until recently, when we started
> > to see:
(In reply to Alive Kuo [:alive] from comment #4)
> Could you tell me how to run your automation?

https://github.com/mozilla-b2g/gaia/blob/master/tests/python/gaia-ui-tests/README.md could provide information on how to run the tests.
The b2gperf tests are also affected by this, causing all following tests to be aborted.
Severity: normal → blocker
Whiteboard: [qa-automation-blocked]
Bug 905116 was mentioned by :lightsofapollo as a /possible/ suspect, but nobody's yet had a chance to try an on-device backout and re-test.
For what it's worth, in the performance startup tests this appears to almost always fail when starting the clock test. I haven't been able to narrow this down yet, but it might be related to launching the browser app 30 times between restarts, which immediately precedes the clock launches. I've added some extra resilience to b2gperf to allow it to continue after this, but it means we're losing measurements for clock at this time.
Possibly a coincidence, but the b2gperf test after bug 920944 was fixed passed. Perhaps that had further implications that just the ones seen there.
(In reply to Dave Hunt (:davehunt) from comment #11)
> Possibly a coincidence, but the b2gperf test after bug 920944 was fixed
> passed. Perhaps that had further implications that just the ones seen there.

Or not... The Hamachi build with the same revision failed. I have replicated the failure locally but haven't had time to see if I can reduce it down. It currently takes a couple of hours before the failure occurs, however I should be able to test if backing out bug 905116 locally helps.
This does happen occasionally before bug 905116 landed. I don't think back-out that works.
Thanks!
Can we just fix it for good instead of a backout? This is really affecting a lot of our automation, sometimes greater than 10% of tests.

The failure seems to rely on perfect alignment of this Exception and marionette running execute_aync_script which is why it comes as so intermittent.
No longer blocks: 801898
Blocks: 801898
I've reproduced locally and noticed this in the logcat just after the trusted UI error:

E/GeckoConsole(30228): Content JS LOG at dummy file:971 in GaiaApps.kill: terminating app with origi
n 'app://calendar.gaiamobile.org'
E/GeckoConsole(30228): Content JS LOG at dummy file:961 in GaiaApps.kill/gt_onAppTerminated/<: app with origin 'app://calendar.gaiamobile.org' has terminated

This doesn't appear when we are successful.  Doesn't fully explain whats going on, but perhaps a clue.
I hacked apps/calendar/js/controllers/alarm.js to always use `duration=1` in _resetSyncAlarm.  This forces the calendar sync process to launch every minute.

With this in place I can reliably trigger the trusted UI failure by running b2gperf.

So it definitely seems related to trying to kill the Calendar app launched from a mozAlarm.
Ok, I see what is happening here.

For the normal app life cycle trusted_ui.js sets the app origin to `this._lastDisplayedApp` when it receives the `appwillopen` event.  This then automatically creates an entry for that origin in the `this._displayedStacks` hash when you access `this.currentStack`.  This works out for normal apps because the `appterminated` event will have the same matching origin and will be found in `this._displayedStacks`.

In the case of non-normal app launches, for example from mozAlarm, then you may never get `appwillopen`.  In this case `this._lastDisplayedApp` is never set which results in a stack being created at `this._displayedStacks["null"]`.  Since there is no entry for the origin from the `appterminated` event, no stack is found and the type error occurs.

I'm not sure what the correct fix is here.  Perhaps something as simple as:

  if (origin && this._displayedApps[origin]) {
    stack = this._displayedApps[origin];
  }

In any case, I really think we should add a test case handling the non-normal app path here.  So a test that sets a very quick mozAlarm and then tries to kill it.  Since this is such an outlier case that will only happen infrequently it will be easy to regress on this.

Alive, what do you think is the best path forward?
Flags: needinfo?(alive)
As an aside, should the b2gperf and gaia-ui-tests clear all mozAlarms?  On the one hand, it would produce more consistent results, but on the other hand we need the system function well with alarms running.  Thoughts?
(In reply to Ben Kelly [:bkelly] from comment #18)
> Ok, I see what is happening here.
> 
> For the normal app life cycle trusted_ui.js sets the app origin to
> `this._lastDisplayedApp` when it receives the `appwillopen` event.  This
> then automatically creates an entry for that origin in the
> `this._displayedStacks` hash when you access `this.currentStack`.  This
> works out for normal apps because the `appterminated` event will have the
> same matching origin and will be found in `this._displayedStacks`.
> 
> In the case of non-normal app launches, for example from mozAlarm, then you
> may never get `appwillopen`.  In this case `this._lastDisplayedApp` is never
> set which results in a stack being created at
> `this._displayedStacks["null"]`.  Since there is no entry for the origin
> from the `appterminated` event, no stack is found and the type error occurs.
> 
> I'm not sure what the correct fix is here.  Perhaps something as simple as:
> 
>   if (origin && this._displayedApps[origin]) {
>     stack = this._displayedApps[origin];
>   }
> 
> In any case, I really think we should add a test case handling the
> non-normal app path here.  So a test that sets a very quick mozAlarm and
> then tries to kill it.  Since this is such an outlier case that will only
> happen infrequently it will be easy to regress on this.
> 
> Alive, what do you think is the best path forward?

Much thanks for spending time on this.
The proposal is fine.

The said mozAlarm should be invoked by system message, and window manager put the invoked frame in background.

In current stage, I would like to have 'appcreated' event in window manager when a app frame is created. (we have similar event for homescreen app already, see https://github.com/mozilla-b2g/gaia/blob/master/apps/system/js/homescreen_window.js#L5).
|created| would be paired with |terminated| event doubtless.

Ben, do you like to steal this bug or want me to complete it?

Thank you again for investigating!
Flags: needinfo?(alive)
Alive, can you complete?  I'm not that familiar with this part of the code and I still have a number of 1.2 launch/fps bugs to deal with.  Thanks!
(In reply to Ben Kelly [:bkelly] from comment #21)
> Alive, can you complete?  I'm not that familiar with this part of the code
> and I still have a number of 1.2 launch/fps bugs to deal with.  Thanks!

Sure, thank you for fast reply!
Fix async app life state in trustedUI
Attachment #816325 - Flags: review?(timdream)
Attachment #816325 - Flags: review?(timdream) → review+
(In reply to Ben Kelly [:bkelly] from comment #19)
> As an aside, should the b2gperf and gaia-ui-tests clear all mozAlarms?  On
> the one hand, it would produce more consistent results, but on the other
> hand we need the system function well with alarms running.  Thoughts?

We're seeing this in CI after flashing the device, so there shouldn't be any mozAlarms present (should there?). Also, the gaia-ui-tests run with --restart, which removes all persistent storage between tests, so there should be no mozAlarms there either...
Ben and Alive, thanks a lot for your work on this bug!
(In reply to Dave Hunt (:davehunt) from comment #24)
> We're seeing this in CI after flashing the device, so there shouldn't be any
> mozAlarms present (should there?). Also, the gaia-ui-tests run with
> --restart, which removes all persistent storage between tests, so there
> should be no mozAlarms there either...

Dave, are you sure we wipe the mozAlarm database between runs?  I was seeing the calendar alarms persist across the b2gperf executed restart locally.

I suspect that the alarm was being added after the flash when the Calendar set of tests run.  So the failure probably occurs around 15 minutes after those Calendar tests.
Thanks for the quick fix!

Since we think this might address the automation breakage I took the liberty of merging to get it into the build queue faster:

  https://github.com/mozilla-b2g/gaia/commit/19cbc0a11fa3f17908027569fdced8421d03ee0f
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
(In reply to Ben Kelly [:bkelly] from comment #26)
> (In reply to Dave Hunt (:davehunt) from comment #24)
> > We're seeing this in CI after flashing the device, so there shouldn't be any
> > mozAlarms present (should there?). Also, the gaia-ui-tests run with
> > --restart, which removes all persistent storage between tests, so there
> > should be no mozAlarms there either...
> 
> Dave, are you sure we wipe the mozAlarm database between runs?  I was seeing
> the calendar alarms persist across the b2gperf executed restart locally.
> 
> I suspect that the alarm was being added after the flash when the Calendar
> set of tests run.  So the failure probably occurs around 15 minutes after
> those Calendar tests.

We flash between runs, so everything should be wiped. We don't clear between apps within the b2gperf run though, and the failure is after the Calendar launch. I don't understand why just opening the Calendar app would add alarms though...
I've raised bug 926454 to clear up between app launches in b2gperf.
Sorry I forgot to mention that I did check this earlier. I ran some gaia-ui-tests clock tests on a device on a loop and they definitely weren't being kept between test runs. I think the tests would have failed long ago if that were the case anyway.
Zac: Apparently just launching Calendar creates an alarm for the syncing service ~15 minutes later. This explains the b2gperf issues but not necessarily the Gaia UI tests issues. I wonder if there's a different app other than Calendar that could be involved, or if we're somehow not clearing that particular mozAlarm...
oh pardon me, I didn't understand the STR then!
You need to log in before you can comment on or make changes to this bug.