Closed Bug 1031164 Opened 6 years ago Closed 6 years ago

Race condition for operatorvariant startup

Categories

(Firefox OS Graveyard :: Gaia::System, defect, blocker)

ARM
Gonk (Firefox OS)
defect
Not set
blocker

Tracking

(blocking-b2g:2.0+, b2g-v2.0 verified, b2g-v2.1 verified, b2g-v2.2 verified)

VERIFIED FIXED
2.0 S6 (18july)
blocking-b2g 2.0+
Tracking Status
b2g-v2.0 --- verified
b2g-v2.1 --- verified
b2g-v2.2 --- verified

People

(Reporter: gerard-majax, Assigned: gerard-majax)

References

Details

(Keywords: regression, Whiteboard: [systemsfe])

Attachments

(1 file)

This has been hugely documented in bug 1026381.

Basically, running the operatorvariant application depends on AppWindowFactory to have registered "open-app" event handler. This is done on receiving the "load" event from Gecko, by bootstrap.js.

The race happens when the open-app event handler gets registered after the Gecko OperatorApps.jsm modules detected a first-run-with-sim and triggered the operatorvariant app to get opened. Since the event handler has not yet been installed, there is no way that the operatorvariant app can be started.

This results in the customization not being applied at all.

Reproductibility in my case (master, Flame) was 4/5 at least.
Nominating since I think it will be very bad for 2.0.
blocking-b2g: --- → 2.0?
Target Milestone: --- → 2.0 S5 (4july)
QA, could we have this tested on something other than Flame?  This issue doesn't really matter on the Flame but certainly may affect other shipping devices.
Keywords: qawanted
QA Whiteboard: [VH-FL-blocking-][VH-FC-blocking?]
Just putting down notes from discussing this with Vivien:
 - we need to move loading OperatorApps after getting the load event
 - there may also be a race between DOMApplicationRegistry and OperatorApps
 - it would  be safer to:
   - get a notification of OperatorApps job's finished
   - move the loading of OperatorsApps into the registryStarted callback of DOMApplicationRegistry
   - make sending the webapp-registry-ready depending on this new previously added event

This would make sure that
 (1) we do not do OperatorApps stuff (that may install apps) while the DOMApplicationRegistry is not yet ready
 (2) make sure gaia triggers applicationsready event after OperatorApps is finished

By having (2), we get sure that the operatorvariant application will get started properly, and the open-app event that has been sent by OperatorApps has been queued.
(In reply to Peter Dolanjski [:pdol] from comment #2)
> QA, could we have this tested on something other than Flame?  This issue
> doesn't really matter on the Flame but certainly may affect other shipping
> devices.

Peter, according to my discussion this afternoon on this issue with Vivien, this race condition may have other bad impact on any devices.
This doesn't have anything to do with vertical homescreen.
QA Whiteboard: [VH-FL-blocking-][VH-FC-blocking?]
Carmen, it is my understanding that the leaves of the call graph of OperatorApps.jsm is the DOMApplicationRegistry.confirmInstall() call.

We should wait for this to complete before sending webapps-registry-ready.
Flags: needinfo?(cjc)
I don't know if I'm following here... 
OperatorApps doesn't really need to wait for the Webapps registry to finish registering the message handlers, because SystemMessageInternal already buffers the messages until that's done ([1] and [2]). And we don't really want to wait on OperatorApps finishing it's job, since it might take some time, and if you boot without a SIM it won't do anything, and if you boot with a PIN-locked SIM it won't even start until you unlock it on the FTU or at a later point.

I think the only thing needed here is to move the OperatorApps loading from [3] to [4]


[1] https://mxr.mozilla.org/mozilla-central/source/dom/messages/SystemMessageInternal.js#225
[2] https://mxr.mozilla.org/mozilla-central/source/dom/messages/SystemMessageInternal.js#515
[3] https://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#636
[4] https://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#641
Flags: needinfo?(cjc)
(In reply to Carmen Jimenez Cabezas from comment #7)
> I don't know if I'm following here... 
> OperatorApps doesn't really need to wait for the Webapps registry to finish
> registering the message handlers, because SystemMessageInternal already
> buffers the messages until that's done ([1] and [2]). And we don't really
> want to wait on OperatorApps finishing it's job, since it might take some
> time, and if you boot without a SIM it won't do anything, and if you boot
> with a PIN-locked SIM it won't even start until you unlock it on the FTU or
> at a later point.

OperatorApps can do some DOMApplicationRegistry.confirmInstall() calls. How can we do those without being sure that DOMApplicationRegistry has been fully and properly initialized?

> 
> I think the only thing needed here is to move the OperatorApps loading from
> [3] to [4]
> 

Which I already experimented back in las friday, and proved it was NOT enough to fix the race condition.

> 
> [1]
> https://mxr.mozilla.org/mozilla-central/source/dom/messages/
> SystemMessageInternal.js#225
> [2]
> https://mxr.mozilla.org/mozilla-central/source/dom/messages/
> SystemMessageInternal.js#515
> [3]
> https://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.
> js#636
> [4]
> https://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.
> js#641
(In reply to Alexandre LISSY :gerard-majax from comment #8)
> 
> OperatorApps can do some DOMApplicationRegistry.confirmInstall() calls. How
> can we do those without being sure that DOMApplicationRegistry has been
> fully and properly initialized?
> 

Because the initialization do a lot of things that can be done in parallel with installing other apps. Things like registering activities, registering message handlers, loading permissions... are not really blocking with installing new apps while they're running.

The only thing needed to install an app is for the registry to actually exist. And that's taken care of by the constructor/init of Webapps, which is invoked before OperatorApps start
Still discussing if this is a bug or not but we want to make sure our operator variant code is rock-solid.
blocking-b2g: 2.0? → 2.0+
So, below is my understanding of the events. I tried to make this clean and clear.

=====> applying operator customizations depends on:
       (1) start of "operatorvariant" app with "first-run-with-sim" system message
       cf. apps/operatorvariant/js/operator_variant.js, init()

=> (1) starting the operatorvariant app depends on:
       (2) reception of the "open-app" event
       cf. apps/system/js/app_window_factory.js, "open-app" case in handleEvent()
 
=> (2) receiving "open-app" depends on:
       (3) having the event listener setup
       cf. apps/system/js/app_window_factory.js, start()
       (4) gecko sending "open-app" request
       cf. b2g/components/SystemMessageGlue.js, openApp()

=> (3) adding "open-app" event depends on:
       (5) applications.ready is true
       cf. apps/system/js/app_window_factory.js, start()

=> (4) gecko sending "open-app" request depends on:
       (10) detecting first run with SIM and sending system message
       cf. dom/apps/src/OperatorApps.jsm, init()

=> (5) applications.ready is true depends on:
       (6) mozApps.mgmt.getAll() completes
       cf. apps/system/js/applications.js, getAllApps()

=> (6) mozApps.mgmt.getAll() completes depends on:
       (7) receiving "webapps-registry-ready" mozChromeEvent
       cf. apps/system/js/applications.js, start()

=>(13) AppWindowFactory.start() depends on:
       (5) applications.ready is true
       cf. apps/system/js/bootstrap.js, registerGlobalEntries

=>(12) Applications.start() depends on:
       (8) loading system app, "load" event

=> (7) mozChromeEvent "webapps-registry-ready" depends on:
       (9) DOMApplicationRegistry is ready
       cf. b2g/chrome/content/shell.js, DOMApplicationRegistry.registryReady.then()

=>(10) detecting first run with SIM and sending system message depends on:
       (11) loading OperatorApps.jsm, running init(), triggered by "mozbrowserloadstart" event
       cf. b2g/chrome/content/shell.js, notifyContentStart()

The key of the race condition is that we need (13) to be done before (4). Also, (13) depends on (6) (via (5)). Getting (6) means DOMApplicationRegistry is ready AND we traversed the whole list of applications.

If [ (7) + (6) ] > (10) then we will send the system message and thus the "open-app" event before the handler got a chance to be ready in (13).
Flags: needinfo?(fabrice)
Flags: needinfo?(cjc)
The more I think about this, the more I believe this is a Gaia bug and not a Gecko bug. Gecko is already queueing the events until the moment Gaia tells it's ready to process them. Only Gaia is telling it's ready (somehow, on the onload event) way before it's actually ready.

The handler for open-app isn't set until applications.ready is true (or the applicationready event is fired) on [1]. That happens when getAll finishes. And what I think it's happening is that Gecko is firing the open-app message before the getAll finishes

[1] http://mxr.mozilla.org/gaia/source/apps/system/js/applications.js#62

And the fix would be:
1. Set the open-app handler as soon as possible (that is, don't wait for applicationready).
2. If a open-app event arrives before application is ready, cache it.
3. When applicationready is received, process any cached open-app events

Anyway, this race condition isn't specific for OperatorApps. It just happens that OperatorApps is one of (the one?) that fires a System Message earlier.
Flags: needinfo?(cjc)
Ok, after reading your list, I think we agree on the event list. 
As I said on my previous comment, I think the solution for the race is to ensure than when gecko sends the open-app (which will actually be when the system app onload fires, independently on when OperatorApps queue it) the handler is set. And we can ensure that by actually setting the handler independently of the applications.ready state.
(In reply to Peter Dolanjski [:pdol] from comment #2)
> QA, could we have this tested on something other than Flame?  This issue
> doesn't really matter on the Flame but certainly may affect other shipping
> devices.

Hi Peter - could I get some STRs. This thread is a little dev-jargon heavy and it is not clear to the QA-Wanted testers what exactly the bug is and how to repro it to check it on other devices (if you still need that).

Thanks!
Flags: needinfo?(pdolanjski)
(In reply to Joshua Mitchell from comment #14)
> (In reply to Peter Dolanjski [:pdol] from comment #2)
> > QA, could we have this tested on something other than Flame?  This issue
> > doesn't really matter on the Flame but certainly may affect other shipping
> > devices.
> 
> Hi Peter - could I get some STRs. This thread is a little dev-jargon heavy
> and it is not clear to the QA-Wanted testers what exactly the bug is and how
> to repro it to check it on other devices (if you still need that).
> 
> Thanks!

Alexandre would be a better person to ask here.
Flags: needinfo?(pdolanjski) → needinfo?(lissyx+mozillians)
(In reply to Carmen Jimenez Cabezas from comment #12)

> The handler for open-app isn't set until applications.ready is true (or the
> applicationready event is fired) on [1]. That happens when getAll finishes.
> And what I think it's happening is that Gecko is firing the open-app message
> before the getAll finishes

Why do we wait for getAll there? Don't we have all what we need in the open-app message to launch the app?
Flags: needinfo?(fabrice)
(In reply to Joshua Mitchell from comment #14)
> (In reply to Peter Dolanjski [:pdol] from comment #2)
> > QA, could we have this tested on something other than Flame?  This issue
> > doesn't really matter on the Flame but certainly may affect other shipping
> > devices.
> 
> Hi Peter - could I get some STRs. This thread is a little dev-jargon heavy
> and it is not clear to the QA-Wanted testers what exactly the bug is and how
> to repro it to check it on other devices (if you still need that).
> 
> Thanks!

I'm reproducing this by:
 - flash an uptodate gecko master on my flame
 - modify gaia's customization/variant.json to add my SIM MCC/MNC. Please note it should be a 3-3 string, like for my case "208-001" (Orange F SIM card)
 - apply customization by doing a reset-gaia

Expected:
 Customization are applied

Actual:
 Customization are not applied. Specifically, the default variant.json makes the background blue when operator variant is applied, while the default wallpaper setting is a star sky. So if you get a star sky, operator variant was not applied, if you get a blue wallpaper, it has been applied.

Please note that from my testing, it seems the very first operator variant customization /may/ work, but further reset-gaia will mostly constantly fail.

The reset-gaia I'm performing is:
> make MOZILLA_OFFICIAL=1 PRODUCTION=1 REMOTE_DEBUGGER=1 NOFTU=0 GAIA_DEV_PIXELS_PER_PX=1.5 GAIA_OPTIMIZE=1 GAIA_DISTRIBUTION_DIR=$PWD/customization/ reset-gaia
Flags: needinfo?(lissyx+mozillians) → needinfo?(jmitchell)
Josh - For QA Wanted, you just need to flash a customization using https://wiki.mozilla.org/B2G/QA/Customizations#Build_Setup & verify that the custom wallpaper appears based on your SIM:

* Red if it's AT&T US
* Blue if it's T-Mobile US
(In reply to Jason Smith [:jsmith] from comment #18)
> * Red if it's AT&T US
> * Blue if it's T-Mobile US

I got a solid red wallpaper on Homescreen with an AT&T SIM.

Tested on Buri with today's 2.0 Gecko, and Gaia is from following this guide https://wiki.mozilla.org/B2G/QA/Customizations#Build_Setup

Comment 2 said to test on something other than Flame so I tested on Buri. And since this was marked blocking 2.0, I tested on 2.0 branch.

(In reply to Alexandre LISSY :gerard-majax from comment #17)
> Please note that from my testing, it seems the very first operator variant
> customization /may/ work, but further reset-gaia will mostly constantly fail.

I tried twice doing step 5 of https://wiki.mozilla.org/B2G/QA/Customizations#Build_Setup and both times showed a red wallpaper.

Also tried resetting phone by going to Settings > Device Information > More Information > Reset Phone, and after the FTU I saw a red wallpaper.

I did see a sky wallpaper on all three attempts for a second BEFORE the device successfully booted up and entered FTU. I'm not sure if this is the issue you're seeing.
QA Whiteboard: [QAnalyst-Triage?]
Keywords: qawanted
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(jmitchell)
(In reply to Pi Wei Cheng (piwei) from comment #19)
> (In reply to Jason Smith [:jsmith] from comment #18)
> > * Red if it's AT&T US
> > * Blue if it's T-Mobile US
> 
> I got a solid red wallpaper on Homescreen with an AT&T SIM.
> 
> Tested on Buri with today's 2.0 Gecko, and Gaia is from following this guide
> https://wiki.mozilla.org/B2G/QA/Customizations#Build_Setup
> 
> Comment 2 said to test on something other than Flame so I tested on Buri.
> And since this was marked blocking 2.0, I tested on 2.0 branch.
> 
> (In reply to Alexandre LISSY :gerard-majax from comment #17)
> > Please note that from my testing, it seems the very first operator variant
> > customization /may/ work, but further reset-gaia will mostly constantly fail.
> 
> I tried twice doing step 5 of
> https://wiki.mozilla.org/B2G/QA/Customizations#Build_Setup and both times
> showed a red wallpaper.
> 
> Also tried resetting phone by going to Settings > Device Information > More
> Information > Reset Phone, and after the FTU I saw a red wallpaper.
> 
> I did see a sky wallpaper on all three attempts for a second BEFORE the
> device successfully booted up and entered FTU. I'm not sure if this is the
> issue you're seeing.

No, I'm referring to the status after finishing FTU.

Anyway, I'm not sure it's very conclusive.
This is a race condition, Buri is a single core device, Flame is a dual core. This may have an impact. Also, you only did it twice while my comments documented that sometimes it was working. So I don't think that testing customization only twice would be enough to make sure you are reproducing or not.
Also... if you have a PIN set then you won't run into this problem ever (since you have to enter the PIN before the process is launched, and to do that system has to be loaded completely and have to had launched the FTU, so the race condition won't happen).
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage+][lead-review+]
Target Milestone: 2.0 S5 (4july) → 2.0 S6 (18july)
(In reply to Carmen Jimenez Cabezas from comment #13)
> Ok, after reading your list, I think we agree on the event list. 
> As I said on my previous comment, I think the solution for the race is to
> ensure than when gecko sends the open-app (which will actually be when the
> system app onload fires, independently on when OperatorApps queue it) the
> handler is set. And we can ensure that by actually setting the handler
> independently of the applications.ready state.

Okay, I'm going to send a patch based on this idea:
 - register open-app as early as possible,
 - wait until applications.ready is true,
 - deliver pending open-app messages,

Meanwhile, could you please reply to fabrice's comment 16 ?
Flags: needinfo?(carmen.jimenezcabezas)
(In reply to Alexandre LISSY :gerard-majax from comment #22)
> (In reply to Carmen Jimenez Cabezas from comment #13)
> > Ok, after reading your list, I think we agree on the event list. 
> > As I said on my previous comment, I think the solution for the race is to
> > ensure than when gecko sends the open-app (which will actually be when the
> > system app onload fires, independently on when OperatorApps queue it) the
> > handler is set. And we can ensure that by actually setting the handler
> > independently of the applications.ready state.
> 
> Okay, I'm going to send a patch based on this idea:
>  - register open-app as early as possible,
>  - wait until applications.ready is true,
>  - deliver pending open-app messages,
> 
> Meanwhile, could you please reply to fabrice's comment 16 ?

I wasn't ignoring Fabrice :) It's just that I don't know why we're waiting for getAll there, I didn't wrote that code. Redirecting the ni to :garychen (who wrote the code) and/or :alive (who reviewed it)
Flags: needinfo?(gchen)
Flags: needinfo?(carmen.jimenezcabezas)
Flags: needinfo?(alive)
It's not gary wrote. It coming from the old window manager.

The possible cause might be, we need to launch FTU.
But it's not coming from open-app request, hence we don't know the url and manifest context.

FTU is coming from ftu.manifestURL, hence we need to query app mgmt to know the detail of FTU app.
Flags: needinfo?(gchen)
Flags: needinfo?(alive)
So here I proposed to
* Change browserConfigHelper to use more arguments.
** If there's no manifest info in the arguments, query window.applications by getByManifestURL.
** If there's manifest/origin information provided by open-app, skip querying.

Thus we don't need AppWindowFactory to depend to Application.ready(),
but FtuLauncher/HomescreenLauncher both need because they are not launched via gecko open-app message.

Let me know if you need my help to fix it.
(In reply to Fabrice Desré [:fabrice] from comment #16)
> (In reply to Carmen Jimenez Cabezas from comment #12)
> 
> > The handler for open-app isn't set until applications.ready is true (or the
> > applicationready event is fired) on [1]. That happens when getAll finishes.
> > And what I think it's happening is that Gecko is firing the open-app message
> > before the getAll finishes
> 
> Why do we wait for getAll there? Don't we have all what we need in the
> open-app message to launch the app?

Sadly, I don't think so http://dxr.mozilla.org/mozilla-central/source/dom/messages/SystemMessageInternal.js#597

So we won't be able to implement Alive's proposal :(
Implementation of the solution suggested by Carmen: early registration of open-app and queuing of pending events until AppWindowFactory is ready. We also have to change homescreenLauncher start because of a slight dependency (from AppWindowFactory.launch).

Still lacks tests, though.
I did a bunch (10+) of retries, and after applying my changes there's no way I boot without operator customizations applied :)
Whiteboard: [systemsfe] → [systemsfe][ETA=7/15]
Pushed updated PR: adding tests that ensures event app-window-factory-ready is sent
Depends on: 1038609
Pushed updated PR: all tests should now be in place. All green on my side :)
Comment on attachment 8455261 [details] [review]
Link to Github https://github.com/mozilla-b2g/gaia/pull/21689

Alive, could you please review this? Thanks!
Attachment #8455261 - Flags: review?(alive)
Comment on attachment 8455261 [details] [review]
Link to Github https://github.com/mozilla-b2g/gaia/pull/21689

I want to keep bootstrap clean.
Let's try start the appWindowFactory just after window.load and application is instantiated. Queue the events inside awf.
Attachment #8455261 - Flags: review?(alive)
Comment on attachment 8455261 [details] [review]
Link to Github https://github.com/mozilla-b2g/gaia/pull/21689

Moved everything to AppWindowFactory, keeping changes to bootstrap.js to the bare minimum.
Attachment #8455261 - Flags: review?(alive)
Updated PR, refactoring as suggested by alive offline: we now queue all events.
Comment on attachment 8455261 [details] [review]
Link to Github https://github.com/mozilla-b2g/gaia/pull/21689

r+ with nits, thanks for your patience.
Attachment #8455261 - Flags: review?(alive) → review+
Whiteboard: [systemsfe][ETA=7/15] → [systemsfe][ETA=7/16]
(In reply to Alive Kuo [:alive][NEEDINFO!][Berlin 7/14-7/18] from comment #35)
> Comment on attachment 8455261 [details] [review]
> Link to Github https://github.com/mozilla-b2g/gaia/pull/21689
> 
> r+ with nits, thanks for your patience.

All nits addressed, thanks!
https://github.com/mozilla-b2g/gaia/commit/7fa1e2157e4e0fe1f1b6796a2a67ac59dd515053
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
\o/  Thanks, Alexandre!  This had blocked me from testing on flame and now it's working!
Issue verified fixed on Flame 2.0, Flame 2.1, and Flame 2.2

Actual Results: With operator variant set the background is set to the correct blue mobizilla background. Checked again, adding other backgrounds to customization folder and changing paths in variant.json. In each instance customization applied correctly.

Device: Flame 2.0 (319mb)(Kitkat Base)(Shallow Flash)
BuildID: 20141111000205
Gaia: ab83632c92f9fc571b11d8468b6901cc4ed905c0
Gecko: fcc5d31f84d7
Version: 32.0 (2.0)
Firmware: V188-1
User Agent: Mozilla/5.0 (Mobile; rv:32.0) Gecko/32.0 Firefox/32.0

Device: Flame 2.1 (319mb)(Kitkat Base)(Shallow Flash)
BuildID: 20141111001201
Gaia: 4c159e75a1568afbbf0c83c1235ec56facfbe87d
Gecko: 452db1a0c9a0
Version: 34.0 (2.1)
Firmware: V188-1
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0

Device: Flame 2.2 Master (319mb)(Kitkat Base)(Shallow Flash)
BuildID: 20141111040202
Gaia: 5ae28ff11b982e2bd7d1aa097cda131536952bdc
Gecko: 76b85b90a8cb
Version: 36.0a1 (2.2 Master)
Firmware: V188-1
User Agent: Mozilla/5.0 (Mobile; rv:36.0) Gecko/36.0 Firefox/36.0
Status: RESOLVED → VERIFIED
QA Whiteboard: [QAnalyst-Triage+][lead-review+] → [QAnalyst-Triage?][lead-review+]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?][lead-review+] → [QAnalyst-Triage+][lead-review+]
Flags: needinfo?(ktucker)
You need to log in before you can comment on or make changes to this bug.