Closed Bug 1160923 Opened 5 years ago Closed 4 years ago

[System] System app may lose system messages at startup due to initialization races; may result in overdue alarm clock alarms being lost until the clock app is manually started

Categories

(Firefox OS Graveyard :: Runtime, defect, P3)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:2.5+, b2g-v2.1 affected, b2g-v2.1S affected, b2g-v2.2 affected, b2g-master fixed)

RESOLVED FIXED
blocking-b2g 2.5+
Tracking Status
b2g-v2.1 --- affected
b2g-v2.1S --- affected
b2g-v2.2 --- affected
b2g-master --- fixed

People

(Reporter: ffos_st, Assigned: timdream, NeedInfo)

References

Details

(Whiteboard: sprd431227)

Attachments

(9 files, 6 obsolete files)

77.03 KB, text/plain
Details
74.58 KB, text/plain
Details
61.80 KB, text/plain
Details
62.01 KB, text/plain
Details
265.45 KB, application/x-gzip
Details
218.85 KB, application/x-gzip
Details
1.67 KB, text/plain
Details
223.16 KB, application/x-gzip
Details
22.57 KB, patch
timdream
: review+
Details | Diff | Splinter Review
User Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:37.0) Gecko/20100101 Firefox/37.0
Build ID: 20141217030202

Steps to reproduce:

Description: Alarm does not ring by itself when we set the alarm , power off the phone and after some time power on the phone again. We have to open the clock app to make the alarm ring

Device: SPRD 7715ea
Build Identifier: 20150504131236

Steps to Reproduce:
1) don't  insert SIM card (or insert SIM card, but not select Set automatically in Date&Time settings)

2) Goto clock, set an alarm which is two minutes later than current time of the mobile.
   e.g. Current time is 5:02pm, Set an alarm at 5:04pm.

3) power off the phone, wait for a few minutes more than two minutes. power on the phone.

Actual Result: The alarm does not ring by itself, we have to open the clock application to make the alarm ring

Expected Result: Alarm should always ring by itself when we power on the phone again.

Repro frequency:10/10,100%
Whiteboard: sprd431227
Dear Vance : 
   Can you arrange for an engineer to look at this question ? thanks very much.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(vchen)
Hum not sure if this is an issue or not. Hi Jinchao could you help to check if Flame can reproduce this issue as well?

Thanks
Flags: needinfo?(vchen) → needinfo?(jinchao.wang)
(In reply to Vance Chen [:vchen][vchen@mozilla.com] from comment #2)
> Hum not sure if this is an issue or not. Hi Jinchao could you help to check
> if Flame can reproduce this issue as well?
> 
> Thanks

Dear Vance:
   As comment in sprd431227, our engineer can reproduce this issue on flame too , thanks.
Flags: needinfo?(jinchao.wang)
Hi Naoki, do you know if this is indeed a bug or a per-design behavior?
Flags: needinfo?(nhirata)
(In reply to jinchao.wang from comment #3)
>    As comment in sprd431227, our engineer can reproduce this issue on flame
> too , thanks.

Is this a Flame configured to only have 319M or does it have more memory?  At startup time the mozAlarms API implementation runs through all of the alarms that should already fired and delivers them.  It seems quite possible that at startup and/or if there are other configured alarms, the clock app could be launched but get OOM-killed before it can trigger its alarm.  There might also be indications of this in any logcat results you might be able to provide.
Flags: needinfo?(jinchao.wang)
Hi Jinchao, per Comment#5, could you help to verify this issue again with Flame 512 MB and Flame 1G RAM? Also please help to get a logcat log of this issue as well

Thanks
(In reply to Vance Chen [:vchen][vchen@mozilla.com] from comment #6)
> Hi Jinchao, per Comment#5, could you help to verify this issue again with
> Flame 512 MB and Flame 1G RAM? Also please help to get a logcat log of this
> issue as well
> 
> Thanks

Hi, Vance:

I don't know how big is my flame RAM, but I use adb to check the meminfo, RAM is 416124 kB,
root@flame:/proc # cat meminfo                                                 
MemTotal:         416124 kB

and we don't have a Flame that RAM is 512MB or 1G, we just own one flame. sorry to cann't help you.
Attached file flame.txt
this is the logcat log of this issue
Hi ffos_st -

Actually every Flame device comes with 1G RAM. Please follow the steps here to re-config it to 1G or 512MB

1. type "adb reboot bootloader" to enter fastboot mode
2. type "fastboot oem mem 1024" to change the RAM to 1 G
3. type "fastboot reboot" to reboot the device.

And from your information, your current RAM configuration is already 512 MB. So please change the RAM back to 512(by changing the number from 1024 to 512 in step 2) after you done verification with 1G RAM

Thanks
Flags: needinfo?(jinchao.wang) → needinfo?(ffos_st)
Also BTW, ffos_st could you also help to get the dmesg log of this issue? (by "adb shell dmesg")

Thanks!
Attached file flame-1G.txt
Hi, Vance:

I can repreduce when flame RAM is 1G,  this is the log
Flags: needinfo?(ffos_st)
Attached file dmsg-512M.txt
this is the dmesg log when RAM is 512M
Attached file dmsg-1G
this is the dmesg log when RAM is 1G
The logcats themselves look somewhat boring, although the segfault in dmesg looks concerning (although probably unrelated?).

Probably the right thing to do from a Mozilla QA and/or devices team perspective is:
- Attempt to reproduce on Flame
- If it reproduces, modify the "const DEBUG = false;" in https://dxr.mozilla.org/mozilla-central/source/dom/alarm/AlarmsManager.js and https://dxr.mozilla.org/mozilla-central/source/dom/alarm/AlarmDB.jsm and https://dxr.mozilla.org/mozilla-central/source/dom/alarm/AlarmService.jsm to be true.
- Run again with that debugging, capturing a logcat.

This will make it pretty clear what mozAlarms is doing at startup.  If ffos_st wants to make these changes locally and run with them, that's also great.
Hi Andrew, I think the logs ffos_st provided in Comment#12 and Comment#13 are Flame log

Hi ffos_st, is it possible you can modify your local flame codebase as Andrew mentioned in Comment#14(turn on debug option in AlarmsManager.js, AlarmDB.jsm, and AlarmService.jsm), then generate a new build to test and get log again?

Thanks for your help
Flags: needinfo?(ffos_st)
Hi, Vance:

if I insert SIM card, but not select Set automatically in Date&Time settings, then I will reproduce this bug, 
but if I select Set automatically to checked in Date&Time settings, I cann't reproduce this bug.

Can you analyze root causes? thank you very much.
Flags: needinfo?(ffos_st)
(In reply to Andrew Sutherland [:asuth] from comment #14)
> The logcats themselves look somewhat boring, although the segfault in dmesg
> looks concerning (although probably unrelated?).
> 
> Probably the right thing to do from a Mozilla QA and/or devices team
> perspective is:
> - Attempt to reproduce on Flame
> - If it reproduces, modify the "const DEBUG = false;" in
> https://dxr.mozilla.org/mozilla-central/source/dom/alarm/AlarmsManager.js
> and https://dxr.mozilla.org/mozilla-central/source/dom/alarm/AlarmDB.jsm and
> https://dxr.mozilla.org/mozilla-central/source/dom/alarm/AlarmService.jsm to
> be true.
> - Run again with that debugging, capturing a logcat.
> 
> This will make it pretty clear what mozAlarms is doing at startup.  If
> ffos_st wants to make these changes locally and run with them, that's also
> great.

Preset conditions:
1)insert sim card
2)settings > Date&Time > Set automatically > disable
3) settings > Screen Lock > Lock screen > disable (Convenient to test)

Steps:
1)if now is 10:00 am, set alarm to 10:01am  and power off
2)  after 2 minutes , now is 10:02 am , then we power on the phone
3)  place the phone on the desk , don't touch

Result:
Wait 5 minutes, the alarm won't ring any more . But, after 5 minutes,
If now we click clock icon to open clock app , alarm will ring.
If we click settings > Date & Time > Set automatically > enable  , alarm will ring too.

Analysis:
We think it maybe relative to moztimechange event.


Dear Vance &  Andrew:
   I will open debug switch on spreadtrum dophin platform , and provide you reproduce log, thanks.
Hi All:
  If alarm to ring , we found b2g process will firstly fire open-app(app_window_factory.js)event to start clock app, then clock app/process will do something , then b2g process will fire mozbrowseropenwindow event to open clock attantionscreen to ring alarm.

  So we found a workround way to avoid this issue. we start clock app every time when the phone is power on:
  app_window_factory.js
        case 'ftuskip':
          if (!System.locked) {
            this.display();
          }

         + var self = this;
         + setTimeout(function(){
         +   var clockConfig = {
         +     manifestURL: "app://clock.gaiamobile.org/manifest.webapp",
         +     name: "Clock",
         +     oop:true,
         +     origin:"app://clock.gaiamobile.org",
         +     url: "app://clock.gaiamobile.org/index.html",
         +     stayBackground: true
         +   };
         +
         +   var clockApp = applications.getByManifestURL(clockConfig.manifestURL);
         +   clockConfig.manifest = clockApp.manifest;
         +
         +   if (!self.isRunning(clockConfig)) {
         +     new AppWindow(clockConfig);
         +   }
         + },2000);

          break;

    This is just a work round , but we think this is not a good idea. we should solve this problem by root cause.
Attached file alarm.tar.gz
reply to comment 14 

Dear Vance &  Andrew:
   I opened debug switch on spreadtrum dophin platform , and provide you reproduce log, thanks.
Flags: needinfo?(vchen)
Thank you very much for providing the log!

We can see that the AlarmService is correctly finding the alarm at startup and trying to notify the clock app:
===
05 17:29:26.110   122   122 I Gecko   : AlarmDB: Request successful. Record count: 1
05-05 17:29:26.150   122   122 I Gecko   : AlarmService: Callback after getting alarms from database: [{"date":"2015-05-05T09:28:00.000Z","ignoreTi
mezone":true,"data":{"id":1,"type":"normal"},"pageURL":"app://clock.gaiamobile.org/index.html","manifestURL":"app://clock.gaiamobile.org/manifest.w
ebapp","timezoneOffset":-480,"id":1}]
05-05 17:29:26.150   122   122 I Gecko   : AlarmService: _removeAlarmFromDb()
05-05 17:29:26.150   122   122 I Gecko   : AlarmDB: remove()
05-05 17:29:26.150   122   122 I Gecko   : AlarmService: _notifyAlarmObserver()
05-05 17:29:26.150   122   122 I Gecko   : AlarmService: Fire system message: {"date":"2015-05-05T09:28:00.000Z","ignoreTimezone":true,"data":{"id"
:1,"type":"normal"},"pageURL":"app://clock.gaiamobile.org/index.html","manifestURL":"app://clock.gaiamobile.org/manifest.webapp","timezoneOffset":-
480,"id":1}
===

But we never see the clock app launch until triggered by the home screen:
===
05-05 17:32:22.470   580   580 E GeckoConsole: Content JS LOG at app://verticalhome.gaiamobile.org/gaia_build_defer_index.js:395 in GridView.prototype.clickIcon: Bug426981, launch app is app://clock.gaiamobile.org
===

At that point (thanks to custom debugging someone added with a "wjc" prefix, it looks like, because I'm not seeing the other debug statements from SystemMessageInternal.js), we can see that the system message was in fact waiting there because we see:
===
05-05 17:32:24.330   744   744 I Gecko   : wjc receiveMessage SystemMessageManager:GetPendingMessages:Return for [alarm] with manifest URL = app://clock.gaiamobile.org/manifest.webapp and page URL = app://clock.gaiamobile.org/index.html
===
which is then followed by an indication that an attentionwindow is generated:
===
05-05 17:32:24.680   122   122 E GeckoConsole: Content JS LOG at app://system.gaiamobile.org
/js/app_window.js:1165 in aw_debug: [AttentionWindow][app://clock.gaiamobile.org][AttentionWindow_3][173.001] publishing external event: willrender
===

So this seems like it's either a problem with SystemMessageInternal or the system app processing the should-happen openApp/open-app thing.  Since I didn't see any patches in SystemMessageInternal that seem like v2.1s could be missing, I'm going to say this is a system app thing and move the bug there.  Unfortunately, there's another lack of enabled/useful logging at this point.

jinchao.wang, if you can enable the debug in https://dxr.mozilla.org/mozilla-central/source/dom/messages/SystemMessageInternal.js#52 and provide another log, we can try and further determine exactly where the problem is, although hopefully the system team can help identify what's going on.

System functional team/app team, I'm officially handing this off to you at this point.  In recognition of that, I'm needinfoing :alive who is listed as the first owner of the app.
Component: Gaia::Clock → Gaia::System
Flags: needinfo?(alive)
Summary: [FFOS7715 v2.1s] [Clock] Alarm does not ring by itself when we set the alarm , power off the phone and after some time power on the phone again. We have to open the clock app to make the alarm ring → [FFOS7715 v2.1s] [Clock] overdue "alarm" system message sent at startup is not triggering launch of clock app, but system remains in GetPendingMessages until app is manually launched UNLESS "set automatically" is enabled for the clock
Flags: needinfo?(vchen)
Flags: needinfo?(nhirata)
Hand over to chens

What we could do is make sure open-app event is caught in AppWindowFactory or not to clarity this is gaia or gecko bug.
Flags: needinfo?(alive) → needinfo?(chens)
Attached file add log to awf
Jinchao, would you try this patch, follow same steps in comment 17 and share the log with us? thanks.
Flags: needinfo?(chens) → needinfo?(jinchao.wang)
(In reply to Sherman Chen [:chens] from comment #23)
> Created attachment 8602040 [details]
> add log to awf
> 
> Jinchao, would you try this patch, follow same steps in comment 17 and share
> the log with us? thanks.

Hi Sherman :
   I has upload the log according to your patch, please check, thanks !
Flags: needinfo?(jinchao.wang)
This bug can also be reproduced(10/10, 100%) on flame following the same steps in comment 17 .
From the log in attachment 8602104 [details], we can observe AppWindowFactory is trying to handle pending events, but there's no open-app events. And the problem should be why gecko is not sending open-app event to system app.
Ah, so I think I identified the problem.  There's a pretty bad initialization ordering race in js/bootstrap.js:

- AppWindowFactory.start() is called by registerGlobalEntries() inside window.addEventListener('load', function startup() {}), but only once applications.ready is true.  This is dependent upon navigator.mozApps.mgmt.getAll() having returned.

- shell.js' notifyContentStart method (triggered by mozbrowserloadstart) calls "SystemAppProxy.setIsReady();" (which triggers SystemAppProxy to deliver all _pendingEvents) inside a content.addEventListener('load', function() {}) itself.

Which is two fundamental issues:
1) registerGlobalEntries() is potentially called WAY after "load" fires because of the data dependency on mozApps.mgmt.getAll().
2) SystemAppProxy and bootstrap's "load" event listeners are potentially racing each other.

There are a number of ways this could be addressed, but pretty much AppWindowFactory.start() should be called synchronously during the execution of one of the JS files.  This ensures that it is initialized prior to the "load" event firing, thereby avoiding all potential races.

(Note: my code analysis was done against gaia master.)
er, and for context for anyone reading along, AppWindowFactory.start() is where window.addEventListener('open-app', ...) gets called.  So the event is just falling on the floor because of this race.

(The comment 21 log clearly contained:
  05-06 21:40:38.520   123   123 I Gecko   : -- SystemMessageInternal 1430919638529 : Asking to open pageURL: app://clock.gaiamobile.org/index.html, manifestURL: app://clock.gaiamobile.org/manifest.webapp, type: alarm, target: undefined, showApp: false, onlyShowApp: false, extra: undefined
which means that the problem was in SystemAppProxy or downstream of that.)
Summary: [FFOS7715 v2.1s] [Clock] overdue "alarm" system message sent at startup is not triggering launch of clock app, but system remains in GetPendingMessages until app is manually launched UNLESS "set automatically" is enabled for the clock → [FFOS7715 v2.1s] System app may lose system messages at startup due to initialization races; may result in overdue alarm clock alarms being lost until the clock app is manually started
(In reply to Andrew Sutherland [:asuth] from comment #28)
> Ah, so I think I identified the problem.  There's a pretty bad
> initialization ordering race in js/bootstrap.js:
> 
> - AppWindowFactory.start() is called by registerGlobalEntries() inside
> window.addEventListener('load', function startup() {}), but only once
> applications.ready is true.  This is dependent upon
> navigator.mozApps.mgmt.getAll() having returned.
> 
> - shell.js' notifyContentStart method (triggered by mozbrowserloadstart)
> calls "SystemAppProxy.setIsReady();" (which triggers SystemAppProxy to
> deliver all _pendingEvents) inside a content.addEventListener('load',
> function() {}) itself.
> 
> Which is two fundamental issues:
> 1) registerGlobalEntries() is potentially called WAY after "load" fires
> because of the data dependency on mozApps.mgmt.getAll().
> 2) SystemAppProxy and bootstrap's "load" event listeners are potentially
> racing each other.
> 
> There are a number of ways this could be addressed, but pretty much
> AppWindowFactory.start() should be called synchronously during the execution
> of one of the JS files.  This ensures that it is initialized prior to the
> "load" event firing, thereby avoiding all potential races.
> 
> (Note: my code analysis was done against gaia master.)

Hmmm...interesting investigation. But I am having one question: I don't see the STR includes shutdown the phone in comment 17. What you described might happen but it shouldn't be the case in this bug. What do I miss?
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #30)
> (In reply to Andrew Sutherland [:asuth] from comment #28)
> > Ah, so I think I identified the problem.  There's a pretty bad
> > initialization ordering race in js/bootstrap.js:
> > 
> > - AppWindowFactory.start() is called by registerGlobalEntries() inside
> > window.addEventListener('load', function startup() {}), but only once
> > applications.ready is true.  This is dependent upon
> > navigator.mozApps.mgmt.getAll() having returned.
> > 
> > - shell.js' notifyContentStart method (triggered by mozbrowserloadstart)
> > calls "SystemAppProxy.setIsReady();" (which triggers SystemAppProxy to
> > deliver all _pendingEvents) inside a content.addEventListener('load',
> > function() {}) itself.
> > 
> > Which is two fundamental issues:
> > 1) registerGlobalEntries() is potentially called WAY after "load" fires
> > because of the data dependency on mozApps.mgmt.getAll().
> > 2) SystemAppProxy and bootstrap's "load" event listeners are potentially
> > racing each other.
> > 
> > There are a number of ways this could be addressed, but pretty much
> > AppWindowFactory.start() should be called synchronously during the execution
> > of one of the JS files.  This ensures that it is initialized prior to the
> > "load" event firing, thereby avoiding all potential races.
> > 
> > (Note: my code analysis was done against gaia master.)
> 
> Hmmm...interesting investigation. But I am having one question: I don't see
> the STR includes shutdown the phone in comment 17. What you described might
> happen but it shouldn't be the case in this bug. What do I miss?

Uh, I see the keyword "poweroff + start" in comment 0. Hmm...good investigation.

I wonder if start AppWindowFactory in the window.onload fix this issue.. if load event handler in shell.js and window.onload has different timing to be triggered; hence I think the proper fix would be telling shell.js we are NOW ready for the system message event by sending a mozContentEvent in AppWindowFactory.

Chens are you able to handle this?

Also, great thanks Andrew for your hard investigation!
BTW, what's the desired behavior of an alarm which was fired during the device is off...? Should we show alarm when device is booted even when the time of the alarm is expired...?
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #31)
> hence I think the proper fix would be telling shell.js we are NOW
> ready for the system message event by sending a mozContentEvent in
> AppWindowFactory.

Yeah, having shell.js not call SystemAppProxy.setIsReady(); until it receives a mozContentEvent would work too.

(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #32)
> BTW, what's the desired behavior of an alarm which was fired during the
> device is off...? Should we show alarm when device is booted even when the
> time of the alarm is expired...?

Short answer: yes, right now at least.  I'm sure the UX could be smarter if it notices the alarm is waaay in the past, but for a simple scenario like the device rebooted itself to perform a FOTA just as a wake-up alarm was supposed to ring, the user definitely is going to prefer a 2-minute late wake-up alarm over no wake-up alarm.  :mcav is the clock owner though, and I see now I forgot to cc him when I moved the bug out of the clock component... I'll leave it to him to answer any deep questions about that :)
Flags: needinfo?(nhirata.bugzilla)
bug 1118272 ( 2015-04-06 ) might have changed the behavior from what it used to be to what it is now in this bug...

Question : does this bug happen in master as well?
Flags: needinfo?(ffos_st)
Flags: needinfo?(nhirata.bugzilla)
Flags: needinfo?(m)
Flags: needinfo?(jdorlus)
Mike, is there someone well-suited to work with AppWindowFactory / shell.js that should be pinged about this, per the analysis by :asuth above? I'm concerned that this may underpin some of the Clock alarm problems. I can investigate later but I'd only like to do so if there's no one more familiar with that code.
Flags: needinfo?(m) → needinfo?(mhenretty)
See Also: → 1163325
[Blocking Requested - why for this release]:

Yeah, this sounds pretty bad. Let's find someone to take a look for 2.5.
blocking-b2g: --- → 2.5?
Flags: needinfo?(mhenretty)
Whiteboard: sprd431227 → sprd431227, [systemsfe]
blocking-b2g: 2.5? → 2.5+
Tim, could you take a look to the bug please? thank you.
Flags: needinfo?(timdream)
This should not happen because system message is only sent when a system-message-listener-ready mozContentEvent is dispatched to shell.js.

The code currently lives in core.js after bug 1094759.

If we could assert the timing of that event while reproducing this bug, it would be very helpful.
Flags: needinfo?(timdream)
(In reply to Tim Guan-tin Chien [:timdream] (slow response; please ni? to queue) from comment #38)
> This should not happen because system message is only sent when a
> system-message-listener-ready mozContentEvent is dispatched to shell.js.
> 
> The code currently lives in core.js after bug 1094759.

I'm having trouble parsing your exact meaning.  Are you saying that you think the race cited in comment 28 is fixed by bug 1094759?  Or are you saying you think the race in comment 28 couldn't happen then or now?

If it's the former, we should probably put qawanted on this so someone can re-test.
Flags: needinfo?(timdream)
(In reply to Andrew Sutherland [:asuth] from comment #39)
> (In reply to Tim Guan-tin Chien [:timdream] (slow response; please ni? to
> queue) from comment #38)
> > This should not happen because system message is only sent when a
> > system-message-listener-ready mozContentEvent is dispatched to shell.js.
> > 
> > The code currently lives in core.js after bug 1094759.
> 
> I'm having trouble parsing your exact meaning.  Are you saying that you
> think the race cited in comment 28 is fixed by bug 1094759?  Or are you
> saying you think the race in comment 28 couldn't happen then or now?
> 

The race shouldn't happen then, or now, because <what I said> and that piece of code sits in core.js since bug 1094759.

> If it's the former, we should probably put qawanted on this so someone can
> re-test.
Flags: needinfo?(timdream)
(In reply to Tim Guan-tin Chien [:timdream] (slow response; please ni? to queue) from comment #38)
> This should not happen because system message is only sent when a
> system-message-listener-ready mozContentEvent is dispatched to shell.js.

It doesn't look like the system-message-listener-ready event is hooked up to that.  From my understanding of what I've written previously on the bug (which has faded somewhat, I admit :), I agree that if things worked that way, this bug would not exist.

When shell.js receives the system-message-listener-ready event at https://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#763 it notifies observers using the same string but doesn't do much else.

The list of observers of this notification are:
https://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#763

The only non-test user is the RIL AFAICT:
https://dxr.mozilla.org/mozilla-central/source/dom/system/gonk/RadioInterfaceLayer.js#331


The logic I referred to in comment 28 still seems to operate in the same way.  Specifically, the following things happen in shell.js:
- mozbrowserloadstart happens, synchronously calling notifyContentStart: https://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#470
- notifyContentStart adds a "load" listener: https://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#675
- When the "load" event fires, SystemAppProxy.setIsReady() is invoked https://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#689
- SystemAppProxy.setIsReady dispatches all the pending events: https://dxr.mozilla.org/mozilla-central/source/b2g/components/SystemAppProxy.jsm#45


If you could clarify where my understanding of system-message-listener-ready's processing by shell.js is incorrect in more detail, I'd appreciate it.  I do understand you're very busy, so if you're not the right person to address this, if you could needinfo someone else who could, that would be very helpful.

(Note that I have not re-examined the dynamic loading behaviour of the system app itself since it seems like you're saying that does not matter because of the system-message-listener-ready event.)
Flags: needinfo?(timdream)
(In reply to Andrew Sutherland [:asuth] from comment #41)
> The list of observers of this notification are:
> https://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.
> js#763

Copy and paste fail.  Correct link:
https://dxr.mozilla.org/mozilla-central/search?q=system-message-listener-ready&case=true&redirect=true
(In reply to Andrew Sutherland [:asuth] from comment #41)
> (In reply to Tim Guan-tin Chien [:timdream] (slow response; please ni? to
> queue) from comment #38)
> > This should not happen because system message is only sent when a
> > system-message-listener-ready mozContentEvent is dispatched to shell.js.
> 
> It doesn't look like the system-message-listener-ready event is hooked up to
> that.  From my understanding of what I've written previously on the bug
> (which has faded somewhat, I admit :), I agree that if things worked that
> way, this bug would not exist.
> 
> When shell.js receives the system-message-listener-ready event at
> https://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.
> js#763 it notifies observers using the same string but doesn't do much else.
> 
> The list of observers of this notification are:
> https://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.
> js#763
> 
> The only non-test user is the RIL AFAICT:
> https://dxr.mozilla.org/mozilla-central/source/dom/system/gonk/
> RadioInterfaceLayer.js#331
> 
> 
> The logic I referred to in comment 28 still seems to operate in the same
> way.  Specifically, the following things happen in shell.js:
> - mozbrowserloadstart happens, synchronously calling notifyContentStart:
> https://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.
> js#470
> - notifyContentStart adds a "load" listener:
> https://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.
> js#675
> - When the "load" event fires, SystemAppProxy.setIsReady() is invoked
> https://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.
> js#689
> - SystemAppProxy.setIsReady dispatches all the pending events:
> https://dxr.mozilla.org/mozilla-central/source/b2g/components/SystemAppProxy.
> jsm#45
> 

If this is what you saw, then this is how Gecko work nowadays. What I said was how it is supposed to work at the time. Thank you for pointing out that.

> If you could clarify where my understanding of
> system-message-listener-ready's processing by shell.js is incorrect in more
> detail, I'd appreciate it.  I do understand you're very busy, so if you're
> not the right person to address this, if you could needinfo someone else who
> could, that would be very helpful.
> 

No one if owning that part of B2G, so there is no one *right person*, and that makes everyone the right person...

unless you want to point the bug to module owners of B2G Runtime.
Flags: needinfo?(timdream)
Duplicate of this bug: 1130863
Summary: [FFOS7715 v2.1s] System app may lose system messages at startup due to initialization races; may result in overdue alarm clock alarms being lost until the clock app is manually started → [System] System app may lose system messages at startup due to initialization races; may result in overdue alarm clock alarms being lost until the clock app is manually started
Still been testing this bug and it still persists. Clearing needinfo.
Flags: needinfo?(jdorlus)
Duplicate of this bug: 1190416
I got a patch according to finding at comment 41.
Assignee: nobody → timdream
Component: Gaia::System → Runtime
Whiteboard: sprd431227, [systemsfe] → sprd431227
Attached patch bug1160923.patch (obsolete) — Splinter Review
This patch removes the window.onload listener of System app in shell.js and move all the code to a notifyContentReady method and invoke it when we receive system-message-listener-ready mozContentEvent.

I am reluctant to patch the code here because shell.js itself is not testable, and we do not know if Gaia is actually waiting for a mozChromeEvent before sending out the message-listener-ready mozContentEvent (we will fall into a both-are-waiting-for-other-to-send-the-first-event case if so).

window.onload is never a safe timing unless we explicitly expects System app have all the mozChromeEvent event listeners scripts included in <script> or <script defer> in HTML. These scripts are the only scripts blocks the onload event.

I could not figure out the right regression range here. One could put the blame to bug 1094759 but it's definitely not truth according to regression range.

Anyway, I hope this is the last time I fix mozChromeEvent related races. We really don't have time to blindly tweaking the bootstrap steps.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7161d8cc3fdc

Feel free to cancel the review if try report back test bustage.
Attachment #8661082 - Flags: review?(fabrice)
Comment on attachment 8661082 [details] [diff] [review]
bug1160923.patch

Not surprisingly, a few test setup requires mozChromeEvent being sent despite there isn't a System app to send the mozContentEvent...
Attachment #8661082 - Flags: review?(fabrice)
Attached patch bug1160923.patch (obsolete) — Splinter Review
There are two arrays trying to queue the events (in shell.js and in SystemAppProxy), and webapp-launch event (send from shell.sendCustomEvent()) is not queued.

This patch removes the queue in shell.js and always use the queue in SystemAppProxy. We should be able to fix the Gu tests this way I think.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3d8b0e2dc383

For the M-1 failure I don't think it actually loads Gaia System app, so we need some other way to fix the setup.
Attached patch bug1160923.patch (obsolete) — Splinter Review
After talking to SC I realized we should allow test scripts to send observer message in place of the mozContentEvent from chrome tests.

I still have not yet figure out how Gu breaks though -- according to gaia/tests/python/gaia-unit-tests/gaia_unit_test/main.py and log it loads the same System app and use |--runapp| to start Test Agent; as long as the mozChromeEvent that triggers Test Agent to launch got caught we should be fine there.

Anyway, the previous and the below Try run can verify that. 

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0f61bf91998
Attachment #8661082 - Attachment is obsolete: true
Attachment #8661132 - Attachment is obsolete: true
... and yes, this patch will make system-message-listener-ready mozContentEvent the last mozContentEvent -- it can never be removed unless we never ever have to send a custom event.
Overdue alarm, making it a P3.
Priority: -- → P3
Attached patch bug1160923.patch (obsolete) — Splinter Review
This is the version I think it would work.

This patch queues all mozChromeEvents until system-message-listener-ready mozContentEvent (except two, see below).

Other pending arrays in shell.js are also removed so that only the event pending array in SystemAppProxy is used. The timing of SystemAppProxy.setIsReady() is changed accordingly.

Another queue is created to queue noPending events before window.onload. There are two events related: The infamous webapps-registry-ready (to be removed in bug 1161489) actually sits on the System app bootstrap path, so System won't boot (and send us system-message-listener-ready) unless webapps-registry-ready reaches it first.

This event and the other event (system-first-paint, which is timing sensitive but does not sit on the launch path) must be allow to send right after window.onload -- but it must be queued still so that we don't dispatch the event to |about:blank|.

The patch establish the contract between shell.js and System app: (A) it must have the event listeners for handling the two event set in <script> and/or <script defer> so that it could catch the event at window.onload, and (B) it must send as a system-message-listener-ready mozContentEvent for shell.js to stop queuing the other pending mozChromeEvents, when all the event listeners are in place.

I've manually verify this patch on a phone and B2G Desktop to verify it works. Gu and Gij also works according to previous try runs.

The caveats here are the places where B2G {Desktop|Emulator} are used but Gaia System app are not being run. I've modified the test runner so that it would send an observer message so we could stop queue the mozChromeEvents.

(It turned out we could test mozChromeEvents in a non-obvious way on the slow emulator, i.e. SpecialPowers.loadChromeScript() -- contrary to what I stated before. I still don't think we should assert mozChromeEvent with this approach though)

I've updated systemapp_helper.js tests too. I have not verify the test locally but I think this will work. Will submit another patch if it turn out not to be.

This Try run will build everything to ensure I have covered all the test setups.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=df454f98f5ef
Attachment #8661143 - Attachment is obsolete: true
Attachment #8662172 - Flags: review?(poirot.alex)
Attachment #8662172 - Flags: review?(fabrice)
Attached patch bug1160923.patch (obsolete) — Splinter Review
Patch rebased against bug 1202902 and systemapp_helper.js test assertion fixed.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=51d5cb9ffae4
Attachment #8662172 - Attachment is obsolete: true
Attachment #8662172 - Flags: review?(poirot.alex)
Attachment #8662172 - Flags: review?(fabrice)
Attachment #8662252 - Flags: review?(poirot.alex)
Attachment #8662252 - Flags: review?(fabrice)
Comment on attachment 8662252 [details] [diff] [review]
bug1160923.patch

Review of attachment 8662252 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not really aware of the system-message-listener-ready story, but look good to me.
Also I'm not sure I'm an official reviewer for this code.

::: b2g/chrome/content/shell.js
@@ -84,5 @@
>  
> -function getContentWindow() {
> -  return shell.contentBrowser.contentWindow;
> -}
> -

Please do not remove that or update this callsite:
  http://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/devtools/hud.js#205

@@ +725,5 @@
> +#ifdef MOZ_WIDGET_GONK
> +    libcutils.property_set('sys.boot_completed', '1');
> +#endif
> +
> +    Services.obs.notifyObservers(null, "browser-ui-startup-complete", "");

Is it important/necessary to also delay OrientationChangeHandler/sys.boot_completed/browser-ui-startup-complete
until system-message-listener-ready?

::: b2g/components/SystemAppProxy.jsm
@@ +56,5 @@
>    setIsReady: function () {
>      if (this._isReady) {
>        Cu.reportError('SystemApp has already been declared as being ready.');
>      }
>      this._isReady = true;

We may add a check for this._isLoaded and/or call this.setIsLoaded()
to ensure there is no other race that would prevent sending pending loaded events.

@@ +99,5 @@
> +        this._pendingReadyEvents.push([type, details]);
> +      }
> +
> +      return null;
> +    }

The `!content` is redundant with the one in the next if.
I think you can just remove the second one.
Attachment #8662252 - Flags: review?(poirot.alex) → feedback+
Duplicate of this bug: 1206548
Attached patch bug1160923.patch (obsolete) — Splinter Review
Thanks for your review :ochameau, your comments are addressed. 

hud.js no longer need to keep a _frame and a frame() getter to redirect the reference from the frame to window since SystemAppProxy will do it with this patch, so I've remove it. I've also added a new comment in the early return block in shell.sendEvent() for this.

My original intention was to remove the window.onload listener; it makes no sense for us to creating this much timing on the top stack that could affect the timing of the lower (all the way to gonk property bit :-/) stack, but now the that I have failed in removing onload event listener I agree with you I should move the these back there (inside a shell.notifyContentWindowLoaded()) method.

I am not sure who the official reviewer either. I think you should unless there are objections, since you were the author of SystemAppProxy.jsm. I am flagging :vingtetun too since he is the co-owner of the Runtime component too.

I have inspected the previous try run and it's safe to assume this patch should not cause any known problems. Here is the new try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4316e6a4b6e1
Attachment #8662252 - Attachment is obsolete: true
Attachment #8662252 - Flags: review?(fabrice)
Attachment #8663538 - Flags: review?(poirot.alex)
Attachment #8663538 - Flags: review?(fabrice)
Attachment #8663538 - Flags: review?(21)
Comment on attachment 8663538 [details] [diff] [review]
bug1160923.patch

Review of attachment 8663538 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for addressing my comments. Looks good to me.
But I'm not involved enough into fxos to pretend r+ such patch,
I would prefer fabrice or vivien to rubbertstamp it.

::: b2g/chrome/content/shell.js
@@ +618,5 @@
>    sendEvent: function shell_sendEvent(target, type, details) {
> +    if (target === this.contentBrowser) {
> +      // We must ask SystemAppProxy to send the event in this case so
> +      // that event would be send to frame.contentWindow instead of just
> +      // the frame.

We could also just say that SystemAppProxy handles it better.
Or is the thing to use when it comes to communicate with the system app (frame).
With this frame.contentWindow story as well as the various message queues if not ready/loaded...
Attachment #8663538 - Flags: review?(poirot.alex) → feedback+
Attachment #8663538 - Flags: review?(vnicolas)
Comment on attachment 8663538 [details] [diff] [review]
bug1160923.patch

Review of attachment 8663538 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me if try is green.

::: b2g/chrome/content/shell.js
@@ -84,5 @@
>  
> -function getContentWindow() {
> -  return shell.contentBrowser.contentWindow;
> -}
> -

I think I have added a long time ago because sometimes some Gecko from some JSM modules code was relying on that. Can't remember what.

This can probably be removed if try is green.

@@ +711,5 @@
>  #endif
>    },
> +
> +  // This gets called when window.onload fires on the System app content window,
> +  // which means things in <html> are drawn and statically referenced <script>s

AFAIK it does not really means things are drawn, but css has been loaded and parsed as well as the html content. Which may be a bit different but are likely OK in our case.
Attachment #8663538 - Flags: review?(21) → review+
Attached patch Patch to commitSplinter Review
Try run on the latest m-c in case there is something wrong with the new base commit...

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e741167017e9
Attachment #8663538 - Attachment is obsolete: true
Attachment #8663538 - Flags: review?(vnicolas)
Attachment #8663538 - Flags: review?(fabrice)
Attachment #8666734 - Flags: review+
Comment on attachment 8666734 [details] [diff] [review]
Patch to commit

Review of attachment 8666734 [details] [diff] [review]:
-----------------------------------------------------------------

::: b2g/chrome/content/shell.js
@@ +635,1 @@
>    },

Can you file a follow up to move all the users of shell.sendCustomEvent to use SystemAppProxy instead and remove this function? Bonus points for renaming _sendCustomEvent() in SystemAppProxy to sendCustomEvent() !
Hi, 
This seems to have broken our shell.sendEvent that we are calling from hud.js.  Is there something we need to do to workaround?  Our custom event never reaches to it's destination so telemetry is never sent out.

I'm hitting this message:
https://hg.mozilla.org/mozilla-central/file/9169f652fe5e/b2g/components/SystemAppProxy.jsm#l128

Thanks,

-tamara
Flags: needinfo?(timdream)
That's just a (obsolete) message. Your event should be dispatched properly...

However it seems like we will do the cloneInto() twice (once in hud.js, once in SystemAppProxy). There's definitely something weird here.
Fabrice,

You're right.  This was a red herring.  Apologies for this.  removing the NI.

-tamara

(In reply to [:fabrice] Fabrice Desré from comment #65)
> That's just a (obsolete) message. Your event should be dispatched properly...
> 
> However it seems like we will do the cloneInto() twice (once in hud.js, once
> in SystemAppProxy). There's definitely something weird here.
Flags: needinfo?(timdream)
(In reply to [:fabrice] Fabrice Desré from comment #62)
> Can you file a follow up to move all the users of shell.sendCustomEvent to
> use SystemAppProxy instead and remove this function? Bonus points for
> renaming _sendCustomEvent() in SystemAppProxy to sendCustomEvent() !

bug 1210271.
fixed in https://hg.mozilla.org/mozilla-central/rev/6656685be04f
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.