Closed Bug 862322 Opened 11 years ago Closed 11 years ago

Allow multiple Gecko listeners for RTC timeouts on B2G.

Categories

(Firefox OS Graveyard :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(firefox21 wontfix, firefox22 wontfix, firefox23 fixed, b2g18+ fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix)

RESOLVED FIXED
Tracking Status
firefox21 --- wontfix
firefox22 --- wontfix
firefox23 --- fixed
b2g18 + fixed
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- wontfix

People

(Reporter: nsm, Assigned: nsm)

References

Details

Attachments

(2 files, 4 obsolete files)

The mozAlarms API is driven by AlarmService uses the RTC clock to schedule alarms. The RTC clock is the only clock that can wakeup the phone from a sleep state (nsITimer does not). The HAL layer currently allows only one observer for RTC clock events (Hal::RegisterTheOneAlarmObserver). This observer is the alarm service (via singleton AlarmHalService). This rules out any other Gecko code from using the low level alarm API. In addition the AlarmService requires alarms added to have a pageURL and a manifestURL, and uses System Messages to deliver alarm notifications. Gecko code which is not an app cannot use either of the 3 things. But it should still be able to use alarms.

Use case:
This is required for push notifications on B2G devices.
For Push Notifications over WebSockets, a TCP connection is maintained open. The phone eventually enters low power/sleep mode.
Loss of network on the phone, or proxy servers dropping idle TCP connections will result in the websocket being disconnected without app or push server being notified soon enough (TCP timeout window can be 13-30 minutes, while push notifications should be delivered within seconds in the best case and within a few minutes at the worst).
To detect connection drops faster the Push service needs to wake up periodically
a) To send a ping as a keep alive for the WebSocket if the connection is still up, or
b) To detect that the socket is disconnected and re-establish the connection.
It will need to use the RTC clock to schedule these regular wakeups.
Modified title to separate 'alarm API' which is a consumer of the RTC from what Gecko code needs.

Would it be better to convert nsITimer to use RTC, or should nsITimer and this interface be kept separate?

I have a WIP patch that modifies AlarmService.jsm to have refactored add/remove methods that Gecko JS code can use by directly accessing AlarmService. This is not the appropriate fix since it serializes every alarm to IndexedDB and prevents use from C++ code. Depending on Push deadlines this patch may have to be landed.

A good fix would pull down alarm queue ordering into C++. It seems AlarmHalService is a good choice for this. AlarmService.jsm would then implement only persistence logic as required by Alarms API. Gecko code can just set new alarms on startup, or use custom persistence in the rare case that it is required. The only reason the Alarms API requires persistence is that apps don't run at startup but should still be woken up when an alarm goes off, which the AlarmService does by setting alarms from IDB at startup.
No longer blocks: alarm-api
Summary: Alarm API should be accessible to Gecko code → Allow multiple Gecko listeners for RTC timeouts on B2G.
Whoops, mid-air collision lost flags.

@lsblakk: Sorry about that, can you mark tracking-b2g18+?
@genelian: I'm restoring your blocking, but I'm not sure why this bug 'blocks' the alarm API. The alarm API already works and is landed. Depends might be better?
Attached patch Refactor add and remove methods (obsolete) — Splinter Review
add and remove can be called by importing AlarmService.jsm and calling AlarmService.add/remove()

Because of this requirement, only code in the parent Gecko process can use this, as AlarmService will throw an exception if init() is called from a sandboxed process.
Attachment #739598 - Flags: review?(gene.lian)
> Because of this requirement, only code in the parent Gecko process can use this, as AlarmService 
> will throw an exception if init() is called from a sandboxed process.

I don't see this documented in the patch, but maybe I'm missing something.  I expected to see it on the add/remove methods, or maybe at the top of the file.  I also don't see an explicit assertion in init(), although I believe you that it throws an exception.

> but I'm not sure why this bug 'blocks' the alarm API. The alarm API already works and is landed. 
> Depends might be better?

I know it's weird, but this is the usual convention.  In a sense, this bug "blocks" completion of the alarm API.  Or, the alarm API depends on this bug in order to be complete.

Like I said in bug 863732, I think using the observer service here makes for a bad API.  It means that every consumer who registers an alarm gets woken up every time any alarm goes off, even if it's not their own alarm.

Can you pass in a callback function instead?
Comment on attachment 739598 [details] [diff] [review]
Refactor add and remove methods

Review of attachment 739598 [details] [diff] [review]:
-----------------------------------------------------------------

Hi Nikhil, thanks for the nice patch. I agree on the comment #4 addressed by Justin:

1. I didn't see the sanity checks to prevent it from calling .add()/.remove() by child processes.
2. Registering callback instead of using observers sounds definitely better. Binding the callback to the alarm stayed in the queue should be working. Note that you don't need to save the callback into DB, just in the queue instead (well, it's not possible to persist a callback anyway). We can do something like:

   add: function(aNewAlarm, aAddSuccessCb, aAddErrorCb, aAlarmFiredCb) {...}

3. A question: so currently you don't need to call the .add()/.remove() from C++ codes. Right?

I'd like to take another review after all the comments are handled. Thanks!

::: dom/alarm/AlarmService.jsm
@@ +114,5 @@
>        case "AlarmsManager:Add":
>          // prepare a record for the new alarm to be added
>          let newAlarm = {
>            date: json.date, 
>            ignoreTimezone: json.ignoreTimezone, 

I know this isn't your fault but since you're here could you please remove the trailing spaces (TS) in the above 2 lines? Thanks!

@@ +120,5 @@
>            pageURL: json.pageURL,
>            manifestURL: json.manifestURL
>          };
>  
> +        this.add(newAlarm, function(newId) {

NIT: let's add a name for the anonymous function like:

function addSuccessCb(...) {...}

Well, don't ask me why. My previous reviewer said it's a convention, though.

@@ +121,5 @@
>            manifestURL: json.manifestURL
>          };
>  
> +        this.add(newAlarm, function(newId) {
> +          this._sendAsyncMessage(mm, "Add", true, json.requestId, aNewId);

s/newId/aNewId

@@ +122,5 @@
>          };
>  
> +        this.add(newAlarm, function(newId) {
> +          this._sendAsyncMessage(mm, "Add", true, json.requestId, aNewId);
> +        }, function(error) {

The same: function addErrorCb(...) {...}

@@ +123,5 @@
>  
> +        this.add(newAlarm, function(newId) {
> +          this._sendAsyncMessage(mm, "Add", true, json.requestId, aNewId);
> +        }, function(error) {
> +          this._sendAsyncMessage(mm, "Add", false, json.requestId, error);

Also, s/error/aErrorMsg.

@@ +142,5 @@
>      debug("_sendAsyncMessage()");
>  
>      if (!aMessageManager) {
>        debug("Invalid message manager: null");
> +      //throw Components.results.NS_ERROR_FAILURE;

Why not keeping this line? For the Gecko call, it shouldn't have chance to come here. Right? Please correct me know if I'm wrong.

@@ +150,5 @@
>      let json = null;
>      switch (aMessageName)
>      {
>        case "Add":
>          json = aSuccess ? 

Ditto: remove TS. Well, that would be a bonus if you could please remove all of TS within this file, since you're going to modify most of the logic here.

@@ +151,5 @@
>      switch (aMessageName)
>      {
>        case "Add":
>          json = aSuccess ? 
>            { requestId: aRequestId, id: aData } : 

Ditto: remove TS.

@@ +202,3 @@
>                    "date":            aAlarm.date,
>                    "respectTimezone": aAlarm.ignoreTimezone ?
>                                         "ignoreTimezone" : "honorTimezone", 

Ditto: remove TS.

@@ +206,5 @@
>  
>      messenger.sendMessage("alarm", alarm, pageURI, manifestURI);
>    },
>  
> +  _notifyAlarmObservers: function(alarm) {

s/alarm/aAlarm (also for the following codes) to make the coding style consistent within this file.

@@ +209,5 @@
>  
> +  _notifyAlarmObservers: function(alarm) {
> +    if (!alarm.manifestURL) {
> +        debug(JSON.stringify(alarm));
> +        Services.obs.notifyObservers(null, "alarm-fired", JSON.stringify(alarm));

We only need 2 spaces for indention.

@@ +211,5 @@
> +    if (!alarm.manifestURL) {
> +        debug(JSON.stringify(alarm));
> +        Services.obs.notifyObservers(null, "alarm-fired", JSON.stringify(alarm));
> +    } else {
> +        this._fireSystemMessage(alarm);

Ditto: correct indention.

@@ +230,5 @@
>      while (alarmQueue.length > 0) {
>        let nextAlarm = alarmQueue.shift();
>        let nextAlarmTime = this._getAlarmTime(nextAlarm);
>  
>        // If the next alarm has been expired, directly 

Ditto: remove TS.

@@ +274,1 @@
>              this._removeAlarmFromDb(aAlarm.id, null);

Why don't you move the this._notifyAlarmObservers(...) after this._removeAlarmFromDb(...) just like what you've done for other parts?

@@ +332,5 @@
> +   * }
> +   *
> +   * onSuccess should be a function that receives an alarm ID (integer).
> +   * onError should be a function that receives a string error message.
> +   */

Good to have function descriptions. :) However, I'd prefer using a more formal format and rewriting some of the descriptions as below if you don't have strong opinions:

/**
 *
 * Add a new alarm. This will set the RTC to fire at the selected date and
 * notify the caller through System Message or Gecko internal callbacks [2].
 *
 * @param object aNewAlarm
 *        Should contain the following literal properties:
 *          - |date| date: when the alarm should timeout.
 *          - |ignoreTimezone| boolean: see [1] for the details.
 *          - |data| object [optional]: data that can be stored in DB.
 * @param function aSuccessCb
 *        Callback function to receive an alarm ID (number).
 * @param function aErrorCb
 *        Callback function to receive an error message (string).
 * @returns void
 * 
 * Notes:
 * [1] https://wiki.mozilla.org/WebAPI/AlarmAPI#Proposed_API
 * [2] (since using observers isn't a good idea, we'd have new description here)
 */

@@ +333,5 @@
> +   *
> +   * onSuccess should be a function that receives an alarm ID (integer).
> +   * onError should be a function that receives a string error message.
> +   */
> +  add: function(newAlarm, onSuccess, onError) {

s/newAlarm/aNewAlarm
s/onSuccess/aSuccessCb
s/onError/aErrorCb

for here and below, which has a more consistent coding style within this file.

@@ +336,5 @@
> +   */
> +  add: function(newAlarm, onSuccess, onError) {
> +    debug("add(" + newAlarm.date + ")");
> +    if (!newAlarm)
> +      return;

Please add parentheses even for one-line code. Also, make sure we always call the proper callback before return.

if (!aNewAlarm) {
  if (aErrorCb) {
    aErrorCb("...");
  }
  return;
}

@@ +341,5 @@
> +
> +    newAlarm['timezoneOffset'] = this._currentTimezoneOffset;
> +
> +    var onSuccess = onSuccess || function() {};
> +    var onError = onError || function() {};

I assume |aSuccessCb| and |aErrorCb| can be null from the internal Gecko calls. I'd prefer removing the above 2 lines because I don't really see many codes written in this way. Please see below.

Btw, please use |let| instead of |var| for platform codes.

@@ +346,5 @@
> +    let newAlarmTime = this._getAlarmTime(newAlarm);
> +    if (newAlarmTime <= Date.now()) {
> +      debug("Adding a alarm that has past time.");
> +      this._debugCurrentAlarm();
> +      onError("InvalidStateError");

if (aErrorCb) {
  aErrorCb("InvalidStateError");
}

which looks more clear to me and you don't really need to run a dummy function in this way when it's not specified.

@@ +357,5 @@
> +        debug("Callback after adding alarm in database.");
> +
> +        newAlarm['id'] = aNewId;
> +
> +        // if there is no alarm being set in system, set the new alarm

I know this is not your fault. Since you're here, could you please rewrite the comment to be:

// "I"f... alarm"."

All the comments should be in the format of:

//{one-space}{upper-case-letter}...{period}

@@ +361,5 @@
> +        // if there is no alarm being set in system, set the new alarm
> +        if (this._currentAlarm == null) {
> +          this._currentAlarm = newAlarm;
> +          this._debugCurrentAlarm();
> +          onSuccess(aNewId);

if (aSuccessCb) {
  aSuccessCb(aNewId);
}

@@ +366,5 @@
> +          return;
> +        }
> +
> +        // if the new alarm is earlier than the current alarm
> +        // swap them and push the previous alarm back to queue

Ditto: correct comment format.

@@ +373,5 @@
> +        if (newAlarmTime < currentAlarmTime) {
> +          alarmQueue.unshift(this._currentAlarm);
> +          this._currentAlarm = newAlarm;
> +          this._debugCurrentAlarm();
> +          onSuccess(aNewId);

if (aSuccessCb) {
  aSuccessCb(aNewId);
}

@@ +377,5 @@
> +          onSuccess(aNewId);
> +          return;
> +        }
> +
> +        //push the new alarm in the queue

Ditto: correct comment format:

// "P"ush... queue"."

@@ +381,5 @@
> +        //push the new alarm in the queue
> +        alarmQueue.push(newAlarm);
> +        alarmQueue.sort(this._sortAlarmByTimeStamps.bind(this));
> +        this._debugCurrentAlarm();
> +        onSuccess(aNewId);

if (aSuccessCb) {
  aSuccessCb(aNewId);
}

@@ +384,5 @@
> +        this._debugCurrentAlarm();
> +        onSuccess(aNewId);
> +      }.bind(this),
> +      function addErrorCb(aErrorMsg) {
> +        onError(aErrorMsg);

if (aErrorCb) {
  aErrorCb(aErrorMsg);
}

@@ +390,5 @@
> +    );
> +  },
> +
> +  /*
> +   * Remove the alarm associated with aAlarmId.

* Remove the alarm associated with an ID.

@@ +392,5 @@
> +
> +  /*
> +   * Remove the alarm associated with aAlarmId.
> +   *
> +   * aManifestURL is optional.

* @param number aAlarmId
         The ID of alarm to be removed.
* @param string aManifestURL [optional]
         The manifest URL of the app that alarm belongs to.
* @returns void

@@ +402,5 @@
> +      aManifestURL,
> +      function removeSuccessCb() {
> +        debug("Callback after removing alarm from database.");
> +
> +        // if there is no alarm being set

Ditto: correct comment format:

// If there is no alarm being set, nothing to do.

@@ +409,5 @@
> +          return;
> +        }
> +
> +        // check if the alarm to be removed is in the queue
> +        // by ID and whether it belongs to the requesting app

Ditto: correct comment format:

// "C"...
// ..."."

@@ +411,5 @@
> +
> +        // check if the alarm to be removed is in the queue
> +        // by ID and whether it belongs to the requesting app
> +        let alarmQueue = this._alarmQueue;
> +        if (this._currentAlarm.id != aAlarmId || 

Ditto: remove TS.

@@ +415,5 @@
> +        if (this._currentAlarm.id != aAlarmId || 
> +            this._currentAlarm.manifestURL != aManifestURL) {
> +
> +          for (let i = 0; i < alarmQueue.length; i++) {
> +            if (alarmQueue[i].id == aAlarmId && 

Ditto: remove TS.

@@ +427,5 @@
> +          return;
> +        }
> +
> +        // the alarm to be removed is the current alarm
> +        // reset the next alarm from queue if any

Ditto: correct comment format.

@@ +434,5 @@
> +          this._debugCurrentAlarm();
> +          return;
> +        }
> +
> +        // no alarm waiting to be set in the queue

Ditto: correct comment format.
Attachment #739598 - Flags: review?(gene.lian) → review-
(In reply to Justin Lebar [:jlebar] from comment #4)
> > Because of this requirement, only code in the parent Gecko process can use this, as AlarmService 
> > will throw an exception if init() is called from a sandboxed process.
> 
> I don't see this documented in the patch, but maybe I'm missing something. 
> I expected to see it on the add/remove methods, or maybe at the top of the
> file.  I also don't see an explicit assertion in init(), although I believe
> you that it throws an exception.
> 

http://mxr.mozilla.org/mozilla-central/source/hal/sandbox/SandboxHal.cpp#321
Since AlarmHalService tries to set up alarms in init(), it goes down :)

> > but I'm not sure why this bug 'blocks' the alarm API. The alarm API already works and is landed. 
> > Depends might be better?
> 
> I know it's weird, but this is the usual convention.  In a sense, this bug
> "blocks" completion of the alarm API.  Or, the alarm API depends on this bug
> in order to be complete.
Hmm... ok, looks like I've been marking things wrong sometimes.
> 
> Like I said in bug 863732, I think using the observer service here makes for
> a bad API.  It means that every consumer who registers an alarm gets woken
> up every time any alarm goes off, even if it's not their own alarm.

Yes I sort of knew from the beginning that this was a bad idea, and something discussed with gene, but it was the Friday of the work week and I wanted to have a patch that worked :). I'll fix this to use callbacks.
Forgot to say that the SandboxHal stuff is not a deal breaker since all Gecko code is in the parent process.
Attached patch Refactor add and remove methods (obsolete) — Splinter Review
Switch to callbacks, cleanup code, fix nits from review.
Attachment #739598 - Attachment is obsolete: true
Attachment #741142 - Flags: review?(gene.lian)
Comment on attachment 741142 [details] [diff] [review]
Refactor add and remove methods

Review of attachment 741142 [details] [diff] [review]:
-----------------------------------------------------------------

Hi Nikhil, thanks for all the code clean-up (which I didn't do very well when I was still a newbie). Well done! :D I'd like to take a final review on your new patch and confirm a CPU wake lock issue with Justin later. It's close to be done!

::: dom/alarm/AlarmService.jsm
@@ +51,5 @@
>  
>      this._currentTimezoneOffset = (new Date()).getTimezoneOffset();
>  
> +    let alarmHalService =
> +        this._alarmHalService = Cc["@mozilla.org/alarmHalService;1"]

NIT: 2-space indention is enough.

@@ +139,5 @@
>  
> +        this.add(newAlarm, null, function addSuccessCb(aNewId) {
> +          this._sendAsyncMessage(mm, "Add", true, json.requestId, aNewId);
> +        }, function addErrorCb(aError) {
> +          this._sendAsyncMessage(mm, "Add", false, json.requestId, aError);

s/aError/aErrorMsg to be consistent with other codes.

@@ +224,5 @@
>  
>      messenger.sendMessage("alarm", alarm, pageURI, manifestURI);
>    },
>  
> +  _notifyAlarmObservers: function(aAlarm) {

s/_notifyAlarmObservers/_notifyAlarmObserver since we only need to notify *single* observer now (either through system message or callback).

Also, please add the same name for the anonymous function:

  _notifyAlarmObserver: function _notifyAlarmObserver(aAlarm)

@@ +225,5 @@
>      messenger.sendMessage("alarm", alarm, pageURI, manifestURI);
>    },
>  
> +  _notifyAlarmObservers: function(aAlarm) {
> +    debug("_notifyAlarmObservers()");

s/_notifyAlarmObservers/_notifyAlarmObserver

@@ +229,5 @@
> +    debug("_notifyAlarmObservers()");
> +    if (aAlarm.manifestURL) {
> +      this._fireSystemMessage(aAlarm);
> +    } else if (typeof aAlarm.onAlarmFiredCb === "function") {
> +      aAlarm.onAlarmFiredCb(aAlarm);

The |aAlarm| now has lots of additional properties that are only maintained and used within the AlarmService. I think we don't really need to expose all of them. Please refer to ._fireSystemMessage(). I think we can refactor the common part into a utility function (please rename the function if you think there's a better one):

  _makeFiredAlarm: function _makeFiredAlarm(aAlarm) {
    let alarm = { "id":              aAlarm.id,
                  "date":            aAlarm.date,
                  "respectTimezone": aAlarm.ignoreTimezone ?
                                     "ignoreTimezone" : "honorTimezone", 
                  "data":            aAlarm.data };
    return alarm;
  }

and call the following line in this function:

  aAlarm.onAlarmFiredCb(this._makeFiredAlarm(aAlarm));

and also call the following line in the _fireSystemMessage():

  messenger.sendMessage("alarm", this._makeFiredAlarm(aAlarm),
                        pageURI, manifestURI);

Also, I'd prefer s/onAlarmFiredCb/alarmFiredCb. Please see the comment below.

@@ +248,5 @@
>      while (alarmQueue.length > 0) {
>        let nextAlarm = alarmQueue.shift();
>        let nextAlarmTime = this._getAlarmTime(nextAlarm);
>  
> +      // If the next alarm has been expired, directly fire system message for

s/fire system message/notify the observer

@@ +283,5 @@
>          alarmQueue.length = 0;
>          this._currentAlarm = null;
>  
> +        // Only restore the alarm that's not yet expired; otherwise, remove it
> +        // from database and notify observers.

s/observers/the observer

@@ +357,5 @@
> +   * @returns void
> +   *
> +   * Notes:
> +   * [1] https://wiki.mozilla.org/WebAPI/AlarmAPI#Proposed_API
> +   * [2] (since using observers isn't a good idea, we'd have new description here)

We don't need the above line now. ;)

@@ +364,5 @@
> +  add: function(aNewAlarm, aAlarmFiredCb, aSuccessCb, aErrorCb) {
> +    debug("add(" + aNewAlarm.date + ")");
> +
> +    let aSuccessCb = aSuccessCb || function() {};
> +    let aErrorCb = aErrorCb || function() {};

Alright, after some thought I'm convinced to use this kind of trick. ;) Just wondering why not:

aSuccessCb = aSuccessCb || function() {};
aErrorCb = aErrorCb || function() {};

Any special reasons for declaring new variables (using |let|)?

@@ +389,5 @@
> +        aNewAlarm['id'] = aNewId;
> +
> +        // Now that the alarm has been added to the database, we can tack on
> +        // the non-serializable callback to the in-memory object.
> +        aNewAlarm['onAlarmFiredCb'] = aAlarmFiredCb;

In general, we don't use |onAlarmFiredCb| because both of the following terms denote the same meaning for a callback function:

onAlarmFired == alarmFiredCb

Please choose either one you like (I'd prefer the latter). Also, please call the new one in ._notifyAlarmObservers().

@@ +440,5 @@
> +      aManifestURL,
> +      function removeSuccessCb() {
> +        debug("Callback after removing alarm from database.");
> +
> +        // If there are no alarms set.

Add more words to make a complete sentence.

// If there are no alarms set, nothing to do.
Attachment #741142 - Flags: review?(gene.lian)
Hi Justin,

Since we're going to make the AlarmService more generic for internal Gecko use, I'm worrying about the CPU wake lock issue again. I think we should add some codes to lock the CPU to make sure the alarm can wake up the system from sleeping before the callers is aware of the alarm firing:

  gPowerManagerService.newWakeLock("cpu");

Do you think it's AlarmService's or callback's responsibility? Note that the earlier we lock the CPU (closer to the RTC wake lock), the lower possibility we'd fall into the sleeping gaps. May I have your suggestions? Thanks!
Flags: needinfo?(justin.lebar+bug)
I think that would probably be fine, but I still don't feel like I have a good idea of what the CPU wake lock actually does.  I mean, we're running code in the alarm service, so we're obviously not asleep at that point in time.

I think Kan-Ru explained this to cjones and me once.
Flags: needinfo?(justin.lebar+bug) → needinfo?(kchen)
(In reply to Justin Lebar [:jlebar] from comment #11)
> I think that would probably be fine, but I still don't feel like I have a
> good idea of what the CPU wake lock actually does.  I mean, we're running
> code in the alarm service, so we're obviously not asleep at that point in
> time.
> 
> I think Kan-Ru explained this to cjones and me once.

Yes, use a CPU lock until you know that a event has been consumed is a good idea.

One possible problem here is that the CPU lock is handled entirely in gaia currently. If we want to make absolutely sure that CPU aren't suspend during event handling, we might want to set cpuSuspendAllowed = false in gecko.
Flags: needinfo?(kchen)
> One possible problem here is that the CPU lock is handled entirely in gaia currently.

Does moving the lock into Gecko actually solve the problem, if we can be put to sleep at any point in time?
Btw, just one clarification. Whether need to add a CPU wake lock for this issue is still a hypothesis. Since the use case is under the scope of parent process, maybe we don't really need to add an extra CPU wake lock at all.

Nikhil, have you ever tested your patch under the following conditions to see if the alarm callback can run exactly on time?

  1. Turn off the screen to let the mobile sleep.
  2. Unplug the USB cable to make sure the mobile is not in battery charged.

The above conditions can simulate a scenario that the mobile is fully sleeping.
Flags: needinfo?(nsm.nikhil)
Hi Justin,

Nikhil said the callback would get a few seconds (or even a few minutes) delayed before the callback get running. It's highly possible due to the wake lock issue again. That is, RTC wake up the system on time but system falls into sleep in a very short time later.

I'd prefer adding a 3 or 5 second CPU wake lock in AlarmService whenever an alarm is fired (it will automatically be released when timeout), so that the callback will have enough time to be triggered. Bug 830425 used to do the same trick for making the incoming call rings on time.

I'm also wondering how do you handle this kind of wake lock issue at Bug 836654, because you used to remove the CPU wake lock mechanism from AlarmService. I hope to add it back but this time I don't want to rely on the "SystemMessageManager:HandleMessageDone" to release the wake lock, simply use a timeout instead. I think it's fine because the alarms don't fire very often (just like ringing the incoming call), which means a 3-second CPU wake lock should be acceptable.

What do you think?
Flags: needinfo?(justin.lebar+bug)
(In reply to Gene Lian [:gene] from comment #15)
> Hi Justin,
> 
> Nikhil said the callback would get a few seconds (or even a few minutes)
> delayed before the callback get running. It's highly possible due to the
> wake lock issue again. That is, RTC wake up the system on time but system
> falls into sleep in a very short time later.
> 
> I'd prefer adding a 3 or 5 second CPU wake lock in AlarmService whenever an
> alarm is fired (it will automatically be released when timeout), so that the
> callback will have enough time to be triggered. Bug 830425 used to do the
> same trick for making the incoming call rings on time.
I'd prefer not using a timeout as the escape mechanism. My suggestion is:
1) When the RTC alarm is fired, AlarmService acquires a wakelock
2) AlarmService notifies all observers whose alarms should fire. These are either:
  a) Gecko code observers using callbacks
  b) Apps using system messages
3) AlarmService releases wake lock.
See justification below

> 
> I'm also wondering how do you handle this kind of wake lock issue at Bug
> 836654, because you used to remove the CPU wake lock mechanism from
> AlarmService. I hope to add it back but this time I don't want to rely on
> the "SystemMessageManager:HandleMessageDone" to release the wake lock,
> simply use a timeout instead. I think it's fine because the alarms don't
> fire very often (just like ringing the incoming call), which means a
> 3-second CPU wake lock should be acceptable.

The problem with the approach i've detailed above is that code that runs async in the case of 2a may get delayed because the device enters sleep. The solution is for Gecko code in the callback to acquire its own wakelock if it performs any async operation. Since all this code is trusted, we can try and ensure that all code that may use this releases those locks properly in all its async branches.

For 2b, the alarm API documentation can clearly state that apps doing XHR or similar should acquire a wakelock. Since the system message handler in the child processes runs separately from SystemMessageInternal in the parent process, step 3 of above is not affected by a bad app which writes an infinite loop in the system message handler callback.

I think elevating an app to FOREGROUND or higher on pending system message is a great idea, under the assumption that such a priority automatically prevents the CPU from sleeping by acquiring a lock somewhere (I've not read the code). This takes care of the interval of time from step 3 ending to the app process getting CPU.

Having a timeout is arbitrary because in the case of Push an app will *likely* do XHR with the app server in the system message handler. If this XHR is over a slow network, the phone goes to sleep before the reply arrives.

To summarize, we need a wakelock to prevent sleep at the following boundaries

1) From RTC firing to AlarmService's onAlarmFired being called (Me and Gene don't know if this codepath can run without CPU sleeping or not)
2) AlarmService callback invoked and Observers of alarms being called. This is fixed by steps 1 to 3 above.
3) Time between queuing up a system message and app processing it. This is handled by foreground priority (Right?)

For the other points where CPU can sleep, we offload responsibility to observers and apps.
Flags: needinfo?(nsm.nikhil)
Flags: needinfo?(justin.lebar+bug)
> I think elevating an app to FOREGROUND or higher on pending system message is a great idea, under 
> the assumption that such a priority automatically prevents the CPU from sleeping by acquiring a lock 
> somewhere (I've not read the code).

It doesn't.

We elevate apps to bg_perceivable on a pending system message with bug 860799.  Going higher than that is tricky, but anyway process priorities don't effect wake locks; only the other way around.

What I still do not understand is: If we're running with the screen off and without the CPU wake lock held, under what circumstances can our code stop running?  Can we only stop running at Gecko event boundaries, or can we stop running anywhere the kernel would like?

> 3) Time between queuing up a system message and app processing it. This is handled by foreground 
> priority (Right?)

No, but it is handled.  We simply hold the CPU lock until the app handles the message.  We do have a timeout in there (because an app might spin in an infinite loop and never handle the message), but it's a long timeout (30s?), and should leave plenty of time for you to acquire a wake lock after being notified of the sys msg.

> 1) From RTC firing to AlarmService's onAlarmFired being called (Me and Gene don't know if this 
> codepath can run without CPU sleeping or not)

Does RTC wake up the main thread, or some other thread?  If it's the main thread, you can't release a wake lock immediately after calling a function; you can only do that either immediately before  calling the function, or after the function finishes.

That's obvious, but it's an important distinction.  If you hold the wake lock up until you call onAlarmFired, you have to hold the wake lock for the entire duration of onAlarmFired.

> 2) AlarmService callback invoked and Observers of alarms being called. This is fixed by steps 1 to 
> 3 above.

So now, does onAlarmFired synchronously run its observers?  I'd guess it does.  Then perhaps we don't need a wake lock in there.

We should try to figure this out, because it seems like none of us actually understands where you have to hold a wake lock and where you don't.  If we can only be put to sleep at Gecko event boundaries, and if everything from RTC wake-up to firing alarm observers happens synchronously, perhaps we don't need to acquire any new wake locks here.
(In reply to Justin Lebar [:jlebar] from comment #17)
> What I still do not understand is: If we're running with the screen off and
> without the CPU wake lock held, under what circumstances can our code stop
> running?  Can we only stop running at Gecko event boundaries, or can we stop
> running anywhere the kernel would like?

We could stop running anywhere the kernel likes.
> We could stop running anywhere the kernel likes.

So then in order to do this correctly, we must acquire the CPU wake lock immediately after the realtime clock wakes us up.  Even that isn't totally correct (because the kernel might put us to sleep before we run that code), but it's the best we can do.

Is that right?
(In reply to Justin Lebar [:jlebar] from comment #19)
> > We could stop running anywhere the kernel likes.
> 
> So then in order to do this correctly, we must acquire the CPU wake lock
> immediately after the realtime clock wakes us up.  Even that isn't totally
> correct (because the kernel might put us to sleep before we run that code),
> but it's the best we can do.
> 
> Is that right?

Yes. Usually after rtc wakes us up we have some time to acquire the wake lock. If the timeout is too short then we might fail to acquire the lock in time.
(In reply to Justin Lebar [:jlebar] from comment #17)
> So now, does onAlarmFired synchronously run its observers?  I'd guess it
> does.  Then perhaps we don't need a wake lock in there.

We cannot expect the observers always run synchronously. The observers' callback could probably have asynchronous codes.

No matter it runs synchronously or asynchronously. Kernel would stop running at any where and any time (even if they synchronous codes). Therefore, the AlarmService must at least acquire a wake lock to ensure either 1) the system message handler or 2) the Gecko callback can be executed for sure before system falls asleep again. After that, it's system message handler or Gecko callback's responsibility to grab additional wake locks in its codes to prevent itself from sleeping.
Hi Justin,

Please see comment #21 for why we need to acquire a CPU wake lock before notifying observers. Note that I acquire the CPU wake lock at AlarmHalService.cpp instead of GonkHal.cpp for some reasons:

1. Based on my previous experience at Bug 796255, the code path from RTC wake up (GonkHal.cpp) to mAlarmFiredCb->OnAlarmFired() (AlarmHalService.cpp) is too short to have chances to fall into sleep.

2. The best place to acquire wake lock should be in GonkHal.cpp (immediately after the RTC wake up). However, in general we are not allowed to use XPCOM service in the HAL layer. If we really want to do that, we may utilize ctypes tricks to call .acquire_wake_lock() and .release_wake_lock() in libhardware. However, that would mess up the wake lock mechanism, because Kanru used to highly suggest every wake lock should be created and maintained through power manager service.


Hi Nikhil,

Could you please include this part to verify if your callbacks can be triggered on time? Thanks!
Attachment #742967 - Flags: feedback?(nsm.nikhil)
Attachment #742967 - Flags: feedback?(justin.lebar+bug)
(In reply to Justin Lebar [:jlebar] from comment #17)
> That's obvious, but it's an important distinction.  If you hold the wake
> lock up until you call onAlarmFired, you have to hold the wake lock for the
> entire duration of onAlarmFired.

I assume you're talking about AlarmService._onAlarmFired() and you're worrying about we would lock a very long CPU wake lock before it returns. Right? In that function, we're having 2 different ways to notify observers:

  1) Fire system message: obviously, the handler is asynchronously running in another process, so I believe ._fireSystemMessage() doesn't take long time to return.

  2) Run Gecko callback (on going in this bug): this part must be written by internal engineers, which means it can be controlled to avoid it. That is, the callback is aimed for catching the alarm instead of synchronously running everything they desire.

So I think it should be fine. Or maybe we can add an extra timeout to unlock the wake lock earlier for safety? which means we will release the wake lock when 1) timeout or 2) mAlarmFiredCb->OnAlarmFired() returns.
Gene,

The wakelock does ensure that all gecko observers are called precisely (sometimes there is a deviation of a second, but i think thats because of the microsecond to second conversion in Date()). I can acquire a wakelock in PushService if I need to.
Attachment #741142 - Attachment is obsolete: true
Attachment #743028 - Flags: review?(gene.lian)
(In reply to Kan-Ru Chen [:kanru] from comment #18)
> (In reply to Justin Lebar [:jlebar] from comment #17)
> > What I still do not understand is: If we're running with the screen off and
> > without the CPU wake lock held, under what circumstances can our code stop
> > running?  Can we only stop running at Gecko event boundaries, or can we stop
> > running anywhere the kernel would like?
> 
> We could stop running anywhere the kernel likes.

Does this include stopping even while a thread/process is running instructions and is CPU bound?
> 2. The best place to acquire wake lock should be in GonkHal.cpp (immediately after the RTC wake up). 
> However, in general we are not allowed to use XPCOM service in the HAL layer.

I wasn't aware of that rule, but in cjones's absence I'm the new owner of hal,
and I hereby give you permission to call into an XPCOM service if you feel it's
really necessary.

But I don't think using XPCOM here in Gonk would solve the problem. Acquiring
the CPU wake lock from the power manager service doesn't actually lock the CPU.
Instead, it sends a message to Gaia, and then Gaia flips the bit that disallows
the CPU from turning off.

So this code goes like:

 1) Woken up by alarm service on alarm thread
 2) Send message to main thread
 3) Main thread event runs, alarm service acquires CPU lock
 4) Power manager calls into Gaia to inform it of wake-lock changed
 5) Gaia flips the cpuMaySleep bit
 6) Flipping the cpuMaySleep bit calls into the power manager
 7) The power manager calls into hal, disallowing CPU sleep.

In particular, no matter what we do, if we want to have Gaia flip the
cpuMaySleep bit, we have to post an alarm-thread --> main-thread event.

I'm particularly worried that, if the system goes to sleep when there are many
events queued in the main thread's event loop, we'll spend a lot of time
between steps 2 and 3 above.  I believe that you didn't see this in your
testing, but from the discussion here it sounds like we should be cautious with
our assumptions about how long we stay awake after an RTC wakeup and how long
it takes to run observers.

I think we'd all like to prevent the CPU from sleeping at step (1), right?
That's by far the safest thing to do, as I understand the conversation here.  I
think we can totally do that.

The easiest thing to do would be to modify GonkHal so that we can call a new
method, say InternalLockCpu(), off the main thread.  We'd modify
SetCpuSleepAllowed and write InternalLockCpu as follows:

  void SetCpuSleepAllowed(bool aAllowed) {
    sCpuSleepAllowed = aAllowed;
    UpdateCpuSleepState();
  }

  static void InternalLockCpu() {
    MOZ_ASSERT(!sInternalCpuSleepAllowed);
    sInternalCpuSleepAllowed = true;
    UpdateCpuSleepState();
  }

  static void InternalUnlockCpu() { ... }

  static void UpdateCpuSleepState()
  {
    bool allowed = sCpuSleepAllowed || sInternalCpuSleepAllowed;
    WriteToFile(allowed ? ...);
  }

except you'd need to put locks around these methods, since they would need to
be threadsafe.

Also, it may be better to use a counter for sInternalCpuSleepAllowed, since as
is, you can only call InternalLockCpu and InternalUnlockCpu on one thread, and
we call it on the alarm thread, so you can never call it off that thread.

Do you think this would work?  It's not much harder than what we're doing now,
and it feels much safer to me, based on what I've heard here.

> I assume you're talking about AlarmService._onAlarmFired() and you're
> worrying about we would lock a very long CPU wake lock before it returns.
> Right?

No; I think the code in the patch is doing the right thing, for what it's
trying to do.  I was worried about you /unlocking/ the lock before notifying
the observers.  I agree what you have is fine.
Attachment #742967 - Flags: feedback?(justin.lebar+bug)
Hi Justin,

Thanks for your detailed solution. Really appreciate! However, I admit I didn't follow your suggested counter very well. This is a quick WIP patch. Could you please correct me if I'm wrong? Thanks!

Btw, I did some minor code clean-up: folding long lines.
Attachment #742967 - Attachment is obsolete: true
Attachment #742967 - Flags: feedback?(nsm.nikhil)
Attachment #743497 - Flags: feedback?(justin.lebar+bug)
Comment on attachment 743028 [details] [diff] [review]
Refactor add and remove methods

Good job! Thanks!
Attachment #743028 - Flags: review?(gene.lian) → review+
Nikhil's patch already got review+. I'm OK with Nikhil's suggestion to land that patch first and address the wake lock issue in a separate bug, which needs more time to be done.
We should use NS_DISPATCH_SYNC to call .NS_DispatchToMainThread() so that the AlarmFiredEvent.Run() can be executed under the CPU wake lock. This is important.

Please also see comment #28. We still need to clarify the purpose of counter thing.
Attachment #743497 - Attachment is obsolete: true
Attachment #743497 - Flags: feedback?(justin.lebar+bug)
Attachment #743509 - Flags: feedback?(justin.lebar+bug)
Recently, we had lots of critical issues that the alarm doesn't ring on time until we turn on the screen (bug 866366, bug 865570, bug 860884). This kind of symptom smells exactly like the CPU wake lock issue.
> I'm OK with Nikhil's suggestion to land that patch first and address the wake lock issue in a 
> separate bug, which needs more time to be done.

It's fine if you want to move the wake lock issue to a separate bug.

This patch looks like what I had in mind, although note that you need to lock around SetCpuSleepAllowed as well, because it's modifying a variable that might be read on other threads (and because we should only run UpdateCpuSleepState from one thread at a time.

Also, is it a problem if AlarmFiredEvent blocks the alarm thread for a long time?  Like, if we have two separate alarms set close in time, do we expect to fire two separate alarm notifications, or is it OK if they get merged into one notification?

It might be safer if you unlocked the lock from the AlarmFiredEvent's run method and didn't dispatch sync.
Attachment #743509 - Flags: feedback?(justin.lebar+bug) → feedback+
Hi Nikhil, please go ahead to land your patch so that we won't block bug 863732. I'll fire another bug with my patch to solve the wake lock issue. Tomorrow is Taiwanese holiday. Will probably have delayed response. Thanks!
There used to be an announcement in the b2g mailing list [1] that we should keep landing b2g-specific patches to https://hg.mozilla.org/projects/birch/ even after the Madrid work week [2]. Anyway, the inbound patch will finally be sync'ed to birch and central so no need to back it out.

[1] E-mail title: [b2g] Reminder - Land B2G Gecko Patches on Birch!
[2] https://etherpad.mozilla.org/B2G-workweek-merge-schedule
https://hg.mozilla.org/mozilla-central/rev/241e3e556766
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment on attachment 743028 [details] [diff] [review]
Refactor add and remove methods

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 749551
User impact if declined: 
Push notifications blocks on this.
Testing completed: 
Local only. No comprehensive unit tests for mozAlarms.
Risk to taking this patch (and alternatives if risky): 
Does not break existing code. Different code paths.
String or UUID changes made by this patch:
None
Attachment #743028 - Flags: approval-mozilla-b2g18?
Attachment #743028 - Flags: approval-mozilla-b2g18? → approval-mozilla-b2g18+
Blocks: 858005
You need to log in before you can comment on or make changes to this bug.