requestsync assumes all in-memory mozAlarms will never be purged (and these alarms get persisted anyways), but they are purged on timezone/clock changes

RESOLVED FIXED in Firefox 45

Status

()

RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: asuth, Assigned: baku)

Tracking

Trunk
mozilla45
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:2.5+, firefox44 affected, firefox45 fixed)

Details

(Whiteboard: [platform])

Attachments

(1 attachment, 8 obsolete attachments)

(Reporter)

Description

3 years ago
In bug 1182358 requestsync was changed to use mozAlarms.  Unfortunately, mozAlarms does not deal with Chrome consumers well because they lack a manifestURL.

The alarms generated by requestsync look like this, per the log:
{"date":"2015-10-09T02:26:25.329Z","ignoreTimezone":false,"timezoneOffset":240,"id":15}

We will note that the index definition looks like this:
objStore.createIndex("manifestURL",    "manifestURL",    { unique: false });

And the getall request looks like this:
  let index = aStore.index("manifestURL");
  index.mozGetAll(aManifestURL)

And as we all know from reading the IndexedDB spec[1], per 4.5 "If value is Undefined, return failure." of 3.5.1 "Steps to extract a key from a value using a key path" and given 6.2 "If index key is an exception, or invalid, or failure, take no further actions for this index, and continue these substeps for the next index." of 3.4.1 "Object Store Storage Operation", there will be no reference to the record.  So these records become invisible.

AlarmService uses this getall functionality in its implementation of _restoreAlarmsFromDb.  It likes to call this method and forget about its existing in-memory alarms whenever the timezone changes or the system clock changes.  The system clock seems to like to change when the device transitions between cellular data and wifi, or at least when you take it out of your microwave Faraday cage.

I made the following change to AlarmDB.jsm:
===
-      let index = aStore.index("manifestURL");
-      index.mozGetAll(aManifestURL).onsuccess = function setTxnResult(aEvent) {
+      let req;
+
+      if (aManifestURL) {
+        let index = aStore.index("manifestURL");
+        debug("using index on:", aManifestURL);
+        req = index.mozGetAll(aManifestURL);
+      } else {
+        debug("using raw store");
+        req = aStore.mozGetAll();
+      }
+
+      req.onsuccess = function setTxnResult(aEvent) {
===

And the result was that at the process reboot from "./flash.sh gecko", all of my disappeared alarms came back and I did a totally sweet fist-pump.

It's likely this problem is not even more pronounced because requestsync seems to like to reinitialize its alarms at startup.

1: This is an attempt at humor.  It seems almost impossible that I'm actually reading the spec correctly given the choose-your-own-adventure jumping around that is required (but important for a spec).  But it seems reasonable to surmise that there is no entry in the index for the value.
(Reporter)

Comment 1

3 years ago
[Blocking Requested - why for this release]:
The email app, most important among all apps, uses the requestsync API.  Currently, if the device ever updates its clock, say due to a transition between cellular data and wi-fi, email's periodic sync mechanism will be broken until the device reboots.  Also, any lesser apps that use the requestsync API.
Blocks: 1212793
blocking-b2g: --- → 2.5?
(Reporter)

Comment 2

3 years ago
Whoops, I just realized that there's deeper problems than the IDB query.  Specifically, requestsync is using mozAlarms only for in-memory purposes.  It specifies an aAlarmFiredCb to AlarmService.add, which is the only way it will ever get any notifications, since there is no other mapping from the alarm's data to the requestsync service.  (I was assuming there was an observer service channel that the requestsync sniffed to know when its alarms fired, not that that's a great solution.)

It seems like if requestsync is going to be using mozAlarms for in-memory-only purposes:
1) We really should not be persisting the alarms issued by requestsync to the database at all.
2) _restoreAlarmsFromDb and its friends need to not trash the in-memory-only alarms.
Summary: mozAlarms' IDB reads do not return chrome-issued alarms (or any alarms without a manifestURL), causing requestsync to break on timezone or clock changes → requestsync assumes all in-memory mozAlarms will never be purged (and these alarms get persisted anyways), but they are purged on timezone/clock changes
blocking-b2g: 2.5? → 2.5+
(Assignee)

Comment 3

3 years ago
Andrew, are you planning to work on this issue? Otherwise I can take it.
Flags: needinfo?(bugmail)
(Reporter)

Comment 4

3 years ago
It's all yours! ;)
Assignee: nobody → amarchesini
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)

Updated

3 years ago
Whiteboard: [platform]
(Reporter)

Updated

3 years ago
Duplicate of this bug: 1212793
(Assignee)

Comment 6

3 years ago
Created attachment 8674803 [details] [diff] [review]
alarm.patch

Feedback request because I didn't test it properly yet.
Attachment #8674803 - Flags: feedback?(bugmail)
(Reporter)

Comment 7

3 years ago
Comment on attachment 8674803 [details] [diff] [review]
alarm.patch

I think this might be an out-of-date patch or there's some confusion arising from my initial confusion in comment 0.

requestsync breaks at runtime whenever _onTimezoneChanged or _onSystemClockChanged is invoked because they call _restoreAlarmsFromDb which zeroes out the _alarmQueue and repopulates it using the database state.

This patch causes the AlarmService to no longer attempt to persist the chrome alarms, but they will still be lost whenever _restoreAlarmsFromDb is invoked.

I assume the existing code's idiom of invoking _restoreAlarmsFromDb was simple code reuse.  Unless there was intent to leverage IndexedDB's transaction model for ordering (and there are no comments to that effect), it seems like the "come up to speed" guts of _restoreAlarmsFromDb could be refactored out like you've done with processNewAlarm.  If that's your intent (or you've already done that and didn't refresh your patch! ;) that seems good to me.  It would be great to have a comment documenting the plan for API-use races against timezone/clock change events.

I do think it is probably also desirable to have some fix or plan in place to deal with the already persisted chrome alarms that requestync has registered.  Anyone who has been using the email app with requestsync (or if there are any other consumers), will have one garbage alarm entry for every time b2g restarted (via powerup or simplying stopping/starting b2g) and every time they changed their requestsync settings from within the email app.  Making the change I mention in comment 0 with a defensive discard-filtering of records lacking a manifestUrl would accomplish the goal, although bumping the schema version and running one-off code would also work.
Attachment #8674803 - Flags: feedback?(bugmail)
(Assignee)

Comment 8

3 years ago
Created attachment 8679534 [details] [diff] [review]
alarm.patch

What I would like to have with this patch is that, for alarms without manifest, we never go to the DB. This patch is similar to the previous one, but when the timezone changes, the restoreFromDB method, ignores the existing alarms if they are coming from chrome (or if they don't have a manifest).
Attachment #8674803 - Attachment is obsolete: true
Attachment #8679534 - Flags: review?(bugmail)

Updated

3 years ago
Duplicate of this bug: 1210626
(Reporter)

Comment 10

3 years ago
Comment on attachment 8679534 [details] [diff] [review]
alarm.patch

The _restoreAlarmsFromDb implementation logic here is failing to put the _currentAlarm back into the _alarmQueue, resulting in the current alarm gets dropped if it was chrome-issued.

It's a minor fix to correct this so I've made that change locally and am testing it under various permutations.  I'll attach the revised patch and kick off a try-run once I finish my testing; should be less than an hour.
Attachment #8679534 - Flags: review?(bugmail) → review-
(Reporter)

Comment 12

3 years ago
During testing I also identified an ordering problem with _currentAlarm and re-entrancy in the chrome alarm case.  Currently (without a fix) _currentAlarm invokes _notifyAlarmObserver and then sets _currentAlarm to null.  In the case of a chrome alarm, _notifyAlarmObserver synchronously invokes the callback.  If the chrome callback adds an alarm before returning, it will be clobbered out of existence upon return.

This scenario occurs in RequestSyncService.timeout() when hasPendingMessages() returns true, causing it to immediately invoke scheduleTimer().  Otherwise (and the apparent normal case from logging), a more asynchronous flow occurs where requestsync sends the 'request-sync' message and waits for it to complete before rescheduling the timer.

In the specific case where hasPendingMessages ends up returning true in my test, it appears that what happened is that the call to systemMessenger.sendMessage('request-sync', ...) returned a rejection.  Why this happened is unclear; nothing obvious jumps out at me in SystemMessageInternal.js and I hadn't thought to uncomment the logging idiom in use throughout these APIs in that file
  function debug(aMsg) {
    // dump("-- SystemMessageInternal " + Date.now() + " : " + aMsg + "\n");
  }

I will however, do that now and attempt to figure out what's up with that, since this also seems like a case that would permanently wedge the email app's periodic sync mechanism as well, and may even impact mozAlarms.

In the meantime, I think we're going to consider moving email back to mozAlarms.  After my quick foray into the requestsync code for this extra debugging, I'm thinking the efforts of bug 1182358 to deal with power-saving sleep were not sufficiently extensive.  Specifically, I don't see any consideration in the requestsync codebase about holding wakelocks despite the use of promises with asynchronous flows, and the nsITimer failsafe likely also needs to be a mozAlarm, etc.

I'll make another update on this bug with my findings about the message sending failure and an updated patch that contends with the re-entrancy issue.
(Reporter)

Comment 13

3 years ago
2 inline log traces.

Log for when the system message sending fails.  The triggering cause per the log is a child process shutdown announcement.  The email app likes to be a good memory citizen and so it invokes window.close() on itself if it was started in the background and not foregrounded by the user by the time it completes its periodic sync.  I suspect there is a power-saving-sleep race related to this.  Specifically, the email app was PID 1811, and we can see that PID 1811 logs something during this log trace.  So the problem is that the preceding b2g instance for the email app only completes its shutdown when we are woken up by the alarm to spawn a new b2g instance for the email app.  It's possible the email app wants to avoid explicitly releasing its cpu wakelock when invoking window.close() instead depending on the platform to automatically clean it up, although we may still end up in some type of race no matter what.  I'll give that a quick try now since this failure mode is reproducing quite reliably for me.
=== 
I/Gecko   (  976): DEBUG RequestSyncService: Sending message.
I/Gecko   (  976): -- SystemMessageInternal 1446144760849 : Sending request-sync {"task":"interval200000","lastSync":"2015-10-29T18:49:20.053Z","oneShot":false,"minInterval":200,"wakeUpPage":"app://email.gaiamobile.org/index.html","wifiOnly":false,"data":{"accountIds":["0"],"interval":200000}} for app://email.gaiamobile.org/index.html @ app://email.gaiamobile.org/manifest.webapp; extra: undefined
I/Gecko   (  976): SystemMessagePermissionsChecker.jsm: isSystemMessagePermittedToSend(): aSysMsgName: request-sync, aPageURL: app://email.gaiamobile.org/index.html, aManifestURL: app://email.gaiamobile.org/manifest.webapp
I/Gecko   (  976): SystemMessagePermissionsChecker.jsm: getSystemMessagePermissions(): aSysMsgName: request-sync
I/Gecko   (  976): -- SystemMessageInternal 1446144760851 : Acquiring a CPU wake lock for page key = k9yKZLKL7M1Q/X6fxQ6KpJsPuEk=
I/Gecko   (  976): -- SystemMessageInternal 1446144760854 : _getMessageConfigurator for type: request-sync
I/Gecko   (  976): -- SystemMessageInternal 1446144760854 : Returned status of sending message: 0
I/Gecko   (  976): AlarmService: Current alarm: null
I/Gecko   (  976): AlarmService: Alarm queue: []
I/Gecko   (  976): -- SystemMessageInternal 1446144760864 : Got Unregister from [object ChromeMessageSender] innerWinID 15032385539
I/Gecko   (  976): -- SystemMessageInternal 1446144760865 : remove the listener for app://email.gaiamobile.org/manifest.webapp
I/Gecko   (  976): -- SystemMessageInternal 1446144760865 : rejecting because of unregister for manifestURL: app://email.gaiamobile.org/manifest.webapp
I/Gecko   (  976): -- SystemMessageInternal 1446144760865 : rejecting pending promises by manifest URL: app://email.gaiamobile.org/manifest.webapp
I/Gecko   (  976): DEBUG RequestSyncService: promise rejected
I/Gecko   (  976): DEBUG RequestSyncService: promise or timeout for task calls taskCompleted
I/Gecko   (  976): DEBUG RequestSyncService: operationCompleted
I/Gecko   (  976): DEBUG RequestSyncService: updateObjectInDB
I/Gecko   (  976): DEBUG RequestSyncService: dbTxn
I/Gecko   (  976): DEBUG RequestSyncService: pendingOperationStarted
I/Gecko   ( 1811): -- SystemMessageCache 1446144760868 : received RefreshCache
I/Gecko   (  976): -- SystemMessageInternal 1446144760879 : Got child-process-shutdown from [object ChromeMessageSender]
I/Gecko   (  976): -- SystemMessageInternal 1446144760879 : rejecting because of shutdown for manifest URL: app://system.gaiamobile.org/manifest.webapp
I/Gecko   (  976): -- SystemMessageInternal 1446144760879 : rejecting pending promises by manifest URL: app://system.gaiamobile.org/manifest.webapp
I/Gecko   (  976): -- SystemMessageInternal 1446144760881 : rejecting because of shutdown for manifest URL: app://callscreen.gaiamobile.org/manifest.webapp
I/Gecko   (  976): -- SystemMessageInternal 1446144760881 : rejecting pending promises by manifest URL: app://callscreen.gaiamobile.org/manifest.webapp
I/Gecko   (  976): -- SystemMessageInternal 1446144760882 : rejecting because of shutdown for manifest URL: app://findmydevice.gaiamobile.org/manifest.webapp
I/Gecko   (  976): -- SystemMessageInternal 1446144760882 : rejecting pending promises by manifest URL: app://findmydevice.gaiamobile.org/manifest.webapp
I/Gecko   (  976): DEBUG RequestSyncService: pendingOperationDone
I/Gecko   (  976): DEBUG RequestSyncService: scheduleTimer
I/Gecko   (  976): AlarmService: add(Thu Oct 29 2015 14:56:00 GMT-0400 (EDT))
I/Gecko   (  976): AlarmService: Current alarm: {"date":"2015-10-29T18:56:00.960Z","ignoreTimezone":false,"timezoneOffset":240,"id":-4}
I/Gecko   (  976): AlarmService: Alarm queue: []
I/Gecko   (  976): DEBUG RequestSyncService: processNextTask
I/Gecko   (  976): DEBUG RequestSyncService: UpdateObjectInDB completed
===

Log for when the rejected message is apparently still tracked as a pending message.  Note that the email app does indeed get the request-sync message when I manually invoke it significantly later on.
===
I/Gecko   (  976): AlarmService: _notifyAlarmObserver()
I/Gecko   (  976): DEBUG RequestSyncService: timeout
I/Gecko   (  976): -- SystemMessageInternal 1446144960969 : received SystemMessageManager:HasPendingMessages request-sync for app://email.gaiamobile.org/index.html @ app://email.gaiamobile.org/manifest.webapp
I/Gecko   (  976): DEBUG RequestSyncService: Pending messages: true
I/Gecko   (  976): DEBUG RequestSyncService: scheduleTimer
===
(Reporter)

Comment 14

3 years ago
Under a small sample size (n=4) having email "leak" its wake-locks when it knows it will call window.close() does seem to result in the "unregister" event happening before the device goes to sleep.  The "child-process-shutdown" events (which seem pretty sketchy, but I'll look at that when filing the SystemMessagesInternal bug), less reliably happen before shutdown.  One of the runs found them only unregistering after the subsequent wakeup.

I've commented in bug 1219853 where we're now planning to move the email app back to mozAlarms about changing our wakelocks behaviour as part of that fix.
(Reporter)

Comment 15

3 years ago
Created attachment 8681033 [details] [diff] [review]
alarm.patch with _currentAlarm and re-entrancy fixes

This is your patch with the _currentAlarm fix and correcting the re-entrancy _currentAlarm clobbering.  r=asuth on your fix with these changes, but I think you need to review my changes too, minor though they be.  I'll push this to try too.

In terms of the blocking status of this bug, it may be appropriate to revisit this.  The email app is planning to move back to mozAlarms for v2.5 as part of bug 1219853, so we no longer provide blocking justification.  However, it may be appropriate to keep this blocking v2.5 on behalf of other users of the API.  The API appears to still be marked certified only, so I would expect the consumers to be limited to those in Gaia, which seems to mean that the "Firefox Sync" mechanism in the system app is the other consumer: https://github.com/mozilla-b2g/gaia/blob/master/apps/system/js/sync_manager.js

Assuming that code lives in the main process with the rest of the system app, that consumer at least should not have to worry about the hasPendingMessages breakage scenario, although there still may be other scenarios where problems arise.
Attachment #8679534 - Attachment is obsolete: true
Attachment #8681033 - Flags: review?(amarchesini)
(Reporter)

Comment 17

3 years ago
I filed bug 1220032 on the SystemMessageInternal issues raised in comment 13.
(Assignee)

Comment 18

3 years ago
Comment on attachment 8681033 [details] [diff] [review]
alarm.patch with _currentAlarm and re-entrancy fixes

Review of attachment 8681033 [details] [diff] [review]:
-----------------------------------------------------------------

can you land it if fully green on treeherder?
Attachment #8681033 - Flags: review?(amarchesini) → review+
(Reporter)

Comment 20

3 years ago
Comment on attachment 8681033 [details] [diff] [review]
alarm.patch with _currentAlarm and re-entrancy fixes

The try runs are not green, https://dxr.mozilla.org/mozilla-central/source/hal/tests/browser_alarms.js is unhappy and times out.  It seems that trigger_all_alarms only sees 9 of the 10 alarms firing.

It appears this is a different reentrancy issue in _onAlarmFired in the alarmQueue.length > 0 case.  It is assuming that the earlier _notifyAlarmObserver call could not have possibly mutated _currentAlarm.  But it did.  So it's clobbering _currentAlarm.

This is a newfound problem with this patch since the logic (now) in processNewAlarm previously would always have been deferred to a future turn of the event loop because it always occurred after the IndexedDB database write occurred.

For simplicity, it might make sense to also defer processNewAlarm to a future turn of the micro-task queue to avoid these re-entrancy problems without complicating the app logic too much.  For example, by doing a nextTick() type trick using Promise.resolve().then(...).  The other alternative would be to defer/batch the callback invocations until after all state management is concluded.  This would also moot the re-entrancy problem I noted in comment 12.  (Note that the "current" patch also includes the still necessary fix discussed in comment 10, so it should not be abandoned.)

Note that correctness in terms of wake-locks may be impacted for chrome alarms.  Specifically, for content code, they get their wake-lock correctness from the AlarmsManager holding the wake-lock.  The chrome consumers of the AlarmService like requestsync do not get that.  (But as noted elsewhere, I think requestsync is largely living on luck right now anyways.)

Because of the wake-lock issue I'm declaring this beyond me.  :baku, back to you for preparing the next rev of the patch that makes the broken test happy, etc.
Attachment #8681033 - Flags: review+
Since we moved away from requestSync for 2.5 (at least for email) [1], should we avoid continuing to block 2.5 on this?

[1]
https://groups.google.com/d/msg/mozilla.dev.fxos/3tW7demWV1k/xfgvDJLgAQAJ
Flags: needinfo?(doliver)
Flags: needinfo?(bugmail)
(Reporter)

Comment 22

3 years ago
I sorta addressed this in comment 15; email no longer needs requestsync and so does not merit blocking justification for requestsync, but there's "Firefox Sync" code in the tree that may.  I have no idea if that code is actually used.
Flags: needinfo?(bugmail)
Fernando, see comment #15 -- does this bug affect sync manager? If not, we'll unblock this for 2.5.
Flags: needinfo?(doliver) → needinfo?(ferjmoreno)
The sync manager is going to be shipping on 2.5 TVs, so yes, I guess this needs to block.

(In reply to Andrew Sutherland [:asuth] from comment #15)
> Assuming that code lives in the main process with the rest of the system
> app, that consumer at least should not have to worry about the
> hasPendingMessages breakage scenario, although there still may be other
> scenarios where problems arise.

Can you describe which are these scenarios?
Flags: needinfo?(ferjmoreno) → needinfo?(bugmail)
(Assignee)

Comment 25

3 years ago
Created attachment 8684350 [details] [diff] [review]
patch

The notification must be sent via a timeout otherwise the callbacks are running into the same function and that creates a mess with the alarm queue.
Attachment #8681033 - Attachment is obsolete: true
Attachment #8684350 - Flags: review?(bugmail)
(Reporter)

Comment 26

3 years ago
Comment on attachment 8684350 [details] [diff] [review]
patch

Deferring the chrome alarm callback to a future turn of the event loop will definitely fix the reentrancy problems for chrome API consumers.  *But* deferring the system message scheduling to a future turn of the event loop loses the wakelock coverage that ensures that the device doesn't go back to sleep immediately.  If the chrome consumer is operating on a device where deep sleep occurs, then the chrome consumer may also end up very sad too.

Specifically, SystemMessageInternal.sendMessage() will acquire a wakelock on behalf of the page receiving the "alarm" system message before returning control to its caller.  Unfortunately, the AlarmHalService code lacks comments explaining what guarantees are made about when sleep will next occur, but I'm assuming we're safe as long as we acquire wake-locks before returning control to the AlarmHalService from _onAlarmFired which it invokes.

If we are going to defer alarm invocation, then I believe that _onAlarmFired will need to acquire a wakelock itself and hold it until all consumer callbacks have run.  I believe this is why in comment 20 I proposed adding the deferral on the alarm-adding side rather than on the notification side.  (Although admittedly, as I note in comment 20, there is an increased deep sleep risk to chrome consumers versus earlier drafts of your patch, but chrome consumers were already deeply exposed to that risk prior to this patch anyways.  requestsync really just wants to be holding a wakelock when it invokes the alarm adding API.)
Flags: needinfo?(bugmail)
Attachment #8684350 - Flags: review?(bugmail) → review-
(Reporter)

Comment 27

3 years ago
(In reply to Fernando Jiménez Moreno [:ferjm] from comment #24)
> The sync manager is going to be shipping on 2.5 TVs, so yes, I guess this
> needs to block.
> 
> (In reply to Andrew Sutherland [:asuth] from comment #15)
> > Assuming that code lives in the main process with the rest of the system
> > app, that consumer at least should not have to worry about the
> > hasPendingMessages breakage scenario, although there still may be other
> > scenarios where problems arise.
> 
> Can you describe which are these scenarios?

If both of the following are true:
A) The TV ever goes into deep sleep where a mozAlarm is needed to wake it up.
B) The sync manager logic needs to be triggered while the TV is in deep sleep and needs to reliably run.

Then the sync manager is in trouble because 'B' isn't going to happen.

However, doing another skim of the sync manager logic, it seems like one of A or B must already be false since I don't see the sync manager acquiring wake locks, so there's no way B could be true given A already.

If it's a goal that 'A' and 'B' be true for the v2.5 TV's, then someone will need to overhaul the sync manager to use wakelocks as well as requestsync.  (Or switch to mozAlarms and still overhaul the sync manager to use wakelocks).
(Assignee)

Comment 28

3 years ago
Created attachment 8684872 [details] [diff] [review]
patch

Here a wakelock. RequestSync doens't need it because it's kept alive by the promise generated by the systemMessage.
Attachment #8684350 - Attachment is obsolete: true
Attachment #8684872 - Flags: review?(bugmail)
(Reporter)

Comment 29

3 years ago
Comment on attachment 8684872 [details] [diff] [review]
patch

Yes, this wakelock should protect the mozAlarms callback and make this change keep mozAlarms working.  Thanks!  r=asuth with the understanding that this moves requestsync forward in terms of wakelock-safety without breaking mozAlarms, but not does make requestsync wakelock-safe.

(In reply to Andrea Marchesini (:baku) from comment #28)
> RequestSync doens't need it because it's kept alive by the
> promise generated by the systemMessage.

I don't think this is the case at all.  The system message wake-lock appears to be held exclusively for the benefit of the recipient page of the system message.  The wake-lock is held until the earlier of the following two things happens:
1) until mozSetMessageHandler for all woken pages has synchronously invoked and returned.  (Giving the page the opportunity to acquire its own wakelock inside its handler.)
2) until the 30 second timeout in _acquireCpuWakeLock times out.

requestsync's wakelock usage seems potentially unsound at least on these two fronts:
- SystemMessageInternal.js resolves the promise, then releases the wakelock.  But by definition of how promises work, that notification must come after the wakelock has been released.
- Requestsync's timeout if the promise doesn't fire is 2 minutes, which is obviously well after the 30 second timeout implemented by SystemMessageInternal.

Because I've now written disclaimers that I'm not 100% on how the wake-lock stuff happens under the hood too many times, I've dug into the platform code and also refreshed my understanding of the gaia logic.  "cpu" wake-lock releases appear to be a synchronous cascade[1] of notifications throughout Hal layers (GECKO/hal) and power manager layers (GECKO/dom/power) until they end up in the system app's (GAIA/apps/system) cpu_manager.js which synchronously ends up back in hal::SetCpuSleepAllowed(), which on gonk will bottom out in a write to "/sys/power/wake_unlock". And looking at the kernel implementation for "autosleep", it looks extremely eager to hibernate() the device, consistent with the docs for "/sys/power/autosleep" at https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-power.  In particular, I see no indication of logic that wants to let the Gecko event-loop or Promise micro-task queues drain first.  (Noting that because of the notification indirection going on, there could be some secret logic happening, but I didn't see any sign of it, and arguably that logic would be dangerous.)

1: At least, if we're in the parent process, which is the case for the RequestSyncService.  It appears that in the content processes hal::ModifyWakeLock will be asynchronously relayed to the parent process, which means that content wake-lock unlocks may take an arbitrary amount of time to take effect, making content logic doing sketchy things more likely to be race-prone and get away with things some of the time[2].

2: API calls that involve talking to the parent process that are ordered on the same queue as whatever services PHal (looks main-thread, not PBackground) should presumably result in sleep occurring before the call returns.  PBackground-serviced things like IndexedDB may however still get to do stuff?  Things get analytically wonky here and proper wake-lock usage should be safe anyways.  (As long as PHal and SystemMessageInternal continued to be serviced in-order on the same thread, that is.)
Attachment #8684872 - Flags: review?(bugmail) → review+
(Assignee)

Comment 30

3 years ago
Created attachment 8685073 [details] [diff] [review]
patch
Attachment #8684872 - Attachment is obsolete: true
Attachment #8685073 - Flags: review?(bugmail)
(Assignee)

Comment 31

3 years ago
Created attachment 8685408 [details] [diff] [review]
patch

I hope this covers all the possible scenario. I still have to push this patch to try..
Attachment #8685073 - Attachment is obsolete: true
Attachment #8685073 - Flags: review?(bugmail)
Attachment #8685408 - Flags: review?(bugmail)
(Reporter)

Comment 32

3 years ago
Comment on attachment 8685408 [details] [diff] [review]
patch

I haven't looked at the changed code yet for correctness.  Some comments are needed in the source to describe the (intended) invariants surrounding the use of wake-locks in requestsync before I do.  Although you need not go as far annotating each function (briefly) as to what wake-lock covers it as I suggested as ideal on IRC, something is required.  In particular, since timeout() may create its own wake-lock or receive one, I think it would be helpful to describe the contract for it being passed a wake-lock, and under whose implicit wake-lock it is assumed to be operating when it has to create one.  (I believe the answer is the mozAlarms chrome callback.) 

I'm happy to re-review with the comments, or if you want to find an alternate reviewer with less love of comments than myself, that's also fine.
Attachment #8685408 - Flags: review?(bugmail)
(Assignee)

Comment 33

3 years ago
Created attachment 8686085 [details] [diff] [review]
patch
Attachment #8685408 - Attachment is obsolete: true
Attachment #8686085 - Flags: review?(bugmail)
(Reporter)

Comment 34

3 years ago
Comment on attachment 8686085 [details] [diff] [review]
patch

Review of attachment 8686085 [details] [diff] [review]:
-----------------------------------------------------------------

Thank you for the comment.  I think another round of fixes and the steady-state should be safe from hibernation.

::: dom/requestsync/RequestSyncService.jsm
@@ +702,5 @@
>      if (!app) {
>        dump("ERROR!! RequestSyncService - Failed to retrieve app data from a principal.\n");
>        aObj.active = false;
>        this.updateObjectInDB(aObj);
> +      this.maybeReleaseWakeLock(aWakeLock);

You might consider hanging this off the updateObjectInDB callback for idiom hygiene reasons.  For this csae it doesn't really matter; the main risk is that the device hibernates before the transaction goes through and it gets rolled back on startup and this logic has to run again.

@@ +712,5 @@
>  
>      // Maybe need to be rescheduled?
>      if (this.hasPendingMessages('request-sync', manifestURL, pageURL)) {
>        this.scheduleTimer(aObj);
> +      this.maybeReleaseWakeLock(aWakeLock);

For hygiene reasons, this maybeReleaseWakeLock call should happen in a scheduleTimer callback.  Strictly speaking, this is safe right now because AlarmService.add() is synchronous in this chrome-case, but that's not the contract and there's no tests or comments that prevent this from regressing.

@@ +723,5 @@
>      if (!manifestURL || !pageURL) {
>        dump("ERROR!! RequestSyncService - Failed to create URI for the page or the manifest\n");
>        aObj.active = false;
>        this.updateObjectInDB(aObj);
> +      this.maybeReleaseWakeLock(aWakeLock);

Ditto the above updateObjectInDB note, but there's also a requestsync-broken-until-reboot issue if this logic ever fires.  Note that _activeTask is set to aObj above, but is not cleared in this branch.  This strops all timeouts from firing until the device restarts.

I would suggest moving this failsafe check up above the assignment to this._activeTask but below removeTimer.

@@ +751,5 @@
>        if (!done) {
>          done = true;
> +        self.operationCompleted(wakeLock);
> +      } else {
> +        wakeLock.unlock();

The else branch is weird and seems wrong.  We end up in the else branch if both the timer and the promise have resolved/rejected.  The first time through, we hand the wakelock off to operationCompleted.  The second time through, we potentially are canceling the wakelock out from under the operationCompleted logic, very possibly breaking the self-rescheduling logic.

It seems like you can just remove the else branch and things should be fine.
Attachment #8686085 - Flags: review?(bugmail)
(Assignee)

Comment 35

3 years ago
Created attachment 8687669 [details] [diff] [review]
requestsync-with-fixes.diff
Attachment #8686085 - Attachment is obsolete: true
Attachment #8687669 - Flags: review?(bugmail)
(Reporter)

Comment 36

3 years ago
Comment on attachment 8687669 [details] [diff] [review]
requestsync-with-fixes.diff

Woo!
Attachment #8687669 - Flags: review?(bugmail) → review+

Updated

3 years ago
Duplicate of this bug: 1222964

Comment 39

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/1310c38d6d26
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox45: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla45

Updated

3 years ago
Duplicate of this bug: 1226599
You need to log in before you can comment on or make changes to this bug.