Closed Bug 1181489 Opened 4 years ago Closed 4 years ago

mozAlarm mixes up 2 alarms: one for a regular CalDav sync, the other to wake you up

Categories

(Core :: DOM: Core & HTML, defect, blocker)

38 Branch
defect
Not set
blocker

Tracking

()

RESOLVED FIXED
mozilla44
blocking-b2g 2.5+
Tracking Status
firefox44 --- fixed
b2g-master --- affected

People

(Reporter: jlorenzo, Assigned: timhuang)

References

Details

(Keywords: foxfood)

Attachments

(3 files, 1 obsolete file)

Attached file Logcat
Pre-requisites
On Firefox OS: 
* Have a CalDav set up in Calendar
* Have an alarm in Clock

Steps to reproduce
1. Don't do anything[1]: don't change the time, don't change the time zone, don't wait for daylight saving switch to happen, don't create a new alarm, don't deactivate and reactivate it.
2. When the alarm rings in the middle of the night, capture the logs.

Results
We have a race condition in MozAlarm. In the logs attached, you can see that the Calendar creates a new alarm for itself to sync up.
> 07-08 02:58:21.334  4776  4776 I Calendar: Content JS LOG: controllers/periodic_sync [calendar]  "Will save alarm:" {"alarmId":3853,"start":"2015-07-08T00:58:21.241Z","end":"2015-07-08T01:28:21.241Z"} 

Then, at 01:28:22 UTC (3:28 in my timezone), an alarm is indeed emitted, but it's not the right one:
> 07-08 03:28:22.104  4905  4905 I Clock   : Content JS LOG: [Clock] ### ALARM FIRED! ### Details: {"id":3579,"date":"2015-07-08T05:05:00.000Z","respectTimezone":"ignoreTimezone","data":{"id":1,"type":"normal","date":"2015-07-08T05:05:00.000Z"}} 


Repro frequency:
As said above, that's a race condition. However, I've had 2 alarms that rang in the middle of the night for 3 weeks

Potential dupes: bug 1031573, bug 1180257


Build ID               20150707071412
Gaia Revision          e6506d68e01489b57616f8b74e8773f938ce62b3
Gaia Date              2015-07-06 18:14:41
Gecko Revision         https://hg.mozilla.org/mozilla-central/rev/e7e69cc8c07b
Gecko Version          42.0a1
Device Name            aries
Firmware(Release)      4.4.2
Firmware(Incremental)  eng.worker.20150619.224059
Firmware Date          Fri Jun 19 22:41:08 UTC 2015
Bootloader             s1

[1] These are the actions I've not done since the last weekend. The alarm worked perfectly on Monday and Tuesday.
Gene Lian was apparently the main contributor for MozAlarm, but he left the company. From the last reviews that happened on /dom/alarms and regarding Fernando's feedback, it seems like you could help investigating this bug Fabrice. Is that right? If not, please redirect.
Severity: normal → blocker
Component: DOM: Device Interfaces → DOM
Flags: needinfo?(fabrice)
QA Whiteboard: [foxfood-triage]
Attached file dev-log-main.log
This night my firefox OS 2.2 started an alarm at 4h34 AM.
But I only have an alarm set at 5h23 UTC.
In the log, we can see that OpenWapp is setting an alarm every hour (first is at 1h34). Do not hesitate to ask for more test or log.
Reading [1], I wonder if we should check whether the currentAlarm belongs to the requesting app? I see the previous condition checks it.

It would help to set the DEBUG const to true in that file and test similar cases.

[1] https://dxr.mozilla.org/mozilla-central/source/dom/alarm/AlarmService.jsm#489-491


Reading the file, I think the dance around "this._currentAlarm" is really complicated and error-prone. There is code like [2] that comes from this and could just be removed if we'd use only the queue (and the "current" alarm would be this._alarmQueue[0]).

[2] https://dxr.mozilla.org/mozilla-central/source/dom/alarm/AlarmService.jsm#430-436
hey Fabrice, what do you think about adding some more debug here? Do you think we should enable it in AlertsManager as well?
Attachment #8638912 - Flags: review?(fabrice)
(In reply to Julien Wajsberg [:julienw] from comment #3)

> Reading the file, I think the dance around "this._currentAlarm" is really
> complicated and error-prone. There is code like [2] that comes from this and
> could just be removed if we'd use only the queue (and the "current" alarm
> would be this._alarmQueue[0]).

I see that setting _currentAlarm actually set the underlying alarm. I don't really like this because this is quite implicit and I prefer explicit code, but now I understand better why this is used.
Comment on attachment 8638912 [details] [diff] [review]
0001-Bug-1181489-enable-debug-for-AlarmService.jsm.patch

Review of attachment 8638912 [details] [diff] [review]:
-----------------------------------------------------------------

How much debug spew do we get from that? I would be fine doing that on non-release builds only.
I think the debug lines would be quite small: only logs when the alarms are set or triggered. We have already such logs in the clock app (and OpenWapp too from the previous logs). But logs in the API would help knowing if the alarms are triggered from the HAL backend or directly from the queue (because some code would find out that they're expired).
We need more logs in the wild and we can disable them once we find the issue.
Attachment #8638912 - Flags: review?(fabrice) → review+
Johan, 

Can you reproduce this again? Based on that we can 2.5+ it. 

Thanks
Flags: needinfo?(jlorenzo)
Not consistently. The last time was last week at 3 in the morning. I didn't have the log patch at that time. I updated my phone this morning so I'll get some more logs next time it happens.
Flags: needinfo?(jlorenzo)
[Triage] Blocked by this.
blocking-b2g: 2.5? → 2.5+
I also got numerous reports of similar issues, but nobody gave me any log yet. (and I don't reproduce myself, but likely because I don't use Whatsapp/Telegram/Calendar that set other alarms).
Johan, Any luck with the logs on this one? Or, should we request QAWanted to try reproducing it?

This is a 2.5 blocker and is waiting on logs. 

Thanks
Flags: needinfo?(jlorenzo)
On the dogfood POV, I didn't get waken by a nightly alarm. 

I tried to force the race condition by creating 20 alarms, plus 5 calendars, 1 timer and all. I couldn't get it. We could try a qawanted, but I'm afraid we need a more mechanical way to make this race condition to appear. For example, create 10,000 mozAlarms and see if the id logged is always the correct one.

At the moment, I don't think the QA team has the bandwidth to ramp up on the mozAlarm API, in order to create that many alarms. Can somebody from the dev team give it a try?
Flags: needinfo?(jlorenzo) → needinfo?(mpotharaju)
Ken, Can you please find an assignee for this one too?

Thanks!
Flags: needinfo?(mpotharaju) → needinfo?(kchang)
Assignee: nobody → tihuang
Thanks, Tim.
Flags: needinfo?(kchang)
I believe that the root cause of this bug is triggered by async operations between the alarms service and the child thread which listens the RTC timer when it involves the “remove alarm” operation. I, however, cannot fully confirm that this is the root cause due to the lack of debug messages. The reason that I think the “remove” involves is based on the second log. In this log, there were always 5 alarms set for OpenWapp, and 5 awake signals had been fired. At the very last round of set-and-trigger before the alarm of clock app fired, there were 5 alarms set, but only 4 awake signals. Thus, I believe there was one alarm removed in one of these awake signals.

In a certain situation, the remove operation may cause the next alarm to be fired. Before I illustrate the process which triggers this bug, there is a very important premise need to describe first. That is, even the child thread fires an alarm on time, the actual time that this alarm been processed by the alarms service may be delayed due to some reasons, like the system is busy or some events need to be process first.

The process triggers this bug which involves three alarms, two of these alarms will expire close to each other, say they separate in timing for 100ms, and the first alarm will remove the second alarm in its handle function. The third alarm will expire later than these two alarms. Following is the process in a chronological order and I named these alarms with numbers according to the order of their expired timing.

1. Three alarms been set.
    Current alarm = Alarm1, Queued alarm = Alarm2, 3, underling alarm = Alarm1
2. Alarm 1 expired, the child thread fires the alarm 1.
    Current alarm = Alarm1, Queued alarm = Alarm2, 3, underling alarm = NULL
3. The alarms service catches the fired event from child thread for alarm 1, send alarm 2 to app, and set the second alarm.
    Current alarm = Alarm2, Queued alarm = Alarm3, underling alarm = Alarm2
4. The remove request has been sent from App, but the alarms service is not going to set current alarm at this moment, it will set the alarm after it finishes the database removal.
    Current alarm = Alarm2, Queued alarm = Alarm3, underling alarm = Alarm2
5. Alarm 2 expired, the child thread fires the alarm 2.
    Current alarm = Alarm2, Queued alarm = Alarm3, underling alarm = NULL
6. The database removal returns, the alarms service set the current alarm here.
    Current alarm = Alarm3, Queued alarm = NULL, underling alarm = alarm 3
7. The alarms service catches the fired even from child thread for alarm 2, sent alarm 3 to app. (The bug happens here)
    Current alarm = NULL, Queued alarm = NULL, underling alarm = alarm 3

Right now, I can reproduce the similar phenomena by manually adding a short delay when alarms manager handling the fired alarm. And this bug could be solved by adding a sanity check when alarm been fired. If the alarm does not expire at that time, I reset this alarm to prevent this issue.
I also found another bug may cause that an alarm would be fired earlier. The actual timing of setting the underlying alarm would be delayed when the system is extremely busy. So it will set an alarm which already expired. This may trigger some problems of the alarm service. This could be solved by adding a sanity check when the alarm writing to underlying alarm. If this alarm already expired at that moment, it directly notify this alarm instead of setting it.
Attached patch bug1181489.patchSplinter Review
Attachment #8664614 - Flags: review?(fabrice)
Attachment #8664614 - Flags: review?(fabrice) → review+
Attachment #8638912 - Attachment is obsolete: true
https://hg.mozilla.org/mozilla-central/rev/a91c821d85b7
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
Duplicate of this bug: 1031573
See Also: → 1187527
Duplicate of this bug: 1180257
Blocks: 1232256
Flags: needinfo?(fabrice)
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.