Closed Bug 997949 Opened 6 years ago Closed 6 years ago

[Messages] The notification never disappears in some situations

Categories

(Core :: DOM: Core & HTML, defect, critical)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

()

RESOLVED FIXED
2.0 S1 (9may)
blocking-b2g 1.4+
Tracking Status
firefox30 --- wontfix
firefox31 --- wontfix
firefox32 --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: julienw, Assigned: gerard-majax)

References

Details

(Keywords: regression, Whiteboard: [systemsfe] [p=1])

Attachments

(1 file, 4 obsolete files)

Seen on v1.4 on Buri, when bug 988726 happens, the notification does not disappear when we click it.

We have no 100% STR.

I tried to debug a phone when that was happening, I saw that `Notification.get` was returning a promise that was never resolved nor rejected, and as a result we never tried to close the notification. Nothing relevant in the logcat.

Requesting 1.4? because this is happening in v1.4 and the phone has lots of weird behaviors when that happens.
See Also: → 988726
Note that simply entering the thread does not make the notification disappear either. Same technical reason, but we still don't know the root reason.
Summary: sms message thread is all wrong - wrong messages, wrong contacts, wrong header → [Messages] The notification never disappears in some situations
Julien, have you seen this happening on master?
Assignee: nobody → mhenretty
Anthony, Dietrich, did you see this on your phone when bug 988726 is occurring?
Flags: needinfo?(dietrich)
Flags: needinfo?(anthony)
Note - We're going to need better STR for this to block.
I think it happens when the phone is used for real, like being up for some hours, with calls, messages, etc, and somehow a resource is leaking.

That's why it will be very difficult to have a STR but yet this is IMO a blocker.
as an end-user, I can tell it is very disturbing. I never reboot my device but this issue appears and disappears randomly (it seems).
I had the issue again yesterday with alcatel one touch fire, 1.4.0.0-prerelease
BI : 20140411000202
Git commit info : 2014-04_11 03:40:46 6c50349f
Keywords: steps-wanted
blocking-b2g: 1.4? → 1.4+
It's a bit difficult to remember but I think I've never seen notification not disappear on my dogfooding phone, even when bug 988726 happens. But then, these might be unrelated bugs.
Flags: needinfo?(anthony)
I don't remember seeing this ever.
Flags: needinfo?(dietrich)
(In reply to Michael Henretty [:mhenretty] from comment #2)
> Julien, have you seen this happening on master?

Hey Mic, 

Do you have enough information to investigate this issue or ideas that may help progress here ? Do not want to endlessly block on an issue which cannot be investigated without STR or more information.
Flags: needinfo?(mhenretty)
(In reply to bhavana bajaj [:bajaj] from comment #9)
> Hey Mic, 
> 
> Do you have enough information to investigate this issue or ideas that may
> help progress here ? Do not want to endlessly block on an issue which cannot
> be investigated without STR or more information.

At this point it will be really hard for me to investigate. I have seen similar things happening in the past. However, there have been some recent fixes in master to the underlying Notification implementation (or rather components that notifications rely on) that I thought fixed this issue in master. That is why I asked Julien if he has also seen this on master.

In any case, without good STR (even if it is intermittent), this will be very hard to investigate and even harder to fix.
Flags: needinfo?(mhenretty)
In triage, we discussed that we would only block on this if it was actionable. comment 10 seems to imply that isn't true, so this is being moved to backlog.
blocking-b2g: 1.4+ → backlog
I still disagree with this rationale. This is a vicious circle: it's not a blocker so nobody works on it so there is no valid STR so it's not a blocker. A vicious circle needs to be broken. How to break the vicious circle? 

2 solutions:
* By blocking on important bugs for users.
* Somebody who cares (today it's been Alexandre Lissy, Natalia and myself) tries to find a STR on a non-blocker bug.

The process you use these days really doesn't work for me.
Since we haven't seen this on master, maybe it's a case of an uplift to 1.4 that caused this. We may need to uplift more fixes, like those mentioned in comment 10. Michael, could you list the recent improvements for Notifications?

(And I agree with Julien about not removing blockers when we have a dogfooder seeing the issue)
blocking-b2g: backlog → 1.4?
Flags: needinfo?(mhenretty)
(In reply to Julien Wajsberg [:julienw] from comment #12)
> I still disagree with this rationale. This is a vicious circle: it's not a
> blocker so nobody works on it so there is no valid STR so it's not a
> blocker. A vicious circle needs to be broken. How to break the vicious
> circle? 

Well, find the STR to reproduce the issue consistently. Or give enough information to the developer to allow them to fix the bug. But we can't block on something without an ability to fix the problem. We'll never ship software that way.

> 
> 2 solutions:
> * By blocking on important bugs for users.

They need to be actionable. Non-actionable bugs fall under criteria for not blocking.

(In reply to Anthony Ricaud (:rik) from comment #13)
> Since we haven't seen this on master, maybe it's a case of an uplift to 1.4
> that caused this. We may need to uplift more fixes, like those mentioned in
> comment 10. Michael, could you list the recent improvements for
> Notifications?
> 
> (And I agree with Julien about not removing blockers when we have a
> dogfooder seeing the issue)

If a few dogfooders hit the issue, then this best action is to get that information in the bug & provide the information needed to allow the bug to be actionable. Also, work with QA to figure out the way to reproduce the bug based on that information. However, if we've got nothing to go off to fix the problem, then there's nothing we can do here. We need to actually have an ability to fix the problem to be able to continue to block on a bug.
Lets get a logcat the next time this shows up.
(In reply to Julien Wajsberg [:julienw] from comment #0)
> I tried to debug a phone when that was happening, I saw that
> `Notification.get` was returning a promise that was never resolved nor
> rejected, and as a result we never tried to close the notification. Nothing
> relevant in the logcat.

I had seen this before, but believed this was fixed by bug 961317. However, that patch was uplifted to aurora before the report of this bug (Apr 10, '14), so this must be a different issue.

Thinking about this more, in bug 967475 Alexandre found a case where calling "close" twice on the same notification can caused Notification.get fail (see bug 967475 comment 68). This could be the same problem as here.

I'm going to attach his fix for that here, and try to get feedback from you guys to see if it still reproduces.
Flags: needinfo?(mhenretty)
Attached patch notification.patch (obsolete) — Splinter Review
Can one of you guys build 1.4 with this patch and try to reproduce this issue?
Attachment #8412394 - Flags: feedback?(nwinter)
Attachment #8412394 - Flags: feedback?(lissyx+mozillians)
Attachment #8412394 - Flags: feedback?(felash)
Attachment #8412394 - Flags: feedback?(anthony)
(In reply to Anthony Ricaud (:rik) from comment #13)
> Since we haven't seen this on master,

as far as I remember, _you_'ve seen this on master?

But it was at least 2 weeks ago.

(In reply to Michael Henretty [:mhenretty] from comment #16)
> 
> Thinking about this more, in bug 967475 Alexandre found a case where calling
> "close" twice on the same notification can caused Notification.get fail (see
> bug 967475 comment 68). This could be the same problem as here.
> 
> I'm going to attach his fix for that here, and try to get feedback from you
> guys to see if it still reproduces.

Don't worry, we work quite closely with Alexandre about this; I think we'll try to flash Natalia (who reproduces this bug more than anyone else) a gecko with Alexandre's fix and see how this goes.
Just want to say I reproduced this today on a master+central build from 17th April (so quite old). Note that I don't see bug 988726. I don't have a good STR either. The new thing I did is that the phone slept during some hours with this notification in. And I had also other notifications (I took screenshots).

Devtools don't work (I can't debug in the application) which is strange but is a bug we have sometimes these days so maybe not related.

I can flash a new version of the SMS app and keeping the bug happening, so I could add some logs. I saw (once again, like in comment comment 0) that the promise returned by Notification.get is never resolved nor rejected.
I can't find more things here so I'll try to reproduce using a similar STR with the same build and see if I can find any reliable STR.
See Also: → 979462
So, I reproduced it once as soon as I rebooted the phone (same build from central+master last 17h April), and then I couldn't anymore (even after rebooting). So this is probably a race condition...

This time there was only one notification, and I didn't leave the phone sleep.
(In reply to Julien Wajsberg [:julienw] from comment #21)
> So, I reproduced it once as soon as I rebooted the phone (same build from
> central+master last 17h April), and then I couldn't anymore (even after
> rebooting). So this is probably a race condition...
> 
> This time there was only one notification, and I didn't leave the phone
> sleep.

Set debug to true in dom/src/notification/NotificationDB.jsm and dom/src/notification/NotificationStorage.js, then try to reproduce and have a look at logcat.
Flags: needinfo?(felash)
In the code for the Notification API, I think there could be race conditions with the cache handling in NotificationStorage:

* we cache only when we get
* on delete and put, we remove/add notifications to the cache, even if the cache is not populated yet
* also, we don't protect against 2 simultaneous gets

I don't think this is the issue we see here though. The "put" and double "get" cases seems to be fine (because we don't clear the cache when we populate it), and we rarely do a delete and a get in the same time (we either do a delete only, or a get _then_ a delete). So these race conditions don't look serious currently.


The issue with the promise that would be never resolved nor rejected must come from the fact that the task for this operation is never run in NotificationDB, and this could come from the issue fixed in attachment 8412394 [details] [diff] [review], so we should definitely uplift it in any case.

Other possibilities:
* "NotificationDB.load" could be called several times, if we call ensureLoad while the current load is still happening. I don't know if this can have any effect
* I don't know why you use Promises inside the functions only, but still use callbacks... Seems very error-prone to me :/ The result is that if the code inside the promise's then handlers fails for some reason, the callbacks would never be called, which is wrong given that all the following and new promise-related tasks would never run. As a quick fix for this, you can add .catch(callback) at the end of the promise chains (which would be easier is you returned promises from your functions instead of taking callbacks).
  I think we really miss a .finally() methods in promises, that would run for both resolved and rejected promises without changing its status.

I really think the current code is not well resistant to errors at all.

Now, the main problem to diagnostic the bug we have here, is that it's mainly a _consequence_ of another bug. The notification never disappears because the notification subsystem failed _before_, which makes it difficult to get logs because the relevant logcat was likely already cleared.

I'll enable DEBUG for the Notification subsystem and disable DEBUG for other subsystems I have, and try to reproduce this again with the phone plugged in on my computer with a running logcat. Hopefully this will bring something useful.
Maybe we are hitting bug 951629 ?
Flags: needinfo?(felash)
Attached patch notification-promises-poc.patch (obsolete) — Splinter Review
This is a POC to rewrite the notificationDB using promises instead of callbacks.

I didn't try this code at all, this is more to dump what was in my mind and start some discussion.

The main good thing is that all failures are propagated until the "catch" in runNextTask, and so we never stop at one task.
Attachment #8413202 - Flags: feedback?(mhenretty)
(In reply to Alexandre LISSY :gerard-majax from comment #24)
> Maybe we are hitting bug 951629 ?

yeah, I've seen these nullptr as well but here it's the edge of my knowledge :) I assumed that this happens before we start any asynchronous stuff and as a result Notification.get would return "null" or throw. Was I wrong?

Seems that there are so many cases where the promise would not be resolved nor rejected, we need to fix all of these cases... We should never silently fail. Programming error sadly happens.
Duplicate of this bug: 978097
Duplicate of this bug: 979462
With the enabled logs, I now see that there are queued tasks that do not finish, but without any effect in the current environment, I think that's because the cache is used so there is no visible effect in a running system. But this would have an effect once the notifications are persisted to be kept after a reboot.
Ok, the case I have right now is the case "No notification found with id: ". So the attachment 8412394 [details] [diff] [review] would fix it.

So, I think a visible effect for this bug could be triggered if we can call "delete" on a notification that is not found, before a first get is done for the app, so that we don't get a cache.

I have a fairly good idea on how to reproduce this reliably. More info in a minute.
I don't know why (yet) but I see 2 delete tasks queued for one "close" function called at the step 3) below. So this makes the following STR which is 100% reproducible for me:

1. ensure the SMS app is not launched (not sure it's necessary)
2. receive a SMS
3. click the notification
=> the notification is correctly removed
4. press home
5. receive another SMS
6. kill the SMS app
7. click the notification
=> the notification is not removed

The issue can be seen in other ways too but this one is 100% reproducible on my Buri on master. I'm confident the fix in attachment 8412394 [details] [diff] [review] will fix this because this fix one of the issues that produces it, but I think we should fix all of them:

* 2 delete tasks queued for one close function (don't know what the issue is yet)
* "notification is not found" returns without calling the callback (fixed by attachment 8412394 [details] [diff] [review])
* if a task fails, no other tasks will run (fixed by my POC)
(In reply to Julien Wajsberg [:julienw] from comment #31)
> I don't know why (yet) but I see 2 delete tasks queued for one "close"
> function called at the step 3) below. So this makes the following STR which
> is 100% reproducible for me:
> 
> 1. ensure the SMS app is not launched (not sure it's necessary)
> 2. receive a SMS
> 3. click the notification
> => the notification is correctly removed
> 4. press home
> 5. receive another SMS
> 6. kill the SMS app
> 7. click the notification
> => the notification is not removed
> 
> The issue can be seen in other ways too but this one is 100% reproducible on
> my Buri on master. I'm confident the fix in attachment 8412394 [details] [diff] [review]
> [diff] [review] will fix this because this fix one of the issues that
> produces it, but I think we should fix all of them:
> 
> * 2 delete tasks queued for one close function (don't know what the issue is
> yet)
> * "notification is not found" returns without calling the callback (fixed by
> attachment 8412394 [details] [diff] [review])
> * if a task fails, no other tasks will run (fixed by my POC)

No surprise then, the STR you provide are the exact one I talked about last friday :).
I'm taking this one, I'll provide a patch on monday so that we can fix it asap on 1.3, since it's likely to be broken there too.

Thanks julien for marking the dupe and confirming the issue.
Assignee: mhenretty → lissyx+mozillians
blocking-b2g: 1.4? → 1.3?
Severity: normal → critical
Status: NEW → ASSIGNED
Keywords: steps-wanted
Whiteboard: [systemsfe] [p=1]
Target Milestone: --- → 2.0 S1 (9may)
Blocks: 963234, 967475
Blocks: 1000337
The same STR does not reproduce the bug on my 1.4 phone. Yet I saw the symptom on Natalia's 1.4 phone. Therefore maybe we need to trigger the issue differently on 1.4...
Comment on attachment 8412394 [details] [diff] [review]
notification.patch

Since this was the fix I already had, I'm obviously f+ :). I'm preparing the same fix with unit tests to make sure we cannot starve NotificationDB.
Attachment #8412394 - Flags: feedback?(lissyx+mozillians) → feedback+
Alex, do you know why we have 2 Delete operations?
(In reply to Julien Wajsberg [:julienw] from comment #36)
> Alex, do you know why we have 2 Delete operations?

No, but I have ideas about this. I'll investigate on monday
Also you didn't comment anything about what I said about promises... I really think you should remove all callback-based code and fully use the promises instead, like I'm proposing in attachment 8413202 [details] [diff] [review]. Doing it only on master is good enough for me but I think we should also try to do it on 1.4.
(In reply to Julien Wajsberg [:julienw] from comment #38)
> Also you didn't comment anything about what I said about promises... I
> really think you should remove all callback-based code and fully use the
> promises instead, like I'm proposing in attachment 8413202 [details] [diff] [review]
> [review]. Doing it only on master is good enough for me but I think we
> should also try to do it on 1.4.

I did not comment it because I had no time to have a close look at it. I do agree that we need to harden the code as much as possible, that's why I'm concentrating on preparing a fix we can easily put on v1.3 as soon as possible with tests to ensure we don't regress this in the future. I have a lot of pending and somewhat conflicting patches regarding code that touches notifications areas, so it's definitively a good time to fix it as well.
A first patch with unit tests. Try pending: https://tbpl.mozilla.org/?tree=Try&rev=c464d14ff8bd

When we remove the callback() fix from the NotificationDB.jsm, this set of unit tests will expose the bug as it will timeout.

There are probably a couple of cases we want to add on this before asking review and merging, feel free to submit any you have in mind :)
Attachment #8412394 - Attachment is obsolete: true
Attachment #8412394 - Flags: feedback?(nwinter)
Attachment #8412394 - Flags: feedback?(felash)
Attachment #8412394 - Flags: feedback?(anthony)
Covering a couple more cases.
Attachment #8413362 - Attachment is obsolete: true
Attachment #8413388 - Flags: review?(mhenretty)
Comment on attachment 8413388 [details] [diff] [review]
0001-Bug-997949-Make-sure-NotificationDB-never-starves.patch

Review of attachment 8413388 [details] [diff] [review]:
-----------------------------------------------------------------

I still think we need to be more proactive.

::: dom/src/notification/NotificationDB.jsm
@@ +278,5 @@
>      // Make sure we can find the notification to delete.
>      var oldNotification = this.notifications[origin][id];
>      if (!oldNotification) {
>        if (DEBUG) { debug("No notification found with id: " + id); }
> +      callback();

you still need to return !
Attachment #8413388 - Attachment is obsolete: true
Attachment #8413388 - Flags: review?(mhenretty)
Attachment #8413645 - Flags: review?(mhenretty)
100% STR on 1.4:
1. go to http://everlong.org/mozilla/testcase-notification-delete/ from the browser
2. allow the content to send notifications
3. ensure the SMS app is killed
4. receive a SMS on the phone
5. click the notification
=> the SMS app opens and the correct thread is displayed, but the notification is not removed

This shows than any content on 1.4 can break the notification subsystem very easily.


100% STR on 1.3:
1. go to http://everlong.org/mozilla/testcase-notification-delete/ from the browser
2. allow the content to send notifications
3. configure the email app to check email every 5 minutes and send notification
4. ensure the email app is killed
5. send a mail to the configured mailbox (from another computer)
6. wait that you receive the notification for the new mail
7. click the notification
=> the email app opens and the correct mailbox is displayed, but the notification is not removed
=> same conclusion, and that's why it should be a 1.3 blocker

Note: on 1.3 only voicemail and the email app, and possibly some third-party apps and content, use the new Notification API


These issues are solved by the patch from Alexandre. I still fear there are other issues, but IMO switching to promises is risky so I'd rather see it happening on central only.
Is this a cert blocker for 1.3?

Lets fix it in 1.4
blocking-b2g: 1.3? → 1.4+
Flags: needinfo?(faramarz)
(In reply to Gregor Wagner [:gwagner] from comment #46)
> Is this a cert blocker for 1.3?
> 
> Lets fix it in 1.4

Note - we'll want to flag Vance here, as he works directly with TCL.
Flags: needinfo?(faramarz) → needinfo?(vchen)
Attachment #8413645 - Flags: review?(mhenretty) → review+
Comment on attachment 8413202 [details] [diff] [review]
notification-promises-poc.patch

Review of attachment 8413202 [details] [diff] [review]:
-----------------------------------------------------------------

I agree, we should rely completely on promises here. I think we should land Alex's fix for master/1.4 and possible 1.3 in this bug. And then land your promises fix in a followup. What do you think?

As for for the outstanding issues mentioned in comment #31:

(In reply to Julien Wajsberg [:julienw] from comment #31)
> * 2 delete tasks queued for one close function (don't know what the issue is
> yet)

Agreed we should investigate why this is happening, but Alexander's fix should prevent the DB from hanging when we encounter this.

> * "notification is not found" returns without calling the callback (fixed by
> attachment 8412394 [details] [diff] [review])

Right, same as above.

> * if a task fails, no other tasks will run (fixed by my POC)

Let's land your Promises revamp in a followup.

::: dom/src/notification/NotificationDB.jsm
@@ +54,4 @@
>        function onSuccess(data) {
> +      // JSON parsing failure will get handled by a later catch in the promise
> +      // chain
> +      this.notifications = JSON.parse(gDecoder.decode(data));

If we were unable to parse the contents of the notification storage file, rather than bail to the .catch() handler, we should just continue on. We need to mark the database as loaded somewhere, or else if we have a corrupt notification storage file we would get caught in a loop trying to load the file for each task.

@@ +181,4 @@
>        this.runNextTask();
>      }
> +
> +    return promise;

Do we need this? I can't see anybody who uses the queueTask promise.
Attachment #8413202 - Flags: feedback?(mhenretty) → feedback+
(In reply to Michael Henretty [:mhenretty] from comment #48)
> Comment on attachment 8413202 [details] [diff] [review]
> notification-promises-poc.patch
> 
> Review of attachment 8413202 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I agree, we should rely completely on promises here. I think we should land
> Alex's fix for master/1.4 and possible 1.3 in this bug. And then land your
> promises fix in a followup. What do you think?
> 
> As for for the outstanding issues mentioned in comment #31:
> 
> (In reply to Julien Wajsberg [:julienw] from comment #31)
> > * 2 delete tasks queued for one close function (don't know what the issue is
> > yet)
> 
> Agreed we should investigate why this is happening, but Alexander's fix
> should prevent the DB from hanging when we encounter this.
> 
> > * "notification is not found" returns without calling the callback (fixed by
> > attachment 8412394 [details] [diff] [review])
> 
> Right, same as above.
> 
> > * if a task fails, no other tasks will run (fixed by my POC)
> 
> Let's land your Promises revamp in a followup.
> 
> ::: dom/src/notification/NotificationDB.jsm
> @@ +54,4 @@
> >        function onSuccess(data) {
> > +      // JSON parsing failure will get handled by a later catch in the promise
> > +      // chain
> > +      this.notifications = JSON.parse(gDecoder.decode(data));
> 
> If we were unable to parse the contents of the notification storage file,
> rather than bail to the .catch() handler, we should just continue on. We
> need to mark the database as loaded somewhere, or else if we have a corrupt
> notification storage file we would get caught in a loop trying to load the
> file for each task.

Maybe that's the reason why the patch didn't work for me ;)

> 
> @@ +181,4 @@
> >        this.runNextTask();
> >      }
> > +
> > +    return promise;
> 
> Do we need this? I can't see anybody who uses the queueTask promise.

It doesn't hurt anybody either ;) but I don't mind.
(In reply to Michael Henretty [:mhenretty] from comment #48)

> > * 2 delete tasks queued for one close function (don't know what the issue is
> > yet)
> 
> Agreed we should investigate why this is happening, but Alexander's fix
> should prevent the DB from hanging when we encounter this.

This comes from http://dxr.mozilla.org/mozilla-central/source/b2g/components/AlertsService.js#137 likely.

After discussing with Alexandre, now I don't think it's an issue if we ensure the delete call is idempotent.
Keywords: checkin-needed
Component: Gaia::System → DOM
Product: Firefox OS → Core
Attachment #8413202 - Attachment is obsolete: true
https://hg.mozilla.org/mozilla-central/rev/566be5533bc5
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.