Closed Bug 1011158 Opened 10 years ago Closed 10 years ago

Intermittent test_notification_resend.html | Test timed out.

Categories

(Core :: General, defect)

29 Branch
ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox31 --- unaffected
firefox32 --- fixed
firefox33 --- fixed
firefox-esr24 --- unaffected
b2g-v1.4 --- wontfix
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: RyanVM, Assigned: Yoric)

References

Details

(Keywords: intermittent-failure, Whiteboard: [systemsfe])

Attachments

(1 file, 1 obsolete file)

This started after the Gaia commit for bug 967475 AFAICT. It happening frequently enough that we're going to want to backout or disable ASAP if there isn't a fix readily available.

https://tbpl.mozilla.org/php/getParsedLog.php?id=39692488&tree=Mozilla-Central

b2g_emulator_vm mozilla-central debug test mochitest-debug-11 on 2014-05-14 16:39:48 PDT for push 8c41134879dd
slave: tst-linux64-spot-549

17:09:36     INFO -  457 INFO TEST-START | /tests/dom/tests/mochitest/notification/test_notification_resend.html
17:09:36     INFO -  [Child 728] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1348
17:09:37     INFO -  [Child 728] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1348
17:09:40     INFO -  [Child 728] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1348
17:09:40     INFO -  [Child 728] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1348
17:09:55     INFO -  [Child 728] WARNING: NS_ENSURE_TRUE(mDocShell) failed: file ../../../../gecko/embedding/browser/webBrowser/nsWebBrowser.cpp, line 365
17:09:55     INFO -  [Child 728] WARNING: NS_ENSURE_TRUE(mDocShell) failed: file ../../../../gecko/embedding/browser/webBrowser/nsWebBrowser.cpp, line 365
17:09:55     INFO -  [Child 728] WARNING: NS_ENSURE_TRUE(domWindow) failed: file ../../../../gecko/embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 86
17:09:55     INFO -  [Child 728] WARNING: NS_ENSURE_TRUE(inBrowser) failed: file ../../../../gecko/embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 82
17:09:55     INFO -  [Child 728] WARNING: NS_ENSURE_TRUE(mDocShell) failed: file ../../../../gecko/embedding/browser/webBrowser/nsWebBrowser.cpp, line 365
17:09:55     INFO -  [Child 728] WARNING: NS_ENSURE_TRUE(mDocShell) failed: file ../../../../gecko/embedding/browser/webBrowser/nsWebBrowser.cpp, line 365
17:09:55     INFO -  [Child 728] WARNING: NS_ENSURE_TRUE(domWindow) failed: file ../../../../gecko/embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 86
17:09:55     INFO -  [Child 728] WARNING: NS_ENSURE_TRUE(inBrowser) failed: file ../../../../gecko/embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 82
17:14:47     INFO -  458 INFO TEST-INFO | dumping last 6 message(s)
17:14:47     INFO -  459 INFO TEST-INFO | if you need more context, please use SimpleTest.requestCompleteLog() in your test
17:14:47     INFO -  460 INFO TEST-INFO | /tests/dom/tests/mochitest/notification/test_notification_resend.html | ::Notification Tests:: - Set manifestURL
17:14:47     INFO -  461 INFO TEST-INFO | /tests/dom/tests/mochitest/notification/test_notification_resend.html | ::Notification Tests:: - Value of manifestURL is: http://mochi.test:8888/manifest.webapp
17:14:47     INFO -  462 INFO TEST-INFO | /tests/dom/tests/mochitest/notification/test_notification_resend.html | ::Notification Tests:: - Test that we have mozChromeNotifications API
17:14:47     INFO -  463 INFO TEST-PASS | /tests/dom/tests/mochitest/notification/test_notification_resend.html | should have mozChromeNotifications API
17:14:47     INFO -  464 INFO TEST-PASS | /tests/dom/tests/mochitest/notification/test_notification_resend.html | should have mozResendAllNotifications()
17:14:47     INFO -  465 INFO TEST-INFO | /tests/dom/tests/mochitest/notification/test_notification_resend.html | ::Notification Tests:: - Making sure we have no previous notification pending
17:14:47     INFO -  466 INFO TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/notification/test_notification_resend.html | Test timed out.
17:14:49     INFO -  467 INFO TEST-INFO | MEMORY STAT vsize after test: 127160320
17:14:49     INFO -  468 INFO TEST-INFO | MEMORY STAT residentFast after test: 58687488
17:14:49     INFO -  469 INFO TEST-INFO | MEMORY STAT heapAllocated after test: 17135340
17:14:50     INFO -  470 INFO TEST-END | /tests/dom/tests/mochitest/notification/test_notification_resend.html | finished in 313874ms
Flags: needinfo?(lissyx+mozillians)
That's strange. Especially since this code has been running for some time. Why would a gaia change have an impact ?

The test that is failing is http://dxr.mozilla.org/mozilla-central/source/dom/tests/mochitest/notification/test_notification_resend.html#55 and it's just doing a Notification.get(). The fact that it times out means we never get the promise :(.

Or maybe we got an error in the promise ?

Michael, can you give me some help?
Flags: needinfo?(lissyx+mozillians) → needinfo?(mhenretty)
I'm sorry Alexandre, I don't have the time to help with this right now. I'm going out of town tomorrow, and will be out all next week. This particular test is certainly unrelated to your changeset, so I would be ok for disabling it while investigating deeper.
Flags: needinfo?(mhenretty)
(In reply to Michael Henretty [:mhenretty] from comment #5)
> I'm sorry Alexandre, I don't have the time to help with this right now. I'm
> going out of town tomorrow, and will be out all next week. This particular
> test is certainly unrelated to your changeset, so I would be ok for
> disabling it while investigating deeper.

Makes sense. We have the Gaia integration tests also so we would probably be able to disable this test without any loss of coverage overall.
Depends on: 1003050
I agree, there is no reason the Gaia commit is the issue here.

Ryan, can you easily find which oldest inbound push is the issue happening in? Is it https://tbpl.mozilla.org/?rev=69ef0bb7aeb8&tree=Mozilla-Inbound ?
Alexandre, could it be because the gaia commit actually turned on a setting "notifications.resend" ?

See https://github.com/mozilla-b2g/gaia/commit/ad2b0cb926d073566fe7f3a5f2361001aa172f90#diff-8

Is the pref changing something on the Gecko side?
FYI I retriggered the mochitests in [1], this is where the gecko part for bug 967475 landed in inbound.

[1] https://tbpl.mozilla.org/?rev=11291a7123f4&tree=Mozilla-Inbound
(In reply to Julien Wajsberg [:julienw] from comment #8)
> Alexandre, could it be because the gaia commit actually turned on a setting
> "notifications.resend" ?
> 
> See
> https://github.com/mozilla-b2g/gaia/commit/
> ad2b0cb926d073566fe7f3a5f2361001aa172f90#diff-8
> 
> Is the pref changing something on the Gecko side?

No.

The mochitest failing is about Notification.get(), totally unrelated to the settings which enables the use of the resending code. This resending code should have no impact whatsoever on the Notification API's storage/DB which is what Notification.get() is about.
I see a segfault in gralloc for each of the reported intermittents.
Flags: needinfo?(nical.bugzilla)
(In reply to Julien Wajsberg [:julienw] from comment #7)
> Ryan, can you easily find which oldest inbound push is the issue happening
> in? Is it https://tbpl.mozilla.org/?rev=69ef0bb7aeb8&tree=Mozilla-Inbound ?

hm looks like its more https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=b2g_emulator_vm%20mozilla-inbound%20debug%20test%20mochitest-debug-11&rev=8c234572141a
(In reply to Carsten Book [:Tomcat] from comment #12)
> (In reply to Julien Wajsberg [:julienw] from comment #7)
> > Ryan, can you easily find which oldest inbound push is the issue happening
> > in? Is it https://tbpl.mozilla.org/?rev=69ef0bb7aeb8&tree=Mozilla-Inbound ?
> 
> hm looks like its more
> https://tbpl.mozilla.org/?tree=Mozilla-
> Inbound&jobname=b2g_emulator_vm%20mozilla-inbound%20debug%20test%20mochitest-
> debug-11&rev=8c234572141a

Terrence, could it be your patch at fault or exposing a bug in the Notification API ?
Flags: needinfo?(terrence)
Carsten, just want to make sure how you found comment 12; have you tried to trigger it on earlier changesets too?
First off, the gralloc segfaults are common for "test timed out and was killed by the harness" - assume them to be noise until proven otherwise :)

Also FTR, I generally recommend bisecting on m-c first to narrow it down to a specific merge. Bisecting on inbound can get very confusing when it crosses merges (due to non-linear history when merges are involved) and I would try to avoid it if I were you.

When I filed this yesterday, the first occurrence I could find was on m-c on the fx-team merge immediately after b-i was merged there (which contained the Gaia change for bug 967475). Seeing bug 967475 in that pushlog immediately made me suspicious.

That said, I retriggered more M11 runs on the merges around that time and the regression does appear to be older. I'll keep retriggering to try to get a higher confidence in the regression range.
https://tbpl.mozilla.org/?jobname=b2g_emulator_vm%20mozilla-central%20debug%20test%20mochitest-debug-11&fromchange=5754fc307237&tochange=8c41134879dd
Flags: needinfo?(terrence)
Flags: needinfo?(nical.bugzilla)
Also, I find it interesting that there haven't been any other instances starred in this bug since it was filed yesterday (the TBPLbot comments where on other retriggers from yesterday as are the ones that I'm about to star :)...).
K, I give up trying to track this down. I just made it happen on a push from 13-May, but it took 300 runs to hit it once.
Whiteboard: [systemsfe]
Ryan, just to keep you posted.

This try request https://tbpl.mozilla.org/?tree=Try&rev=fbfda42afe32 has been retriggered with 50 jobs for mochitest 11 with the latest pending changes for bug 1003050.
Flags: needinfo?(ryanvm)
(In reply to Alexandre LISSY :gerard-majax from comment #37)
> Ryan, just to keep you posted.
> 
> This try request https://tbpl.mozilla.org/?tree=Try&rev=fbfda42afe32 has
> been retriggered with 50 jobs for mochitest 11 with the latest pending
> changes for bug 1003050.

This try has triggered the intermittent three times.

This is one of those: https://tbpl.mozilla.org/php/getParsedLog.php?id=41190149&tree=Try&full=1

We can see that the NotificationDB task is getting blocked just at to the first command (save) sent.

I'll retrigger other builds, with much debug. Later in this log we get an error from osfile_async_front.jsm, it may be another symptom.
Flags: needinfo?(mhenretty)
Flags: needinfo?(dteller)
According to the logs:
- we send three messages to OS.File;
- the last message sent is a writeAtomic and it never receives a reply, so this is likely what blocks your save;
- apparently, the worker has been launched at least once;
- apparently, there is no worker available at the time of shutdown (which may be normal);
- apparently, we have never requested shutdown of the worker, which is surprising.

Is it possible that your code is executed during shutdown and that we have a race condition?

It would be useful if you could launch your tests with preference toolkit.osfile.log set to true.

Side-note: I just checked `messagesSent`, which should be 3, but this is an error in logging.
Flags: needinfo?(dteller)
Also, can you make sure that you are in a branch in which bug 995162 has been landed? Because that's a bug that had the same symptoms.
(In reply to David Rajchenbach Teller [:Yoric] from comment #40)
> Also, can you make sure that you are in a branch in which bug 995162 has
> been landed? Because that's a bug that had the same symptoms.

This is on master from yesterday, so bug 995162 should be there :). I'll add more logs and switch the pref. I'll also check your assumptions regarding race.
What info were you needing from me?
Flags: needinfo?(ryanvm)
(In reply to Alexandre LISSY :gerard-majax from comment #42)
> Try with just OS File debug:
> https://tbpl.mozilla.org/?tree=Try&rev=5d7c0b36f565
> Try with the improvements suggested in bug 1003050:
> https://tbpl.mozilla.org/?tree=Try&rev=dd5189731612

Frequency seems lower now with the changes from bug 1003050 :)
Flags: needinfo?(dteller)
Reproducing with OS.File logging: https://tbpl.mozilla.org/php/getParsedLog.php?id=41279693&tree=Try&full=1
Flags: needinfo?(dteller)
Thanks. The following lines in https://tbpl.mozilla.org/php/getParsedLog.php?id=41279693&tree=Try&full=1 are suspicious:
«
06:26:55     INFO -  OS Controller Posting message {"fun":"Meta_shutdown","args":[true],"id":2}
06:26:55     INFO -  OS Controller Message posted
06:26:55     INFO -  OS Controller Posting message {"fun":"Meta_shutdown","args":[true],"id":3}
06:26:55     INFO -  OS Controller Message posted
»

Apparently, we send two requests for shutdown near-simultaneously, which points to a race condition. This is the kind of situation that I believed I had fixed with bug 995162. I will see if I find something that could explain it.
Ok, I believe that I found the section that it still subject to race conditions. Working on it.
(In reply to David Rajchenbach Teller [:Yoric] from comment #50)
> Created attachment 8436551 [details] [diff] [review]
> Strenghtening Scheduler.kill against race conditions with itself

Thanks!

I've pushed a Try request with this one: https://tbpl.mozilla.org/?tree=Try&rev=8d0ed6fcc9d1
Good news, it seems that after much more retriggers I cannot reproduce the issue anymore with your patch applied: https://tbpl.mozilla.org/?tree=Try&rev=8d0ed6fcc9d1
Flags: needinfo?(dteller)
Added a test.
Assignee: lissyx+mozillians → dteller
Attachment #8436551 - Attachment is obsolete: true
Attachment #8436551 - Flags: review?(nfroyd)
Attachment #8436919 - Flags: review?(nfroyd)
Attachment #8436919 - Flags: review?(nfroyd) → review+
https://hg.mozilla.org/mozilla-central/rev/1be1b85f1ab1
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Flags: needinfo?(mhenretty)
Please request Aurora approval for this patch when you get a chance.
Comment on attachment 8436919 [details] [diff] [review]
Strenghtening Scheduler.kill against race conditions with itself, v2

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 927560
User impact if declined: Some processes can't write to disk using OS.File, nor quit
Testing completed: Try + manual testing.
Risk to taking this patch (and alternatives if risky): Can't think of any.
String or UUID changes made by this patch: None.

Fabrice, given that the issue has appeared in bug 927560, I am tempted to request a serious uplift. What's your word on this?
Attachment #8436919 - Flags: approval-mozilla-aurora?
Flags: needinfo?(dteller) → needinfo?(fabrice.desre)
Comment on attachment 8436919 [details] [diff] [review]
Strenghtening Scheduler.kill against race conditions with itself, v2

Good fix Yoric! Aurora approval granted.

What do you mean by a serious uplift? beta? 1.4? 1.3?
Attachment #8436919 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Lawrence Mandel [:lmandel] from comment #64)
> Comment on attachment 8436919 [details] [diff] [review]
> Strenghtening Scheduler.kill against race conditions with itself, v2
> 
> Good fix Yoric! Aurora approval granted.
> 
> What do you mean by a serious uplift? beta? 1.4? 1.3?

The partial fix of bug 961317 was uplifted to 1.4, so I believe that it would make sense to uplift the full fix that far, too, if possible.
(In reply to David Rajchenbach Teller [:Yoric] from comment #62)

> Fabrice, given that the issue has appeared in bug 927560, I am tempted to
> request a serious uplift. What's your word on this?

No objection on my side.
Flags: needinfo?(fabrice.desre)
How do I request uplift to 1.4 or even to 2.0?
It's already fixed for 2.0 by virtue of the Aurora uplift. For 1.4 consideration, you need to request approval-mozilla-b2g30 on the patch. Note that the bar for 1.4 approval is getting pretty high, so you'll need a strong reason for doing so.
Comment on attachment 8436919 [details] [diff] [review]
Strenghtening Scheduler.kill against race conditions with itself, v2

NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 927560
User impact if declined: This fixes mysterious freezes in applications. I do not know whether B2G 1.4 is affected.
Testing completed: Has been in Firefox 32 and B2G 2.0 for a few weeks.
Risk to taking this patch (and alternatives if risky): Should be safe.
String or UUID changes made by this patch: None.
Attachment #8436919 - Flags: approval-mozilla-b2g30?
Hey Bhavana,

any chance you can look at this approval request for v1.4?

The code looks pretty well tested so I'm not afraid of regressions. I'm actually more afraid of not fixing this issue ;)
Flags: needinfo?(bbajaj)
(In reply to Julien Wajsberg [:julienw] (away June 21 to 30) from comment #73)
> The code looks pretty well tested so I'm not afraid of regressions. I'm
> actually more afraid of not fixing this issue ;)

We're only taking partner driven changes on 1.4 at this point. I have set ni on Ivan and Kevin for their opinions on this one. I think we need a stronger case than Yoric's comment "I do not know whether B2G 1.4 is affected." :) Jilien's comment (above) is a good base for discussion.
Flags: needinfo?(khu)
Flags: needinfo?(itsay)
Let me mark this with 1.4? to raise Ivan and Wayne's attention. They're driving the 1.4 triage.
blocking-b2g: --- → 1.4?
Flags: needinfo?(khu)
I'll add some more information.

What frihgtens me with this bug is that it's a race: so we'll have great difficulty to have a reliable STR to trigger it (we could trigger it only in unit tests). Still, when it happens, it will break all notification subsystem until the phone is restarted: notifications will be added but won't be removed.

Yoric should be able to add more information about the likeliness of the issue.
Flags: needinfo?(dteller)
(In reply to Julien Wajsberg [:julienw] (away June 21 to 30) from comment #76)
> I'll add some more information.
> 
> What frihgtens me with this bug is that it's a race: so we'll have great
> difficulty to have a reliable STR to trigger it (we could trigger it only in
> unit tests). Still, when it happens, it will break all notification
> subsystem until the phone is restarted: notifications will be added but
> won't be removed.

Given this is a race, we are reliably able to reproduce this in our unit tests and the user impact here if we hit this condition, plus keeping in mind the bake time we've got and the risk evaluation here I am leaning on taking this on 1.4, but lets wait for Yoric's further comment on likeliness and risk here.
> 
> Yoric should be able to add more information about the likeliness of the
> issue.
Flags: needinfo?(bbajaj)
So far I am not aware of the same issue reported by our partner. Also, it looks to me that further clarification is needed whether or not v1.4 is affected (at least the status flag is 1.4 unaffected so far). I'd like to wait for the comment from Yoric before making decision for uplift.
Flags: needinfo?(itsay)
I talked with Yoric and Alexandre IRL. It's still not clear whether this issue can reproduce easily for end-users. Still, we know we had numerous issues with notifications and that now we don't have any (that we know of). This patch is part of several other patches so it's hard to know which patches are more useful than others.

So, up to you, but I'd take it :)
Flags: needinfo?(dteller) → needinfo?(bbajaj)
Let's re-visit this bug for 1.4 landing if the similar issue is reported from Dolphin.
blocking-b2g: 1.4? → ---
Comment on attachment 8436919 [details] [diff] [review]
Strenghtening Scheduler.kill against race conditions with itself, v2

b2g30- based on Ivan's call in comment 80.
Attachment #8436919 - Flags: approval-mozilla-b2g30? → approval-mozilla-b2g30-
Flags: needinfo?(bbajaj)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: