Closed
Bug 1211266
Opened 9 years ago
Closed 9 years ago
Occasional missing icons after flashing homescreen
Categories
(Core :: IPC, defect, P1)
Tracking
()
People
(Reporter: kgrandon, Assigned: baku)
References
Details
Attachments
(19 files, 14 obsolete files)
3.96 KB,
text/plain
|
Details | |
46 bytes,
text/x-github-pull-request
|
cwiiis
:
review+
|
Details | Review |
162.49 KB,
text/x-log
|
Details | |
3.71 KB,
text/plain
|
Details | |
4.34 KB,
text/plain
|
Details | |
180.01 KB,
text/x-log
|
Details | |
2.61 KB,
text/plain
|
Details | |
1011 bytes,
text/x-sh
|
Details | |
261.77 KB,
text/x-log
|
Details | |
8.45 KB,
text/plain
|
Details | |
10.47 KB,
patch
|
Details | Diff | Splinter Review | |
294.65 KB,
text/x-log
|
Details | |
344.75 KB,
text/x-log
|
Details | |
24.49 KB,
text/plain
|
Details | |
15.82 KB,
text/plain
|
Details | |
1.56 KB,
patch
|
Details | Diff | Splinter Review | |
16.56 KB,
patch
|
bent.mozilla
:
review+
gerard-majax
:
feedback+
|
Details | Diff | Splinter Review |
57 bytes,
text/x-github-pull-request
|
gmarty
:
review+
|
Details | Review |
46 bytes,
text/x-github-pull-request
|
gmarty
:
review+
cwiiis
:
approval‑mozilla‑b2g44?
|
Details | Review |
This happens with the homescreen app and not verticalhome. I think this is a race condition with using the homescreen-webapps-manage permission and privileged icon APIs.
Noticed this on a flame.
This is possibly the same issue as: https://bugzilla.mozilla.org/show_bug.cgi?id=1042807#c22
Comment 1•9 years ago
|
||
I get the same behavior on b2gdroid. These errors in logcat seem related:
0-03 17:02:51.893 I/Gecko ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-03 17:02:51.893 I/Gecko ( 5452):
10-03 17:02:51.893 I/Gecko ( 5452):
10-03 17:02:51.893 I/Gecko ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-03 17:02:51.893 I/Gecko ( 5452):
10-03 17:02:51.893 I/Gecko ( 5452):
10-03 17:02:51.893 I/Gecko ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-03 17:02:51.893 I/Gecko ( 5452):
10-03 17:02:51.893 I/Gecko ( 5452):
10-03 17:02:51.893 I/Gecko ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-03 17:02:51.893 I/Gecko ( 5452):
10-03 17:02:51.906 I/Gecko ( 5452):
10-03 17:02:51.906 I/Gecko ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-03 17:02:51.906 I/Gecko ( 5452):
10-03 17:02:51.906 I/Gecko ( 5452):
10-03 17:02:51.906 I/Gecko ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-03 17:02:51.906 I/Gecko ( 5452):
Updated•9 years ago
|
Whiteboard: [systemsfe]
Comment 2•9 years ago
|
||
Ted, can you help out here?
blocking-b2g: --- → 2.5+
Flags: needinfo?(tclancy)
Comment 3•9 years ago
|
||
I'll try to play with that. It's been something we have been spotting a couple of times already ... on several devices
Assignee: nobody → lissyx+mozillians
Flags: needinfo?(tclancy)
Reporter | ||
Comment 5•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #4)
> Kevin, can you provide STRs ?
The STR is:
1) Flash device with make reset-gaia
2) Scroll down the entire app list until you see an empty space with no icon (the app name will be there).
3) If all icons are present, repeat from 1.
There's ~50 icons installed in an engineering build, and I saw it happen to a single icon in the grid about ~50% of the time. The reproduction of this is only 1% or so per icon.
Flags: needinfo?(kevingrandon)
Comment 6•9 years ago
|
||
Ok, I did:
0. make NOFTU=1 reset-gaia
1. Settings, Homescreens, select "New Home Screen"
This changed my homescreen and I have three missing icons: Camera, Twitter and FM Radio. Logcat confirms what Fabrice said:
10-09 13:59:10.959 3595 3595 E GeckoConsole: [JavaScript Error: "ConstraintError" {file: "jar:file:///system/b2g/omni.ja!/components/DataStoreImpl.js" line: 233}]
10-09 13:59:11.029 212 212 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
10-09 13:59:15.229 3595 3632 I Gecko :
10-09 13:59:15.229 3595 3632 I Gecko : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-09 13:59:15.229 3595 3632 I Gecko :
10-09 13:59:15.229 3595 3632 I Gecko :
10-09 13:59:15.229 3595 3632 I Gecko : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-09 13:59:15.229 3595 3632 I Gecko :
10-09 13:59:15.759 3595 3632 I Gecko :
10-09 13:59:15.759 3595 3632 I Gecko : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-09 13:59:15.759 3595 3632 I Gecko :
10-09 13:59:15.759 3595 3632 I Gecko :
10-09 13:59:15.759 3595 3632 I Gecko : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-09 13:59:15.759 3595 3632 I Gecko :
10-09 13:59:15.919 3595 3632 I Gecko :
10-09 13:59:15.919 3595 3632 I Gecko : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-09 13:59:15.919 3595 3632 I Gecko :
10-09 13:59:15.919 3595 3632 I Gecko :
10-09 13:59:15.919 3595 3632 I Gecko : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-09 13:59:15.919 3595 3632 I Gecko :
10-09 13:59:16.039 212 212 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
10-09 13:59:16.859 4289 4289 E New Home Screen: Content JS ERROR: Failed to retrieve icon [object DOMError]
10-09 13:59:16.859 4289 4289 E New Home Screen: at proto.refresh/< (app://homescreen.gaiamobile.org/bower_components/gaia-app-icon/script.js:552:13)
10-09 13:59:17.659 4477 4520 I (Preallocated app): PAC file installed from data: URI
10-09 13:59:17.829 3595 3632 I Gecko :
10-09 13:59:17.829 3595 3632 I Gecko : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-09 13:59:17.829 3595 3632 I Gecko :
10-09 13:59:17.829 3595 3632 I Gecko :
10-09 13:59:17.829 3595 3632 I Gecko : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
Comment 7•9 years ago
|
||
DOMError is unrelated, it is always reproduced and concerns mochi.test app.
Updated•9 years ago
|
Priority: -- → P1
Comment 8•9 years ago
|
||
cc-ing Sam as he's been looking at a similar blob ipc issue when using device storage.
Comment 9•9 years ago
|
||
Unable to reproduce at all with a debug Gecko build on a Flame :(
Reporter | ||
Comment 10•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #9)
> Unable to reproduce at all with a debug Gecko build on a Flame :(
Are you able to reproduce with a normal build on a flame?
Comment 11•9 years ago
|
||
(In reply to Kevin Grandon :kgrandon from comment #10)
> (In reply to Alexandre LISSY :gerard-majax from comment #9)
> > Unable to reproduce at all with a debug Gecko build on a Flame :(
>
> Are you able to reproduce with a normal build on a flame?
Yes. And on a Z3c non debug with lots of applications. No need to reset-gaia, on that profile, just switching homescreen is enough. And even after switching once, I still got occurrences of PBlob errors ...
Comment 12•9 years ago
|
||
And after a reboot there is no trace of the issue. Unable to reproduce also when installing Gecko as an update.
Comment 13•9 years ago
|
||
So I have been able to reproduce that on my first try with Z3c, Gecko built in debug and following the STRs of reset-gaia then switch to new homescreen.
Flags: needinfo?(Jan.Varga)
Comment 14•9 years ago
|
||
> (gdb) print idTableEntry
> $1 = {mRawPtr = 0xb6f65050 <_pr_envLock>}
Comment 15•9 years ago
|
||
That assert comes from bug 1076975.
Comment 16•9 years ago
|
||
logcat while reproducing
Updated•9 years ago
|
Component: Gaia::Homescreen → DOM: IndexedDB
Product: Firefox OS → Core
Updated•9 years ago
|
Component: DOM: IndexedDB → IPC
Comment 17•9 years ago
|
||
Duplicates this (https://bugzilla.mozilla.org/show_bug.cgi?id=1209545) but this one is probably better for tracking it I think.
Comment 19•9 years ago
|
||
So it would look like we are calling |IDTableEntry::GetOrCreateInternal()| https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#4505 from https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#1412
Since the MOZ_ASSERT() is https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#3991 it means what we get from |IDTableEntry::Create| is wrong.
With gdb I could check:
> (gdb) print id
> $3 = {m0 = 0, m1 = 61615, m2 = 45716, m3 = "\000\000\000\000\000\000\000"}
I don't know if that value is a correct one though.
There are four ways to return null from that code:
(1) https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#4523
(2) https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#4535
(3) https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#4539
(4) https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#4543
We know we pass:
> /* aMayCreate */ true,
> /* aMayGet */ false,
> /* aIgnoreProcessID */ false
The logcat shows:
> 10-13 12:17:02.091 1525 1556 I Gecko : [Parent 1525] WARNING: '!aMayCreate', file ../../../gecko/dom/ipc/Blob.cpp, line 4542
> 10-13 12:17:02.091 1525 1556 I Gecko : [Parent 1525] WARNING: '!idTableEntry', file ../../../gecko/dom/ipc/Blob.cpp, line 3965
> 10-13 12:17:02.092 1525 1556 F MOZ_Assert: Assertion failure: false (), at ../../../gecko/dom/ipc/Blob.cpp:3966
So we are reaching condition (4) and issue the warning. And thus we fail at https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#3965 |NS_WARN_IF(!idTableEntry)|.
All those are in the |AnyBlobConstructorParams::TKnownBlobConstructorParams| case. Yet, we get a stack that is about assertion failure at https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#3991 in |AnyBlobConstructorParams::TSameProcessBlobConstructorParams|
Comment 20•9 years ago
|
||
Fun fact: it looks like delaying the publication of "displayed" event in Gaia when switching homescreen makes the bug disappearing ...
Comment 21•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #20)
> Fun fact: it looks like delaying the publication of "displayed" event in
> Gaia when switching homescreen makes the bug disappearing ...
Hey Alex, since this bug is assigned to you, are you working on the fix for this? Or does this bug need to be addressed by the platform team? :)
Thanks!
Comment 22•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #20)
> Fun fact: it looks like delaying the publication of "displayed" event in
> Gaia when switching homescreen makes the bug disappearing ...
It's not "displayed" but "changed".
Comment 23•9 years ago
|
||
(In reply to Candice Serran (:cserran) from comment #21)
> (In reply to Alexandre LISSY :gerard-majax from comment #20)
> > Fun fact: it looks like delaying the publication of "displayed" event in
> > Gaia when switching homescreen makes the bug disappearing ...
>
> Hey Alex, since this bug is assigned to you, are you working on the fix for
> this? Or does this bug need to be addressed by the platform team? :)
>
> Thanks!
I fear that is something platform will have to work on. So far everything I have been able to hack would point to some kind of race condition around Blob passing within process. That is triggered when we are changing the homescreen because at that moment we will kill the previous homescreen, instantiate the new one and then make it visible.
We do notify to start displaying it by publishing that "changed" event. By doing so, the homescreen code will start to fetch apps icons. Then it seems we have something bad happening at that time but as exposed in comment 19 it's unclear exactly where it is failing.
Comment 26•9 years ago
|
||
Checking the IDTableEntry Get() and Put() calls. One thing that is interesting in that is the latest ID we manipulate before crashing is "61de7b13-54c2-4a44-8eba-4e59f6d30e95".
Now, just before we crash, we see:
> 10-16 16:40:12.562 18167 18198 I Gecko : AllocPBlobParent:209 => AllocPBlobParent
> 10-16 16:40:12.562 18167 18198 I Gecko : CreateFromParams:3960 => TKnownBlobConstructorParams: IN
> 10-16 16:40:12.562 18167 18198 I Gecko : IDTableEntry::GetOrCreateInternal: aID={61de7b13-54c2-4a44-8eba-4e59f6d30e95} ->Get()
> 10-16 16:40:12.562 18167 18198 I Gecko : IDTableEntry::GetOrCreateInternal: aID={61de7b13-54c2-4a44-8eba-4e59f6d30e95} ->Get() => entry NOK
> 10-16 16:40:12.562 18167 18198 I Gecko : [Parent 18167] WARNING: '!aMayCreate', file ../../../gecko/dom/ipc/Blob.cpp, line 4551
> 10-16 16:40:12.562 18167 18198 I Gecko : [Parent 18167] WARNING: '!idTableEntry', file ../../../gecko/dom/ipc/Blob.cpp, line 3966
> 10-16 16:40:12.562 18167 18198 F MOZ_Assert: Assertion failure: false (), at ../../../gecko/dom/ipc/Blob.cpp:3967
So we do not find an entry in IDTableEntry::GetOrCreateInternal, hence we are in https://dxr.mozilla.org/mozilla-central/rev/2387ada864282880d3a498d643abe3d8b887ee59/dom/ipc/Blob.cpp#4541
Yet, if you look earlier in that log for the same ID value, we get something different:
> 10-16 16:40:08.641 18167 18167 I Gecko : IDTableEntry::GetOrCreateInternal: aID={61de7b13-54c2-4a44-8eba-4e59f6d30e95} ->Get()
> 10-16 16:40:08.641 18167 18167 I Gecko : IDTableEntry::GetOrCreateInternal: aID={61de7b13-54c2-4a44-8eba-4e59f6d30e95} ->Get() => entry NOK
> 10-16 16:40:08.641 18167 18167 I Gecko : IDTableEntry::GetOrCreateInternal: aID={61de7b13-54c2-4a44-8eba-4e59f6d30e95} ->Put()
On that call, we do have no entry, but we are able to reach the end of the method and to call Put() to populate the table.
Comment 27•9 years ago
|
||
Comment 28•9 years ago
|
||
Comment 29•9 years ago
|
||
Comment on attachment 8674985 [details] [review]
[gaia] Cwiiis:bug1211266-work-around-get-icon > mozilla-b2g:master
Got r=gmarty earlier, was leaving until the last minute in case the platform bug got fixed before now.
Attachment #8674985 -
Flags: review+
Comment 30•9 years ago
|
||
Comment 31•9 years ago
|
||
Can you have a look at comment 26 and attachment 8674919 [details] and attachment 8674998 [details] ? It seems that the sIDTable behavior is the a good root to the problem but I still suspect that this is just a symptom. How can we debug further?
Flags: needinfo?(amarchesini)
Flags: needinfo?(Jan.Varga)
Comment 32•9 years ago
|
||
I've merged the work-around: https://github.com/mozilla-b2g/gaia/commit/4d592d4d385c85296511216e7c702c71bf2c9540
Not sure if we want to track the platform bug in this bug or not - I'll leave that up to others.
Comment 33•9 years ago
|
||
Crash with Blob 5ee51033-aac4-4576-a673-d33884bfb6e1
Comment 34•9 years ago
|
||
Debugging patch
Comment 35•9 years ago
|
||
This is the stack we get if we break when we are in the ::Get() call in the unexpected state: that matches the NS_WARN_IF(!aMayCreate).
Comment 36•9 years ago
|
||
Log matching the attachment 8675309 [details]: gdb stacktrace with MOZ_ASSERT(false) added
Comment 37•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #36)
> Created attachment 8675310 [details]
> homescreen_get_put_assert.log
>
> Log matching the attachment 8675309 [details]: gdb stacktrace with
> MOZ_ASSERT(false) added
From that we know that ID "efa6019f-ccab-4b86-966c-26bbe2819e08" did propery got inserted with the Put() call and a subsequent Get() call did succeed. Now when we hit the failure, we do see that |entry| is a null:
> (gdb) f
> #0 0xb3794da8 in mozilla::dom::BlobParent::IDTableEntry::GetOrCreateInternal (aID=..., aProcessID=-1619102152, aBlobImpl=aBlobImpl@entry=0x0, aMayCreate=aMayCreate@entry=false, aMayGet=aMayGet@entry=true,
aIgnoreProcessID=aIgnoreProcessID@entry=false) at ../../../gecko/dom/ipc/Blob.cpp:4558
> 4558 MOZ_ASSERT(false);
> (gdb) print entry
> $1 = {mRawPtr = 0x0}
Comment 38•9 years ago
|
||
Comment 39•9 years ago
|
||
That log exposes the crash. And we can see the Blob ID "842b74dd-7a7a-431c-87aa-2e927f66af3f". That blob gets removed from the hashmap just before we try to do a Get(), because of IDTableEntry destructor being called.
Comment 40•9 years ago
|
||
Comment 41•9 years ago
|
||
That is the stack addr2line result of attachment 8675643 [details]
Comment 42•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #41)
> Created attachment 8675644 [details]
> addr2line
>
> That is the stack addr2line result of attachment 8675643 [details]
So we remove the entry after receiving an IPC "PBlob::Msg___delete__" message.
Comment 43•9 years ago
|
||
More debugging, to try to identify if the IPC |Send::__delete__| call could come from |NoteDyingRemoteBlobImpl| but so far it's not very completely useful.
Attachment #8672256 -
Attachment is obsolete: true
Attachment #8674900 -
Attachment is obsolete: true
Attachment #8674919 -
Attachment is obsolete: true
Attachment #8674998 -
Attachment is obsolete: true
Attachment #8675276 -
Attachment is obsolete: true
Attachment #8675309 -
Attachment is obsolete: true
Attachment #8675310 -
Attachment is obsolete: true
Attachment #8675318 -
Attachment is obsolete: true
Comment 44•9 years ago
|
||
Attachment #8675277 -
Attachment is obsolete: true
Comment 45•9 years ago
|
||
Hacked https://dxr.mozilla.org/mozilla-central/rev/d1a89632277fbaaf470c90a35573776048988f2d/dom/ipc/Blob.cpp#4111 and commented the |PBlobParent::Send__delete__()| statement, I am still hitting the issue. That answers comment 43: we are not receiving the bogus delete call from that call site but from somewhere else.
Comment 46•9 years ago
|
||
I have troubles explaining that set of calls that are part of attachment 8675644 [details]:
> mozilla::dom::PContentParent::RemoveManagee(int, mozilla::ipc::IProtocol*)
> /home/alex/codaz/Mozilla/b2g/devices/XperiaZ3c/B2G.L/objdir-gecko-dbg/ipc/ipdl/PContentParent.cpp:2579
> mozilla::dom::PBlobParent::OnMessageReceived(IPC::Message const&)
> /home/alex/codaz/Mozilla/b2g/devices/XperiaZ3c/B2G.L/objdir-gecko-dbg/ipc/ipdl/PBlobParent.cpp:252 (discriminator 1)
> mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&)
> /home/alex/codaz/Mozilla/b2g/devices/XperiaZ3c/B2G.L/objdir-gecko-dbg/ipc/ipdl/PContentParent.cpp:3366
The statement at |PContentParent.cpp:2579| is |DeallocPPresentationParent(actor);|, being part of |PPresentationMsgStart| case in |PContentParent::RemoveManagee|.
The statement at |PBlobParent.cpp:252| is |return MsgProcessed;|, being part of |PBlob::Msg___delete____ID| case.
The statement at |PContentParent.cpp:3366| is |(routed__)->OnMessageReceived(msg__);|
So it looks like we reach the expected |PContentParent::RemoveManagee| method, but I cannot explain why we enter the |PPresentationMsgStart| case while we are manipulating blobs.
Comment 47•9 years ago
|
||
Here we can see that blob d756c688-b7a8-402c-91a1-d07be96798eb gets created by thread 13711 of process 13711. The same thread issues a Remove() operation. The subsequent CreateFromParams() and Get() calls that fails are in issued by thread 13742 from process 13711.
Comment 48•9 years ago
|
||
Comment 49•9 years ago
|
||
Comment 50•9 years ago
|
||
Log shows the correct PBlobMsgStart protocol id being used, but the stacktrace stil matches the RemoveManagee() call site of PPresentationMsgStart, so I'm a little bit lost. I would suspect that there's something wrong when collecting the stack itself and that we do indeed really trigger the correct path of PBlobMsgStart and its RemoveManagee() call. Said otherwise, we receive a legit delalloc request for the blob. Still have to track where this is coming from.
Comment 51•9 years ago
|
||
Attachment #8675667 -
Attachment is obsolete: true
Comment 52•9 years ago
|
||
This blocks replaceable homescreens, but not new homescreen (which is shipping a work-around currently).
Comment 53•9 years ago
|
||
More logging, we see PBlob delete issued:
> 10-22 11:29:35.039 29908 29908 I GeckoIPC: [time:1445506175038996][29908->29441][PBlobChild] Sending Msg___delete__([TODO])
> 10-22 11:29:35.040 29908 29908 I GeckoIPC: [time:1445506175040093][29908->29441][PBlobChild] Sending Msg_PBlobStreamConstructor([TODO])
> 10-22 11:29:35.040 29441 29472 I GeckoIPC: [time:1445506175040311][29441<-29908][PBlobParent] Received Msg_PBlobStreamConstructor([TODO])
> 10-22 11:29:35.042 29908 29908 I GeckoIPC: [time:1445506175042367][29908->29441][PBlobChild] Sending Msg___delete__([TODO])
> 10-22 11:29:35.043 29908 29908 I GeckoIPC: [time:1445506175043300][29908->29441][PBlobChild] Sending Msg_PBlobStreamConstructor([TODO])
> 10-22 11:29:35.045 29441 29441 I GeckoIPC: [time:1445506175045524][29441<-29908][PBlobParent] Received Msg___delete__([TODO])
> 10-22 11:29:35.045 29441 29441 I Gecko : RemoveManagee: aProtocolId:30
> 10-22 11:29:35.045 29441 29441 I Gecko : RemoveManagee: aProtocolId:30 PBlobMsgStart
> 10-22 11:29:35.045 29441 29441 I GeckoIPC: [time:1445506175045645][29441<-29908][PBlobParent] Received Msg___delete__([TODO])
> 10-22 11:29:35.045 29441 29441 I Gecko : RemoveManagee: aProtocolId:30
> 10-22 11:29:35.045 29441 29441 I Gecko : RemoveManagee: aProtocolId:30 PBlobMsgStart
Comment 54•9 years ago
|
||
That contains stack from PBlobChild::Send__delete__ call sites.
Comment 55•9 years ago
|
||
Excerpt of the end of attachment 8677364 [details] before the crash
Comment 56•9 years ago
|
||
New stack after hacking StructuredCloneData() for Move constructor
Comment 57•9 years ago
|
||
Comment 58•9 years ago
|
||
Good news/bad news, it's all so reproduced on B2G Desktop/Mulet build (debug or not)
Comment 59•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #58)
> Good news/bad news, it's all so reproduced on B2G Desktop/Mulet build (debug
> or not)
I'm going to try that too.
Flags: needinfo?(Jan.Varga)
Comment 60•9 years ago
|
||
I'm still struggling. Tried collecting nsHttp logs yesterday with Mulet build, in case the problems lies with the use of XHR (which produces the blob we send), but each time I enabled NSPR_LOG_MOULES this was not reproduced.
I still have some unanswered questions regarding attachment 8678120 [details]: it looks like we are releasing RemoteBlobImpl because of the StructuredCloneData destructor being called, and that comes from Release() on nsRefPtr<nsFrameMessageManager>. So:
- does that fact holds and matches the expected code path?
If so, then I guess it means the lack of reference is either against:
- BlobChild::RemoteBlobImpl
- StructuredCloneData
I'm out of ideas of where to check those two ideas :/
Comment 61•9 years ago
|
||
(In reply to Jan Varga [:janv] from comment #59)
> (In reply to Alexandre LISSY :gerard-majax from comment #58)
> > Good news/bad news, it's all so reproduced on B2G Desktop/Mulet build (debug
> > or not)
>
> I'm going to try that too.
Jan, any update here?
Flags: needinfo?(Jan.Varga)
Assignee | ||
Comment 62•9 years ago
|
||
> Jan, any update here?
Yes. It's a regression introduced by StructuredCloneHolder. The problem is that when we retrieve data from a cloned buffer (child process), we create a new PBackground actor for any cloned BlobImpl. This is something that it was not done previously (wrongly), but doing this, when this PBackground Actor is released, it deletes the BlobParent on the parent process, and from here, the crash.
Finally we know where the bug is. Now we have to fix it :)
Flags: needinfo?(amarchesini)
Flags: needinfo?(Jan.Varga)
Comment 63•9 years ago
|
||
I guess it's time to switch assignee :)
Assignee: lissyx+mozillians → amarchesini
Updated•9 years ago
|
Whiteboard: [systemsfe]
Assignee | ||
Comment 64•9 years ago
|
||
What I suspect it's happening here is that, racy, the constructor of the PBackground BlobParent is faster than then PContent one. This case it's the first one that has to keep the second one alive. khuey can you review this patch?
BTW, I also would like to have the move ctor patch landed.
Attachment #8680669 -
Flags: review?(khuey)
Attachment #8680669 -
Flags: feedback?(lissyx+mozillians)
Assignee | ||
Comment 65•9 years ago
|
||
Comment on attachment 8680669 [details] [diff] [review]
patch
Review of attachment 8680669 [details] [diff] [review]:
-----------------------------------------------------------------
::: b2g/app/b2g.js
@@ +299,5 @@
> // Layers Acceleration. We can only have nice things on gonk, because
> // they're not maintained anywhere else.
> pref("layers.offmainthreadcomposition.enabled", true);
> #ifndef MOZ_WIDGET_GONK
> +pref("dom.ipc.tabs.disabled", false);
Ignore this. This is just required to reproduce the issue.
Comment 66•9 years ago
|
||
Comment on attachment 8680669 [details] [diff] [review]
patch
More than 15 consecutive runs successfull, even with the workaround removed from homescreen.
As soon as I remove that patch, issue gets reproduced very easily again. Looks like you found the fix!
Attachment #8680669 -
Flags: feedback?(lissyx+mozillians) → feedback+
Comment 67•9 years ago
|
||
Great - here's a PR for removing the work-around from gaia-site-icon: https://github.com/gaia-components/gaia-site-icon/pull/17
And n?me to test/merge that and then open a PR for Gaia once this is fixed.
Flags: needinfo?(chrislord.net)
Assignee | ||
Comment 68•9 years ago
|
||
Attachment #8680669 -
Attachment is obsolete: true
Attachment #8680669 -
Flags: review?(khuey)
Attachment #8681199 -
Flags: review?(khuey)
Comment on attachment 8681199 [details] [diff] [review]
crash.patch
Review of attachment 8681199 [details] [diff] [review]:
-----------------------------------------------------------------
::: b2g/app/b2g.js
@@ +299,5 @@
> // Layers Acceleration. We can only have nice things on gonk, because
> // they're not maintained anywhere else.
> pref("layers.offmainthreadcomposition.enabled", true);
> #ifndef MOZ_WIDGET_GONK
> +pref("dom.ipc.tabs.disabled", false);
uh, what is this about?
::: dom/base/StructuredCloneHolder.cpp
@@ +562,5 @@
> +
> + // This is needed just for workers thread.
> + if (NS_IsMainThread()) {
> + return blobImpl.forget();
> + }
Why? Isn't this called when somebody takes a PContent blob and puts it in an IndexedDB?
Assignee | ||
Comment 70•9 years ago
|
||
Bent, if you have time, can you tell me something more about EnsureBlobForBackgroundManager?
In your code, this was used only in workers when postMessage() was called. In my unification of the StructureClone use, I use this everywhere, and in general it works fine.
The issue we have here is a race condition created by the 2 BlobChild actors (1 from PContent and 1 from PBackground).
One of these 2 is released, and doing that, it calls __delete__. The BlobParent deletes the IDTableEntry.
But when the second BlobChild, finally, creates the BlobParent for the same RemoteBlob, this parent doesn't have a valid IDTableEntry anymore.
With my patch I don't called EnsureBlobForBackgroundManager when we are not in a worker thread. This works, but it's an hack :)
Flags: needinfo?(bent.mozilla)
Hm. I don't remember this code too well but it sounds like there's a refcount problem on the parent side? If you have two BlobChild actors that are referencing the same blob then when the first is released it shouldn't trigger the deletion of the IDTableEntry right?
Flags: needinfo?(bent.mozilla)
Assignee | ||
Comment 72•9 years ago
|
||
Ok, this is the _real_ fix for this bug. And here a quick description of what the bug is about:
1. When we transfer a Blob from parent to process, at some point, we create a BlobChild.
2. in StructuredCloneHolder we find out that this is a nsIRemoteBlob and we create a BlobChild for PBackground as well.
3. a SendPBlobConstructor is sent to the parent for the creation of a BlobParent in PBackground.
3. nsIRemoteBlob of the first BlobChild (created from nsIContent*) gets released because nothing is keeping it alive.
4. race condition: the Send__delete_ of BlobChild (nsIContent) is received before the CTOR for the BlobParent in PBackground.
5. we remove the entry of IDTableEntry.
6. When the BlobParent for PBackground is created, it finds the ID of the other BlobParent, but that has been deleted in point 4. IDTableEntry doesn't exist any more and we crash.
What this patch does is to keep alive the remoteBlob when we create the PBackground BlobParent from another BlobChild (nsIContent).
Attachment #8681199 -
Attachment is obsolete: true
Attachment #8681199 -
Flags: review?(khuey)
Attachment #8683264 -
Flags: review?(khuey)
Assignee | ||
Comment 73•9 years ago
|
||
Maybe bent, if you have time, you can give me a feedback for this patch.
Flags: needinfo?(bent.mozilla)
(In reply to Andrea Marchesini (:baku) from comment #73)
> Maybe bent, if you have time, you can give me a feedback for this patch.
Sure, I'll look over the weekend.
Comment on attachment 8683264 [details] [diff] [review]
patch
Review of attachment 8683264 [details] [diff] [review]:
-----------------------------------------------------------------
I don't think this is what you really want to do. This holds the BlobImpl alive on the child side forever, right? The whole idea underlying this blob stuff is that when the BlobImpl refcount drops to 0 we destroy the actor. With this patch in some cases that will never happen.
I think what you probably want to do is AddRef the BlobImpl while your PBackground blob is being constructed. Then when it is you can drop the ref. This is basically the same problem we solved with WaitForSliceCreation (http://mxr.mozilla.org/mozilla-central/source/dom/ipc/Blob.cpp#3373) only for the main thread. We do that synchronously, but you could make an async version of that for the main thread:
protocol PBlob {
// ...
child:
Created();
};
Then just send that message once the actor is created on PBackground in the parent process, and you can drop the extra ref there.
Attachment #8683264 -
Flags: feedback-
Assignee | ||
Comment 76•9 years ago
|
||
bent, I hope you can find time for this new patch!
This patch is exactly the same of before, but in addition it adds a "|child: CreatedFromKnownBlob()|". When the PBlobParent is created, we send a notification to the child in case that blob is built from a known blob.
When the child receives that call, it frees the mDifferentProcessRemoteBlob.
Attachment #8683264 -
Attachment is obsolete: true
Attachment #8683264 -
Flags: review?(khuey)
Attachment #8688561 -
Flags: review?(bent.mozilla)
Assignee | ||
Comment 77•9 years ago
|
||
Here a new version. I changed 1 like to improve the check of the kind of blob constructor param we have.
Attachment #8688561 -
Attachment is obsolete: true
Attachment #8688561 -
Flags: review?(bent.mozilla)
Attachment #8688962 -
Flags: review?(bent.mozilla)
Assignee | ||
Comment 78•9 years ago
|
||
Andrew, how do we want to proceed with this bug? I'm sure bent will review this patch soon or late, but how urgent is this bug?
Flags: needinfo?(overholt)
Comment 80•9 years ago
|
||
(In reply to Andrew Overholt [:overholt] from comment #79)
> Gregor, wdyt?
It's important that we fix it but we are not in a huge hurry. It should be fine if we can fix it in the next 1-2 weeks.
Flags: needinfo?(anygregor)
Assignee | ||
Updated•9 years ago
|
Attachment #8688962 -
Flags: feedback?(lissyx+mozillians)
Comment 81•9 years ago
|
||
I'm rebuilding uptodate without the workaround from Cwiiis and I will be retesting the latest patch.
Comment 82•9 years ago
|
||
Comment on attachment 8688962 [details] [diff] [review]
crash.patch
So I have performed some testing using:
> for i in $(seq 60); do (echo "Making gaia ... "; make PRODUCTION=0 MOZILLA_OFFICIAL=0 NO_LOCK_SCREEN=1 NOFTU=1 reset-gaia 2>/dev/null 1>/dev/null; echo "Gaia booting #${i}, waiting 60 secs now ..."; sleep 60) | ts "%b %d %H:%M:%S |>"; done;
Over 42 runs with the fix, there was no IPC blob error in the logs.
Over 16 runs without the fix, there was 4 IPC blob errors in the logs. I am continuing the testing to assert the frequency, but so far it would confirm the fix is good.
Comment 83•9 years ago
|
||
After 42 retries without the fix, I got 10 IPC blob errors ...
Comment 84•9 years ago
|
||
Comment on attachment 8688962 [details] [diff] [review]
crash.patch
I could not reproduce the issue with that. Without, I get a IPC error in a reliable and reproductible way.
Attachment #8688962 -
Flags: feedback?(lissyx+mozillians) → feedback+
Comment on attachment 8688962 [details] [diff] [review]
crash.patch
Review of attachment 8688962 [details] [diff] [review]:
-----------------------------------------------------------------
I think this is probably fine. Just make sure you test with artificial slowdowns on PBackground to be sure (e.g. play around with http://mxr.mozilla.org/mozilla-central/source/dom/indexedDB/ActorsParent.cpp#245).
::: dom/base/StructuredCloneHolder.cpp
@@ +562,5 @@
>
> if (!aManager) {
> aManager = BackgroundChild::GetForCurrentThread();
> + if (!aManager) {
> + return blobImpl.forget();
It seems really weird to call a function named 'EnsureBlobForBackgroundManager' and expect it to do nothing if there's no PBackgroundChild available... Why make this change?
@@ +670,5 @@
>
> RefPtr<BlobImpl> blobImpl = EnsureBlobForBackgroundManager(aBlob->Impl());
> MOZ_ASSERT(blobImpl);
>
> + if (NS_WARN_IF(NS_FAILED(blobImpl->SetMutable(false)))) {
This can be MOZ_ALWAYS_TRUE() right?
::: dom/ipc/Blob.cpp
@@ +1752,5 @@
>
> BlobChild* mActor;
> nsCOMPtr<nsIEventTarget> mActorTarget;
>
> + RefPtr<BlobImpl> mDifferentProcessBlobImpl;
This needs a big comment explaining why this exists, how long it lives, etc., because this kinda goes against the whole design.
Attachment #8688962 -
Flags: review?(bent.mozilla) → review+
Comment 86•9 years ago
|
||
Flags: needinfo?(bent.mozilla)
(In reply to Pulsebot from comment #86)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/385c7069dca8
This was a patch with the wrong bug number and has since been backed out. Please disregard. :)
Assignee | ||
Comment 88•9 years ago
|
||
Wow, thanks bent.
> It seems really weird to call a function named
> 'EnsureBlobForBackgroundManager' and expect it to do nothing if there's no
> PBackgroundChild available... Why make this change?
This EnsureBlobForbackgroundManager can be called on the main-thread as well, but nothing guaranties that PBackground is set up there yet. Soon PBackground will be initialized by default on main-thread and this check will go away.
Comment 89•9 years ago
|
||
Comment 90•9 years ago
|
||
(In reply to Pulsebot from comment #89)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/cbc2c8a24438
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/71ba07014e12 for asan m(1) failures
https://treeherder.mozilla.org/logviewer.html#?job_id=17775493&repo=mozilla-inbound
Flags: needinfo?(amarchesini)
Assignee | ||
Comment 94•9 years ago
|
||
The bad news is that the patch has been backed out. But the good news is that this patch triggers bug 1226477 more frequently.
Flags: needinfo?(amarchesini)
Comment 95•9 years ago
|
||
Comment 96•9 years ago
|
||
Comment 97•9 years ago
|
||
Comment 98•9 years ago
|
||
Backed out for static analysis bustage on Linux and OS X (Windows has infrastructure issues at the moment):
Backout: https://hg.mozilla.org/integration/mozilla-inbound/rev/3cec4367c10b
Bustage: https://treeherder.mozilla.org/logviewer.html#?job_id=17833382&repo=mozilla-inbound
/builds/slave/m-in-m64-st-an-d-0000000000000/build/src/dom/ipc/Blob.cpp:3044:3: error: 'AddRef' cannot be called on the return value of 'operator->'
make[5]: *** [Blob.o] Error 1
make[5]: *** Waiting for unfinished jobs....
make[4]: *** [dom/ipc/target] Error 2
make[4]: *** Waiting for unfinished jobs....
make[3]: *** [compile] Error 2
make[2]: *** [default] Error 2
make[1]: *** [realbuild] Error 2
make: *** [build] Error 2
Return code: 2
Flags: needinfo?(amarchesini)
Comment 99•9 years ago
|
||
Comment 100•9 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox45:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
Comment 101•9 years ago
|
||
All icons were visible when switching between homescreens on Flame 2.6 and Aries 2.6.
Environmental Variables:
Device: Aries 2.6 [Full Flash]
BuildID: 20151201122021
Gaia: 59c8605876736b22acaaed25be00008e452149cb
Gecko: 89ae375e24d7560f613b75a585e9ed4238604494
Gonk: a19052e4389c3ae2d8fc3e7a74a475401baacc56
Version: 45.0a1 (2.6)
Firmware Version: D5803_23.1.A.1.28_NCB.ftf
User Agent: Mozilla/5.0 (Mobile; rv:45.0) Gecko/45.0 Firefox/45.0
Device: FlameKK 2.6 [Full Flash][512mb]
BuildID: 20151201030222
Gaia: 59c8605876736b22acaaed25be00008e452149cb
Gecko: 66a6d7ec9534b9d7847b665142fef0dd87623768
Gonk: 205ac4204bbbb2098a8046444acba551ba5dc75a
Version: 45.0a1 (2.6)
Firmware Version: v18D
User Agent: Mozilla/5.0 (Mobile; rv:45.0) Gecko/45.0 Firefox/45.0
Result:
All icons are visible.
Updated•9 years ago
|
QA Whiteboard: [QAnalyst-Triage+]
Flags: needinfo?(jmercado)
Comment 102•9 years ago
|
||
This removes the work-around from the component, I'll add a PR for gaia merge/uplift once this is merged.
Flags: needinfo?(chrislord.net)
Attachment #8694847 -
Flags: review?(gmarty)
Comment 103•9 years ago
|
||
Comment on attachment 8694847 [details] [review]
Remove work-around
r+. Glad we finally removed that hack.
Attachment #8694847 -
Flags: review?(gmarty) → review+
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(amarchesini)
Comment 104•9 years ago
|
||
Comment 105•9 years ago
|
||
Comment on attachment 8703593 [details] [review]
[gaia] Cwiiis:bug1211266-remove-workaround > mozilla-b2g:master
Update gaia-site-icon to remove bug work-around.
Attachment #8703593 -
Flags: review?(gmarty)
Comment 106•9 years ago
|
||
Comment on attachment 8703593 [details] [review]
[gaia] Cwiiis:bug1211266-remove-workaround > mozilla-b2g:master
Looks good!
Attachment #8703593 -
Flags: review?(gmarty) → review+
Comment 107•9 years ago
|
||
Quick, thanks! Merged: https://github.com/mozilla-b2g/gaia/commit/81511c721b1140f1ec261a27e44bea04c43e5b16
Assignee | ||
Comment 109•9 years ago
|
||
Comment on attachment 8688962 [details] [diff] [review]
crash.patch
NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.
[Approval Request Comment]
Bug caused by (feature/regressing bug #): Cloning blobs between threads
User impact if declined: a crash
Testing completed: yes, but there is not a mochitest because this is a race condition
Risk to taking this patch (and alternatives if risky): no big risk. This patch has been tested for quite a while.
String or UUID changes made by this patch: none
Flags: needinfo?(amarchesini)
Attachment #8688962 -
Flags: approval‑mozilla‑b2g44?
Comment 110•9 years ago
|
||
Comment on attachment 8703593 [details] [review]
[gaia] Cwiiis:bug1211266-remove-workaround > mozilla-b2g:master
This is actually a b2g-2.5 uplift request, but you can't do that on a Gecko bug. Assuming that the platform bug is uplifted (as it should, given this has long been nominated a 2.5 blocker), this should also be uplifted to Gaia.
NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.
[Approval Request Comment]
Bug caused by (feature/regressing bug #): The work-around for this bug can itself, rarely, cause missing icons, due to a small logic error.
User impact if declined: Occasionally missing bookmark icons.
Testing completed: On master, manually tested, this just removes a work-around in Gaia that is no longer necessary when the platform bug is fixed.
Risk to taking this patch (and alternatives if risky): Low
String or UUID changes made by this patch: None
Attachment #8703593 -
Flags: approval‑mozilla‑b2g44?
Assignee | ||
Updated•7 years ago
|
Attachment #8688962 -
Flags: approval‑mozilla‑b2g44?
You need to log in
before you can comment on or make changes to this bug.
Description
•