Crash in [@ mozilla::BufferList<T>::IterImpl::Data]
Categories
(Core :: IPC, defect, P2)
Tracking
()
People
(Reporter: fluffyemily, Unassigned)
References
Details
(Keywords: crash, Whiteboard: [geckoview])
Crash Data
Attachments
(2 files)
This bug is for crash report bp-eb91c2e4-2caf-4846-b8bc-467530200506.
Top 10 frames of crashing thread:
0 libxul.so mozilla::BufferList<InfallibleAllocPolicy>::IterImpl::Data const mfbt/BufferList.h:198
1 libxul.so IPC::Channel::ChannelImpl::ProcessOutgoingMessages ipc/chromium/src/chrome/common/ipc_channel_posix.cc:626
2 libxul.so IPC::Channel::ChannelImpl::OnFileCanWriteWithoutBlocking ipc/chromium/src/chrome/common/ipc_channel_posix.cc:858
3 libxul.so base::MessagePumpLibevent::OnLibeventNotification ipc/chromium/src/base/message_pump_libevent.cc:241
4 libxul.so event_process_active_single_queue ipc/chromium/src/third_party/libevent/event.c:1646
5 libxul.so event_base_loop ipc/chromium/src/third_party/libevent/event.c:1961
6 libxul.so base::MessagePumpLibevent::Run ipc/chromium/src/base/message_pump_libevent.cc
7 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:290
8 libxul.so base::Thread::ThreadMain ipc/chromium/src/base/thread.cc:192
9 libxul.so ThreadFunc ipc/chromium/src/base/platform_thread_posix.cc:40
Comment 1•5 years ago
|
||
These are all MOZ_RELEASE_ASSERT(!Done())
.
Comment 2•5 years ago
|
||
Nathan, maybe you remember enough of this data structure to figure out what this assert means?
![]() |
||
Comment 3•5 years ago
|
||
We're here in the caller:
so, "are we at the very start of the buffer that we want to send?" is the first question we're trying to answer. Have to check where the iterator's data is pointing, so we're calling into Data()
:
and Data()
says "hey, we should check that this iterator still has actual data to hand out", which is the release assert that we're hitting. So we must have initialized the iterator here:
(I think we don't have to worry about the partial_write_iter_
somehow being None
because of our release asserts in Maybe
.) I guess there are three cases to consider:
- We are sending an empty
Message
, which seems bad. - We are sending an
Message
with a buffer that is empty, which seems bad. - Somehow we're getting to this point in the function when we've already consumed the
partial_write_iter_
-- I think that has to happen on some previous call to the function -- but we had some error that prevented us from resettingpartial_write_iter_
. The reset would have happened here:
I guess we might be in this case:
since I see libevent
code in the stack. But I don't know how we would have consumed all of the partial_write_iter_
data but not written all of it. When we setup our IO buffers, we're working with a copy of partial_write_iter_
:
So that bit of code wouldn't advance it (unless copying the iterator is just busted?), and if we don't write everything we thought we would, we update the internal state of partial_write_iter_
:
I think the result of the above is that we must not be in case 3, and therefore we must be sending some kind of empty message? (Or else I'm missing some other path through that function such that case 3 is a real possibility?) I don't know who would be sending such a thing though; maybe we should stick a release assert whereabouts messages get written into the queue and see if that fires?
Comment 4•5 years ago
|
||
Bug 1595453 was a similar problem with a slightly different signature. The idea of asserting that the message is non-empty when it's added to the queue (it should never be empty because of the header) came up then as well.
Updated•5 years ago
|
Reporter | ||
Comment 6•5 years ago
|
||
This issue has been raised by the Fenix QA team as a release blocker with an STR under their GitHub Repo: https://github.com/mozilla-mobile/fenix/issues/10770.
Comment 7•5 years ago
•
|
||
Looks like ~150 crashes / day, for a population of 300k, which is quite high for us. Might also be causing our user reports from https://github.com/mozilla-mobile/fenix/issues/10517 as well.
Reporter | ||
Comment 8•5 years ago
•
|
||
This bug has been marked as a dupe of bug 1638016 but not closed. As we are seeing this a lot in Fenix, and this is fixed by the closed bug, can we get an uplift to 77 for bug 1638016?
![]() |
||
Comment 9•5 years ago
|
||
(In reply to Emily Toop (:fluffyemily) from comment #8)
This bug has been marked as a dupe of bug 1638016 but not closed. As we are seeing this a lot in Fenix, and this is fixed by the closed bug, can we get an uplift to 77 for bug 1638016?
Bug 1638016 does not look like a patch landed for it, so this issue isn't fixed?
![]() |
||
Comment 10•5 years ago
|
||
Just noticed that every crash for this signature is from an arm64 device, so it's possible we're hitting some sort of weird miscompilation?
And of course as to the question of what the iterator looks like, the assembly for Data()
looks like:
7f71dfef74: fe 0f 1f f8 str x30, [sp, #0xfffffffffffffff0]!
7f71dfef78: e8 03 00 aa mov x8, x0
# load the data pointers and compare
7f71dfef7c: 00 04 40 f9 ldr x0, [x0, #8]
7f71dfef80: 08 09 40 f9 ldr x8, [x8, #0x10]
7f71dfef84: 1f 00 08 eb cmp x0, x8
# error, they're not equal, branch to error path
7f71dfef88: 60 00 00 54 b.eq #0x7f71dfef94
7f71dfef8c: fe 07 41 f8 ldr x30, [sp], #0x10
7f71dfef90: c0 03 5f d6 ret
and we are almost able to see what the data pointers would have been in the dump...except that the error path clobbers x8
. :(
![]() |
||
Comment 11•5 years ago
|
||
I thought I could do a try push for geckoview (well, I thought I could do a try push for fenix, but apparently that's not a thing) and incorporate that into a fenix build, but apparently there are no directions on how one might accomplish such a thing. I'm going to play around with some alternatives and see if I can get fenix build + geckoview try push to work.
Comment 12•5 years ago
•
|
||
This is the #4 overall Focus topcrash as well at the moment.
![]() |
||
Comment 13•5 years ago
|
||
OK, I went back to look at the assembly for a crashing libxul and cross-reference that with the information in the minidump.
There's some good news and some bad news. The good news is that the iterator data structure is stored on the stack, which means we can test the theories in comment 3, because we can trawl through the stack to see what the iterator looks like at the point of the crash. And we have one of the pointers in the minidump in the crashing stack frame's registers at x0
, which means we have a much better idea of where the iterator might live in the several pages of stack in the minidump.
The pointers say that Iter::mData
and Iter::mDataEnd
are the same pointer, so we've reached the end of some segment.
The bad news is that -- assuming that I'm reading this right -- we have 12 (!) segments in the message buffer, which means that we're hitting the case comment 3 thought was impossible -- we consumed all the data and therefore should have popped the message off the queue, but we didn't, and therefore we are back to try and write zero data.
The good news is that we could -- presumably -- paper over the bug by just checking for Done()
earlier in this function and popping the current message off the queue. I do not feel great about doing that, but it's not a completely unreasonable thing to do.
I'd really like to understand how we get into this function having already sent off the entire message but still have said message on the front of the queue. I'd especially like to understand why we're only seeing this on aarch64 and nowhere else. I have a 1:1, but I'll go back to staring at ProcessOutgoingMessages
after that.
Comment 14•5 years ago
|
||
Even if you do a longer term fix for 78, I'd like to get even just a papered-over fix for GV 77 so we can have a more stable Fenix 5.1 release. Will keep checking back here too for updates.
![]() |
||
Comment 15•5 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=55f3bdac65df17d90efb14846e36c0f4626b4373&selectedTaskRun=XlBazXk5R66ZslLnM6nmRg-0 is a link to the Geckoview APK built with some patches with asserts that ought to fire in the cases we are seeing with this crash. They appear to have passed Linux/Android tests, so they at least shouldn't break ordinary operation. We can land them if we reeeeally need to, I guess.
I don't know how to assemble that into a Fenix binary; is that something that someone on the mobile team can look at and then giving that to QA? Happy to debug any problems that seem related to those asserts.
(In reply to Nathan Froyd [:froydnj] from comment #15)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=55f3bdac65df17d90efb14846e36c0f4626b4373&selectedTaskRun=XlBazXk5R66ZslLnM6nmRg-0 is a link to the Geckoview APK built with some patches with asserts that ought to fire in the cases we are seeing with this crash. They appear to have passed Linux/Android tests, so they at least shouldn't break ordinary operation. We can land them if we reeeeally need to, I guess.
I don't know how to assemble that into a Fenix binary; is that something that someone on the mobile team can look at and then giving that to QA? Happy to debug any problems that seem related to those asserts.
These patches shouldn't make anything actually worse, right? I think the easiest way to get testing on this will be to just put it into Nightly and watch crash-stats.
![]() |
||
Comment 17•5 years ago
|
||
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #16)
(In reply to Nathan Froyd [:froydnj] from comment #15)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=55f3bdac65df17d90efb14846e36c0f4626b4373&selectedTaskRun=XlBazXk5R66ZslLnM6nmRg-0 is a link to the Geckoview APK built with some patches with asserts that ought to fire in the cases we are seeing with this crash. They appear to have passed Linux/Android tests, so they at least shouldn't break ordinary operation. We can land them if we reeeeally need to, I guess.
I don't know how to assemble that into a Fenix binary; is that something that someone on the mobile team can look at and then giving that to QA? Happy to debug any problems that seem related to those asserts.
These patches shouldn't make anything actually worse, right? I think the easiest way to get testing on this will be to just put it into Nightly and watch crash-stats.
Theoretically they should not. I don't know that I'd want to bet on them not triggering someplace else, though (maybe that would be valuable information for the related bug). As written, I don't think they'll trigger on Fenix betas if we roll another one because a) they wouldn't be on beta and b) diagnostic assertions get turned off in later betas.
(In reply to Nathan Froyd [:froydnj] from comment #17)
Theoretically they should not. I don't know that I'd want to bet on them not triggering someplace else, though (maybe that would be valuable information for the related bug). As written, I don't think they'll trigger on Fenix betas if we roll another one because a) they wouldn't be on beta and b) diagnostic assertions get turned off in later betas.
Yeah, but we could ship this in GV nightly which would get used in the Fenix nightlies. Lower population, but folks there are definitely hitting this crash.
![]() |
||
Comment 19•5 years ago
|
||
add diagnostic asserts for outgoing IPC messages; r=jld,nika,mccr8
We are seeing crashes on aarch64 Fenix devices that appear to be related
to zero-sized messages. But we're seeing the crashes when we're trying
to send the messages on the IO thread, and not where we're dispatching
them from. Add some asserts so we get errors closer to the source, and
add some asserts for other things that we believe to be true and would
be useful to know aren't actually true.
Updated•5 years ago
|
![]() |
||
Comment 20•5 years ago
|
||
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #18)
(In reply to Nathan Froyd [:froydnj] from comment #17)
Theoretically they should not. I don't know that I'd want to bet on them not triggering someplace else, though (maybe that would be valuable information for the related bug). As written, I don't think they'll trigger on Fenix betas if we roll another one because a) they wouldn't be on beta and b) diagnostic assertions get turned off in later betas.
Yeah, but we could ship this in GV nightly which would get used in the Fenix nightlies. Lower population, but folks there are definitely hitting this crash.
A lot lower: like two people if I am reading https://crash-stats.mozilla.org/signature/?signature=%3Cname%20omitted%3E%20%7C%20IPC%3A%3AChannel%3A%3AChannelImpl%3A%3AOnFileCanWriteWithoutBlocking right (the other signature for this bug, https://crash-stats.mozilla.org/signature/?signature=mozilla%3A%3ABufferList%3CT%3E%3A%3AIterImpl%3A%3AData , does not appear to have any Fenix nightly reports)
Comment 21•5 years ago
|
||
Just a comment as a user seeing this issue. I had to manually submit crash reports via about:crashes it didn't seem like the crash popup was ever submitting crash reports (that I could find in any case).
Happy to try out any apk with a fix if a guinea pig is needed
![]() |
||
Updated•5 years ago
|
Comment 22•5 years ago
|
||
Comment 23•5 years ago
|
||
Here's my understanding of what you could do, but I'll also flag Sebastian from the AC team to help clarify if needed.
Overall, there are 3 code bases chained together to build a Fenix apk: GeckoView, Android Components (which specifies a GeckoView version [1]), Fenix (which specifies an Android Components version)
A few options IIUC, based on where you're getting your GV version from:
a) If the Gecko build is something that's already been published (to Nightly, or Beta), you should be able to locally build an apk by setting the GV version you want to fetch in this AC components Gecko config file [1], and building a Fenix Beta flavor* with that local AC version. The GV versions would be published here, so you need to choose the correct version: https://maven.mozilla.org/?prefix=maven2/org/mozilla/geckoview/
b) Building a native GV version, you should be able to use the build and bundling instructions on the Fenix README (however, I can't recall if that was something that worked for you or not?). Then, to make a Fenix Beta build, you would have to specify the version in
- To build a Fenix Beta flavor, you can either choose a beta variant either through Android Studio or from the command line
./gradlew clean app:assembleGeckoNightlyFenixBeta
(for example, to use the GeckoNightly). Variants are described here: https://github.com/mozilla-mobile/fenix/#guide-to-build-variants . However, I'm not 100% sure what happens if you're not pulling GeckoNightly, but building GV locally.
[1] GeckoView version specified in Androidhttps://github.com/mozilla-mobile/android-components/blob/master/buildSrc/src/main/java/Gecko.kt
Comment 24•5 years ago
|
||
bugherder |
![]() |
||
Comment 25•5 years ago
|
||
(In reply to Natalia Csoregi [:nataliaCs] from comment #24)
This is not showing any difference in Nightly yet, which is not super-surprising, as the number of devices recording crashes in Nightly was pretty small.
We can make a paper-over fix that just recognizes when we're going to be sending zero-length packets and considers the Message
done at that point, something akin to:
--- a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
+++ b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ -623,6 +623,18 @@ bool Channel::ChannelImpl::ProcessOutgoingMessages() {
partial_write_iter_.emplace(iter);
}
+ // This situation is really weird; if we don't have anything left to send,
+ // it means the message itself was empty (which shouldn't happen, since
+ // all messages should at least have a header) or it means that we sent
+ // all the data of the message in an earlier iteration, but somehow
+ // decided that we weren't done sending it and therefore took another turn
+ // through the event loop. Just pop the message and move on.
+ if (partial_write_iter_.ref().Done()) {
+ partial_write_iter_.reset();
+ OutputQueuePop();
+ delete msg;
+ continue;
+ }
+
if (partial_write_iter_.value().Data() == msg->Buffers().Start() &&
!msg->file_descriptor_set()->empty()) {
// This is the first chunk of a message which has descriptors to send
That's small enough that I think I'd be comfortable sending it to Beta with a fairly short bake time. But ending up in that case in the first place makes me nervous: did we get memory corrupted somehow? Are we going to be sending messages out of order after that point? I don't see obvious evidence of UAF bugs (no freed memory canaries on the stack--which includes the iterator, so we can see a little bit of what the Message
structure itself contained--or in registers at the point of the crash), but it's still a peculiar situation to end up in.
Nika, I think every other IPC peer is out for US Memorial Day today. WDYT about doing something like the above, and uplifting that patch to maybe fix the Fenix crashes?
Comment 26•5 years ago
|
||
I took another look at this again, and the only way I could imagine for this to happen would be if partial_write_iter_
somehow became out-of-sync with message_queue_
. This is possible after Close()
is called, as message_queue_
is drained, but partial_write_iter_
isn't cleared. I couldn't figure out how we'd end up back in ProcessOutgoingMessages
with a non-empty message_queue_
after Close()
was called, though.
In addition, the buffer list code is littered with release assertions which should've probably fired earlier in that situation.
Given that that change should theoretically stem the bleeding, I'll throw up a quick patch like the one you've suggested, and perhaps add a couple of extra assertions.
Comment 27•5 years ago
|
||
Comment 28•5 years ago
|
||
Comment 29•5 years ago
|
||
Since we're not seeing many of these crashes on Nightly at all, and they all seem to be on beta, is this be something due to beta/nightly flagging? (relatedly, I'm wondering how much baking on nightly will help, if that's the case.)
Comment 30•5 years ago
|
||
Per discussion with Chenxia and Nathan, we're going to push this to Beta now while it's in an "in-between" state - waiting for next week's merge of 78 to it while 77 has already ridden to mozilla-release. Doing that, we can manually spin a new GV77 Beta build without putting any Gecko 77 releases at risk. If we do end up wanting to push this to Release77, we should go through the regular approval process for doing so.
https://hg.mozilla.org/releases/mozilla-beta/rev/bcdfef56d99868fda850afc107bc3291e2809f1a
Comment 31•5 years ago
|
||
bugherder |
Comment 32•5 years ago
|
||
Hi, the following issues related to this matter were marked as fixed on Beta 5.1.0-beta.2 from 5/27
using a OnePlus A3 (Android 6.0.1):
[Bug]Native code crash on Beta 5.1.0-beta.1 after quickly opening one by one the featured Top Sites
[Bug]Native code crash after adding and opening a PWA
[Bug]Native code crash when trying to sign in on Twitter
[Bug]Native code crash on Beta 5.1.0-beta.1 after accepting the cookie policy on Theverge.com
Comment 33•5 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #30)
Per discussion with Chenxia and Nathan, we're going to push this to Beta now while it's in an "in-between" state - waiting for next week's merge of 78 to it while 77 has already ridden to mozilla-release. Doing that, we can manually spin a new GV77 Beta build without putting any Gecko 77 releases at risk. If we do end up wanting to push this to Release77, we should go through the regular approval process for doing so.
https://hg.mozilla.org/releases/mozilla-beta/rev/bcdfef56d99868fda850afc107bc3291e2809f1a
Per discussion with Stefan, we decided to create a GV77 relbranch to cherry-pick this fix to in order to avoid any possible risk to Desktop Firefox should a future 77.0.1 dot release be needed. Given that, I don't think we need to uplift this fix any further.
GECKOVIEW_77_RELBRANCH: https://hg.mozilla.org/releases/mozilla-release/rev/8929f2a9ac6af2dc49643c7e67a11a8cf45844ae
Updated•5 years ago
|
Comment 34•5 years ago
|
||
Removing the [fenix:p1] tag as the crashes have significantly reduced.
Comment 35•4 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:jld, maybe it's time to close this bug?
Comment 36•4 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:jld, maybe it's time to close this bug?
Comment hidden (offtopic) |
Comment 38•4 years ago
|
||
(Disregard previous comment; I meant to post it on another bug.)
Comment 39•4 years ago
|
||
This hasn't happened since October, and is extremely low-volume. Closing for now.
Updated•4 years ago
|
Description
•