Open Bug 1635720 Opened 1 year ago Updated 5 months ago

Crash in [@ mozilla::BufferList<T>::IterImpl::Data]

Categories

(Core :: IPC, defect, P2)

Unspecified
Android
defect

Tracking

()

Tracking Status
firefox76 --- wontfix
firefox77 --- wontfix
firefox78 --- affected

People

(Reporter: fluffyemily, Unassigned, NeedInfo)

References

Details

(Keywords: crash, leave-open, Whiteboard: [geckoview])

Crash Data

Attachments

(2 files)

This bug is for crash report bp-eb91c2e4-2caf-4846-b8bc-467530200506.

Top 10 frames of crashing thread:

0 libxul.so mozilla::BufferList<InfallibleAllocPolicy>::IterImpl::Data const mfbt/BufferList.h:198
1 libxul.so IPC::Channel::ChannelImpl::ProcessOutgoingMessages ipc/chromium/src/chrome/common/ipc_channel_posix.cc:626
2 libxul.so IPC::Channel::ChannelImpl::OnFileCanWriteWithoutBlocking ipc/chromium/src/chrome/common/ipc_channel_posix.cc:858
3 libxul.so base::MessagePumpLibevent::OnLibeventNotification ipc/chromium/src/base/message_pump_libevent.cc:241
4 libxul.so event_process_active_single_queue ipc/chromium/src/third_party/libevent/event.c:1646
5 libxul.so event_base_loop ipc/chromium/src/third_party/libevent/event.c:1961
6 libxul.so base::MessagePumpLibevent::Run ipc/chromium/src/base/message_pump_libevent.cc
7 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:290
8 libxul.so base::Thread::ThreadMain ipc/chromium/src/base/thread.cc:192
9 libxul.so ThreadFunc ipc/chromium/src/base/platform_thread_posix.cc:40

Component: General → IPC
Product: GeckoView → Core
Whiteboard: [geckoview]

These are all MOZ_RELEASE_ASSERT(!Done()).

Nathan, maybe you remember enough of this data structure to figure out what this assert means?

Flags: needinfo?(nfroyd)

We're here in the caller:

https://hg.mozilla.org/releases/mozilla-beta/annotate/b4ef8571957b03316302e9dd21979a0d2612f648/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#l626

so, "are we at the very start of the buffer that we want to send?" is the first question we're trying to answer. Have to check where the iterator's data is pointing, so we're calling into Data():

https://hg.mozilla.org/releases/mozilla-beta/annotate/b4ef8571957b03316302e9dd21979a0d2612f648/mfbt/BufferList.h#l198

and Data() says "hey, we should check that this iterator still has actual data to hand out", which is the release assert that we're hitting. So we must have initialized the iterator here:

https://hg.mozilla.org/releases/mozilla-beta/annotate/b4ef8571957b03316302e9dd21979a0d2612f648/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#l621

(I think we don't have to worry about the partial_write_iter_ somehow being None because of our release asserts in Maybe.) I guess there are three cases to consider:

  1. We are sending an empty Message, which seems bad.
  2. We are sending an Message with a buffer that is empty, which seems bad.
  3. Somehow we're getting to this point in the function when we've already consumed the partial_write_iter_ -- I think that has to happen on some previous call to the function -- but we had some error that prevented us from resetting partial_write_iter_. The reset would have happened here:

https://hg.mozilla.org/releases/mozilla-beta/annotate/b4ef8571957b03316302e9dd21979a0d2612f648/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#l748

I guess we might be in this case:

https://hg.mozilla.org/releases/mozilla-beta/annotate/b4ef8571957b03316302e9dd21979a0d2612f648/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#l740

since I see libevent code in the stack. But I don't know how we would have consumed all of the partial_write_iter_ data but not written all of it. When we setup our IO buffers, we're working with a copy of partial_write_iter_:

https://hg.mozilla.org/releases/mozilla-beta/annotate/b4ef8571957b03316302e9dd21979a0d2612f648/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#l658

So that bit of code wouldn't advance it (unless copying the iterator is just busted?), and if we don't write everything we thought we would, we update the internal state of partial_write_iter_:

https://hg.mozilla.org/releases/mozilla-beta/annotate/b4ef8571957b03316302e9dd21979a0d2612f648/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#l734

I think the result of the above is that we must not be in case 3, and therefore we must be sending some kind of empty message? (Or else I'm missing some other path through that function such that case 3 is a real possibility?) I don't know who would be sending such a thing though; maybe we should stick a release assert whereabouts messages get written into the queue and see if that fires?

Flags: needinfo?(nfroyd)

Bug 1595453 was a similar problem with a slightly different signature. The idea of asserting that the message is non-empty when it's added to the queue (it should never be empty because of the header) came up then as well.

Priority: -- → P2
See Also: → 1595453
Duplicate of this bug: 1638016
Crash Signature: [@ mozilla::BufferList<T>::IterImpl::Data] → [@ mozilla::BufferList<T>::IterImpl::Data] [@ <name omitted> | IPC::Channel::ChannelImpl::OnFileCanWriteWithoutBlocking]

This issue has been raised by the Fenix QA team as a release blocker with an STR under their GitHub Repo: https://github.com/mozilla-mobile/fenix/issues/10770.

Crash Signature: [@ mozilla::BufferList<T>::IterImpl::Data] [@ <name omitted> | IPC::Channel::ChannelImpl::OnFileCanWriteWithoutBlocking] → [@ mozilla::BufferList<T>::IterImpl::Data] [@ <name omitted> | IPC::Channel::ChannelImpl::OnFileCanWriteWithoutBlocking]
Whiteboard: [geckoview] → [geckoview][fenix:p1]

This bug has been marked as a dupe of bug 1638016 but not closed. As we are seeing this a lot in Fenix, and this is fixed by the closed bug, can we get an uplift to 77 for bug 1638016?

Flags: needinfo?(jld)

(In reply to Emily Toop (:fluffyemily) from comment #8)

This bug has been marked as a dupe of bug 1638016 but not closed. As we are seeing this a lot in Fenix, and this is fixed by the closed bug, can we get an uplift to 77 for bug 1638016?

Bug 1638016 does not look like a patch landed for it, so this issue isn't fixed?

Just noticed that every crash for this signature is from an arm64 device, so it's possible we're hitting some sort of weird miscompilation?

And of course as to the question of what the iterator looks like, the assembly for Data() looks like:

   7f71dfef74:	fe 0f 1f f8          	str x30, [sp, #0xfffffffffffffff0]!
   7f71dfef78:	e8 03 00 aa          	mov x8, x0
# load the data pointers and compare
   7f71dfef7c:	00 04 40 f9          	ldr x0, [x0, #8]
   7f71dfef80:	08 09 40 f9          	ldr x8, [x8, #0x10]
   7f71dfef84:	1f 00 08 eb          	cmp x0, x8

# error, they're not equal, branch to error path
   7f71dfef88:	60 00 00 54          	b.eq #0x7f71dfef94
   7f71dfef8c:	fe 07 41 f8          	ldr x30, [sp], #0x10
   7f71dfef90:	c0 03 5f d6          	ret 

and we are almost able to see what the data pointers would have been in the dump...except that the error path clobbers x8. :(

I thought I could do a try push for geckoview (well, I thought I could do a try push for fenix, but apparently that's not a thing) and incorporate that into a fenix build, but apparently there are no directions on how one might accomplish such a thing. I'm going to play around with some alternatives and see if I can get fenix build + geckoview try push to work.

This is the #4 overall Focus topcrash as well at the moment.

OK, I went back to look at the assembly for a crashing libxul and cross-reference that with the information in the minidump.

There's some good news and some bad news. The good news is that the iterator data structure is stored on the stack, which means we can test the theories in comment 3, because we can trawl through the stack to see what the iterator looks like at the point of the crash. And we have one of the pointers in the minidump in the crashing stack frame's registers at x0, which means we have a much better idea of where the iterator might live in the several pages of stack in the minidump.

The pointers say that Iter::mData and Iter::mDataEnd are the same pointer, so we've reached the end of some segment.

The bad news is that -- assuming that I'm reading this right -- we have 12 (!) segments in the message buffer, which means that we're hitting the case comment 3 thought was impossible -- we consumed all the data and therefore should have popped the message off the queue, but we didn't, and therefore we are back to try and write zero data.

The good news is that we could -- presumably -- paper over the bug by just checking for Done() earlier in this function and popping the current message off the queue. I do not feel great about doing that, but it's not a completely unreasonable thing to do.

I'd really like to understand how we get into this function having already sent off the entire message but still have said message on the front of the queue. I'd especially like to understand why we're only seeing this on aarch64 and nowhere else. I have a 1:1, but I'll go back to staring at ProcessOutgoingMessages after that.

Blocks: 1639964

Even if you do a longer term fix for 78, I'd like to get even just a papered-over fix for GV 77 so we can have a more stable Fenix 5.1 release. Will keep checking back here too for updates.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=55f3bdac65df17d90efb14846e36c0f4626b4373&selectedTaskRun=XlBazXk5R66ZslLnM6nmRg-0 is a link to the Geckoview APK built with some patches with asserts that ought to fire in the cases we are seeing with this crash. They appear to have passed Linux/Android tests, so they at least shouldn't break ordinary operation. We can land them if we reeeeally need to, I guess.

I don't know how to assemble that into a Fenix binary; is that something that someone on the mobile team can look at and then giving that to QA? Happy to debug any problems that seem related to those asserts.

Flags: needinfo?(liuche)

(In reply to Nathan Froyd [:froydnj] from comment #15)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=55f3bdac65df17d90efb14846e36c0f4626b4373&selectedTaskRun=XlBazXk5R66ZslLnM6nmRg-0 is a link to the Geckoview APK built with some patches with asserts that ought to fire in the cases we are seeing with this crash. They appear to have passed Linux/Android tests, so they at least shouldn't break ordinary operation. We can land them if we reeeeally need to, I guess.

I don't know how to assemble that into a Fenix binary; is that something that someone on the mobile team can look at and then giving that to QA? Happy to debug any problems that seem related to those asserts.

These patches shouldn't make anything actually worse, right? I think the easiest way to get testing on this will be to just put it into Nightly and watch crash-stats.

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #16)

(In reply to Nathan Froyd [:froydnj] from comment #15)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=55f3bdac65df17d90efb14846e36c0f4626b4373&selectedTaskRun=XlBazXk5R66ZslLnM6nmRg-0 is a link to the Geckoview APK built with some patches with asserts that ought to fire in the cases we are seeing with this crash. They appear to have passed Linux/Android tests, so they at least shouldn't break ordinary operation. We can land them if we reeeeally need to, I guess.

I don't know how to assemble that into a Fenix binary; is that something that someone on the mobile team can look at and then giving that to QA? Happy to debug any problems that seem related to those asserts.

These patches shouldn't make anything actually worse, right? I think the easiest way to get testing on this will be to just put it into Nightly and watch crash-stats.

Theoretically they should not. I don't know that I'd want to bet on them not triggering someplace else, though (maybe that would be valuable information for the related bug). As written, I don't think they'll trigger on Fenix betas if we roll another one because a) they wouldn't be on beta and b) diagnostic assertions get turned off in later betas.

(In reply to Nathan Froyd [:froydnj] from comment #17)

Theoretically they should not. I don't know that I'd want to bet on them not triggering someplace else, though (maybe that would be valuable information for the related bug). As written, I don't think they'll trigger on Fenix betas if we roll another one because a) they wouldn't be on beta and b) diagnostic assertions get turned off in later betas.

Yeah, but we could ship this in GV nightly which would get used in the Fenix nightlies. Lower population, but folks there are definitely hitting this crash.

add diagnostic asserts for outgoing IPC messages; r=jld,nika,mccr8

We are seeing crashes on aarch64 Fenix devices that appear to be related
to zero-sized messages. But we're seeing the crashes when we're trying
to send the messages on the IO thread, and not where we're dispatching
them from. Add some asserts so we get errors closer to the source, and
add some asserts for other things that we believe to be true and would
be useful to know aren't actually true.

Assignee: nobody → nfroyd
Attachment #9151083 - Attachment description: Bug 1635720 - # This is a combination of 2 commits. # The first commit's message is: → Bug 1635720 - add diagnostic asserts for outgoing IPC messages; r=jld,nika,mccr8
Status: NEW → ASSIGNED

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #18)

(In reply to Nathan Froyd [:froydnj] from comment #17)

Theoretically they should not. I don't know that I'd want to bet on them not triggering someplace else, though (maybe that would be valuable information for the related bug). As written, I don't think they'll trigger on Fenix betas if we roll another one because a) they wouldn't be on beta and b) diagnostic assertions get turned off in later betas.

Yeah, but we could ship this in GV nightly which would get used in the Fenix nightlies. Lower population, but folks there are definitely hitting this crash.

A lot lower: like two people if I am reading https://crash-stats.mozilla.org/signature/?signature=%3Cname%20omitted%3E%20%7C%20IPC%3A%3AChannel%3A%3AChannelImpl%3A%3AOnFileCanWriteWithoutBlocking right (the other signature for this bug, https://crash-stats.mozilla.org/signature/?signature=mozilla%3A%3ABufferList%3CT%3E%3A%3AIterImpl%3A%3AData , does not appear to have any Fenix nightly reports)

Just a comment as a user seeing this issue. I had to manually submit crash reports via about:crashes it didn't seem like the crash popup was ever submitting crash reports (that I could find in any case).

Happy to try out any apk with a fix if a guinea pig is needed

Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1a717f48e1a2
add diagnostic asserts for outgoing IPC messages; r=jld,nika

Here's my understanding of what you could do, but I'll also flag Sebastian from the AC team to help clarify if needed.

Overall, there are 3 code bases chained together to build a Fenix apk: GeckoView, Android Components (which specifies a GeckoView version [1]), Fenix (which specifies an Android Components version)

A few options IIUC, based on where you're getting your GV version from:

a) If the Gecko build is something that's already been published (to Nightly, or Beta), you should be able to locally build an apk by setting the GV version you want to fetch in this AC components Gecko config file [1], and building a Fenix Beta flavor* with that local AC version. The GV versions would be published here, so you need to choose the correct version: https://maven.mozilla.org/?prefix=maven2/org/mozilla/geckoview/

b) Building a native GV version, you should be able to use the build and bundling instructions on the Fenix README (however, I can't recall if that was something that worked for you or not?). Then, to make a Fenix Beta build, you would have to specify the version in

  • To build a Fenix Beta flavor, you can either choose a beta variant either through Android Studio or from the command line ./gradlew clean app:assembleGeckoNightlyFenixBeta (for example, to use the GeckoNightly). Variants are described here: https://github.com/mozilla-mobile/fenix/#guide-to-build-variants . However, I'm not 100% sure what happens if you're not pulling GeckoNightly, but building GV locally.

[1] GeckoView version specified in Androidhttps://github.com/mozilla-mobile/android-components/blob/master/buildSrc/src/main/java/Gecko.kt

Flags: needinfo?(liuche) → needinfo?(s.kaspari)

(In reply to Natalia Csoregi [:nataliaCs] from comment #24)

https://hg.mozilla.org/mozilla-central/rev/1a717f48e1a2

This is not showing any difference in Nightly yet, which is not super-surprising, as the number of devices recording crashes in Nightly was pretty small.

We can make a paper-over fix that just recognizes when we're going to be sending zero-length packets and considers the Message done at that point, something akin to:

--- a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
+++ b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ -623,6 +623,18 @@ bool Channel::ChannelImpl::ProcessOutgoingMessages() {
       partial_write_iter_.emplace(iter);
     }
 
+    // This situation is really weird; if we don't have anything left to send,
+    // it means the message itself was empty (which shouldn't happen, since
+    // all messages should at least have a header) or it means that we sent
+    // all the data of the message in an earlier iteration, but somehow
+    // decided that we weren't done sending it and therefore took another turn
+    // through the event loop.  Just pop the message and move on.
+    if (partial_write_iter_.ref().Done()) {
+      partial_write_iter_.reset();
+      OutputQueuePop();
+      delete msg;
+      continue;
+    }
+
     if (partial_write_iter_.value().Data() == msg->Buffers().Start() &&
         !msg->file_descriptor_set()->empty()) {
       // This is the first chunk of a message which has descriptors to send

That's small enough that I think I'd be comfortable sending it to Beta with a fairly short bake time. But ending up in that case in the first place makes me nervous: did we get memory corrupted somehow? Are we going to be sending messages out of order after that point? I don't see obvious evidence of UAF bugs (no freed memory canaries on the stack--which includes the iterator, so we can see a little bit of what the Message structure itself contained--or in registers at the point of the crash), but it's still a peculiar situation to end up in.

Nika, I think every other IPC peer is out for US Memorial Day today. WDYT about doing something like the above, and uplifting that patch to maybe fix the Fenix crashes?

Flags: needinfo?(nika)

I took another look at this again, and the only way I could imagine for this to happen would be if partial_write_iter_ somehow became out-of-sync with message_queue_. This is possible after Close() is called, as message_queue_ is drained, but partial_write_iter_ isn't cleared. I couldn't figure out how we'd end up back in ProcessOutgoingMessages with a non-empty message_queue_ after Close() was called, though.

In addition, the buffer list code is littered with release assertions which should've probably fired earlier in that situation.

Given that that change should theoretically stem the bleeding, I'll throw up a quick patch like the one you've suggested, and perhaps add a couple of extra assertions.

Flags: needinfo?(nika)
Pushed by nlayzell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/74b7ca036371
Abort send attempt if malformed partial_write_iter_ is found, r=froydnj

Since we're not seeing many of these crashes on Nightly at all, and they all seem to be on beta, is this be something due to beta/nightly flagging? (relatedly, I'm wondering how much baking on nightly will help, if that's the case.)

Flags: needinfo?(s.kaspari)

Per discussion with Chenxia and Nathan, we're going to push this to Beta now while it's in an "in-between" state - waiting for next week's merge of 78 to it while 77 has already ridden to mozilla-release. Doing that, we can manually spin a new GV77 Beta build without putting any Gecko 77 releases at risk. If we do end up wanting to push this to Release77, we should go through the regular approval process for doing so.

https://hg.mozilla.org/releases/mozilla-beta/rev/bcdfef56d99868fda850afc107bc3291e2809f1a

(In reply to Ryan VanderMeulen [:RyanVM] from comment #30)

Per discussion with Chenxia and Nathan, we're going to push this to Beta now while it's in an "in-between" state - waiting for next week's merge of 78 to it while 77 has already ridden to mozilla-release. Doing that, we can manually spin a new GV77 Beta build without putting any Gecko 77 releases at risk. If we do end up wanting to push this to Release77, we should go through the regular approval process for doing so.

https://hg.mozilla.org/releases/mozilla-beta/rev/bcdfef56d99868fda850afc107bc3291e2809f1a

Per discussion with Stefan, we decided to create a GV77 relbranch to cherry-pick this fix to in order to avoid any possible risk to Desktop Firefox should a future 77.0.1 dot release be needed. Given that, I don't think we need to uplift this fix any further.
GECKOVIEW_77_RELBRANCH: https://hg.mozilla.org/releases/mozilla-release/rev/8929f2a9ac6af2dc49643c7e67a11a8cf45844ae

Flags: needinfo?(jld)

Removing the [fenix:p1] tag as the crashes have significantly reduced.

Assignee: froydnj+bz → nobody
Status: ASSIGNED → NEW
Whiteboard: [geckoview][fenix:p1] → [geckoview]

The leave-open keyword is there and there is no activity for 6 months.
:jld, maybe it's time to close this bug?

Flags: needinfo?(jld)
You need to log in before you can comment on or make changes to this bug.