Closed Bug 1660826 Opened 3 months ago Closed 3 months ago

Crash in [@ IPC::Channel::ChannelImpl::ProcessOutgoingMessages]

Categories

(Core :: IPC, defect)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
82 Branch
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox80 --- wontfix
firefox81 --- wontfix
firefox82 --- fixed

People

(Reporter: agi, Assigned: jld)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

We have a user that's seeing this a lot, see https://github.com/mozilla-mobile/fenix/issues/14023

Crash report: https://crash-stats.mozilla.org/report/index/6676cfdb-2b29-4ba4-bc43-1548c0200824

Top 7 frames of crashing thread:

0 libxul.so IPC::Channel::ChannelImpl::ProcessOutgoingMessages ipc/chromium/src/chrome/common/ipc_channel_posix.cc:707
1 libxul.so IPC::Channel::Send ipc/chromium/src/chrome/common/ipc_channel_posix.cc:888
2 libxul.so base::MessagePumpLibevent::Run ipc/chromium/src/base/message_pump_libevent.cc:321
3 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:309
4 libxul.so base::Thread::ThreadMain ipc/chromium/src/base/thread.cc:192
5 libxul.so ThreadFunc ipc/chromium/src/base/platform_thread_posix.cc:40
6 libc.so libc.so@0x67784 
Component: General → IPC
Product: GeckoView → Core

MOZ_DIAGNOSTIC_ASSERT(static_cast<size_t>(bytes_written) < amt_to_write)

In theory that can't fail: the total length passed to sendmsg is less than or equal to amt_to_write (strictly less if and only if we exceed kMaxIOVecSize segments), and this is inside a static_cast<size_t>(bytes_written) != amt_to_write conditional, and the 128MiB message size limit should prevent arithmetic overflow.

The other assumption here is that sendmsg doesn't write more data than requested, which seems obvious, but kernel bugs aren't impossible.

The compiler unhelpfully merged the end of all the assertion failures in (or inlined into) the function, but fortunately they're all marked as unlikely so they're all at the end of the function, which meant it was a simple matter of going through and copypasting the addresses for the crash messages back together until:

(gdb) x/s 0x5647390
0x5647390:      "MOZ_DIAGNOSTIC_ASSERT(static_cast<size_t>(bytes_written) < amt_to_write)"

So here's how we end up at the crash (x10 is the crash message, w11 is the line number, and x8 is intentionally a null pointer):

   0x397c1b0:   cmp     x28, x27
   0x397c1b4:   b.hs    0x397c500
   0x397c500:   adrp    x9, 0x79a9000
   0x397c504:   ldr     x9, [x9, #3648]
   0x397c508:   adrp    x10, 0x5647000
   0x397c50c:   mov     x8, xzr
   0x397c510:   add     x10, x10, #0x390
   0x397c514:   mov     w11, #0x29d                     // #669
   0x397c518:   b       0x397c44c
   0x397c44c:   str     x10, [x9]
   0x397c450:   str     w11, [x8]

It looks like x28 is bytes_written and x27 is amt_to_write, and those registers aren't reused before the crash, so:

                    "x27": "0x00000000000007bc",
                    "x28": "0x00000000ffffffff",

And there's the problem. That's probably supposed to be -1, but… it's not.

There was a bug in Android libc that would do this, but it was fixed in 2014 and — at least in Google's sources — the fix is in 5.0 Lollipop and 6.0 Marshmallow. The reporter on the GitHub issue was using 6.0.1, which should have the fix, but maybe the device manufacturer's fork of Android somehow didn't? Or maybe that's not the real cause.

We could work around this if we needed to — messages will never be anywhere near 4GiB, so we could (after asserting that assumption) check specifically for 0xffffffff on ARM64 Android.

But if it is that libc bug, then it affects more than just our IPC code: if I understand correctly, it applies to every syscall with a 64-bit result — anything like read and write, or for that matter mmap. This is likely to cause a crash whenever one of those syscalls would return an error. So, working around it in this particular case might not help much, but also I'm skeptical that that really is what's going on here.

Analyzing the crash reports, this looks specific to 23 (Marshmallow) but not specific to a vendor.

They're all Qualcomm SoC's, so maybe a bug in their kernel/libc fork?

A reason why this probably isn't that libc bug: every time we read from an IPC channel, we iterate recvmsg until it fails with EAGAIN (a.k.a. EWOULDBLOCK); if there were a general bug like that, we'd overrun the buffer and crash the entire browser during startup. So it's probably something specific to sendmsg. The libc side of it is a very small amount of code, either common to all syscalls or autogenerated, so it makes sense to suspect the kernel at this point.

The severity field is not set for this bug.
:jld, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jld)

Something else going on here: this a MOZ_DIAGNOSTIC_ASSERT, meaning that it's not checked on release builds. In that case we'll call AdvanceAcrossSegments with a large value, it will consume the entire message and return false, but the return value isn't used.

If this is happening when we should get EAGAIN, then the protocol would likely lose sync and we'd crash later somewhere else (from trying to deserialize a truncated message with part of the next message appended to it), so that would be a problem and it wouldn't show up on this report, but it might be expected to cause enough crashes in IPC that I would have known about it.

If this is for EPIPE or similar, then we were going to drop the message anyway, but I don't know if that can happen during normal channel shutdown (i.e., when the other process hasn't crashed; if the parent process crashes then we can't get reports for child process crashes afterwards, and these crashes are all content processes).

In any case, that explains why we're seeing this only on Nightly.

Severity: -- → S2
Flags: needinfo?(jld)

Some Android ARM64 devices appear to have a bug where sendmsg sometimes
returns 0xFFFFFFFF, which we're assuming is a -1 that was incorrectly
truncated to 32-bit and then zero-extended. This patch detects that
value (which should never legitimately be returned, because it's 16x
the maximum message size) and replaces it with -1, with some additional
assertions.

The workaround is also enabled on x86_64 Android on debug builds only,
so that the code has CI coverage.

Assignee: nobody → jld
Status: NEW → ASSIGNED
Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f39f39711089
Work around apparent bug with sendmsg() in some 64-bit Android devices. r=nika
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 82 Branch
You need to log in before you can comment on or make changes to this bug.