Crash in [@ IPC::Channel::ChannelImpl::ProcessOutgoingMessages]
Categories
(Core :: IPC, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox-esr78 | --- | unaffected |
firefox80 | --- | wontfix |
firefox81 | --- | wontfix |
firefox82 | --- | fixed |
People
(Reporter: agi, Assigned: jld)
Details
(Keywords: crash)
Crash Data
Attachments
(1 file)
We have a user that's seeing this a lot, see https://github.com/mozilla-mobile/fenix/issues/14023
Crash report: https://crash-stats.mozilla.org/report/index/6676cfdb-2b29-4ba4-bc43-1548c0200824
Top 7 frames of crashing thread:
0 libxul.so IPC::Channel::ChannelImpl::ProcessOutgoingMessages ipc/chromium/src/chrome/common/ipc_channel_posix.cc:707
1 libxul.so IPC::Channel::Send ipc/chromium/src/chrome/common/ipc_channel_posix.cc:888
2 libxul.so base::MessagePumpLibevent::Run ipc/chromium/src/base/message_pump_libevent.cc:321
3 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:309
4 libxul.so base::Thread::ThreadMain ipc/chromium/src/base/thread.cc:192
5 libxul.so ThreadFunc ipc/chromium/src/base/platform_thread_posix.cc:40
6 libc.so libc.so@0x67784
Reporter | ||
Updated•4 years ago
|
Assignee | ||
Comment 1•4 years ago
|
||
MOZ_DIAGNOSTIC_ASSERT(static_cast<size_t>(bytes_written) < amt_to_write)
In theory that can't fail: the total length passed to sendmsg
is less than or equal to amt_to_write
(strictly less if and only if we exceed kMaxIOVecSize
segments), and this is inside a static_cast<size_t>(bytes_written) != amt_to_write
conditional, and the 128MiB message size limit should prevent arithmetic overflow.
The other assumption here is that sendmsg
doesn't write more data than requested, which seems obvious, but kernel bugs aren't impossible.
Assignee | ||
Comment 2•4 years ago
|
||
The compiler unhelpfully merged the end of all the assertion failures in (or inlined into) the function, but fortunately they're all marked as unlikely so they're all at the end of the function, which meant it was a simple matter of going through and copypasting the addresses for the crash messages back together until:
(gdb) x/s 0x5647390
0x5647390: "MOZ_DIAGNOSTIC_ASSERT(static_cast<size_t>(bytes_written) < amt_to_write)"
So here's how we end up at the crash (x10
is the crash message, w11
is the line number, and x8
is intentionally a null pointer):
0x397c1b0: cmp x28, x27
0x397c1b4: b.hs 0x397c500
0x397c500: adrp x9, 0x79a9000
0x397c504: ldr x9, [x9, #3648]
0x397c508: adrp x10, 0x5647000
0x397c50c: mov x8, xzr
0x397c510: add x10, x10, #0x390
0x397c514: mov w11, #0x29d // #669
0x397c518: b 0x397c44c
0x397c44c: str x10, [x9]
0x397c450: str w11, [x8]
It looks like x28
is bytes_written
and x27
is amt_to_write
, and those registers aren't reused before the crash, so:
"x27": "0x00000000000007bc",
"x28": "0x00000000ffffffff",
And there's the problem. That's probably supposed to be -1, but… it's not.
Assignee | ||
Comment 3•4 years ago
|
||
There was a bug in Android libc that would do this, but it was fixed in 2014 and — at least in Google's sources — the fix is in 5.0 Lollipop and 6.0 Marshmallow. The reporter on the GitHub issue was using 6.0.1, which should have the fix, but maybe the device manufacturer's fork of Android somehow didn't? Or maybe that's not the real cause.
We could work around this if we needed to — messages will never be anywhere near 4GiB, so we could (after asserting that assumption) check specifically for 0xffffffff
on ARM64 Android.
But if it is that libc bug, then it affects more than just our IPC code: if I understand correctly, it applies to every syscall with a 64-bit result — anything like read
and write
, or for that matter mmap
. This is likely to cause a crash whenever one of those syscalls would return an error. So, working around it in this particular case might not help much, but also I'm skeptical that that really is what's going on here.
Comment 4•4 years ago
•
|
||
Analyzing the crash reports, this looks specific to 23 (Marshmallow) but not specific to a vendor.
Comment 5•4 years ago
|
||
They're all Qualcomm SoC's, so maybe a bug in their kernel/libc fork?
Assignee | ||
Comment 6•4 years ago
|
||
A reason why this probably isn't that libc bug: every time we read from an IPC channel, we iterate recvmsg
until it fails with EAGAIN
(a.k.a. EWOULDBLOCK
); if there were a general bug like that, we'd overrun the buffer and crash the entire browser during startup. So it's probably something specific to sendmsg
. The libc side of it is a very small amount of code, either common to all syscalls or autogenerated, so it makes sense to suspect the kernel at this point.
Comment 7•4 years ago
|
||
The severity field is not set for this bug.
:jld, could you have a look please?
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 8•4 years ago
|
||
Something else going on here: this a MOZ_DIAGNOSTIC_ASSERT
, meaning that it's not checked on release builds. In that case we'll call AdvanceAcrossSegments
with a large value, it will consume the entire message and return false, but the return value isn't used.
If this is happening when we should get EAGAIN
, then the protocol would likely lose sync and we'd crash later somewhere else (from trying to deserialize a truncated message with part of the next message appended to it), so that would be a problem and it wouldn't show up on this report, but it might be expected to cause enough crashes in IPC that I would have known about it.
If this is for EPIPE
or similar, then we were going to drop the message anyway, but I don't know if that can happen during normal channel shutdown (i.e., when the other process hasn't crashed; if the parent process crashes then we can't get reports for child process crashes afterwards, and these crashes are all content processes).
In any case, that explains why we're seeing this only on Nightly.
Assignee | ||
Comment 9•4 years ago
|
||
Some Android ARM64 devices appear to have a bug where sendmsg sometimes
returns 0xFFFFFFFF, which we're assuming is a -1 that was incorrectly
truncated to 32-bit and then zero-extended. This patch detects that
value (which should never legitimately be returned, because it's 16x
the maximum message size) and replaces it with -1, with some additional
assertions.
The workaround is also enabled on x86_64 Android on debug builds only,
so that the code has CI coverage.
Updated•4 years ago
|
Comment 10•4 years ago
|
||
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f39f39711089 Work around apparent bug with sendmsg() in some 64-bit Android devices. r=nika
Comment 11•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Description
•