Closed Bug 1660826 Opened 5 years ago Closed 5 years ago

Crash in [@ IPC::Channel::ChannelImpl::ProcessOutgoingMessages]

Tracking

()

Status:

RESOLVED FIXED

Milestone:

82 Branch

Tracking Flags:

Tracking

Status

firefox-esr68

---

unaffected

firefox-esr78

---

unaffected

firefox80

---

wontfix

firefox81

---

wontfix

firefox82

---

fixed

People

(Reporter: agi, Assigned: jld)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Bug 1660826 - Work around apparent bug with sendmsg() in some 64-bit Android devices. 5 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 47 bytes, text/x-phabricator-request		Details \| Review

[ex-Mozilla] Agi Sferro | :agi

Reporter

Description

•

5 years ago

We have a user that's seeing this a lot, see https://github.com/mozilla-mobile/fenix/issues/14023

Crash report: https://crash-stats.mozilla.org/report/index/6676cfdb-2b29-4ba4-bc43-1548c0200824

Top 7 frames of crashing thread:

0 libxul.so IPC::Channel::ChannelImpl::ProcessOutgoingMessages ipc/chromium/src/chrome/common/ipc_channel_posix.cc:707
1 libxul.so IPC::Channel::Send ipc/chromium/src/chrome/common/ipc_channel_posix.cc:888
2 libxul.so base::MessagePumpLibevent::Run ipc/chromium/src/base/message_pump_libevent.cc:321
3 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:309
4 libxul.so base::Thread::ThreadMain ipc/chromium/src/base/thread.cc:192
5 libxul.so ThreadFunc ipc/chromium/src/base/platform_thread_posix.cc:40
6 libc.so libc.so@0x67784

[ex-Mozilla] Agi Sferro | :agi

Reporter

Updated

•

5 years ago

Component: General → IPC

Product: GeckoView → Core

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 1

•

5 years ago

MOZ_DIAGNOSTIC_ASSERT(static_cast<size_t>(bytes_written) < amt_to_write)

In theory that can't fail: the total length passed to sendmsg is less than or equal to amt_to_write (strictly less if and only if we exceed kMaxIOVecSize segments), and this is inside a static_cast<size_t>(bytes_written) != amt_to_write conditional, and the 128MiB message size limit should prevent arithmetic overflow.

The other assumption here is that sendmsg doesn't write more data than requested, which seems obvious, but kernel bugs aren't impossible.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 2

•

5 years ago

The compiler unhelpfully merged the end of all the assertion failures in (or inlined into) the function, but fortunately they're all marked as unlikely so they're all at the end of the function, which meant it was a simple matter of going through and copypasting the addresses for the crash messages back together until:

(gdb) x/s 0x5647390
0x5647390:      "MOZ_DIAGNOSTIC_ASSERT(static_cast<size_t>(bytes_written) < amt_to_write)"

So here's how we end up at the crash (x10 is the crash message, w11 is the line number, and x8 is intentionally a null pointer):

   0x397c1b0:   cmp     x28, x27
   0x397c1b4:   b.hs    0x397c500
   0x397c500:   adrp    x9, 0x79a9000
   0x397c504:   ldr     x9, [x9, #3648]
   0x397c508:   adrp    x10, 0x5647000
   0x397c50c:   mov     x8, xzr
   0x397c510:   add     x10, x10, #0x390
   0x397c514:   mov     w11, #0x29d                     // #669
   0x397c518:   b       0x397c44c
   0x397c44c:   str     x10, [x9]
   0x397c450:   str     w11, [x8]

It looks like x28 is bytes_written and x27 is amt_to_write, and those registers aren't reused before the crash, so:

                    "x27": "0x00000000000007bc",
                    "x28": "0x00000000ffffffff",

And there's the problem. That's probably supposed to be -1, but… it's not.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 3

•

5 years ago

There was a bug in Android libc that would do this, but it was fixed in 2014 and — at least in Google's sources — the fix is in 5.0 Lollipop and 6.0 Marshmallow. The reporter on the GitHub issue was using 6.0.1, which should have the fix, but maybe the device manufacturer's fork of Android somehow didn't? Or maybe that's not the real cause.

We could work around this if we needed to — messages will never be anywhere near 4GiB, so we could (after asserting that assumption) check specifically for 0xffffffff on ARM64 Android.

But if it is that libc bug, then it affects more than just our IPC code: if I understand correctly, it applies to every syscall with a 64-bit result — anything like read and write, or for that matter mmap. This is likely to cause a crash whenever one of those syscalls would return an error. So, working around it in this particular case might not help much, but also I'm skeptical that that really is what's going on here.

Gian-Carlo Pascutto [:gcp]

Comment 4

•

5 years ago

•

Edited

Analyzing the crash reports, this looks specific to 23 (Marshmallow) but not specific to a vendor.

Gian-Carlo Pascutto [:gcp]

Comment 5

•

5 years ago

They're all Qualcomm SoC's, so maybe a bug in their kernel/libc fork?

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 6

•

5 years ago

A reason why this probably isn't that libc bug: every time we read from an IPC channel, we iterate recvmsg until it fails with EAGAIN (a.k.a. EWOULDBLOCK); if there were a general bug like that, we'd overrun the buffer and crash the entire browser during startup. So it's probably something specific to sendmsg. The libc side of it is a very small amount of code, either common to all syscalls or autogenerated, so it makes sense to suspect the kernel at this point.

BugBot [:suhaib / :marco/ :calixte]

Comment 7

•

5 years ago

The severity field is not set for this bug.
:jld, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 8

•

5 years ago

Something else going on here: this a MOZ_DIAGNOSTIC_ASSERT, meaning that it's not checked on release builds. In that case we'll call AdvanceAcrossSegments with a large value, it will consume the entire message and return false, but the return value isn't used.

If this is happening when we should get EAGAIN, then the protocol would likely lose sync and we'd crash later somewhere else (from trying to deserialize a truncated message with part of the next message appended to it), so that would be a problem and it wouldn't show up on this report, but it might be expected to cause enough crashes in IPC that I would have known about it.

If this is for EPIPE or similar, then we were going to drop the message anyway, but I don't know if that can happen during normal channel shutdown (i.e., when the other process hasn't crashed; if the parent process crashes then we can't get reports for child process crashes afterwards, and these crashes are all content processes).

In any case, that explains why we're seeing this only on Nightly.

Severity: -- → S2

Flags: needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 9

•

5 years ago

Attached file Bug 1660826 - Work around apparent bug with sendmsg() in some 64-bit Android devices. — Details

Some Android ARM64 devices appear to have a bug where sendmsg sometimes
returns 0xFFFFFFFF, which we're assuming is a -1 that was incorrectly
truncated to 32-bit and then zero-extended. This patch detects that
value (which should never legitimately be returned, because it's 16x
the maximum message size) and replaces it with -1, with some additional
assertions.

The workaround is also enabled on x86_64 Android on debug builds only,
so that the code has CI coverage.

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → jld

Status: NEW → ASSIGNED

Pulsebot

Comment 10

•

5 years ago

Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f39f39711089 Work around apparent bug with sendmsg() in some 64-bit Android devices. r=nika

Cristina Coroiu [:ccoroiu]

Comment 11

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f39f39711089

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox82: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 82 Branch

Ryan VanderMeulen (PTO, back 6-April)

Updated

•

5 years ago

status-firefox80: --- → wontfix

status-firefox81: --- → wontfix

status-firefox-esr68: --- → unaffected

status-firefox-esr78: --- → unaffected

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Crash in [@ IPC::Channel::ChannelImpl::ProcessOutgoingMessages]

Categories

(Core :: IPC, defect)

Tracking

()

People

(Reporter: agi, Assigned: jld)

References

Details

(Keywords: crash)

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Updated

Attachment

General

Description

File Name

Content Type