Closed Bug 1518470 Opened 7 years ago Closed 7 years ago

Uncaught top crash in Focus 64.0.x

Categories

(GeckoView :: General, defect, P1)

64 Branch
x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1354200

People

(Reporter: snorp, Unassigned)

References

Details

(Keywords: regression, topcrash, Whiteboard: [memory corruption?])

We're seeing a crash in libxul.so not being caught by the reporter making it into the play store crashes on Focus 8.0.x. The lines reported to the play store are:

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
pid: 0, tid: 0 >>> org.mozilla.focus <<<

backtrace:
  #00  pc 00000000002e8202  /data/data/org.mozilla.focus/cache/libxul.so
  #01  pc 00000000002e81bd  /data/data/org.mozilla.focus/cache/libxul.so

With :glandium's help, we found that these resolved to non-virtual thunk to mozilla::ipc::DoWorkRunnable::Cancel and mozilla::ipc::DoWorkRunnable::Run. :glandium also found that the crashing instruction is:

  625ef6:       b580            push    {r7, lr}

Which, if correct, indicates that there may be a stack overflow.

Priority: -- → P1
Version: unspecified → 64 Branch

I should add, we also know the crashing signal is signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), which I believe is still consistent with a stack overflow.

Is it possible for DoWorkRunnable::Run to call DoWorkRunnable::Cancel? That's not clear to me at all.

Jim we realize there isn't a lot to go on with just a couple of frames but given the priority (blocking GV roll out) here do you know an ipc code expert who could take a look here? Discussion suggested in #gv slack. Maybe there is an infinite (mutual) recursion situation or something... does our ipc code guard against that?

Flags: needinfo?(jmathies)

The comment here[0] indicates that DoWorkRunnable::Cancel is mostly used by Workers. Andrew, can we get some help with this? It's an urgent issue affecting the rollout of GV in Firefox Focus.

[0] https://searchfox.org/mozilla-central/source/ipc/glue/MessagePump.cpp#218

Flags: needinfo?(overholt)

baku or asuth can probably help with this urgent issue.

Flags: needinfo?(overholt)
Flags: needinfo?(bugmail)
Flags: needinfo?(amarchesini)

Ok, the top crashes that we do get reports for involve AndroidUiThread, so I think maybe the problems could be linked.

https://crash-stats.mozilla.com/topcrashers/?product=Focus&version=64.0.1

Makoto, you've been pretty good at debugging some of these things in the past, so any ideas you have would certainly be welcome.

Flags: needinfo?(m_kato)

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #6)

Ok, the top crashes that we do get reports for involve AndroidUiThread, so I think maybe the problems could be linked.

https://crash-stats.mozilla.com/topcrashers/?product=Focus&version=64.0.1

Can you link to a specific report(s) or signature that you're talking about? And/or if you know any tricks to easily figure out which crash reports we actually have system library symbols for, like filtering on Pixel devices or something. Thanks!

Flags: needinfo?(snorp)

(In reply to Andrew Sutherland [:asuth] from comment #8)

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #6)

Ok, the top crashes that we do get reports for involve AndroidUiThread, so I think maybe the problems could be linked.

https://crash-stats.mozilla.com/topcrashers/?product=Focus&version=64.0.1

Can you link to a specific report(s) or signature that you're talking about?

This one seems plausibly related as it also relates to runnable stuff: https://crash-stats.mozilla.com/report/index/b3f1b5df-565e-4f17-acea-de4b60190108

And/or if you know any tricks to easily figure out which crash reports we actually have system library symbols for, like filtering on Pixel devices or something. Thanks!

I don't believe we have system symbols for any recent devices (or possibly any devices at all).

Flags: needinfo?(snorp)

I'm trying to understand why DoWorkRunnable still exists and what it does. It goes back to the beginning of IPC and it looks like the purpose was to allow an XPCOM thread to spin a Chromium event loop and dispatch Chromium Task objects; see also bug 939182.

However, the DequeueTask used to tell an actor thread (i.e., not the IPC I/O thread) to run callbacks for incoming messages was changed from a Chromium Task to an XPCOM Runnable in bug 1266595 (or, rather, Task was merged into Runnable), and later replaced with MessageTask in bug 1312960 (which was part of being able to have per-actor message routing to different XPCOM event targets).

Meanwhile, bug 1269056 allows a Chromium MessageLoop to actually be a wrapper around an XPCOM nsIEventTarget; I'd expect that to be the case on threads other than the IPC I/O thread, and from a quick look at the source that seems to be the case (at least it is for mozilla::ipc::MessagePump, not to be confused with the superclass base::MessagePump).

So naively I'd expect that none of those should be necessary, or at worst the event loop spinning would be a no-op, but evidence seems to suggest that this is not the case.

(Possibly also related: bug 1263311.)

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #9)

I don't believe we have system symbols for any recent devices (or possibly any devices at all).

The crash is in our code (according to comment #0) so in theory we shouldn't need system symbols to get at least a semi-useful signature, I think?

This one seems plausibly related as it also relates to runnable stuff: https://crash-stats.mozilla.com/report/index/b3f1b5df-565e-4f17-acea-de4b60190108

That's a SIGBUS and the fault address is nowhere near the stack pointer, so it's not the same crash as comment #0. But it is a little worrying that there's a frequent crash at what looks like a vtable call with a lot of different fault addresses.

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #0)

backtrace:
#00 pc 00000000002e8202 /data/data/org.mozilla.focus/cache/libxul.so
#01 pc 00000000002e81bd /data/data/org.mozilla.focus/cache/libxul.so


With `:glandium`'s help, we found that these resolved to `non-virtual thunk to mozilla::ipc::DoWorkRunnable::Cancel` and `mozilla::ipc::DoWorkRunnable::Run`.

Just to be clear, 0x2e8202 (frame #0) is Run and 0x2e81bd (frame #1) is Cancel? I can see that call stack easily, but I can't see the reverse that comment 2 suggests, where Run() is calling Cancel().

Flags: needinfo?(snorp)

(In reply to Nathan Froyd [:froydnj] from comment #12)

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #0)

backtrace:
#00 pc 00000000002e8202 /data/data/org.mozilla.focus/cache/libxul.so
#01 pc 00000000002e81bd /data/data/org.mozilla.focus/cache/libxul.so


With `:glandium`'s help, we found that these resolved to `non-virtual thunk to mozilla::ipc::DoWorkRunnable::Cancel` and `mozilla::ipc::DoWorkRunnable::Run`.

Just to be clear, 0x2e8202 (frame #0) is Run and 0x2e81bd (frame #1) is Cancel? I can see that call stack easily, but I can't see the reverse that comment 2 suggests, where Run() is calling Cancel().

From glandium:

My guess is the offsets are relative to the PT_LOAD corresponding to
.text, which would be:

  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x128cf4 0x0033dcf4 0x0033dcf4 0x27e9c9c 0x27e9c9c R E 0x1000

So 0x2e8202 would be 0x625ef6 and 0x2e81bd would be 0x625eb1.

The former resolves to `non-virtual thunk to mozilla::ipc::DoWorkRunnable::Cancel`
and the latter to `mozilla::ipc::DoWorkRunnable::Run`.

So frame 0 is Cancel(), and frame 1 is Run(). I think I had it backwards above. Hmm.

Flags: needinfo?(snorp)

Saying the stack overflow is in Cancel() which is called by Run()...I can't make sense of that call chain. There'd have to be...massive inlining through virtual calls and whatnot.

What if the offsets are relative to the Offset for the PT_LOAD, which would make 0x2e8202 be 0x410ef6 and 0x2e81bd be 0x410eb1? That puts the stack frames somewhere in nsStandardURL::SetFilePath...which doesn't make any more sense, so scrap that theory.

Specifically, assuming glandium's intepretation of the addresses is right, frame 1 is this call: https://hg.mozilla.org/releases/mozilla-release/file/448b0b631dd99e6e27559598c1cc191d4cbf10f7/ipc/glue/MessagePump.cpp#l225

And the address is 1 less than the end of a call (blx) instruction, which is expected for non-leaf frames.

As for frame 0:

  625ef6:       b580            push    {r7, lr}
  625ef8:       f7ff ffcc       bl      625e94
  625efc:       2000            movs    r0, #0
  625efe:       bd80            pop     {r7, pc}

Which calls DoWorkRunnable::Run and then returns 0, which is exactly what DoWorkRunnable::Cancel would do in a non-debug build (and apparently not a wrapper to deal with multiple inheritance, which is what I thought "non-virtual thunk" usually means).

As previously discussed, the address we have is the start of the function, which makes sense for it actually being the leaf frame, so… somehow MessageLoop::DoWork managed to tail-call it? Even though there's nothing in the C++ code (or the disassembly, as far as I can tell) that would do that. In which case: memory corruption such that the DoWork call went through the wrong vtable, maybe?

This is the call I'm assuming happened, to the word at +8 from the vtable pointer:

  625eaa:       6820            ldr     r0, [r4, #0]
  625eac:       6881            ldr     r1, [r0, #8]
  625eae:       4620            mov     r0, r4
  625eb0:       4788            blx     r1

Earlier, r4 was set by this call:

  625e94:       b5b0            push    {r4, r5, r7, lr}
  625e96:       f7d2 f88d       bl      5f7fb4   @ this is MessageLoop::current
  625e9a:       4604            mov     r4, r0

Now, here's the ::MessageLoop vtable:

(gdb) x/6xw 0x02b56ecc
0x2b56ecc:      0x00000000      0x00000000      0x005f82e5      0x003c81eb
0x2b56edc:      0x005f8799      0x005f887d

0x005f8799 is MessageLoop::DoWork, so I guess the vtable pointer in the object (what the standard calls the address point) points to offset 8 within the vtable data (after the offset-to-top and RTTI words).

Here, in comparison, is the DoWorkRunnable vtable:

(gdb) x/24xw 0x02b58534
0x2b58534:      0x00000000      0x00000000      0x00625e5d      0x0038c979
0x2b58544:      0x003a7beb      0x00625e95      0x00367287      0x00356f51
0x2b58554:      0x00625ef7      0x00625ebf      0xfffffff8      0x00000000
0x2b58564:      0x00625e89      0x0038c9f5      0x003a7bef      0x00625ef7
0x2b58574:      0x00367287      0x0037dbf5      0xfffffff4      0x00000000
0x2b58584:      0x00625e8f      0x003a7be5      0x003a7bf5      0x00625eeb

Every class here inherits from nsISupports, so the first three methods are always (QI, AddRef, Release), and [r0, #8] would always call the Release method. The address of DoWorkRunnable::Cancel would be at offsets 24 and 52 from the vtable address point of a DoWorkRunnable*, or offset 12 if it's cast to nsICancelableRunnable*.

So it's not quite as simple as the DoWorkRunnable getting substituted for the MessageLoop somehow. But if there was some way for the call to get there, it would immediately blow the stack and explain everything else we're seeing.

Any info on when this regressed, builds, whatnot? If it's memory corruption it'd be great if we could find a regression range to look at.

Flags: needinfo?(snorp)
Flags: needinfo?(m_kato)
Flags: needinfo?(jmathies)
Flags: needinfo?(bugmail)
Flags: needinfo?(amarchesini)
Whiteboard: [memory corruption?]

(In reply to Jim Mathies [:jimm] from comment #17)

Any info on when this regressed, builds, whatnot? If it's memory corruption it'd be great if we could find a regression range to look at.

We don't appear to have this crash at all in Fennec, so that seems significant. That means e10s may be in play, or perhaps it's specific to some GV API.

This is the first time we've really tried to deploy GV in Focus, so there isn't anything historical there either.

Flags: needinfo?(snorp)

You can get the specific build from here: https://github.com/mozilla-mobile/focus-android/releases/tag/v8.0-RC10

You'll want the 'Focus-arm.apk'.

It may be necessary to go to focus:test and switch the "use new renderer" toggle if it isn't already set.

Summary: Uncaught top crash in Focus 68.0.x → Uncaught top crash in Focus 64.0.x

This has been reproduced and we have a tombstone file. It's a null pointer crash and the faulting PC is c3a1d202, which strongly suggests that glandium's guess was almost right: that the offset is 0x33d000 (not 0x33dcf4) and the faulting instruction is the deliberate null pointer access in the MOZ_CRASH at https://hg.mozilla.org/releases/mozilla-release/file/9c744599570f/ipc/glue/MessageChannel.cpp#l2669

Following this assumption, “frame 1” isn't: it points to line 2664 of that file, to the call to AwaitingSyncReply, which actually makes sense because it's just taken from lr, and that would likely have been the last function call before the crash and thus the last thing to set the link register.

So this seems to be a content process doing a MOZ_CRASH because the parent process exited without going through normal shutdown of IPC channels, which would also explain not having a breakpad crash: assuming GeckoView works the same way as the other platforms I've dealt with, the crashing child messages the parent to report the crash.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #22)

So this seems to be a content process doing a MOZ_CRASH because the parent process exited without going through normal shutdown of IPC channels, which would also explain not having a breakpad crash: assuming GeckoView works the same way as the other platforms I've dealt with, the crashing child messages the parent to report the crash.

Yes, the parent process generates all the minidumps so if it's gone there's no way we can get a content process dump.

I put a patch on bug 1354200, so I'm going to dup this one to that.

Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE

Removing regressionwindow-wanted keyword because this bug has been resolved.

Removing regressionwindow-wanted keyword because this bug has been resolved.

Removing regressionwindow-wanted keyword because this bug has been resolved.

You need to log in before you can comment on or make changes to this bug.