Closed Bug 1644021 Opened 4 years ago Closed 4 years ago

Crash in [@ GeckoViewStreamingTelemetry::SendBatchRunnable::Run]

Categories

(Toolkit :: Telemetry, defect, P1)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
mozilla79
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox77 --- wontfix
firefox78 --- wontfix
firefox79 --- fixed

People

(Reporter: yoasif, Assigned: chutten)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-2c5238c7-34d8-4f08-bde6-a77540200607.

Top 10 frames of crashing thread:

0 libxul.so GeckoViewStreamingTelemetry::SendBatchRunnable::Run toolkit/components/telemetry/geckoview/streaming/GeckoViewStreamingTelemetry.cpp:123
1 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1236
2 libxul.so mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:87
3 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:290
4 libxul.so nsBaseAppShell::Run widget/nsBaseAppShell.cpp:137
5 libxul.so nsAppStartup::Run toolkit/components/startup/nsAppStartup.cpp:271
6 libxul.so XREMain::XRE_mainRun toolkit/xre/nsAppRunner.cpp:4664
7 libxul.so XREMain::XRE_main toolkit/xre/nsAppRunner.cpp:4811
8 libxul.so XRE_main toolkit/xre/nsAppRunner.cpp:4865
9 libxul.so GeckoStart toolkit/xre/nsAndroidStartup.cpp:46

Mentioned https://www.reddit.com/r/firefox/comments/gy1w7v/please_help_with_fenix/ for Fenix.

Component: General → Telemetry
Flags: needinfo?(chutten)
Product: GeckoView → Toolkit

Crashing line

Crashing address is 0xe5e5e5e5e5e5f5 which looks like an offset to a marker value to me. Suggests that mDelegate has been freed by jemalloc maybe.

Reports on this signature are consistent with a solitary user having a really hard time with Fenix every month or two for a week at a stretch.

mDelegate should not possibly be null. Access to gDelegate is protected by mutex, and we were holding it from when we last checked it was non-null to when we stuffed it in the SendBatchRunnable which will keep it alive inside its RefPtr. We haven't changed the GV Streaming code in quite some time, so it seems likely that this is either uncovering something incredibly unlikely that hasn't been seen before, or perhaps some change outside of GV Streaming has complicated matters (maybe the delegate now frees itself even if it has a refcount? But that code, too, hasn't changed since last August)

Maybe a check for !gDelegate doesn't work if it hasn't been explicitly set to nullptr and so the fact that it's set to 0xe5e5e5e5e5e5e5 allows it to pass the check? Is there a different way I should be checking if a RefPtr is pointing to something that's been deallocated? (aklotz, do you know?)

Flags: needinfo?(chutten) → needinfo?(aklotz)

Having taken a cursory look over this code, I don't see any smoking guns here. :-(

Flags: needinfo?(aklotz)

I looked briefly at this; the assembly leading up to the crash is:

# load `mDelegate`
   7f817bcad4:	60 12 40 f9          	ldr x0, [x19, #0x20]
   7f817bcad8:	e9 17 40 f9          	ldr x9, [sp, #0x28]
# pointer to `histogramName`
   7f817bcadc:	e1 43 00 91          	add x1, sp, #0x10
# load `mDelegate`'s vtable
   7f817bcae0:	08 00 40 f9          	ldr x8, [x0]
# load a pointer to `samples` (I think)
   7f817bcae4:	22 41 00 91          	add x2, x9, #0x10
# load the address of the virtual method `ReceiveHistogramSamples`
# CRASH
   7f817bcae8:	08 09 40 f9          	ldr x8, [x8, #0x10]
# call the method
   7f817bcaec:	00 01 3f d6          	blr x8

How in the world is the vtable holding a pointer to jemalloc poision? Are there terrible things going on in the Java interface bridge, or did we manage to royally scribble over static memory?

Come to think of it, I think I got needinfo'd recently in bug 1635400 where it sounded like perhaps there was some JNI shenanigans going on. I wonder if these are related in any way...

:esawin, you wrote the JNI interface layer back in August. Do you have any idea what's up?

Flags: needinfo?(esawin)

I see that StreamingTelemetryDelegate is (in theory, at least?) used from different threads, but does not use a thread-safe ref counting mechanism. I think this may explain things. :chutten I think you should try changing StreamingTelemetryDelegate to derive from AtomicRefCounted and see if that helps.

Apparently the right thing to do is to use NS_INLINE_DECL_THREADSAFE_REFCOUNTING(StreamingTelemetryDelegate)

Ah. If the refcnt isn't atomic, I can imagine that interesting things could indeed happen when the delegate is dropped after a runnable is dispatched. (or if the delegate is changed). It'd be weird as all get out for this to be run across, but I guess that matches the frequency of the crash.

Let's make the refcounting threadsafe and see what we see.

Assignee: nobody → chutten
Severity: -- → S4
Status: NEW → ASSIGNED
Flags: needinfo?(esawin)
Priority: -- → P1
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9a3a87c7c7bf StreamingTelemetryDelegates need threadsafe ref counting r=snorp
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla79
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: