Crash in DoNativeBacktrace

VERIFIED FIXED in Firefox 54

Status

()

Core
Gecko Profiler
--
critical
VERIFIED FIXED
3 months ago
2 months ago

People

(Reporter: julienw, Assigned: njn)

Tracking

(4 keywords)

unspecified
mozilla55
Unspecified
Linux
crash, regression, regressionwindow-wanted, sec-other
Points:
---

Firefox Tracking Flags

(firefox52 unaffected, firefox53 unaffected, firefox54 verified, firefox55 verified, firefox-esr45 unaffected, firefox-esr52 unaffected)

Details

(crash signature)

Attachments

(3 attachments)

(Reporter)

Description

3 months ago
This happens to me with latest Aurora (Build ID 20170319084208) and the Gecko Profiler addon being enabled. Firefox crashes at start-up.

Crash links:
https://crash-stats.mozilla.com/report/index/358fad70-cd67-4179-89c5-4b0342170320
https://crash-stats.mozilla.com/report/index/429b4b9f-e64a-4507-ab5f-f40a32170320
https://crash-stats.mozilla.com/report/index/cef75467-87d3-46d1-a8b9-e05e52170320
https://crash-stats.mozilla.com/report/index/fe3b6558-9928-4f0f-91d7-aaa052170320

I've been told this could be a security issue that's why I've checked the "security" checkbox.
Socorro said: "Exploitability:	interesting"
(Reporter)

Comment 2

3 months ago
Other bits of information:
* This doesn't happen if I launch Firefox with MOZ_PROFILER_STARTUP=1.
* This happens on a clean profile: just go to perf-html.io, install the addon, and restart Firefox.
Julien, can you try to find a regression window?
Flags: needinfo?(felash)
Keywords: regression, regressionwindow-wanted
(Assignee)

Comment 4

3 months ago
The crash address is consistently 0xa0 (160), which suggests perhaps that |mc| is null. Which is odd. Having said that, when I poked around online to find a defintion of mcontext_t, I got 0x80 (128) for the offset of |gregs[REG_RSP]|.

The only recent significant change in how this stuff works was bug 1342102. jseward, any ideas?
Flags: needinfo?(jseward)
I can't imagine what has happened here.  I suppose it's just about feasible that
the kernel gave us nullptr as the mcontext in the samplee's signal handler.  But
if that's the case, then it should have faulted at line 673, not at 674 as the
backtrace claims:

  673  startRegs.xip = lul::TaggedUWord(mc->gregs[REG_RIP]);
  674  startRegs.xsp = lul::TaggedUWord(mc->gregs[REG_RSP]);

Offset 160 is probably in the saved floating point state, if you wade through
the swamp that is /usr/include/sys/ucontext.h.  So that doesn't make much sense
given that we don't look at FP state at all.

I'll download or build Aurora in the morning, and see if I can reproduce this.
Flags: needinfo?(jseward)
(Reporter)

Comment 6

3 months ago
here is the pushlog:

https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=69efeba94b307f236a50a9e5fdd2bbd2e946657b&tochange=027664f159bbec461390ee13aef826cd412aa791

doesn't look so useful :/
Flags: needinfo?(felash)
(Reporter)

Comment 7

3 months ago
This was the merge day, so likely what happened is a pref or compile change between aurora and nightly broke this. Because nightly works fine.
(Reporter)

Comment 8

3 months ago
I tried running in gdb and here is the backtrace in gdb:

#0  0x00007fffebd156f1 in js::frontend::BytecodeEmitter::maybeSetSourceMap() () from /home/julien/firefox-aurora-64b/libxul.so
#1  0x00007fffebcf4373 in js::frontend::BytecodeEmitter::emitScript(js::frontend::ParseNode*) () from /home/julien/firefox-aurora-64b/libxul.so
#2  0x00007fffebcf0eb1 in BytecodeCompiler::compileScript(JS::Handle<JSObject*>, js::frontend::SharedContext*) () from /home/julien/firefox-aurora-64b/libxul.so
#3  0x00007fffebcf0d96 in js::frontend::CompileGlobalScript(JSContext*, js::LifoAlloc&, js::ScopeKind, JS::ReadOnlyCompileOptions const&, JS::SourceBufferHolder&, js::SourceCompressionTask*, js::ScriptSourceObject**) () from /home/julien/firefox-aurora-64b/libxul.so
#4  0x00007fffebc7428b in Evaluate(JSContext*, js::ScopeKind, JS::Handle<JSObject*>, JS::ReadOnlyCompileOptions const&, JS::SourceBufferHolder&, JS::MutableHandle<JS::Value>) ()
   from /home/julien/firefox-aurora-64b/libxul.so
#5  0x00007fffec254033 in JS::Evaluate(JSContext*, JS::ReadOnlyCompileOptions const&, char16_t const*, unsigned long, JS::MutableHandle<JS::Value>) () from /home/julien/firefox-aurora-64b/libxul.so
#6  0x00007fffebf65f7f in xpc::EvalInSandbox(JSContext*, JS::Handle<JSObject*>, nsAString_internal const&, nsACString_internal const&, int, JSVersion, JS::MutableHandle<JS::Value>) ()
   from /home/julien/firefox-aurora-64b/libxul.so
#7  0x00007fffebf5ba8b in nsXPCComponents_Utils::EvalInSandbox(nsAString_internal const&, JS::Handle<JS::Value>, JS::Handle<JS::Value>, nsACString_internal const&, int, JSContext*, unsigned char, JS::MutableHandle<JS::Value>) () from /home/julien/firefox-aurora-64b/libxul.so
#8  0x00007fffebe6bada in NS_InvokeByIndex () from /home/julien/firefox-aurora-64b/libxul.so
#9  0x00007fffeb6fa159 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) () from /home/julien/firefox-aurora-64b/libxul.so
#10 0x00007fffeb6fbc0c in XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) () from /home/julien/firefox-aurora-64b/libxul.so
#11 0x00000b14f2e150d4 in ?? ()
#12 0x00007fffffff7eaf in ?? ()
#13 0x00007fffffff80e8 in ?? ()
#14 0x00007fffffff81e8 in ?? ()
#15 0x0000000000000000 in ?? ()


I'm surprised because the top frame is not the same than what is in crash-stats.
Keywords: sec-other
Some notes:

* I can't repro it with a from-source aurora, using my "normal" mozconfig

* But the aurora from a.m.o crashes exactly as described above, with similarish
  backtrace (not sure)

* It also fails on valgrind, differently.  The stack actually sounds
  plausible.

==27626== Invalid write of size 4
==27626==    at 0x13ACD35F: JSRuntime::beginSingleThreadedExecution(JSContext*) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x13AB47E6: js::GeckoProfiler::enable(bool) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x13AB4ABA: js::EnableContextProfilingStack(JSContext*, bool) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x136F0294: PseudoStack::enableJSSampling() (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x136EC6B3: profiler_start(int, double, char const**, unsigned int, char const**, unsigned int) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x136F54D5: nsProfiler::StartProfiler(unsigned int, double, char const**, unsigned int, char const**, unsigned int) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x14409AD9: ??? (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x13C97EB1: XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x13C99C0B: XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x140B2A6D: InternalCall(JSContext*, js::AnyInvokeArgs const&) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x140A130D: Interpret(JSContext*, js::RunState&) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x140A0578: js::RunScript(JSContext*, js::RunState&) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
That is indeed a plausible stack.

Nightly and Aurora differ in the fact that Nightly gets --enable-profiling by default.

http://searchfox.org/mozilla-central/source/js/moz.configure#133
Ah, so we might be able to repro it on trunk by building with --disable-profile?
I'd expect it to be called --disable-profiling but I don't know if we have the magic to make it do anything. Worth a try though.
ted confirmed on IRC that --disable-profiling is a thing that exists.
(Reporter)

Comment 14

3 months ago
NI myself to check if this works and bisect on nightly if this does.
Flags: needinfo?(felash)
This reproduces trivially on trunk when built with --disable-profiling.

I assume that there is some "boundary" of functions that we should not call into
in this case, and this is why it faults.  But I'm unclear where that is.  Nick,
do you know?  For sure we shouldn't be getting as far as DoNativeBacktrace if
the profiler is disabled.

The m-c crash stack looks like this:

#0  0x00007fffeb0cf59e in DoNativeBacktrace() at tools/profiler/core/platform.cpp:850
#1  0x00007fffeb0cf59e in Tick() at tools/profiler/core/platform.cpp:974
#2  0x00007fffeb0cb1e5 in profiler_get_backtrace() at tools/profiler/core/platform.cpp:3030
#3  0x00007fffea344aba in AddLayoutFlushObserver() at ff-O1-linux64-clang-noprof/dist/include/nsRefreshDriver.h:181
#4  0x00007fffea32e52d in ScheduleReflow() at layout/base/PresShell.cpp:9094
#5  0x00007fffea32e52d in MaybeScheduleReflow() at layout/base/PresShell.cpp:9082
#6  0x00007fffea32b218 in FrameNeedsReflow() at layout/base/PresShell.cpp:2798
#7  0x00007fffea32a96b in Initialize() at layout/base/PresShell.cpp:1827
#8  0x00007fffe8ea85bc in StartLayout() at dom/base/nsContentSink.cpp:1239
#9  0x00007fffe8ea847d in StyleSheetLoaded() at dom/base/nsContentSink.cpp:251
Flags: needinfo?(n.nethercote)
(Reporter)

Comment 16

3 months ago
After bisect, I found the bad commit is Bug 1339695 part 7.
Flags: needinfo?(felash)
(Assignee)

Comment 17

3 months ago
(In reply to Julian Seward [:jseward] from comment #15)
> This reproduces trivially on trunk when built with --disable-profiling.
> 
> I assume that there is some "boundary" of functions that we should not call
> into
> in this case, and this is why it faults.  But I'm unclear where that is. 
> Nick,
> do you know?  For sure we shouldn't be getting as far as DoNativeBacktrace if
> the profiler is disabled.
> 
> The m-c crash stack looks like this:
> 
> #0  0x00007fffeb0cf59e in DoNativeBacktrace() at
> tools/profiler/core/platform.cpp:850
> #1  0x00007fffeb0cf59e in Tick() at tools/profiler/core/platform.cpp:974
> #2  0x00007fffeb0cb1e5 in profiler_get_backtrace() at
> tools/profiler/core/platform.cpp:3030
> #3  0x00007fffea344aba in AddLayoutFlushObserver() at
> ff-O1-linux64-clang-noprof/dist/include/nsRefreshDriver.h:181

The main way that samples are taken is via SamplerThread::Run(), where we do it periodically when the profiler is active.

The other way is via ad hoc backtrace requests, which are done via profiler_get_backtrace(), as in this case. It does seem surprising that it gets as far as Tick()...

(I'd really love to get rid of profiler_get_backtrace(), because it adds a great deal of complexity. But it seems like a pretty important feature, alas.)


(In reply to Julien Wajsberg [:julienw] from comment #16)
> After bisect, I found the bad commit is Bug 1339695 part 7.

Hmm, that's surprising, and at first glance doesn't seems to make sense w.r.t. comment 15.

Anyway, I'll take a look on Monday. Thanks for the info!
Assignee: nobody → n.nethercote
Flags: needinfo?(n.nethercote)
(Reporter)

Comment 18

3 months ago
Yeah I agree. I'll double check Monday too.
(Assignee)

Comment 19

3 months ago
I bisected the regression down to Bug 1339695 part 8, not part 7. Part 8 was a much bigger change, so it's more plausible.
(Assignee)

Comment 20

3 months ago
Ah! --disable-profiling is a bit of a confusing name. It doesn't mean "don't run the Gecko profiler". It means "disable some stuff that is required for native stackwalking to occur". So when it's specified, we should still be able to run the Gecko Profiler, but we'll get pseudo-stacks instead of native stacks. (Though it depends on the platform.)

Anyway, I've worked out that the problem here relates to a couple of uses of HAVE_NATIVE_UNWIND.
(Assignee)

Comment 21

3 months ago
I don't think this needs to be s-s, because it's a null deref -- we fail to set TickSample::context(), but its constructor defaults it to nullptr, and we deref it. (I'm not sure why the crashes are showing up a few lines later in practice. That's a bit surprising. Maybe an artifact of compiler optimizations?)
Marginally OT, but could we (not in this bug) just get rid of this configure option?
It strikes me as extra verification complexity for no clear purpose.
(Assignee)

Comment 23

3 months ago
Created attachment 8851473 [details] [diff] [review]
(part 1) - Fix a crash in the profiler when MOZ_PROFILING is undefined

Bug 1339695 part 8 unintentionally changed behaviour in profiler_init() when
MOZ_PROFILING is undefined. This patch undoes that change.
Attachment #8851473 - Flags: review?(mstange)
(Assignee)

Comment 24

3 months ago
Created attachment 8851478 [details] [diff] [review]
(part 2) - Fix up HAVE_NATIVE_UNWIND

This patch simplifies and increases the consistency of how HAVE_NATIVE_UNWIND
is used.

- Its definition is moved from platform.h to platform.cpp, because the latter
  is the only file that uses it.

- It's now defined in the same place as USE_{NS,EHABI,LUL}_STACKWALK, and used
  in preference to those, where possible. Also, it's now defined on Linux and
  Android even if MOZ_PROFILING is not.

- HAVE_NATIVE_UNWIND is now used consistently and by itself for all relevant
  conditions, including when defining the presence and use of the "stackwalk"
  feature.

- The patch inlines and removes is_native_unwinding_avail().

Note that MOZ_PROFILING must be defined for HAVE_NATIVE_UNWIND to be true on
Windows and Mac, but not on Linux and Android.
Attachment #8851478 - Flags: review?(mstange)
Blocks: 1339695
status-firefox52: --- → unaffected
status-firefox53: --- → unaffected
status-firefox54: --- → affected
status-firefox55: --- → affected
status-firefox-esr45: --- → unaffected
status-firefox-esr52: --- → unaffected
Attachment #8851473 - Flags: review?(mstange) → review+
Comment on attachment 8851478 [details] [diff] [review]
(part 2) - Fix up HAVE_NATIVE_UNWIND

Review of attachment 8851478 [details] [diff] [review]:
-----------------------------------------------------------------

Good!

In bug 1340721, win 32 will join the list of platforms that can HAVE_NATIVE_UNWIND without --enable-profiling.
Attachment #8851478 - Flags: review?(mstange) → review+
(Assignee)

Comment 26

3 months ago
Obvious question: why didn't testing identify this? Do we not have any tests that exercise profiler_get_backtrace()?
(Assignee)

Comment 27

3 months ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/45a9aa72f7a3d5cb967ae1c486dc223bfb8735ae
Bug 1348776 (part 1) - Fix a crash in the profiler when MOZ_PROFILING is undefined. r=mstange.

https://hg.mozilla.org/integration/mozilla-inbound/rev/eaa16f49952507c145cf5303130fcdd1d91d0e76
Bug 1348776 (part 2) - Fix up HAVE_NATIVE_UNWIND. r=mstange.
(Assignee)

Comment 28

3 months ago
Created attachment 8851835 [details] [diff] [review]
(part 1) - Fix a crash in the profiler when MOZ_PROFILING is undefined

Here's a version for backporting to Aurora. The difference with the patch that
landed on m-c is trivial.
(Assignee)

Comment 29

3 months ago
Comment on attachment 8851835 [details] [diff] [review]
(part 1) - Fix a crash in the profiler when MOZ_PROFILING is undefined

Review of attachment 8851835 [details] [diff] [review]:
-----------------------------------------------------------------

Carrying over r+ from the m-c version.
Attachment #8851835 - Flags: review+
(Assignee)

Comment 30

3 months ago
Comment on attachment 8851835 [details] [diff] [review]
(part 1) - Fix a crash in the profiler when MOZ_PROFILING is undefined

Approval Request Comment

[Feature/Bug causing the regression]: Bug 1339695 part 8

[User impact if declined]: Startup crashes when running the profiler on Linux on Aurora and later channels.

[Is this code covered by automated tests?]: No, sadly enough. I just filed bug 1351136 to improve test coverage of the profiler.

[Has the fix been verified in Nightly?]: I verified it in a local build.

[Needs manual test from QE? If yes, steps to reproduce]: Yes. On Linux, get a build with --disable-profiling (either from a local build, or by using Aurora).
Install the profiler add-on from https://perf-html.io/ in a new profile, then restart the browser. It should crash immediately at startup.

[List of other uplifts needed for the feature/fix]: None.

[Is the change risky?]: No.

[Why is the change risky/not risky?]: Single-line change.

[String changes made/needed]: None.
Attachment #8851835 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/45a9aa72f7a3
https://hg.mozilla.org/mozilla-central/rev/eaa16f499525
status-firefox55: affected → fixed
Target Milestone: --- → mozilla55
Group: core-security → core-security-release
Marking fixed because this landed on central in comment 8.
Status: NEW → RESOLVED
Last Resolved: 3 months ago
Resolution: --- → FIXED
Comment on attachment 8851835 [details] [diff] [review]
(part 1) - Fix a crash in the profiler when MOZ_PROFILING is undefined

Fix a crash when running the profiler on Linux on Aurora. Aurora54+.
Attachment #8851835 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Updated

3 months ago
Flags: qe-verify+

Comment 34

3 months ago
uplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/b95fcbceb5b3
status-firefox54: affected → fixed
Group: core-security-release
Reproduced the crash on Aurora 54.0a2 using STR from comment 30.
- https://crash-stats.mozilla.com/report/index/18d05fff-c0d5-45c8-80d1-ab9da0170424.

I can no longer reproduce the it on Ubuntu 16.04 x64 LTS with latest Nightly 55.0a1 (2017-04-24) and 54 beta 1 (20170420081804).
Status: RESOLVED → VERIFIED
status-firefox54: fixed → verified
status-firefox55: fixed → verified
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.