Closed Bug 1348776 Opened 7 years ago Closed 7 years ago

Crash in DoNativeBacktrace

Categories

(Core :: Gecko Profiler, defect)

Unspecified
Linux
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla55
Tracking Status
firefox-esr45 --- unaffected
firefox52 --- unaffected
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- verified
firefox55 --- verified

People

(Reporter: julienw, Assigned: n.nethercote)

References

Details

(4 keywords)

Crash Data

Attachments

(3 files)

This happens to me with latest Aurora (Build ID 20170319084208) and the Gecko Profiler addon being enabled. Firefox crashes at start-up.

Crash links:
https://crash-stats.mozilla.com/report/index/358fad70-cd67-4179-89c5-4b0342170320
https://crash-stats.mozilla.com/report/index/429b4b9f-e64a-4507-ab5f-f40a32170320
https://crash-stats.mozilla.com/report/index/cef75467-87d3-46d1-a8b9-e05e52170320
https://crash-stats.mozilla.com/report/index/fe3b6558-9928-4f0f-91d7-aaa052170320

I've been told this could be a security issue that's why I've checked the "security" checkbox.
Socorro said: "Exploitability:	interesting"
Other bits of information:
* This doesn't happen if I launch Firefox with MOZ_PROFILER_STARTUP=1.
* This happens on a clean profile: just go to perf-html.io, install the addon, and restart Firefox.
Julien, can you try to find a regression window?
Flags: needinfo?(felash)
The crash address is consistently 0xa0 (160), which suggests perhaps that |mc| is null. Which is odd. Having said that, when I poked around online to find a defintion of mcontext_t, I got 0x80 (128) for the offset of |gregs[REG_RSP]|.

The only recent significant change in how this stuff works was bug 1342102. jseward, any ideas?
Flags: needinfo?(jseward)
I can't imagine what has happened here.  I suppose it's just about feasible that
the kernel gave us nullptr as the mcontext in the samplee's signal handler.  But
if that's the case, then it should have faulted at line 673, not at 674 as the
backtrace claims:

  673  startRegs.xip = lul::TaggedUWord(mc->gregs[REG_RIP]);
  674  startRegs.xsp = lul::TaggedUWord(mc->gregs[REG_RSP]);

Offset 160 is probably in the saved floating point state, if you wade through
the swamp that is /usr/include/sys/ucontext.h.  So that doesn't make much sense
given that we don't look at FP state at all.

I'll download or build Aurora in the morning, and see if I can reproduce this.
Flags: needinfo?(jseward)
This was the merge day, so likely what happened is a pref or compile change between aurora and nightly broke this. Because nightly works fine.
I tried running in gdb and here is the backtrace in gdb:

#0  0x00007fffebd156f1 in js::frontend::BytecodeEmitter::maybeSetSourceMap() () from /home/julien/firefox-aurora-64b/libxul.so
#1  0x00007fffebcf4373 in js::frontend::BytecodeEmitter::emitScript(js::frontend::ParseNode*) () from /home/julien/firefox-aurora-64b/libxul.so
#2  0x00007fffebcf0eb1 in BytecodeCompiler::compileScript(JS::Handle<JSObject*>, js::frontend::SharedContext*) () from /home/julien/firefox-aurora-64b/libxul.so
#3  0x00007fffebcf0d96 in js::frontend::CompileGlobalScript(JSContext*, js::LifoAlloc&, js::ScopeKind, JS::ReadOnlyCompileOptions const&, JS::SourceBufferHolder&, js::SourceCompressionTask*, js::ScriptSourceObject**) () from /home/julien/firefox-aurora-64b/libxul.so
#4  0x00007fffebc7428b in Evaluate(JSContext*, js::ScopeKind, JS::Handle<JSObject*>, JS::ReadOnlyCompileOptions const&, JS::SourceBufferHolder&, JS::MutableHandle<JS::Value>) ()
   from /home/julien/firefox-aurora-64b/libxul.so
#5  0x00007fffec254033 in JS::Evaluate(JSContext*, JS::ReadOnlyCompileOptions const&, char16_t const*, unsigned long, JS::MutableHandle<JS::Value>) () from /home/julien/firefox-aurora-64b/libxul.so
#6  0x00007fffebf65f7f in xpc::EvalInSandbox(JSContext*, JS::Handle<JSObject*>, nsAString_internal const&, nsACString_internal const&, int, JSVersion, JS::MutableHandle<JS::Value>) ()
   from /home/julien/firefox-aurora-64b/libxul.so
#7  0x00007fffebf5ba8b in nsXPCComponents_Utils::EvalInSandbox(nsAString_internal const&, JS::Handle<JS::Value>, JS::Handle<JS::Value>, nsACString_internal const&, int, JSContext*, unsigned char, JS::MutableHandle<JS::Value>) () from /home/julien/firefox-aurora-64b/libxul.so
#8  0x00007fffebe6bada in NS_InvokeByIndex () from /home/julien/firefox-aurora-64b/libxul.so
#9  0x00007fffeb6fa159 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) () from /home/julien/firefox-aurora-64b/libxul.so
#10 0x00007fffeb6fbc0c in XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) () from /home/julien/firefox-aurora-64b/libxul.so
#11 0x00000b14f2e150d4 in ?? ()
#12 0x00007fffffff7eaf in ?? ()
#13 0x00007fffffff80e8 in ?? ()
#14 0x00007fffffff81e8 in ?? ()
#15 0x0000000000000000 in ?? ()


I'm surprised because the top frame is not the same than what is in crash-stats.
Keywords: sec-other
Some notes:

* I can't repro it with a from-source aurora, using my "normal" mozconfig

* But the aurora from a.m.o crashes exactly as described above, with similarish
  backtrace (not sure)

* It also fails on valgrind, differently.  The stack actually sounds
  plausible.

==27626== Invalid write of size 4
==27626==    at 0x13ACD35F: JSRuntime::beginSingleThreadedExecution(JSContext*) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x13AB47E6: js::GeckoProfiler::enable(bool) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x13AB4ABA: js::EnableContextProfilingStack(JSContext*, bool) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x136F0294: PseudoStack::enableJSSampling() (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x136EC6B3: profiler_start(int, double, char const**, unsigned int, char const**, unsigned int) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x136F54D5: nsProfiler::StartProfiler(unsigned int, double, char const**, unsigned int, char const**, unsigned int) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x14409AD9: ??? (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x13C97EB1: XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x13C99C0B: XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x140B2A6D: InternalCall(JSContext*, js::AnyInvokeArgs const&) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x140A130D: Interpret(JSContext*, js::RunState&) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==    by 0x140A0578: js::RunScript(JSContext*, js::RunState&) (in /home/sewardj/MOZ/TMP_AURORA/RMME/firefox/libxul.so)
==27626==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
That is indeed a plausible stack.

Nightly and Aurora differ in the fact that Nightly gets --enable-profiling by default.

http://searchfox.org/mozilla-central/source/js/moz.configure#133
Ah, so we might be able to repro it on trunk by building with --disable-profile?
I'd expect it to be called --disable-profiling but I don't know if we have the magic to make it do anything. Worth a try though.
ted confirmed on IRC that --disable-profiling is a thing that exists.
NI myself to check if this works and bisect on nightly if this does.
Flags: needinfo?(felash)
This reproduces trivially on trunk when built with --disable-profiling.

I assume that there is some "boundary" of functions that we should not call into
in this case, and this is why it faults.  But I'm unclear where that is.  Nick,
do you know?  For sure we shouldn't be getting as far as DoNativeBacktrace if
the profiler is disabled.

The m-c crash stack looks like this:

#0  0x00007fffeb0cf59e in DoNativeBacktrace() at tools/profiler/core/platform.cpp:850
#1  0x00007fffeb0cf59e in Tick() at tools/profiler/core/platform.cpp:974
#2  0x00007fffeb0cb1e5 in profiler_get_backtrace() at tools/profiler/core/platform.cpp:3030
#3  0x00007fffea344aba in AddLayoutFlushObserver() at ff-O1-linux64-clang-noprof/dist/include/nsRefreshDriver.h:181
#4  0x00007fffea32e52d in ScheduleReflow() at layout/base/PresShell.cpp:9094
#5  0x00007fffea32e52d in MaybeScheduleReflow() at layout/base/PresShell.cpp:9082
#6  0x00007fffea32b218 in FrameNeedsReflow() at layout/base/PresShell.cpp:2798
#7  0x00007fffea32a96b in Initialize() at layout/base/PresShell.cpp:1827
#8  0x00007fffe8ea85bc in StartLayout() at dom/base/nsContentSink.cpp:1239
#9  0x00007fffe8ea847d in StyleSheetLoaded() at dom/base/nsContentSink.cpp:251
Flags: needinfo?(n.nethercote)
After bisect, I found the bad commit is Bug 1339695 part 7.
Flags: needinfo?(felash)
(In reply to Julian Seward [:jseward] from comment #15)
> This reproduces trivially on trunk when built with --disable-profiling.
> 
> I assume that there is some "boundary" of functions that we should not call
> into
> in this case, and this is why it faults.  But I'm unclear where that is. 
> Nick,
> do you know?  For sure we shouldn't be getting as far as DoNativeBacktrace if
> the profiler is disabled.
> 
> The m-c crash stack looks like this:
> 
> #0  0x00007fffeb0cf59e in DoNativeBacktrace() at
> tools/profiler/core/platform.cpp:850
> #1  0x00007fffeb0cf59e in Tick() at tools/profiler/core/platform.cpp:974
> #2  0x00007fffeb0cb1e5 in profiler_get_backtrace() at
> tools/profiler/core/platform.cpp:3030
> #3  0x00007fffea344aba in AddLayoutFlushObserver() at
> ff-O1-linux64-clang-noprof/dist/include/nsRefreshDriver.h:181

The main way that samples are taken is via SamplerThread::Run(), where we do it periodically when the profiler is active.

The other way is via ad hoc backtrace requests, which are done via profiler_get_backtrace(), as in this case. It does seem surprising that it gets as far as Tick()...

(I'd really love to get rid of profiler_get_backtrace(), because it adds a great deal of complexity. But it seems like a pretty important feature, alas.)


(In reply to Julien Wajsberg [:julienw] from comment #16)
> After bisect, I found the bad commit is Bug 1339695 part 7.

Hmm, that's surprising, and at first glance doesn't seems to make sense w.r.t. comment 15.

Anyway, I'll take a look on Monday. Thanks for the info!
Assignee: nobody → n.nethercote
Flags: needinfo?(n.nethercote)
Yeah I agree. I'll double check Monday too.
I bisected the regression down to Bug 1339695 part 8, not part 7. Part 8 was a much bigger change, so it's more plausible.
Ah! --disable-profiling is a bit of a confusing name. It doesn't mean "don't run the Gecko profiler". It means "disable some stuff that is required for native stackwalking to occur". So when it's specified, we should still be able to run the Gecko Profiler, but we'll get pseudo-stacks instead of native stacks. (Though it depends on the platform.)

Anyway, I've worked out that the problem here relates to a couple of uses of HAVE_NATIVE_UNWIND.
I don't think this needs to be s-s, because it's a null deref -- we fail to set TickSample::context(), but its constructor defaults it to nullptr, and we deref it. (I'm not sure why the crashes are showing up a few lines later in practice. That's a bit surprising. Maybe an artifact of compiler optimizations?)
Marginally OT, but could we (not in this bug) just get rid of this configure option?
It strikes me as extra verification complexity for no clear purpose.
Bug 1339695 part 8 unintentionally changed behaviour in profiler_init() when
MOZ_PROFILING is undefined. This patch undoes that change.
Attachment #8851473 - Flags: review?(mstange)
This patch simplifies and increases the consistency of how HAVE_NATIVE_UNWIND
is used.

- Its definition is moved from platform.h to platform.cpp, because the latter
  is the only file that uses it.

- It's now defined in the same place as USE_{NS,EHABI,LUL}_STACKWALK, and used
  in preference to those, where possible. Also, it's now defined on Linux and
  Android even if MOZ_PROFILING is not.

- HAVE_NATIVE_UNWIND is now used consistently and by itself for all relevant
  conditions, including when defining the presence and use of the "stackwalk"
  feature.

- The patch inlines and removes is_native_unwinding_avail().

Note that MOZ_PROFILING must be defined for HAVE_NATIVE_UNWIND to be true on
Windows and Mac, but not on Linux and Android.
Attachment #8851478 - Flags: review?(mstange)
Attachment #8851473 - Flags: review?(mstange) → review+
Comment on attachment 8851478 [details] [diff] [review]
(part 2) - Fix up HAVE_NATIVE_UNWIND

Review of attachment 8851478 [details] [diff] [review]:
-----------------------------------------------------------------

Good!

In bug 1340721, win 32 will join the list of platforms that can HAVE_NATIVE_UNWIND without --enable-profiling.
Attachment #8851478 - Flags: review?(mstange) → review+
Obvious question: why didn't testing identify this? Do we not have any tests that exercise profiler_get_backtrace()?
Here's a version for backporting to Aurora. The difference with the patch that
landed on m-c is trivial.
Comment on attachment 8851835 [details] [diff] [review]
(part 1) - Fix a crash in the profiler when MOZ_PROFILING is undefined

Review of attachment 8851835 [details] [diff] [review]:
-----------------------------------------------------------------

Carrying over r+ from the m-c version.
Attachment #8851835 - Flags: review+
Comment on attachment 8851835 [details] [diff] [review]
(part 1) - Fix a crash in the profiler when MOZ_PROFILING is undefined

Approval Request Comment

[Feature/Bug causing the regression]: Bug 1339695 part 8

[User impact if declined]: Startup crashes when running the profiler on Linux on Aurora and later channels.

[Is this code covered by automated tests?]: No, sadly enough. I just filed bug 1351136 to improve test coverage of the profiler.

[Has the fix been verified in Nightly?]: I verified it in a local build.

[Needs manual test from QE? If yes, steps to reproduce]: Yes. On Linux, get a build with --disable-profiling (either from a local build, or by using Aurora).
Install the profiler add-on from https://perf-html.io/ in a new profile, then restart the browser. It should crash immediately at startup.

[List of other uplifts needed for the feature/fix]: None.

[Is the change risky?]: No.

[Why is the change risky/not risky?]: Single-line change.

[String changes made/needed]: None.
Attachment #8851835 - Flags: approval-mozilla-aurora?
Group: core-security → core-security-release
Marking fixed because this landed on central in comment 8.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment on attachment 8851835 [details] [diff] [review]
(part 1) - Fix a crash in the profiler when MOZ_PROFILING is undefined

Fix a crash when running the profiler on Linux on Aurora. Aurora54+.
Attachment #8851835 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Flags: qe-verify+
Group: core-security-release
Reproduced the crash on Aurora 54.0a2 using STR from comment 30.
- https://crash-stats.mozilla.com/report/index/18d05fff-c0d5-45c8-80d1-ab9da0170424.

I can no longer reproduce the it on Ubuntu 16.04 x64 LTS with latest Nightly 55.0a1 (2017-04-24) and 54 beta 1 (20170420081804).
Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: