Closed Bug 1566540 Opened 4 months ago Closed 4 months ago

[10.15] Crash in [@ CrashReporter::TerminateHandler]

Categories

(Toolkit :: Crash Reporting, defect, P1, critical)

Unspecified
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- fixed
firefox68 --- wontfix
firefox69 --- fixed
firefox70 --- fixed

People

(Reporter: marcia, Assigned: haik)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(3 files)

This bug is for crash report bp-81c9333d-4860-4800-a8cb-6f58a0190716. All of the crashes are 10.15 users running 10.15.0 19A501i .

Seen while looking at nightly crash stats: https://bit.ly/2GgJOpY. Crashes started in 20190715214335.

Not sure if this is something we regressed or whether this was around the time the third beta came out.

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=30b8d57cb72a2f955532a0e670c599881a110f17&tochange=c0bcda96a954fe7a3700466bda256aea58189ac9

Top 9 frames of crashing thread:

0 XUL CrashReporter::TerminateHandler toolkit/crashreporter/nsExceptionHandler.cpp:1380
1  @0x7fff69db6da6 
2  @0x7fff69db6b54 
3  @0x7fff69da834e 
4  @0x7fff68ad42af 
5  @0x7fff68adb6de 
6  @0x7fff68adcb35 
7  @0x7fff6cdd3da9 
8  @0x7fff6cdd06ae 

Blocks: catalina

https://bit.ly/2JQ5Gtt has similar stuff in the stack, may be the same issue since they are both 10.15 crashes.

"No proper signature could be created because no good data for the crashing thread" shows up in the reports. All of them show content process rdd. 503 crashes/51 installs so far. Andrew - Any ideas what might have triggered this?

Flags: needinfo?(continuation)

Looking at the regression range in comment 0, bug 1560368 involves RDD. Bug 1546299 talks about the Mac, but I don't know how signing might cause these stacks to be so bad.

Flags: needinfo?(continuation)

Hello Michael and Aki - Adding you both per Comment 3 with the hope of finding out what might have caused this macOS 10.15 spike.

Flags: needinfo?(mfroman)
Flags: needinfo?(aki)

Bug 1546299 is only for the geckodriver binary, which appears to be for internal testing, e.g. for marionette tests. This happens in a task separate from the actual build artifact signing. I would be surprised if that's the cause of the crashreporter spike.

Flags: needinfo?(aki)

(In reply to Marcia Knous [:marcia - needinfo? me] from comment #0)

Not sure if this is something we regressed or whether this was around the time the third beta came out.

Do we know what channel it's from? The third beta would be the first Firefox beta aiui (the first two are devedition-only), making it the first Firefox Beta that can run on Catalina at all. If that's it, this may be the baseline of Catalina crashes on the beta channel.

Could this be related to bug 1556846? That is supposed to fix an RDD crash, but it wasn't uplifted to beta until the 17th.

Sorry I confused everyone with using beta terminology. I meant the macOS 10.15 developer betas. One was just pushed the other day, (19A512f). The crash reports show both that version and the previous version (10.15.0 19A512f ) crashing.

The change in bug 1560368 added Opus decoding on RDD, but this is not the first decoder to run on RDD. However, it would be a simple change to pref-off Opus RDD decoding[1] and see if it moves the crash report needle.

[1] https://searchfox.org/mozilla-central/source/modules/libpref/init/StaticPrefList.h#5918

Flags: needinfo?(mfroman)
See Also: → 1556846
Regressions: 1560368

I guess we can try what Michael suggests in Comment 9. We have 745 crashes/105 installations so far, all on 10.15 (the latest seed 10.15.0 19A512f ).

The bug in Comment 7 landed in Nightly 70 on 7-10, which doesn't exactly map to the regression range since these crashes started in the 7-15 build. Adding Haik in case he has any insight here.

Flags: needinfo?(haftandilian)

Here's the crashing stack. I'll attach a listing of all the thread stacks.

Process 82598 stopped
* thread #12, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010fa95b85 XUL`CrashReporter::TerminateHandler() (.llvm.5017237185227756273) + 21
XUL`CrashReporter::TerminateHandler() (.llvm.5017237185227756273):
->  0x10fa95b85 <+21>: movl   $0x564, 0x0               ; imm = 0x564 
    0x10fa95b90 <+32>: callq  0x1105ea1c8               ; symbol stub for: abort
    0x10fa95b95 <+37>: nopw   %cs:(%rax,%rax)
    0x10fa95b9f <+47>: nop    
Target 0: (plugin-container) stopped.
(lldb) bt
* thread #12, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000010fa95b85 XUL`CrashReporter::TerminateHandler() (.llvm.5017237185227756273) + 21
    frame #1: 0x00007fff63b29da7 libc++abi.dylib`std::__terminate(void (*)()) + 8
    frame #2: 0x00007fff63b29b55 libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 27
    frame #3: 0x00007fff63b1b34f libc++abi.dylib`__cxa_throw + 113
    frame #4: 0x00007fff628492b0 caulk`check_posix_error(char const*, int) + 168
    frame #5: 0x00007fff628506df caulk`caulk::thread::attributes::apply_to_this_thread() + 35
    frame #6: 0x00007fff62851b36 caulk`void* caulk::thread_proxy<std::__1::tuple<caulk::thread::attributes, void (caulk::concurrent::details::worker_thread::*)(), std::__1::tuple<caulk::concurrent::details::worker_thread*> > >(void*) + 15
    frame #7: 0x00007fff66b40cce libsystem_pthread.dylib`_pthread_start + 125
    frame #8: 0x00007fff66b3d72b libsystem_pthread.dylib`thread_start + 15

Full stack list. I noticed this thread stack in cubeb_init called from mozilla::OpusDataDecoder::Init().

  thread #14
    frame #0: 0x00007fff2f41fcaf CoreFoundation`__CFSearchStringROM + 44
    frame #1: 0x00007fff2f41f76b CoreFoundation`__CFStringCreateImmutableFunnel3 + 1988
    frame #2: 0x00007fff2f425824 CoreFoundation`CFStringCreateWithBytes + 27
    frame #3: 0x00007fff2f4706df CoreFoundation`_createUniqueStringWithUTF8Bytes + 165
    frame #4: 0x00007fff2f432dcc CoreFoundation`parseStringTag + 1544
    frame #5: 0x00007fff2f430b30 CoreFoundation`parseXMLElement + 822
    frame #6: 0x00007fff2f4312ff CoreFoundation`parseXMLElement + 2821
    frame #7: 0x00007fff2f430c2d CoreFoundation`parseXMLElement + 1075
    frame #8: 0x00007fff2f430269 CoreFoundation`_CFPropertyListCreateFromUTF8Data + 1884
    frame #9: 0x00007fff2f515c96 CoreFoundation`_CFPropertyListCreateWithData + 600
    frame #10: 0x00007fff2f42f58a CoreFoundation`CFPropertyListCreateWithData + 51
    frame #11: 0x00007fff2f441b4e CoreFoundation`_CFBundleCopyInfoDictionaryInDirectoryWithVersion + 814
    frame #12: 0x00007fff2f5c36dd CoreFoundation`_CFBundleRefreshInfoDictionaryAlreadyLocked + 111
    frame #13: 0x00007fff2f44180c CoreFoundation`CFBundleGetInfoDictionary + 33
    frame #14: 0x00007fff2f515403 CoreFoundation`_CFBundleCreate + 715
    frame #15: 0x00007fff2f47b6f5 CoreFoundation`_CFBundleEnsureBundleExistsForImagePath + 55
    frame #16: 0x00007fff2f47b59d CoreFoundation`CFBundleGetBundleWithIdentifier + 221
    frame #17: 0x00007fff2e9bf927 CoreAudio`HALSystem::InitializeDevices() + 329
    frame #18: 0x00007fff2e9be891 CoreAudio`HALSystem::CheckOutInstance() + 161
    frame #19: 0x00007fff2e9c3c84 CoreAudio`AudioObjectSetPropertyData + 184
    frame #20: 0x000000010ef05bdd XUL`audiounit_init + 125
    frame #21: 0x000000010ef05181 XUL`cubeb_init + 177
    frame #22: 0x000000010d9f53e7 XUL`mozilla::CubebUtils::GetCubebContextUnlocked() + 855
    frame #23: 0x000000010dc91299 XUL`mozilla::OpusDataDecoder::Init() + 2089
    frame #24: 0x000000010dbf4ae8 XUL`mozilla::RemoteDecoderParent::RecvInit() + 56
    frame #25: 0x000000010bb68722 XUL`mozilla::PRemoteDecoderParent::OnMessageReceived(IPC::Message const&) + 178
    frame #26: 0x000000010bb652ee XUL`mozilla::PRemoteDecoderManagerParent::OnMessageReceived(IPC::Message const&) + 1006
    frame #27: 0x000000010b817ba3 XUL`mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) + 467
    frame #28: 0x000000010b819278 XUL`mozilla::ipc::MessageChannel::MessageTask::Run() + 440
    frame #29: 0x000000010b0d8c23 XUL`nsThread::ProcessNextEvent(bool, bool*) + 3411
    frame #30: 0x000000010b0db629 XUL`NS_ProcessNextEvent(nsIThread*, bool) + 73
    frame #31: 0x000000010b81d02a XUL`mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) + 282
    frame #32: 0x000000010b0d5fd2 XUL`nsThread::ThreadFunc(void*) + 498
    frame #33: 0x00000001066dae85 libnss3.dylib`_pt_root + 357
    frame #34: 0x00007fff66b40cce libsystem_pthread.dylib`_pthread_start + 125
    frame #35: 0x00007fff66b3d72b libsystem_pthread.dylib`thread_start + 15

On Nightly, setting media.rdd-opus.enabled to false and restarting avoided the crash for me on macOS 10.15 Beta 4.

Flags: needinfo?(haftandilian)

This might be another sandboxing issue.

Assignee: nobody → haftandilian
Priority: -- → P1

Adding the topcrash keyword since this is #2 overall in 70 nightly.

Keywords: topcrash

This is a sandboxing issue on macOS 10.15. I'm testing a fix and should have it out for review today or tomorrow. Details below.

The cause of the crash is that new code in 10.15 triggers a crash when the pthread function pthread_setname_np fails when called from some macOS internal library threads. The function fails in the RDD process because of sandboxing restrictions where the setcontrol variant of the proc_info syscall is not allowed. We allow this in content, but not RDD only because it was not known to be needed. The fix for bug 1560368 exposed this problem.

With a debug build, this is the call to pthread_setname_np that fails and causes the crash.

frame #0: 0x00007fff70e8598c libsystem_pthread.dylib`pthread_setname_np
frame #1: 0x00007fff6cb901f9 caulk`caulk::mach::this_thread::set_name(char const*) + 9
frame #2: 0x00007fff6cb976df caulk`caulk::thread::attributes::apply_to_this_thread() + 35
frame #3: 0x00007fff6cb98b36 caulk`void* caulk::thread_proxy<std::__1::tuple<caulk::thread::attributes, void (caulk::concurrent::details::worker_thread::*)(), std::__1::tuple<caulk::concurrent::details::worker_thread*> > >(void*) + 15
frame #4: 0x00007fff70e87cce libsystem_pthread.dylib`_pthread_start + 125
frame #5: 0x00007fff70e8472b libsystem_pthread.dylib`thread_start + 15

But this is the actual crash stack after the exception handling.

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: pthread_setname_np failed: Operation not permitted

frame #0: 0x00007fff70dca6ce libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff70e87691 libsystem_pthread.dylib`pthread_kill + 258
frame #2: 0x00007fff70d52a5c libsystem_c.dylib`abort + 120
frame #3: 0x00007fff6de63bc8 libc++abi.dylib`abort_message + 231
frame #4: 0x00007fff6de63d64 libc++abi.dylib`demangling_terminate_handler() + 238
frame #5: 0x00007fff6f94ad52 libobjc.A.dylib`_objc_terminate() + 104
frame #6: 0x00007fff6de70da7 libc++abi.dylib`std::__terminate(void (*)()) + 8
frame #7: 0x00007fff6de70b55 libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 27
frame #8: 0x00007fff6de6234f libc++abi.dylib`__cxa_throw + 113
frame #9: 0x00007fff6cb902b0 caulk`check_posix_error(char const*, int) + 168
frame #10: 0x00007fff6cb976df caulk`caulk::thread::attributes::apply_to_this_thread() + 35
frame #11: 0x00007fff6cb98b36 caulk`void* caulk::thread_proxy<std::__1::tuple<caulk::thread::attributes, void (caulk::concurrent::details::worker_thread::*)(), std::__1::tuple<caulk::concurrent::details::worker_thread*> > >(void*) + 15
frame #12: 0x00007fff70e87cce libsystem_pthread.dylib`_pthread_start + 125
frame #13: 0x00007fff70e8472b libsystem_pthread.dylib`thread_start + 15

The reason the fix for bug 1560368 exposes the crash appears to be that it causes these macOS library threads to be spawned which use the caulk code. One thread is named "AMCP Logging Spool".

thread #11, name = 'AMCP Logging Spool'
frame #0: 0x00007fff70dc43d2 libsystem_kernel.dylib`semaphore_wait_trap + 10
frame #1: 0x00007fff6cb99eb6 caulk`caulk::mach::semaphore::wait() + 16
frame #2: 0x00007fff6cb95452 caulk`caulk::semaphore::timed_wait(double) + 106
frame #3: 0x00007fff6cb98a04 caulk`caulk::concurrent::details::worker_thread::run() + 30
frame #4: 0x00007fff6cb98b54 caulk`void* caulk::thread_proxy<std::__1::tuple<caulk::thread::attributes, void (caulk::concurrent::details::worker_thread::*)(), std::__1::tuple<caulk::concurrent::details::worker_thread*> > >(void*) + 45
frame #5: 0x00007fff70e87cce libsystem_pthread.dylib`_pthread_start + 125
frame #6: 0x00007fff70e8472b libsystem_pthread.dylib`thread_start + 15

thread #12
frame #0: 0x00007fff70dc43d2 libsystem_kernel.dylib`semaphore_wait_trap + 10
frame #1: 0x00007fff6cb99eb6 caulk`caulk::mach::semaphore::wait() + 16
frame #2: 0x00007fff6cb95452 caulk`caulk::semaphore::timed_wait(double) + 106
frame #3: 0x00007fff6cb98a04 caulk`caulk::concurrent::details::worker_thread::run() + 30
frame #4: 0x00007fff6cb98b54 caulk`void* caulk::thread_proxy<std::__1::tuple<caulk::thread::attributes, void (caulk::concurrent::details::worker_thread::*)(), std::__1::tuple<caulk::concurrent::details::worker_thread*> > >(void*) + 45
frame #5: 0x00007fff70e87cce libsystem_pthread.dylib`_pthread_start + 125
frame #6: 0x00007fff70e8472b libsystem_pthread.dylib`thread_start + 15

For the fix, we must add access to process-info-setcontrol (target setlf) in the utility sandbox (used by the RDD process). The utility sandbox has a (deny process-info*) rule which blocks access to all proc_info syscall calls unless they are explicitly allowed. We should also add this to the GMP process to avoid this problem happening with GMP in the future. The web content and Flash plugin sandboxes already allow process-info-setcontrol.

To avoid crashing in macOS 10.15, allow access to the proc_info PROC_INFO_CALL_SETCONTROL syscall variant in the GMP and RDD sandboxes.

The implementation of the proc_info syscall for 10.14.1 (which is the latest macOS release for which it is available at this time) can be found here: https://opensource.apple.com/source/xnu/xnu-4903.221.2/bsd/kern/proc_info.c.auto.html See proc_setcontrol.

Pushed by haftandilian@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/295ef3d15d11
[10.15] Crash in [@ CrashReporter::TerminateHandler] r=spohl
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

Needinfo to myself to file an uplift request after I do some testing on Beta.

Flags: needinfo?(haftandilian)
Blocks: 1560368

Comment on attachment 9080126 [details]
Bug 1566540 - [10.15] Crash in [@ CrashReporter::TerminateHandler] r?spohl

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Not having the patch might expose us to Widevine (e.g. Netflix) or AV1 crashes on macOS 10.15 which is currently in Beta and expected to release in the September timeframe.

This is not needed in ESR 60 because the GMP sandboxing code was less restrictive at that time and the RDD process (for AV1 decoding) was not enabled.

  • User impact if declined: On macOS 10.15, users might experience crashes during Widevine decoding (such as Netflix playback). The crash is triggered by macOS library code which is not well understood. If we don't include the patch, another Firefox fix or a change to macOS might trigger the crashing code.
  • Fix Landed on Version: 70
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change is limited to macOS sandboxing code and only adds an allow rule which is unlikely to cause regressions.
  • String or UUID changes made by this patch:

Beta/Release Uplift Approval Request

  • User impact if declined: On macOS 10.15, users might experience crashes during Widevine decoding (such as Netflix playback) or AV1 decoding. The crash is only happening with AV1 decoding with the fix for bug 1560368 which is only on 70. However, the crash is triggered by macOS library code which is not well understood. If we don't include the patch, another Firefox fix or a change to macOS might trigger the crashing code.

The code is covered by automated tests, but the tests are not run on macOS 10.15 where this problem occurs.

  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: The crashes are not reproducible on releases earlier than 70.
  • List of other uplifts needed: Bug 1558924
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change is limited to macOS sandboxing code and only adds an allow rule which is unlikely to cause regressions.
  • String changes made/needed:
Flags: needinfo?(haftandilian)
Attachment #9080126 - Flags: approval-mozilla-release?
Attachment #9080126 - Flags: approval-mozilla-esr68?
Attachment #9080126 - Flags: approval-mozilla-beta?
Flags: qe-verify+
QA Whiteboard: [qa-triaged]

I've tested on macOS 10.15 Beta 3 and Beta 4, but I wasn't able to reproduce the crash. I've tried to reproduce the crash on netflix.com.
I have the following results:

  • on macOS 10.15 Beta 3 (19A501i) using Nightly 70.0a1 (2019-07-15) -> Netflix doesn't work, an error is displayed, but Firefox doesn't crash.
  • on macOS 10.15 Beta 4 (19A512f) using Nightly 70.0a1 (201-07-15), Nightly 70.0a1 (2019-07-17), Nightly 70.0a1 (2019-07-18) -> Netflix doesn't work, an error is displayed, but Firefox doesn't crash.

This is the Netflix error displayed:

Whoops, something went wrong...
Playback Error
There appears to be a problem with Firefox that is preventing Netflix from starting playback.
Please ensure that you are on the latest version of Firefox.
Error Code: F7702-1290

Note: Widevine version 4.10.1440.19

Any thoughts here? Should I try something else to be able to reproduce the crash?

Comment on attachment 9080126 [details]
Bug 1566540 - [10.15] Crash in [@ CrashReporter::TerminateHandler] r?spohl

Fix for a crash during video playback on macOS 10.15. Approved for 69.0b9.

Attachment #9080126 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to Camelia Badau [:cbadau], Release Desktop QA from comment #23)

Any thoughts here? Should I try something else to be able to reproduce the crash?

The Widevine crashes hit on Nightly are fixed in newer versions. Specifically, since an earlier version of Nightly was tested (2019-07-15 and 2019-07-18), the fix for bug 1566523 was not included and that is needed for Widevine playback. But the crash being fixed here isn't reproducible with Widevine right now. More details below.

On Nightly, the crash should be reproducible only for AV1 content (such as https://demo.bitmovin.com/public/firefox/av1/), but it's not reproducible for Widevine. For AV1 content, it's the RDD process that crashes and the user only sees a playback error and not a full browser crash.

On Beta, the crashes aren't reproducible right now. The motivation for uplifting the patches is that a macOS change or an unrelated Firefox change could trigger the crash on 10.15 without the patch. For example, bug 1560368 triggers this crash because it indirectly causes macOS to create some threads in the RDD process which abort without the fix. See comment 16 for more information.

Sorry for not being more clear in the uplift request.

Comment on attachment 9080126 [details]
Bug 1566540 - [10.15] Crash in [@ CrashReporter::TerminateHandler] r?spohl

Approved for 68.1esr as well. Same as the other macOS 10.15 bugs, not approving for an Fx68 dot release, however. Let's aim to have these fixes ride with Fx69/68.1esr in September.

Attachment #9080126 - Flags: approval-mozilla-release?
Attachment #9080126 - Flags: approval-mozilla-release-
Attachment #9080126 - Flags: approval-mozilla-esr68?
Attachment #9080126 - Flags: approval-mozilla-esr68+
Duplicate of this bug: 1567278
Attached image error.png

I've tested with a demo from https://demo.bitmovin.com/demos/av1 (as you mentioned in comment 26) on macOS 10.15 Beta 3 (19A501i) using an old version of Nightly (2019-07-19) and the latest Nightly 70.0a1 (2019-08-05) - I received a playback error on both Nightly builds. It is ok? You can see the error in the "error.png" attachment.

Also, can someone who initially reproduced the problem check that it is now fixed and there is no crash anymore?

(In reply to Camelia Badau [:cbadau], Release Desktop QA from comment #30)

Created attachment 9083343 [details]
error.png

I've tested with a demo from https://demo.bitmovin.com/demos/av1 (as you mentioned in comment 26) on macOS 10.15 Beta 3 (19A501i) using an old version of Nightly (2019-07-19) and the latest Nightly 70.0a1 (2019-08-05) - I received a playback error on both Nightly builds. It is ok?

No, that error indicates we have a problem which is probably that the RDD process is crashing. We need to determine why we're getting that error on the latest Nightly. Could you file a new bug for this issue?

Due to bug 1570451, we can't test/debug on the latest Catalina build.

I've retested today on macOS 10.15 Beta 4 (19A512f) using latest Nightly 70.0a1 (2019-08-08) and the playback error mentioned in comment 30 isn't displayed anymore: the demo correctly plays and works. It seems that the error appears only on macOS 10.15 Beta 3, but it's fixed on macOS 10.15 Beta 4. In this case, I don't think it is necessary to log a new bug.

Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.