1710483 - crash near null in [@ mozilla::CrossProcessMutex::Lock]

Reporter

Description

•

4 years ago

We are seeing this on fuzzing ccov builds. Found with m-c 20210509-486d7a9aaa0f. This seems to be hit fairly frequenly.

==22747==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000008 (pc 0x7f6775da06f4 bp 0x7ffe72447fe0 sp 0x7ffe72447fe0 T22747)
==22747==The signal is caused by a READ memory access.
==22747==Hint: address points to the zero page.
    #0 0x7f6775da06f4 in mozilla::CrossProcessMutex::Lock() /builds/worker/checkouts/gecko/ipc/glue/CrossProcessMutex_posix.cpp:118:22
    #1 0x7f677cb78248 in BaseAutoLock /builds/worker/workspace/obj-build/dist/include/mozilla/Mutex.h:158:57
    #2 0x7f677cb78248 in mozilla::CodeCoverageHandler::FlushCounters(bool) /builds/worker/checkouts/gecko/tools/code-coverage/CodeCoverageHandler.cpp:50:29
    #3 0x7f679831f3bf  (/lib/x86_64-linux-gnu/libpthread.so.0+0x153bf)
    #4 0x7f6798314cd6 in __pthread_clockjoin_ex /build/glibc-eX1tMB/glibc-2.31/nptl/pthread_join_common.c:145:6
    #5 0x7f677d4f0b7c in js::Thread::join() /builds/worker/checkouts/gecko/js/src/threading/posix/PosixThread.cpp:70:11
    #6 0x7f677d68ff60 in join /builds/worker/checkouts/gecko/js/src/vm/HelperThreads.cpp:2390:36
    #7 0x7f677d68ff60 in js::GlobalHelperThreadState::finishThreads() /builds/worker/checkouts/gecko/js/src/vm/HelperThreads.cpp:1424:13
    #8 0x7f677d687524 in js::GlobalHelperThreadState::finish() /builds/worker/checkouts/gecko/js/src/vm/HelperThreads.cpp:1381:3
    #9 0x7f677d6999f6 in DestroyHelperThreadsState /builds/worker/checkouts/gecko/js/src/vm/HelperThreads.cpp:97:23
    #10 0x7f677d6999f6 in JS_ShutDown() /builds/worker/checkouts/gecko/js/src/vm/Initialization.cpp:240:3
    #11 0x7f6775268000 in mozilla::ShutdownXPCOM(nsIServiceManager*) /builds/worker/checkouts/gecko/xpcom/build/XPCOMInit.cpp:730:5
    #12 0x7f677d133b0e in XRE_TermEmbedding() /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:215:3
    #13 0x7f6775e28dae in mozilla::ipc::ScopedXREEmbed::Stop() /builds/worker/checkouts/gecko/ipc/glue/ScopedXREEmbed.cpp:90:5
    #14 0x7f677d1345cb in XRE_InitChildProcess(int, char**, XREChildData const*) /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:747:16
    #15 0x5583c4255f53 in content_process_main /builds/worker/checkouts/gecko/browser/app/../../ipc/contentproc/plugin-container.cpp:57:28
    #16 0x5583c4255f53 in main /builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp:313:18
    #17 0x7f6797de10b2 in __libc_start_main /build/glibc-eX1tMB/glibc-2.31/csu/../csu/libc-start.c:308:16
    #18 0x5583c4232d77 in _start (/home/worker/builds/m-c-20210509213623-ccov-fuzzing-opt/firefox-bin+0x37d77)

Tyson Smith [:tsmith]

Reporter

Comment 1

•

4 years ago

Calixte, we are seeing this issue very often while trying to collect fuzzing coverage. Are you able to have a look?

Flags: needinfo?(cdenizet)

Marco Castelluccio [:marco]

Comment 2

•

4 years ago

Is this happening at shutdown?

Calixte Denizet (:calixte)

Comment 3

•

4 years ago

Sorry I've a ton of work, so I forgot to look into that bug.
Anyway, do you have an easy way to reproduce it ?
If not could give me the steps to setup the required env to reproduce ?

Flags: needinfo?(cdenizet)

Marco Castelluccio [:marco]

Updated

•

4 years ago

Keywords: crash

Comment 4

•

4 years ago

Attached file screen2.log — Details

Here is a log from one of our instances for reference. Looks like the issue happens when SIGUSR1 is sent.

Calixte Denizet (:calixte)

Comment 5

•

4 years ago

If I understand correctly the log, you've crashes at __lll_lock_wait_private too, is it correct ?
If yes could you get a backtrace for this signature please ?

Tyson Smith [:tsmith]

Reporter

Comment 6

•

4 years ago

(In reply to Calixte Denizet (:calixte) from comment #3)

Sorry I've a ton of work, so I forgot to look into that bug.

No problem :)

Anyway, do you have an easy way to reproduce it ?

Yes, running a few instances of Grizzly in parallel with --coverage seems to trigger the issue within a few minutes.

If not could give me the steps to setup the required env to reproduce ?

You can run a few instances of Grizzy running the no-op adapter (four triggers the issue quickly).
Using a build from TC it looks something like this:

GCOV_PREFIX_STRIP=4 GCOV_PREFIX=/home/user/workspace/browsers/m-c-20210518160751-ccov-fuzzing-opt/ python3 -m grizzly ~/workspace/browsers/m-c-20210518160751-ccov-fuzzing-opt/firefox no-op --coverage --xvfb

Tyson Smith [:tsmith]

Reporter

Comment 7

•

4 years ago

(In reply to Calixte Denizet (:calixte) from comment #5)

If I understand correctly the log, you've crashes at __lll_lock_wait_private too, is it correct ?

Yes I think it's the same as bug 1702620, though I'm not 100% sure.

If yes could you get a backtrace for this signature please ?

==11212==ERROR: UndefinedBehaviorSanitizer: ABRT on unknown address 0x03e80000047d (pc 0x7f5d3165f74b bp 0x000000001000 sp 0x7ffec5999408 T11212)
    #0 0x7f5d3165f74b in __lll_lock_wait_private /build/glibc-eX1tMB/glibc-2.31/nptl/./lowlevellock.c:35:7
    #1 0x7f5d316654aa in malloc /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:3064:3
    #2 0x7f5d3164ce83 in _IO_file_doallocate /build/glibc-eX1tMB/glibc-2.31/libio/filedoalloc.c:101:7
    #3 0x7f5d3165d04f in _IO_doallocbuf /build/glibc-eX1tMB/glibc-2.31/libio/genops.c:347:9
    #4 0x7f5d31659b14 in _IO_file_seekoff /build/glibc-eX1tMB/glibc-2.31/libio/fileops.c:938:7
    #5 0x7f5d316564fc in fseek /build/glibc-eX1tMB/glibc-2.31/libio/fseek.c:36:12
    #6 0x55f117004c9b in llvm_gcda_start_file (/home/worker/builds/m-c-20210328213901-ccov-fuzzing-opt/firefox-bin+0x160c9b)
    #7 0x55f116efffda in __llvm_gcov_writeout (/home/worker/builds/m-c-20210328213901-ccov-fuzzing-opt/firefox-bin+0x5bfda)
    #8 0x7f5d1dc510d4 in __gcov_dump (/home/worker/builds/m-c-20210328213901-ccov-fuzzing-opt/libxul.so+0x1727d0d4)
    #9 0x7f5d14ebb16f in mozilla::CodeCoverageHandler::FlushCounters() /builds/worker/checkouts/gecko/tools/code-coverage/CodeCoverageHandler.cpp:46:3
    #10 0x7f5d3160e20f  (/lib/x86_64-linux-gnu/libc.so.6+0x4620f)
    #11 0x7f5d31662d04 in _int_malloc /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:3671:17
    #12 0x7f5d31665418 in malloc /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:3066:12
    #13 0x55f116f04c78 in moz_xmalloc /builds/worker/checkouts/gecko/memory/mozalloc/mozalloc.cpp:52:15
    #14 0x7f5d110a90a7 in operator new /builds/worker/workspace/obj-build/dist/include/mozilla/cxxalloc.h:33:10
    #15 0x7f5d110a90a7 in mozilla::dom::EventTarget_Binding::removeEventListener(JSContext*, JS::Handle<JSObject*>, void*, JSJitMethodCallArgs const&) /builds/worker/workspace/obj-build/dom/bindings/EventTargetBinding.cpp:762:14
    #16 0x7f5d113b1c2b in bool mozilla::dom::binding_detail::GenericMethod<mozilla::dom::binding_detail::MaybeCrossOriginObjectThisPolicy, mozilla::dom::binding_detail::ThrowExceptions>(JSContext*, unsigned int, JS::Value*) /builds/worker/checkouts/gecko/dom/bindings/BindingUtils.cpp:3242:13
    #17 0x2d9b1f500026  (<unknown module>)

Tyson Smith [:tsmith]

Reporter

Comment 8

•

4 years ago

If I use GCOV_CHILD_PREFIX I am able to avoid this crash but I still get the hangs.

Marco Castelluccio [:marco]

Comment 9

•

4 years ago

Tyson, by running different instances of Grizzly in parallel, I guess you're also running multiple instances of Firefox in parallel?
If so, conflicts between the Firefox instances could explain this. Can you try setting a different GCOV_PREFIX and GCOV_CHILD_PREFIX for each Firefox instance?

Tyson Smith [:tsmith]

Reporter

Comment 10

•

4 years ago

•

Edited

(In reply to Marco Castelluccio [:marco] from comment #9)

Tyson, by running different instances of Grizzly in parallel, I guess you're also running multiple instances of Firefox in parallel?

Yes that is correct.

If so, conflicts between the Firefox instances could explain this. Can you try setting a different GCOV_PREFIX and GCOV_CHILD_PREFIX for each Firefox instance?

We now limit each machine to a single Grizzly/Firefox instance. The issue is happening much less frequently. With GCOV_PREFIX and GCOV_CHILD_PREFIX set we are still seeing hangs and I can't rule out this crash totally atm.

Tyson Smith [:tsmith]

Reporter

Comment 11

•

4 years ago

I should also mention that if we start using GCOV_CHILD_PREFIX in automation we are going to have the issue of dealing with the output. The browser restarts frequently and that causes more directories to be created. We then need to report or merge and track that data. Also cleanup the old directories. This gets complicated quickly so we'd really like to avoid it if possible.

Tyson Smith [:tsmith]

Reporter

Updated

•

4 years ago

Blocks: 1720345

Marco Castelluccio [:marco]

Updated

•

4 years ago

Severity: -- → S3

Priority: -- → P3

Jesse Schwartzentruber (:truber)

Comment 12

•

3 years ago

We're still seeing this crash fairly frequently with every fuzzing coverage collection. Looking at bug 1533918 (esp #c9 and #c17) it looks like CrossProcessMutex is not a reliable API, especially wrt deadlocking.

Is there an alternative locking mechanism available?

Updated

•

3 years ago

Flags: needinfo?(mcastelluccio)

Marco Castelluccio [:marco]

Comment 13

•

3 years ago

I'm not sure if there's anything like CrossProcessMutex in our codebase that we could use instead. :gcp, do you know?

Flags: needinfo?(mcastelluccio) → needinfo?(gpascutto)

Gian-Carlo Pascutto [:gcp]

Comment 14

•

3 years ago

No, and as bug 1533918 describes, if you require a CrossProcessMutex you've probably made a design mistake as it is not a portable construct in general, and there's issues with crash resilience.

It should be noted this code path (CrossProcessMutex on Linux) should never be hit in production, see the same bug, which forces an instant crash when it is detected.

What's happening here is that the coverage code is twice naughty:

It disables Sandboxing.
It uses a CrossProcessMutex.

And it turns out that (1) disables the forced crash in (2).

Flags: needinfo?(gpascutto)