Closed Bug 1616462 Opened 6 years ago Closed 3 years ago

Intermittent SUMMARY: ThreadSanitizer: data race /builds/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/signal.c:125:16 in evsig_set_base_

Categories

(Core :: IPC, defect, P1)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1658072

People

(Reporter: intermittent-bug-filer, Assigned: jld)

References

(Blocks 1 open bug, )

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

Attachments

(1 obsolete file)

Filed by: nbeleuzu [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=289453443&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/TFmdzKfcRPCMCQ1GziyX-g/runs/0/artifacts/public/logs/live_backing.log


[task 2020-02-19T02:47:11.701Z] 02:47:11 INFO - TEST-INFO | Ran 1 Loops
[task 2020-02-19T02:47:11.708Z] 02:47:11 INFO - SimpleTest FINISHED
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | ==================
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | WARNING: ThreadSanitizer: data race (pid=1990)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | Write of size 4 at 0x7f23e939ced8 by thread T2:
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #0 evsig_set_base_ /builds/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/signal.c:125:16 (libxul.so+0x126b7cf)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #1 event_base_loop /builds/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/event.c:1901:3 (libxul.so+0x1264514)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #2 base::MessagePumpLibevent::Run(base::MessagePump::Delegate*) /builds/worker/workspace/build/src/ipc/chromium/src/base/message_pump_libevent.cc (libxul.so+0x1245ca9)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #3 RunInternal /builds/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:315:10 (libxul.so+0x1242fdc)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #4 RunHandler /builds/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:308:3 (libxul.so+0x1242fdc)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #5 MessageLoop::Run() /builds/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:290:3 (libxul.so+0x1242fdc)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #6 base::Thread::ThreadMain() /builds/worker/workspace/build/src/ipc/chromium/src/base/thread.cc:192:16 (libxul.so+0x12535e3)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #7 ThreadFunc(void*) /builds/worker/workspace/build/src/ipc/chromium/src/base/platform_thread_posix.cc:40:13 (libxul.so+0x124d1be)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | Previous read of size 4 at 0x7f23e939ced8 by main thread:
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | [failed to restore the stack]
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | Location is global 'evsig_base_fd' of size 4 at 0x7f23e939ced8 (libxul.so+0x00000a793ed8)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | Thread T2 'Gecko_IOThread' (tid=1998, running) created by main thread at:
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #0 pthread_create /builds/worker/fetches/llvm-project/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:967:3 (firefox+0x571ab)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #1 CreateThread /builds/worker/workspace/build/src/ipc/chromium/src/base/platform_thread_posix.cc:123:14 (libxul.so+0x124a9b7)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #2 PlatformThread::Create(unsigned long, PlatformThread::Delegate*, unsigned long*) /builds/worker/workspace/build/src/ipc/chromium/src/base/platform_thread_posix.cc:134:10 (libxul.so+0x124a9b7)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #3 base::Thread::StartWithOptions(base::Thread::Options const&) /builds/worker/workspace/build/src/ipc/chromium/src/base/thread.cc:97:8 (libxul.so+0x12531c6)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #4 NS_InitXPCOM /builds/worker/workspace/build/src/xpcom/build/XPCOMInit.cpp:316:9 (libxul.so+0xaa2728)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #5 Initialize /builds/worker/workspace/build/src/toolkit/xre/nsAppRunner.cpp:1264:8 (libxul.so+0x632e2dd)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #6 XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) /builds/worker/workspace/build/src/toolkit/xre/nsAppRunner.cpp:4697:22 (libxul.so+0x632e2dd)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #7 XRE_main(int, char**, mozilla::BootstrapConfig const&) /builds/worker/workspace/build/src/toolkit/xre/nsAppRunner.cpp:4752:21 (libxul.so+0x632e6b4)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #8 mozilla::BootstrapImpl::XRE_main(int, char**, mozilla::BootstrapConfig const&) /builds/worker/workspace/build/src/toolkit/xre/Bootstrap.cpp:45:12 (libxul.so+0x6337432)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #9 do_main /builds/worker/workspace/build/src/browser/app/nsBrowserApp.cpp:217:22 (firefox+0xc8b33)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | #10 main /builds/worker/workspace/build/src/browser/app/nsBrowserApp.cpp:331:16 (firefox+0xc8b33)
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | SUMMARY: ThreadSanitizer: data race /builds/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/signal.c:125:16 in evsig_set_base_
[task 2020-02-19T02:47:20.882Z] 02:47:20 INFO - GECKO(1990) | ==================
[task 2020-02-19T02:47:21.064Z] 02:47:21 INFO - TEST-INFO | Main app process: killed by SIGIOT
[task 2020-02-19T02:47:21.065Z] 02:47:21 INFO - Buffered messages finished
[task 2020-02-19T02:47:21.066Z] 02:47:21 ERROR - TEST-UNEXPECTED-FAIL | Last test finished | application terminated with exit code -6

This is reported upstream as https://github.com/libevent/libevent/issues/779. There is also a PR with a fix at https://github.com/libevent/libevent/pull/741, but it has not been accepted (yet).

In the last 7 days there have been 20 occurrences on linux 64 tsan opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=295049632&repo=autoland&lineNumber=3224

Summary: Intermittent GECKO(1990) | SUMMARY: ThreadSanitizer: data race /builds/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/signal.c:125:16 in evsig_set_base_ → Intermittent SUMMARY: ThreadSanitizer: data race /builds/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/signal.c:125:16 in evsig_set_base_

There are 26 failures associated to this bug in the last 7 days. These are occurring on linux1804-64-tsan opt builds.

log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=299884943&repo=autoland&lineNumber=2529
[task 2020-04-28T23:18:07.550Z] 23:18:07 INFO - GECKO(2698) | WARNING: ThreadSanitizer: data race (pid=2698)
[task 2020-04-28T23:18:07.551Z] 23:18:07 INFO - GECKO(2698) | Write of size 4 at 0x7f99ff683038 by thread T4:
[task 2020-04-28T23:18:07.551Z] 23:18:07 INFO - GECKO(2698) | #0 evsig_set_base_ /builds/worker/checkouts/gecko/ipc/chromium/src/third_party/libevent/signal.c:125:16 (libxul.so+0x1333c1f)
[task 2020-04-28T23:18:07.551Z] 23:18:07 INFO - GECKO(2698) | #1 event_base_loop /builds/worker/checkouts/gecko/ipc/chromium/src/third_party/libevent/event.c:1901:3 (libxul.so+0x132c994)
[task 2020-04-28T23:18:07.551Z] 23:18:07 INFO - GECKO(2698) | #2 base::MessagePumpLibevent::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_pump_libevent.cc (libxul.so+0x130d609)
[task 2020-04-28T23:18:07.552Z] 23:18:07 INFO - GECKO(2698) | #3 RunInternal /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:315:10 (libxul.so+0x130a94c)
[task 2020-04-28T23:18:07.553Z] 23:18:07 INFO - GECKO(2698) | #4 RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:308:3 (libxul.so+0x130a94c)
[task 2020-04-28T23:18:07.553Z] 23:18:07 INFO - GECKO(2698) | #5 MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:290:3 (libxul.so+0x130a94c)
[task 2020-04-28T23:18:07.553Z] 23:18:07 INFO - GECKO(2698) | #6 base::Thread::ThreadMain() /builds/worker/checkouts/gecko/ipc/chromium/src/base/thread.cc:192:16 (libxul.so+0x131ace5)
[task 2020-04-28T23:18:07.553Z] 23:18:07 INFO - GECKO(2698) | #7 ThreadFunc(void*) /builds/worker/checkouts/gecko/ipc/chromium/src/base/platform_thread_posix.cc:40:13 (libxul.so+0x1314afe)
[task 2020-04-28T23:18:07.553Z] 23:18:07 INFO - GECKO(2698) | Previous read of size 4 at 0x7f99ff683038 by main thread:
[task 2020-04-28T23:18:07.553Z] 23:18:07 INFO - GECKO(2698) | [failed to restore the stack]
[task 2020-04-28T23:18:07.553Z] 23:18:07 INFO - GECKO(2698) | Location is global 'evsig_base_fd' of size 4 at 0x7f99ff683038 (libxul.so+0x00000ab57038)
[task 2020-04-28T23:18:07.554Z] 23:18:07 INFO - GECKO(2698) | Thread T4 'Gecko_IOThread' (tid=2708, running) created by main thread at:
[task 2020-04-28T23:18:07.554Z] 23:18:07 INFO - GECKO(2698) | #0 pthread_create /builds/worker/fetches/llvm-project/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:967:3 (firefox+0x5749b)
[task 2020-04-28T23:18:07.554Z] 23:18:07 INFO - GECKO(2698) | #1 CreateThread /builds/worker/checkouts/gecko/ipc/chromium/src/base/platform_thread_posix.cc:123:14 (libxul.so+0x1312187)
[task 2020-04-28T23:18:07.554Z] 23:18:07 INFO - GECKO(2698) | #2 PlatformThread::Create(unsigned long, PlatformThread::Delegate*, unsigned long*) /builds/worker/checkouts/gecko/ipc/chromium/src/base/platform_thread_posix.cc:134:10 (libxul.so+0x1312187)
[task 2020-04-28T23:18:07.554Z] 23:18:07 INFO - GECKO(2698) | #3 base::Thread::StartWithOptions(base::Thread::Options const&) /builds/worker/checkouts/gecko/ipc/chromium/src/base/thread.cc:97:8 (libxul.so+0x131a8e1)
[task 2020-04-28T23:18:07.554Z] 23:18:07 INFO - GECKO(2698) | #4 NS_InitXPCOM /builds/worker/checkouts/gecko/xpcom/build/XPCOMInit.cpp:317:9 (libxul.so+0xb1114d)
[task 2020-04-28T23:18:07.555Z] 23:18:07 INFO - GECKO(2698) | #5 ScopedXPCOMStartup::Initialize(bool) /builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp:1311:8 (libxul.so+0x638dcfd)
[task 2020-04-28T23:18:07.556Z] 23:18:07 INFO - GECKO(2698) | #6 XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) /builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp:4739:22 (libxul.so+0x63968b3)
[task 2020-04-28T23:18:07.557Z] 23:18:07 INFO - GECKO(2698) | #7 XRE_main(int, char**, mozilla::BootstrapConfig const&) /builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp:4797:21 (libxul.so+0x6396c24)
[task 2020-04-28T23:18:07.557Z] 23:18:07 INFO - GECKO(2698) | #8 mozilla::BootstrapImpl::XRE_main(int, char**, mozilla::BootstrapConfig const&) /builds/worker/checkouts/gecko/toolkit/xre/Bootstrap.cpp:45:12 (libxul.so+0x63a1762)
[task 2020-04-28T23:18:07.558Z] 23:18:07 INFO - GECKO(2698) | #9 do_main /builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp:217:22 (firefox+0xc93e3)
[task 2020-04-28T23:18:07.558Z] 23:18:07 INFO - GECKO(2698) | #10 main /builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp:331:16 (firefox+0xc93e3)
[task 2020-04-28T23:18:07.558Z] 23:18:07 INFO - GECKO(2698) | SUMMARY: ThreadSanitizer: data race /builds/worker/checkouts/gecko/ipc/chromium/src/third_party/libevent/signal.c:125:16 in evsig_set_base_
[task 2020-04-28T23:18:07.559Z] 23:18:07 INFO - GECKO(2698) | ==================
[task 2020-04-28T23:18:07.735Z] 23:18:07 INFO - TEST-INFO | Main app process: killed by SIGIOT
[task 2020-04-28T23:18:07.735Z] 23:18:07 INFO - Buffered messages finished
[task 2020-04-28T23:18:07.736Z] 23:18:07 ERROR - TEST-UNEXPECTED-FAIL | Last test finished | application terminated with exit code -6
[task 2020-04-28T23:18:07.736Z] 23:18:07 INFO - runtests.py | Application ran for: 0:00:46.715597
[task 2020-04-28T23:18:07.736Z] 23:18:07 INFO - zombiecheck | Reading PID log: /tmp/tmpENcMNrpidlog
[task 2020-04-28T23:18:07.736Z] 23:18:07 INFO - ==> process 2698 launched child process 2713
[task 2020-04-28T23:18:07.736Z] 23:18:07 INFO - ==> process 2698 launched child process 2823
[task 2020-04-28T23:18:07.737Z] 23:18:07 INFO - ==> process 2698 launched child process 2843
[task 2020-04-28T23:18:07.737Z] 23:18:07 INFO - ==> process 2698 launched child process 2919
[task 2020-04-28T23:18:07.737Z] 23:18:07 INFO - ==> process 2698 launched child process 2948
[task 2020-04-28T23:18:07.737Z] 23:18:07 INFO - zombiecheck | Checking for orphan process with PID: 2713
[task 2020-04-28T23:18:07.737Z] 23:18:07 INFO - zombiecheck | Checking for orphan process with PID: 2919
[task 2020-04-28T23:18:07.737Z] 23:18:07 INFO - zombiecheck | Checking for orphan process with PID: 2843
[task 2020-04-28T23:18:07.738Z] 23:18:07 INFO - zombiecheck | Checking for orphan process with PID: 2948
[task 2020-04-28T23:18:07.738Z] 23:18:07 INFO - zombiecheck | Checking for orphan process with PID: 2823
[task 2020-04-28T23:18:07.738Z] 23:18:07 INFO - Stopping web server

:jld can you take a look?

Flags: needinfo?(jld)
Whiteboard: [stockwell needswork]

There's a fundamentally hard problem here, which seems to be why upstream doesn't have a fix: if an async signal handler is using some kind of resource, like a fd or heap memory, it's not enough to unset the signal handler before freeing the resource; it's necessary to ensure that the handler is no longer running on any other thread, and locks can't be used for this.

We may not need to care about that: the fd is owned by the event_base (the abstraction of the native I/O multiplexer) and doesn't appear to be closed until it's destroyed, which for IPC is during shutdown. (Specifically, after ShutdownXPCOM deletes the BrowserProcessSubThread and sends a ThreadQuitTask to the I/O thread.)

But then there's what we're being yelled at about here: the fd is stored into the global variable, then the signal handler is installed, then — according to TSan — it's possible for another thread on another CPU to run the signal handler before it sees the change to the non-atomic global. For that, a change proposed in one of the GitHub issues should suffice: adding the C11 _Atomic qualifier.

Flags: needinfo?(jld)

I have a patch, and it doesn't appear to break the build.

Assignee: nobody → jld
Priority: P5 → P1

Jed, in comment #15 you mentioned you had a patch to fix this, but it's not attached to this bug. What's the state of that?

Flags: needinfo?(jld)

See comment #13 on the bug for an explanation of what this fixes, what it
doesn't fix, and why the upstream bug report about the larger issue here
has stalled.

(In reply to Simon Giesecke [:sg] [he/him] from comment #29)

Jed, in comment #15 you mentioned you had a patch to fix this, but it's not attached to this bug. What's the state of that?

Thanks for the reminder.

Flags: needinfo?(jld)
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c574dd7d46ef Make event_base_fd atomic to avoid a TSan error. r=froydnj
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 81 Branch

Slightly different: it's complaining about evsig_base, not evsig_base_fd.

There's… a lot going on here. If evsig_base_fd were loaded first in the signal handler, because it's stored last in the code that updates the globals, it would act as a barrier for the accesses to evsig_base. But those globals are set every time a signal handlers registered and every time we enter the event loop (which ought to be just once, but that's the source of the write that's called out in the TSan report).

So, I could keep trying to outwit TSan, or I could rewrite the code in IPC that handles child process termination on Unix, which is overdue for some cleanup, to not use this feature of libevent.

Flags: needinfo?(jld)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #35)

Slightly different: it's complaining about evsig_base, not evsig_base_fd.

There's… a lot going on here. If evsig_base_fd were loaded first in the signal handler, because it's stored last in the code that updates the globals, it would act as a barrier for the accesses to evsig_base. But those globals are set every time a signal handlers registered and every time we enter the event loop (which ought to be just once, but that's the source of the write that's called out in the TSan report).

So, I could keep trying to outwit TSan, or I could rewrite the code in IPC that handles child process termination on Unix, which is overdue for some cleanup, to not use this feature of libevent.

Should we file a different bug or can we use this one?

Flags: needinfo?(jld)

(In reply to Andreea Pavel [:apavel] from comment #36)

Should we file a different bug or can we use this one?

I'll file another bug and have it block this one.

Attachment #9167733 - Attachment is obsolete: true
Blocks: tsan
See Also: → 1704824

As far as we know the bug is still present, even if TSan runs aren't finding it anymore for some reason, but bug 1658072 will take care of it.

Status: REOPENED → RESOLVED
Closed: 5 years ago3 years ago
Flags: needinfo?(jld)
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: