libxcb bug, fixed upstream (https://cgit.freedesktop.org/xcb/libxcb/commit/src/xcb_out.c?id=be0fe56c3bcad5124dcc6c47a2fad01acd16f71a)
Created attachment 8779100 [details] Output of minidump_stackwalk During normal browsing today (I was just opening a bugzilla bug in a new tab), the Firefox parent process crashed inside libxcb.
Comment on attachment 8779100 [details] Output of minidump_stackwalk (Sorry, submitted before I meant to.) The crash was pretty bad, occurring immediately again when I tried to restart Firefox, and causing me to lose all my tabs. I submitted crash reports both times, see  and . Additionally, I obtained the raw dump for the first crash and ran minidump_stackwalk on it. The output is attached. Let me know if there is anything else I can do to help diagnose this crash.  https://crash-stats.mozilla.com/report/index/5c58fba1-e2f0-4365-ac8d-cc91c2160808  https://crash-stats.mozilla.com/report/index/b8acbe25-179d-4027-958b-38a442160808
With Andrew's continued help, we were able to arrive at a diagnosis for this. We observed that instances of the crash were accompanied by the following message in standard error: ../../src/xcb_conn.c:186: write_vec: Assertion `!c->out.queue_len' failed. This assertion has been spotted in other programs that consume libxcb as well; it's a race condition in libxcb, which is fixed upstream , in libxcb 1.11. A recent Firefox change (bug 1291845) made it more likely for us to hit the race condition. Using GL layers also makes it more likely for us to hit it, although it can conceivably happen without GL layers too.  https://cgit.freedesktop.org/xcb/libxcb/commit/src/xcb_out.c?id=be0fe56c3bcad5124dcc6c47a2fad01acd16f71a
Created attachment 8781242 [details] Instructions for building a patched libxcb that includes the fix on a Debian stable system (In reply to Botond Ballo [:botond] from comment #2) > it's a race condition in libxcb, which is fixed upstream , in libxcb 1.11. Unfortunately, the latest version of libxcb packaged for Debian stable (including backports) is 1.10. Therefore, to fix the problem on my Debian stable system, I built a patched version of libxcb 1.10 which includes the fix for this race condition. For convenience, I uploaded the resulting packages in case anyone would like to use them. I also attached some notes on how I built these patched packages. A big thanks for Andrew for all his help with this! http://people.mozilla.org/~bballo/libxcb1_1.10-3_amd64.deb http://people.mozilla.org/~bballo/libxcb1_1.10-3_i386.deb http://people.mozilla.org/~bballo/libxcb1-dbg_1.10-3_amd64.deb http://people.mozilla.org/~bballo/libxcb1-dbg_1.10-3_i386.deb
Finally, Andrew and I discussed whether we should make any changes to Firefox (such as undoing or revising the change in bug 1291845) to work around this libxcb bug. Looking at crash-stats, we weren't able to find any occurrences of this crash other than those reported by me, so for now we are proposing not to work around the bug in Firefox. We'll keep an eye the crash stats (especially as bug 1291845 and the enablement of GL layers makes it to more widely-used release channels) and reconsider that decision if necessary.
Fixed by bug 1296911, switched to separate display to VSync thread to avoid MSC notifications getting stolen.
We're seeing this cause a variety of intermittent test failures, I'm about to dup a bunch of bugs to this one, see their individual logs for details.
I just triggered this while fuzzing ASAN Firefox Nightly Build ID 20161128153732. I can't repro it, but at least I have a stack trace. firefox: ../../src/xcb_conn.c:186: write_vec: Assertion `!c->out.queue_len' failed. [Child 17388] ###!!! ABORT: Aborting on channel error.: file /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp, line 2155 [Child 17388] ###!!! ABORT: Aborting on channel error.: file /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp, line 2155 ASAN:DEADLYSIGNAL ================================================================= ###!!! [Child][MessageChannel] Error: (msgtype=0xE40003,name=PTexture::Msg_Destroy) Channel error: cannot send/recv ###!!! [Child][MessageChannel] Error: (msgtype=0x3E0003,name=PCompositable::Msg_Destroy) Channel error: cannot send/recv ==17388==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x0000004e114b bp 0x7ff9a2d17090 sp 0x7ff9a2d17080 T2) ###!!! [Child][MessageChannel] Error: (msgtype=0x3E0003,name=PCompositable::Msg_Destroy) Channel error: cannot send/recv Crash Annotation GraphicsCriticalError: |[C0][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=282.738) #0 0x4e114a in mozalloc_abort(char const*) /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33:5 #1 0x7ff9a58e2035 in Abort(char const*) /home/worker/workspace/build/src/xpcom/base/nsDebugImpl.cpp:449:3 #2 0x7ff9a58e1ddc in NS_DebugBreak /home/worker/workspace/build/src/xpcom/base/nsDebugImpl.cpp:405:7 #3 0x7ff9a6846e7f in mozilla::ipc::MessageChannel::OnChannelErrorFromLink() /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2155:13 #4 0x7ff9a684c123 in OnChannelError /home/worker/workspace/build/src/ipc/glue/MessageLink.cpp:367:5 #5 0x7ff9a684c123 in non-virtual thunk to mozilla::ipc::ProcessLink::OnChannelError() /home/worker/workspace/build/src/ipc/glue/MessageLink.cpp:359 #6 0x7ff9a68021fb in event_process_active_single_queue /home/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/event.c:1350:4 #7 0x7ff9a68021fb in event_process_active /home/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/event.c:1420 #8 0x7ff9a68021fb in event_base_loop /home/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/event.c:1621 #9 0x7ff9a67c1691 in base::MessagePumpLibevent::Run(base::MessagePump::Delegate*) /home/worker/workspace/build/src/ipc/chromium/src/base/message_pump_libevent.cc:372:7 #10 0x7ff9a67bbaf8 in RunInternal /home/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:232:3 #11 0x7ff9a67bbaf8 in RunHandler /home/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:225 #12 0x7ff9a67bbaf8 in MessageLoop::Run() /home/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:205 #13 0x7ff9a67dbca1 in base::Thread::ThreadMain() /home/worker/workspace/build/src/ipc/chromium/src/base/thread.cc:180:3 #14 0x7ff9a67dc7fc in ThreadFunc(void*) /home/worker/workspace/build/src/ipc/chromium/src/base/platform_thread_posix.cc:38:3 #15 0x7ff9c02b00a3 in start_thread /build/glibc-daoqzt/glibc-2.19/nptl/pthread_create.c:309 #16 0x7ff9bf3b762c in clone /build/glibc-daoqzt/glibc-2.19/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:111 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33:5 in mozalloc_abort(char const*) Thread T2 (Chrome_ChildThr) created by T0 (Web Content) here: #0 0x49a869 in __interceptor_pthread_create /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_interceptors.cc:238:3 #1 0x7ff9a67db8bb in CreateThread /home/worker/workspace/build/src/ipc/chromium/src/base/platform_thread_posix.cc:137:14 #2 0x7ff9a67db8bb in Create /home/worker/workspace/build/src/ipc/chromium/src/base/platform_thread_posix.cc:148 #3 0x7ff9a67db8bb in base::Thread::StartWithOptions(base::Thread::Options const&) /home/worker/workspace/build/src/ipc/chromium/src/base/thread.cc:98 #4 0x7ff9a684e307 in mozilla::ipc::ProcessChild::ProcessChild(int) /home/worker/workspace/build/src/ipc/glue/ProcessChild.cpp:24:5 #5 0x7ff9adfe3a1b in ContentProcess /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/dom/ContentProcess.h:31:7 #6 0x7ff9adfe3a1b in XRE_InitChildProcess /home/worker/workspace/build/src/toolkit/xre/nsEmbedFunctions.cpp:660 #7 0x4dfb5b in content_process_main /home/worker/workspace/build/src/browser/app/../../ipc/contentproc/plugin-container.cpp:115:19 #8 0x4dfb5b in main /home/worker/workspace/build/src/browser/app/nsBrowserApp.cpp:438 #9 0x7ff9bf2f0b44 in __libc_start_main /build/glibc-daoqzt/glibc-2.19/csu/libc-start.c:287 ==17388==ABORTING
We're seeing this on Try/inbound as well (see bug 1313562). Not sure the libxcb version on the test machines; my (old) Fedora 22 has 1.11.
This seems to be happening fairly regularly in automation, is it feasible to install either the upgraded upstream library or the patched library mentioned in comment 3 on the machines used in automation?
(In reply to Andrew Swan [:aswan] from comment #18) > This seems to be happening fairly regularly in automation, is it feasible to > install either the upgraded upstream library or the patched library > mentioned in comment 3 on the machines used in automation? In my experience, installing a newer version of a library than the one available in a Linux distribution's repositories can be a tricky business. In the few cases where I've managed it, it was on a local installation, and with heavy guidance from people more familiar with these things (such as Andrew Comminos who kindly helped me put in place the local uprade I described in comment 3). I think a more promising approach in this case would be to upgrade the operating system version itself. I see bug 1290183 is already on file for that, and Karl has already marked it as a dependency of this bug.
Since folks are apparently unwillingly to work on this separately from bug 1290183 would it be possible for whatever tools are used to file intermittent bugs to learn about this particular crash to avoid filing new bugs every time we see it?
Short version: no. Long version: Treeherder has a list of regexes which it matches against log lines to determine what lines indicate a failure. "INFO - firefox: ../../src/xcb_conn.c:180: write_vec: Assertion `!c->out.queue_len' failed." is not matched by any of them. Nor is "INFO - [GFX1-]: Receive IPC close with reason=AbnormalShutdown" nor "INFO - Hit MOZ_CRASH(Aborting on channel error.) at /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2186". Both the filing of new intermittents and the matching of a failure against existing bugs happen based on those parsed out lines. You can get Treeherder to add a regex, with (reasonable) difficulty since it needs to not result in false positives and it's nearly impossible to say whether or not one will result in false positives. Or, probably with even more difficulty, we could crash better than a bunch of INFO lines with the actual information followed by a completely and utterly useless actual minidump which provides zero information. So, yeah, by needinfoing me you now have me knowing that "with exit code 6" plus a minidump with a useless junk signature means I need to open the log, scroll up past the minidump, scroll up past some assertions, to get the the actual assertion that leads to this bug to star a failure as this. I can't speak to how often any other sheriff will remember to do that.
Well, sounds like we're still getting an incremental improvement, thanks!
I talked to :jmaher about the timeline for fixing bug 1290183, and it looks like, realistically speaking, we're probably looking at late Q1 / early Q2. So, given the high volume of these intermittents, it might make sense to try and fix this issue on the existing 12.04 builders after all.
(In reply to Botond Ballo [:botond] from comment #78) > So, given the high volume of these intermittents, it might make sense to try > and fix this issue on the existing 12.04 builders after all. Filed bug 1334641. As I mentioned, I'm a bit out of my depth in this area, so some help with that bug would be greatly appreciated!
13 failures in 749 pushes (0.017 failures/push) were associated with this bug in the last 7 days. Repository breakdown: * mozilla-aurora: 4 * mozilla-inbound: 3 * autoland: 3 * mozilla-beta: 2 * mozilla-central: 1 Platform breakdown: * linux64: 10 * linux32: 3 For more details, see: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1293474&startday=2017-01-23&endday=2017-01-29&tree=all
41 failures in 733 pushes (0.056 failures/push) were associated with this bug in the last 7 days. This is the #43 most frequent failure this week. Repository breakdown: * autoland: 19 * mozilla-inbound: 13 * mozilla-aurora: 3 * mozilla-central: 2 * mozilla-beta: 2 * try: 1 * mozilla-release: 1 Platform breakdown: * linux64: 27 * linux32: 14 For more details, see: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1293474&startday=2017-01-30&endday=2017-02-05&tree=all
(In reply to Botond Ballo [:botond] from comment #79) > (In reply to Botond Ballo [:botond] from comment #78) > > So, given the high volume of these intermittents, it might make sense to try > > and fix this issue on the existing 12.04 builders after all. > > Filed bug 1334641. THe libxcb on the 12.04 builders is now patched. Hopefully we should start seeing these intermittents taper off.
24 failures in 836 pushes (0.029 failures/push) were associated with this bug in the last 7 days. Repository breakdown: * mozilla-inbound: 11 * autoland: 10 * mozilla-central: 1 * mozilla-beta: 1 * mozilla-aurora: 1 Platform breakdown: * linux64: 15 * linux32: 9 For more details, see: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1293474&startday=2017-02-06&endday=2017-02-12&tree=all
Looking pretty good: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1293474&startday=2017-02-06&endday=2017-02-19&tree=all Looks like bug 1334641 fixed this. Closing.