The default bug view has changed. See this FAQ.

Crash in libxcb: "xcb_conn.c:186: write_vec: Assertion `!c->out.queue_len' failed"

RESOLVED FIXED

Status

()

Core
Graphics
RESOLVED FIXED
8 months ago
26 days ago

People

(Reporter: botond, Unassigned)

Tracking

({intermittent-failure})

Firefox Tracking Flags

(firefox51 affected)

Details

(Whiteboard: [gfx-noted][stockwell fixed])

User Story

libxcb bug, fixed upstream (https://cgit.freedesktop.org/xcb/libxcb/commit/src/xcb_out.c?id=be0fe56c3bcad5124dcc6c47a2fad01acd16f71a)

Attachments

(2 attachments)

(Reporter)

Description

8 months ago
Created attachment 8779100 [details]
Output of minidump_stackwalk

During normal browsing today (I was just opening a bugzilla bug in a new tab), the Firefox parent process crashed inside libxcb.
(Reporter)

Comment 1

8 months ago
Comment on attachment 8779100 [details]
Output of minidump_stackwalk

(Sorry, submitted before I meant to.)

The crash was pretty bad, occurring immediately again when I tried to restart Firefox, and causing me to lose all my tabs.

I submitted crash reports both times, see [1] and [2].

Additionally, I obtained the raw dump for the first crash and ran minidump_stackwalk on it. The output is attached.

Let me know if there is anything else I can do to help diagnose this crash.

[1] https://crash-stats.mozilla.com/report/index/5c58fba1-e2f0-4365-ac8d-cc91c2160808
[2] https://crash-stats.mozilla.com/report/index/b8acbe25-179d-4027-958b-38a442160808
Attachment #8779100 - Attachment description: Output of :q → Output of minidump_stackwalk

Updated

8 months ago
Whiteboard: [gfx-noted]
(Reporter)

Comment 2

7 months ago
diagnosis
With Andrew's continued help, we were able to arrive at a diagnosis for this.

We observed that instances of the crash were accompanied by the following message in standard error:

 ../../src/xcb_conn.c:186: write_vec: Assertion `!c->out.queue_len' failed.

This assertion has been spotted in other programs that consume libxcb as well; it's a race condition in libxcb, which is fixed upstream [1], in libxcb 1.11.

A recent Firefox change (bug 1291845) made it more likely for us to hit the race condition. Using GL layers also makes it more likely for us to hit it, although it can conceivably happen without GL layers too.

[1] https://cgit.freedesktop.org/xcb/libxcb/commit/src/xcb_out.c?id=be0fe56c3bcad5124dcc6c47a2fad01acd16f71a
(Reporter)

Updated

7 months ago
User Story: (updated)
See Also: → bug 1291845
(Reporter)

Comment 3

7 months ago
Created attachment 8781242 [details]
Instructions for building a patched libxcb that includes the fix on a Debian stable system

(In reply to Botond Ballo [:botond] from comment #2)
> it's a race condition in libxcb, which is fixed upstream [1], in libxcb 1.11.

Unfortunately, the latest version of libxcb packaged for Debian stable (including backports) is 1.10.

Therefore, to fix the problem on my Debian stable system, I built a patched version of libxcb 1.10 which includes the fix for this race condition.

For convenience, I uploaded the resulting packages in case anyone would like to use them. I also attached some notes on how I built these patched packages.

A big thanks for Andrew for all his help with this!

http://people.mozilla.org/~bballo/libxcb1_1.10-3_amd64.deb
http://people.mozilla.org/~bballo/libxcb1_1.10-3_i386.deb
http://people.mozilla.org/~bballo/libxcb1-dbg_1.10-3_amd64.deb
http://people.mozilla.org/~bballo/libxcb1-dbg_1.10-3_i386.deb
(Reporter)

Comment 4

7 months ago
Finally, Andrew and I discussed whether we should make any changes to Firefox (such as undoing or revising the change in bug 1291845) to work around this libxcb bug.

Looking at crash-stats, we weren't able to find any occurrences of this crash other than those reported by me, so for now we are proposing not to work around the bug in Firefox. We'll keep an eye the crash stats (especially as bug 1291845 and the enablement of GL layers makes it to more widely-used release channels) and reconsider that decision if necessary.
(Reporter)

Updated

7 months ago
Summary: Crash in xcb_connect_to_fd() → Crash in libxcb: "xcb_conn.c:186: write_vec: Assertion `!c->out.queue_len' failed"
See Also: → bug 1296911
Fixed by bug 1296911, switched to separate display to VSync thread to avoid MSC notifications getting stolen.
Status: NEW → RESOLVED
Last Resolved: 7 months ago
Resolution: --- → FIXED

Comment 6

4 months ago
We're seeing this cause a variety of intermittent test failures, I'm about to dup a bunch of bugs to this one, see their individual logs for details.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Updated

4 months ago
Duplicate of this bug: 1319335

Updated

4 months ago
Duplicate of this bug: 1312968

Updated

4 months ago
Duplicate of this bug: 1316056

Updated

4 months ago
Duplicate of this bug: 1320247

Updated

4 months ago
Duplicate of this bug: 1315093

Updated

4 months ago
Duplicate of this bug: 1320515
I just triggered this while fuzzing ASAN Firefox Nightly Build ID 20161128153732. I can't repro it, but at least I have a stack trace.

firefox: ../../src/xcb_conn.c:186: write_vec: Assertion `!c->out.queue_len' failed.
[Child 17388] ###!!! ABORT: Aborting on channel error.: file /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp, line 2155
[Child 17388] ###!!! ABORT: Aborting on channel error.: file /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp, line 2155
ASAN:DEADLYSIGNAL
=================================================================

###!!! [Child][MessageChannel] Error: (msgtype=0xE40003,name=PTexture::Msg_Destroy) Channel error: cannot send/recv


###!!! [Child][MessageChannel] Error: (msgtype=0x3E0003,name=PCompositable::Msg_Destroy) Channel error: cannot send/recv

==17388==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x0000004e114b bp 0x7ff9a2d17090 sp 0x7ff9a2d17080 T2)

###!!! [Child][MessageChannel] Error: (msgtype=0x3E0003,name=PCompositable::Msg_Destroy) Channel error: cannot send/recv

Crash Annotation GraphicsCriticalError: |[C0][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=282.738)     #0 0x4e114a in mozalloc_abort(char const*) /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33:5
    #1 0x7ff9a58e2035 in Abort(char const*) /home/worker/workspace/build/src/xpcom/base/nsDebugImpl.cpp:449:3
    #2 0x7ff9a58e1ddc in NS_DebugBreak /home/worker/workspace/build/src/xpcom/base/nsDebugImpl.cpp:405:7
    #3 0x7ff9a6846e7f in mozilla::ipc::MessageChannel::OnChannelErrorFromLink() /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2155:13
    #4 0x7ff9a684c123 in OnChannelError /home/worker/workspace/build/src/ipc/glue/MessageLink.cpp:367:5
    #5 0x7ff9a684c123 in non-virtual thunk to mozilla::ipc::ProcessLink::OnChannelError() /home/worker/workspace/build/src/ipc/glue/MessageLink.cpp:359
    #6 0x7ff9a68021fb in event_process_active_single_queue /home/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/event.c:1350:4
    #7 0x7ff9a68021fb in event_process_active /home/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/event.c:1420
    #8 0x7ff9a68021fb in event_base_loop /home/worker/workspace/build/src/ipc/chromium/src/third_party/libevent/event.c:1621
    #9 0x7ff9a67c1691 in base::MessagePumpLibevent::Run(base::MessagePump::Delegate*) /home/worker/workspace/build/src/ipc/chromium/src/base/message_pump_libevent.cc:372:7
    #10 0x7ff9a67bbaf8 in RunInternal /home/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:232:3
    #11 0x7ff9a67bbaf8 in RunHandler /home/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:225
    #12 0x7ff9a67bbaf8 in MessageLoop::Run() /home/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:205
    #13 0x7ff9a67dbca1 in base::Thread::ThreadMain() /home/worker/workspace/build/src/ipc/chromium/src/base/thread.cc:180:3
    #14 0x7ff9a67dc7fc in ThreadFunc(void*) /home/worker/workspace/build/src/ipc/chromium/src/base/platform_thread_posix.cc:38:3
    #15 0x7ff9c02b00a3 in start_thread /build/glibc-daoqzt/glibc-2.19/nptl/pthread_create.c:309
    #16 0x7ff9bf3b762c in clone /build/glibc-daoqzt/glibc-2.19/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:111

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33:5 in mozalloc_abort(char const*)
Thread T2 (Chrome_ChildThr) created by T0 (Web Content) here:
    #0 0x49a869 in __interceptor_pthread_create /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_interceptors.cc:238:3
    #1 0x7ff9a67db8bb in CreateThread /home/worker/workspace/build/src/ipc/chromium/src/base/platform_thread_posix.cc:137:14
    #2 0x7ff9a67db8bb in Create /home/worker/workspace/build/src/ipc/chromium/src/base/platform_thread_posix.cc:148
    #3 0x7ff9a67db8bb in base::Thread::StartWithOptions(base::Thread::Options const&) /home/worker/workspace/build/src/ipc/chromium/src/base/thread.cc:98
    #4 0x7ff9a684e307 in mozilla::ipc::ProcessChild::ProcessChild(int) /home/worker/workspace/build/src/ipc/glue/ProcessChild.cpp:24:5
    #5 0x7ff9adfe3a1b in ContentProcess /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/dom/ContentProcess.h:31:7
    #6 0x7ff9adfe3a1b in XRE_InitChildProcess /home/worker/workspace/build/src/toolkit/xre/nsEmbedFunctions.cpp:660
    #7 0x4dfb5b in content_process_main /home/worker/workspace/build/src/browser/app/../../ipc/contentproc/plugin-container.cpp:115:19
    #8 0x4dfb5b in main /home/worker/workspace/build/src/browser/app/nsBrowserApp.cpp:438
    #9 0x7ff9bf2f0b44 in __libc_start_main /build/glibc-daoqzt/glibc-2.19/csu/libc-start.c:287

==17388==ABORTING

Updated

3 months ago
Blocks: 1313562
We're seeing this on Try/inbound as well (see bug 1313562).  Not sure the libxcb version on the test machines; my (old) Fedora 22 has 1.11.

Updated

3 months ago
Duplicate of this bug: 1329180

Updated

2 months ago
Duplicate of this bug: 1329863

Updated

2 months ago
Depends on: 1290183

Updated

2 months ago
Duplicate of this bug: 1330802

Comment 18

2 months ago
This seems to be happening fairly regularly in automation, is it feasible to install either the upgraded upstream library or the patched library mentioned in comment 3 on the machines used in automation?
Flags: needinfo?(botond)
(Reporter)

Comment 19

2 months ago
(In reply to Andrew Swan [:aswan] from comment #18)
> This seems to be happening fairly regularly in automation, is it feasible to
> install either the upgraded upstream library or the patched library
> mentioned in comment 3 on the machines used in automation?

In my experience, installing a newer version of a library than the one available in a Linux distribution's repositories can be a tricky business. In the few cases where I've managed it, it was on a local installation, and with heavy guidance from people more familiar with these things (such as Andrew Comminos who kindly helped me put in place the local uprade I described in comment 3).

I think a more promising approach in this case would be to upgrade the operating system version itself. I see bug 1290183 is already on file for that, and Karl has already marked it as a dependency of this bug.
Flags: needinfo?(botond)

Updated

2 months ago
Duplicate of this bug: 1332551

Updated

2 months ago
Duplicate of this bug: 1332974

Updated

2 months ago
Duplicate of this bug: 1333231

Updated

2 months ago
Duplicate of this bug: 1334258

Comment 24

2 months ago
Since folks are apparently unwillingly to work on this separately from bug 1290183 would it be possible for whatever tools are used to file intermittent bugs to learn about this particular crash to avoid filing new bugs every time we see it?
Flags: needinfo?(philringnalda)
Short version: no.

Long version: Treeherder has a list of regexes which it matches against log lines to determine what lines indicate a failure. "INFO - firefox: ../../src/xcb_conn.c:180: write_vec: Assertion `!c->out.queue_len' failed." is not matched by any of them. Nor is "INFO - [GFX1-]: Receive IPC close with reason=AbnormalShutdown" nor "INFO - Hit MOZ_CRASH(Aborting on channel error.) at /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2186". Both the filing of new intermittents and the matching of a failure against existing bugs happen based on those parsed out lines. You can get Treeherder to add a regex, with (reasonable) difficulty since it needs to not result in false positives and it's nearly impossible to say whether or not one will result in false positives. Or, probably with even more difficulty, we could crash better than a bunch of INFO lines with the actual information followed by a completely and utterly useless actual minidump which provides zero information.

So, yeah, by needinfoing me you now have me knowing that "with exit code 6" plus a minidump with a useless junk signature means I need to open the log, scroll up past the minidump, scroll up past some assertions, to get the the actual assertion that leads to this bug to star a failure as this. I can't speak to how often any other sheriff will remember to do that.
Flags: needinfo?(philringnalda)
Keywords: intermittent-failure

Comment 26

2 months ago
Well, sounds like we're still getting an incremental improvement, thanks!
Duplicate of this bug: 1318421
Duplicate of this bug: 1332820
Duplicate of this bug: 1332800
Duplicate of this bug: 1331520
Duplicate of this bug: 1333284
Duplicate of this bug: 1332548
Duplicate of this bug: 1332148
Duplicate of this bug: 1332047
Duplicate of this bug: 1331535
Duplicate of this bug: 1331196
Duplicate of this bug: 1330194
Duplicate of this bug: 1329866
Duplicate of this bug: 1328196
Duplicate of this bug: 1326569
Duplicate of this bug: 1326417
Duplicate of this bug: 1326073
Duplicate of this bug: 1325659
Duplicate of this bug: 1325389
Duplicate of this bug: 1324207
Duplicate of this bug: 1324060
Duplicate of this bug: 1323785
Duplicate of this bug: 1323757
Duplicate of this bug: 1323752
Duplicate of this bug: 1322642
Duplicate of this bug: 1322530
Duplicate of this bug: 1322363
Duplicate of this bug: 1322324
Duplicate of this bug: 1321937
Duplicate of this bug: 1321747
Duplicate of this bug: 1321722
Duplicate of this bug: 1321481
Duplicate of this bug: 1321457
Duplicate of this bug: 1321452
Duplicate of this bug: 1320847
Duplicate of this bug: 1320496
Duplicate of this bug: 1320167
Duplicate of this bug: 1320002
Duplicate of this bug: 1319656
Duplicate of this bug: 1319327
Duplicate of this bug: 1319313
Duplicate of this bug: 1319308
Duplicate of this bug: 1318598
Duplicate of this bug: 1318244
Duplicate of this bug: 1318233
Duplicate of this bug: 1318161
Duplicate of this bug: 1318022
Duplicate of this bug: 1317885
Duplicate of this bug: 1317623
Duplicate of this bug: 1317613
Duplicate of this bug: 1314211
Duplicate of this bug: 1310070
(Reporter)

Comment 78

2 months ago
I talked to :jmaher about the timeline for fixing bug 1290183, and it looks like, realistically speaking, we're probably looking at late Q1 / early Q2.

So, given the high volume of these intermittents, it might make sense to try and fix this issue on the existing 12.04 builders after all.
(Reporter)

Updated

2 months ago
Depends on: 1334641
(Reporter)

Comment 79

2 months ago
(In reply to Botond Ballo [:botond] from comment #78)
> So, given the high volume of these intermittents, it might make sense to try
> and fix this issue on the existing 12.04 builders after all.

Filed bug 1334641. As I mentioned, I'm a bit out of my depth in this area, so some help with that bug would be greatly appreciated!

Comment 80

2 months ago
13 failures in 749 pushes (0.017 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-aurora: 4
* mozilla-inbound: 3
* autoland: 3
* mozilla-beta: 2
* mozilla-central: 1

Platform breakdown:
* linux64: 10
* linux32: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1293474&startday=2017-01-23&endday=2017-01-29&tree=all
Duplicate of this bug: 1320244
Duplicate of this bug: 1332656
Duplicate of this bug: 1332056
Duplicate of this bug: 1331343
Duplicate of this bug: 1330211
Duplicate of this bug: 1326552
Duplicate of this bug: 1326135
Duplicate of this bug: 1325263
Duplicate of this bug: 1324812
Duplicate of this bug: 1324036
Duplicate of this bug: 1322637
Duplicate of this bug: 1322297
Duplicate of this bug: 1321458
Duplicate of this bug: 1321456
Duplicate of this bug: 1320068
Duplicate of this bug: 1316489
Duplicate of this bug: 1315852
Duplicate of this bug: 1315745

Updated

2 months ago
Duplicate of this bug: 1335950
Duplicate of this bug: 1335951

Comment 101

2 months ago
41 failures in 733 pushes (0.056 failures/push) were associated with this bug in the last 7 days. 

This is the #43 most frequent failure this week. 

Repository breakdown:
* autoland: 19
* mozilla-inbound: 13
* mozilla-aurora: 3
* mozilla-central: 2
* mozilla-beta: 2
* try: 1
* mozilla-release: 1

Platform breakdown:
* linux64: 27
* linux32: 14

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1293474&startday=2017-01-30&endday=2017-02-05&tree=all
Duplicate of this bug: 1313562
Duplicate of this bug: 1337094
Duplicate of this bug: 1337591
Duplicate of this bug: 1337582
(Reporter)

Comment 106

a month ago
(In reply to Botond Ballo [:botond] from comment #79)
> (In reply to Botond Ballo [:botond] from comment #78)
> > So, given the high volume of these intermittents, it might make sense to try
> > and fix this issue on the existing 12.04 builders after all.
> 
> Filed bug 1334641.

THe libxcb on the 12.04 builders is now patched. Hopefully we should start seeing these intermittents taper off.
(Reporter)

Updated

a month ago
No longer depends on: 1290183
Duplicate of this bug: 1338380
Duplicate of this bug: 1337567

Comment 109

a month ago
24 failures in 836 pushes (0.029 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* mozilla-inbound: 11
* autoland: 10
* mozilla-central: 1
* mozilla-beta: 1
* mozilla-aurora: 1

Platform breakdown:
* linux64: 15
* linux32: 9

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1293474&startday=2017-02-06&endday=2017-02-12&tree=all
(Reporter)

Comment 110

a month ago
Looking pretty good:

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1293474&startday=2017-02-06&endday=2017-02-19&tree=all

Looks like bug 1334641 fixed this. Closing.
Status: REOPENED → RESOLVED
Last Resolved: 7 months agoa month ago
Resolution: --- → FIXED
Depends on: 1341387

Updated

26 days ago
Whiteboard: [gfx-noted] → [gfx-noted][stockwell fixed]
You need to log in before you can comment on or make changes to this bug.