Closed Bug 458092 Opened 16 years ago Closed 15 years ago

Crashes with BadIdChoice X errors

Categories

(Core :: Graphics, defect, P2)

x86
Linux
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: vlad)

References

Details

Attachments

(6 files)

This may be the same as bug 454677, but it started happening quite often to me only in the last few days, so I suspect it's different.

In the past four days, I've crashed three times with a BadIdChoice X error:
2008-09-28 12:42:13
2008-09-29 18:47:44
2008-10-01 10:39:24

The first two times the error looked like this:

The program 'firefox-bin' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadIDChoice (invalid resource ID chosen for this connection)'.
  (Details: serial 158749545 error_code 14 request_code 155 minor_code 4)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)


(different serial the second time)

But the third time it gave me more information:

The program 'firefox-bin' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadIDChoice (invalid resource ID chosen for this connection)'.
  (Details: serial 150818044 error_code 14 request_code 155 minor_code 4)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)
Locking assertion failure.  Backtrace:
#0 /usr/lib/libxcb-xlib.so.0 [0x7f833992297c]
#1 /usr/lib/libxcb-xlib.so.0(xcb_xlib_lock+0x15) [0x7f8339922a15]
#2 /usr/lib/libX11.so.6 [0x7f833d7e4323]
#3 /usr/lib/libXrender.so.1(XRenderFreePicture+0x46) [0x7f833abd87c6]
#4 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/libthebes.so [0x7f83327bd8f4]
#5 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/libthebes.so(_moz_cairo_surface_finish+0x3b) [0x7f83327a51cb]
#6 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/libthebes.so(_moz_cairo_surface_destroy+0x58) [0x7f83327a5248]
#7 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/libthebes.so(_ZN11gfxASurface7ReleaseEv+0x36) [0x7f83327676c6]
#8 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/components/libgkgfxthebes.so [0x7f832eed9c6f]
#9 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/components/libgkgfxthebes.so [0x7f832eeda2ba]
#10 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/components/libgkgfxthebes.so [0x7f832eee5544]
#11 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/components/libgkgfxthebes.so [0x7f832eee584f]
#12 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/libxpcom_core.so(_ZN15nsCOMArray_baseD2Ev+0x3b) [0x7f833fd9b0eb]
#13 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/components/libimglib2.so [0x7f832f30ed41]
#14 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/components/libimglib2.so [0x7f832f30cccf]
#15 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/components/libimglib2.so [0x7f832f31f178]
#16 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/components/libimglib2.so [0x7f832f31d52c]
#17 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/components/libimglib2.so [0x7f832f31820a]
#18 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/components/libimglib2.so [0x7f832f31ac35]
#19 /home/dbaron/builds/mozilla-central/obj/firefox-debugopt/dist/bin/libxpcom_core.so(PL_DHashTableFinish+0x73) [0x7f833fd96b43]


I'm going to attempt to turn the above stack into something more useful, although it will be a little tricky.
Flags: blocking1.9.1?
Was the stack trace from running with --sync?
This happens about once a day, seemingly randomly, so I can't just rerun with --sync and make it crash again.
Likely not -- the minor number corresponds to RenderCreatePicture, not FreePicture.  Might have to dig into the X code to see what generates BadID; owen was suggesting that it might be due due to IDs getting out of sync somehow:

13:03 < otaylor> vlad_: Trying to create two resources with the ID of the 
                 second less than the ID of the first would cause that


But I have no idea how we'd get into that situation, unless the IDs wrapped around?
This one has better symbols and is quite different from the first.
Assigning blocking+ and over to me until we figure out what's going on.
Assignee: nobody → vladimir
Flags: blocking1.9.1? → blocking1.9.1+
Priority: -- → P3
I've now installed better debug symbols for libcairo.so.2, so I should get a better stack next time.
Some of the Mozilla filenames / line numbers might be off, since I'd rebuilt some things since starting.
(In reply to comment #10)
> *** Bug 454678 has been marked as a duplicate of this bug. ***

Note that unlike the crashes mentioned in both this bug and bug 454677 (which have request_code 155 minor_code 4), bug 454678 has request_code 53 minor_code 0. Since the day when I reported bugs 454677 and 454678 (and actually maybe a week or more before), I'm getting them approximately once or twice a day (14/155/4 much more often than 14/53/0), and only after SeaMonkey has run some hours since last restart. I suspect bug 454677 is due to the same cause as this one but I have doubts about bug 454678. We'll see when (and if) this one gets fixed.


Currently running

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1b2pre) Gecko/20081020 SeaMonkey/2.0a2pre - Build ID: 20081020000453

and I've had a case of this problem (14/155/4) on it. I'm keeping sysout/syserr logs (available on request) but I don't know how to catch a usable stack trace with gdb. (Breakpad is never triggered by these crashes.)
The requests will be different -- it's being triggered by anything that allocates an ID.  dbaron has some good stack traces showing that happening inside this bug, but from reading the X source I don't know what could be causing it.
Gonna take this off the blocking list for now; I haven't been able to reproduce it in a VM, and David says he hasn't seen it for a while.  From reading the X source and talking to X developers, the two things that could cause it are either a corrupt ID making it over, or running out of XIDs on the X side.  The latter would be odd, because we'd probably have run out of memory well before we got to running out of XIDs.  The former is possible, but again odd..

Useful information to get if it happens again would be the ID in _XIDHandler, and xrestop information.
Flags: wanted1.9.1+
Flags: blocking1.9.1-
Flags: blocking1.9.1+
Priority: P3 → P2
(In reply to comment #14)
> Gonna take this off the blocking list for now; I haven't been able to reproduce
> it in a VM, and David says he hasn't seen it for a while. [...]

I used to get these almost every day; the latest case I see is recorded in a logfile with a datestamp of 2008-10-23 13:54 +0200.
Attached file xrestop -b -m 1
Recorded shortly after OpenOffice received:

  X-Error: BadIDChoice (invalid resource ID chosen for this connection)
  Major opcode: 53 (X_CreatePixmap)

Then, ten minutes later the firefox process listed here (version 3.0.3)
received:

  BadIDChoice (invalid resource ID chosen for this connection)
  Details: serial 85482098 error_code 14 request_code 153

I don't see any large XID users.

Two different apps crashing the same way suggests that the bug is at a low
level.  I have xorg-server-1.4.2, libxcb-1.0, libX11-1.1.4, cairo-1.6.4-r1.
(In reply to comment #16)
[...]
> Two different apps crashing the same way suggests that the bug is at a low
> level.  I have xorg-server-1.4.2, libxcb-1.0, libX11-1.1.4, cairo-1.6.4-r1.

I'm on openSUSE 11.0, currently with:
xorg-x11-7.3-96.2
xorg-x11-server-7.3-110.9
xorg-x11-libxcb-7.3-48.1
xorg-x11-libX11-7.3-56.1
cairo-1.4.14-32.1

I just checked my SeaMonkey saved logs (20 of them, the oldest one dated October 24). Neither of the strings "BadIdChoice" and "X Window" happens in any of them, which means the bug (and its dupes bug 454677 and bug 454678, which I reported) have disappeared some time ago on my system. I haven't particularly noted if and when any "SuSE Online Updates" were applied to any Xorg-X11 packages, but I check at least once a day for any available upgrades (and apply them if there are).

However I had a "BadWindow (invalid window parameter)" X-error (error 3 request 20 minor 0) yesterday in BonEcho (Firefox 2 nightly). At the moment, this is an isolated case.

My current builds:
-- SeaMonkey:
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1b2pre) Gecko/20081112 SeaMonkey/2.0a2pre - Build ID: 20081112033648
-- Firefox:
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.19pre) Gecko/20081112 BonEcho/2.0.0.19pre - Build ID: 2008111203
FWIW, same issue with
firefox-3.0.4, xorg-server-1.5.2, libxcb-1.1.90.1, libX11-1.1.5.
I haven't seen this for quite a while... perhaps it was fixed when I upgraded to Ubuntu 8.10 from 8.04.
I see this bug (or something that looks like it!) every few days using the Ubuntu 8.10 shipped Firefox 3.0.5 ("3.0.5+nobinonly-0ubuntu0.8.10.1", ATM), on a 2-core x86-64 (xlib 1.1.5, xcb 1.1.noidea).  The last crash I captured was from CreatePixmap (request code 53), not sure if I'm seeing it on other requests or not.  I've never seen it with any application besides firefox.  (Though firefox is presumably pushing the X library stack a lot harder than anything else I use; I don't run multi-day OOo sessions.)

Restarting firefox is about the most frustrating computing experience I can think of (cf #177175), so I'm certainly eager to help...

...however, I suspect this was tracked down already in March (though ignored until November):
http://osdir.com/ml/freedesktop.xcb/2008-04/msg00071.html
http://cgit.freedesktop.org/xorg/lib/libX11/commit/?id=cc19618d2eb3ed92a0b574aee26a7da8b4aed5d2
(In reply to comment #22)
> ...however, I suspect this was tracked down already in March (though ignored
> until November):
> http://osdir.com/ml/freedesktop.xcb/2008-04/msg00071.html
> http://cgit.freedesktop.org/xorg/lib/libX11/commit/?id=cc19618d2eb3ed92a0b574aee26a7da8b4aed5d2

Thanks for that link.  It looks related but I'm not sure that it's the fix we need, because I don't think Mozilla uses XCreate* functions from more than 1 thread.

I'd be interested to hear if anyone is seeing the crash with a libX11 built without xcb.  (This can be checked with "ldd /usr/lib*/libX11.so | grep xcb".)
All the stacks that I've seen here have libX11 built using libxcb.

All the stacks that I've seen involve a Create function that uses XAllocID.  Usually the result of XAllocID is assigned to local variables in the Create function called "pid" as well as "req->pid" (and would be returned on success).
It could be useful to know what value either of these variables hold if this information can be obtained using a debugger after breaking on gdk_x_error.
(In reply to comment #23)
> Thanks for that link.  It looks related but I'm not sure that it's the fix we
> need, because I don't think Mozilla uses XCreate* functions from more than 1
> thread.

Ah man, and that was such a good guess... but I've just double-checked with gdb, and in about 10 minutes of browsing I wasn't able to catch any calls to _XAllocID, _XAllocIDs, or xcb_generate_id from anything but the main thread, so you're right.  (Unfortunately, the conditional breakpoint is a little expensive to leave running in general, though.)
Just got another crash, and this time I saved a core file from inside gdk_x_error.  It's 250 MB compressed and only useful if you have Ubuntu 8.10 debugging packages available, so I'm not attaching it, but I can provide it on request.

Relevant bit of backtrace:
#5  0x00007f7d774a71ab in _XSyncFunction (dpy=0x8c4c00)
    at ../../src/Synchro.c:37
#6  0x00007f7d774b398b in _XIDHandler (dpy=0x8c4c00) at ../../src/xcb_io.c:275
#7  0x00007f7d77775f77 in XRenderCreatePicture () from /usr/lib/libXrender.so.1

So in _XIDHandler, we have dpy->xcb->next_xid == 57206503.  Not an obviously invalid value.

The xid comes from xcb_generate_id, so I peeked in there at the algorithm it uses, and noticed something funny.  xcb_generate_id stores its state in the dpy->xcb->connection->xid struct.  At the time of the crash, some members of this struct:
  xid.base == 56623104
  xid.inc  == 1
  xid.last == 57206504
  xid.max  == 57206499

By my reading, it should be impossible for xid.last to exceed xid.max, yet here we are.
Thanks, Nathaniel.  Please keep that core file and copies or at least the package versions of xcb xlib and libXrender.

dpy->xcb->next_xid is most likely the xid after the one that caused the problem.
More likely the problem xid is the one in the XRenderCreatePicture frame.
Can you switch to that frame ("f 7"), please, and print req->pid and pid?
I can't seem to get at req->pid or pid easily -- it turns out I didn't have the debug info installed for libXrender when I started firefox, so the gdb attached to the firefox process can't figure out where local variables like that are hidden, and that when I start gdb on the core file, it declines to load any debug symbols whatsoever ("i shar" claims that the only library mapped is ld-linux).

However, the resourceid field in the XErrorEvent we catch is 57206502, and that should be the same as what we'd see from req->pid or pid.
(In reply to comment #25)
> The xid comes from xcb_generate_id, so I peeked in there at the algorithm it
> uses, and noticed something funny.  xcb_generate_id stores its state in the
> dpy->xcb->connection->xid struct.  At the time of the crash, some members of
> this struct:
>   xid.base == 56623104
>   xid.inc  == 1
>   xid.last == 57206504
>   xid.max  == 57206499
> 
> By my reading, it should be impossible for xid.last to exceed xid.max, yet here
> we are.

(In reply to comment #26)
> Asked on the xcb list:
> http://lists.freedesktop.org/archives/xcb/2009-January/004247.html

Nice detective work.  Thanks!
FWIW, I'm seeing this once every 2-4 days on one of my linux boxes, and I'm pretty sure I've seen it on another box as well. The box where I see it for sure is running Firefox in a VNC X server on Fedora Core 10. The other box is running Fedora Core 8, also under VNC.
Nathaniel and Barton found a theoretical situation that would cause this crash
(http://lists.freedesktop.org/archives/xcb/2009-January/004248.html) but it hasn't actually been proven that this situation is happening.
If this situation is happening, it can be detected by

* getting debuginfo packages for libxcb, and
* setting a conditional breakpoint on the line in xcb_generate_id after the
  call to xcb_xc_misc_get_xid_range_reply, with condition "range->count == 1".

(It is not necessary to run with --sync.)
If the breakpoint is hit, then the program will crash soon afterwards.
If the BadIDChoice results without hitting the breakpoint, then there is also another bug.
Ding ding ding!

Breakpoint 2, xcb_generate_id (c=0x1450640) at xcb_xid.c:51
51	        c->xid.last = range->start_id;
$559 = {response_type = 1 '\001', pad0 = 0 '\0', sequence = 29345, length = 0, start_id = 41385235, count = 1}

Breakpoint 1, gdk_x_error (display=0x144fc00, error=0x7ffff91140d0) at /build/buildd/gtk+2.0-2.14.4/gdk/x11/gdkmain-x11.c:613

I finally caught the crash in action, and indeed, it occurred immediately after the first time the X server sent back a xid_range_reply with count=1.

So my crashes are definitely due to this xcb bug.
Nathaniel: what does this suggest should be done?
Is fixing xcb_generate_id in libxcb going to fix this problem or is some fix to firefox required as well?

PS: I'm getting these same crashes in Firefox on Fedora 10.

PPS: I wondered whether this bug report is related https://bugzilla.redhat.com/show_bug.cgi?id=458657
Yes, it looks like the solution is to fix xcb_generate_id.

The relevant patch hasn't been committed upstream yet, see
  http://article.gmane.org/gmane.comp.freedesktop.xcb/4192
for discussion and the patch.

Probably the most useful thing would be for those seeing the crash to test out that patch and see if it prevents the crash (and doesn't cause any new ones!).

The RH bug indeed looks like it might be related (I really hope there aren't *two* ways of getting BadIDChoice errors!), and if the commenters there are seeing the crash so often, then their testing feedback would probably be very useful.
Nathaniel: thanks for all your work on this.

As I read it, Bart's patch still needs a couple of tweaks.  I hope he comes up with a revision to reflect these.

I created a bugzilla entry on the freedesktop.org bugzilla since I think that this needs to be reported against libxcb, not (just) its clients.  Please add to it.

https://bugs.freedesktop.org/show_bug.cgi?id=20254
I rebuilt xcb with the patch from comment 34 and then restarted Firefox, but it wouldn't start due to:

firefox-bin: xcb_xid.c:51: xcb_generate_id: Assertion `c->xid.last == c->xid.max' failed.
"I just need to get a final commit together.  This weekend,
honest.  Thanks much for the note." -- http://article.gmane.org/gmane.comp.freedesktop.xcb/4317
Here's the proposed patch, and he's looking for testers:
http://thread.gmane.org/gmane.comp.freedesktop.xcb/4317/focus=4324
Marking FIXED based on comment 40 and
https://bugs.freedesktop.org/show_bug.cgi?id=20254#c2
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Any plans on porting this fix to the 1.9.0 branch? I've had it yesterday (2009-05-14) on a GranParadiso nightly (Fx 3.0.11pre).
The bug is in a system library, not our code; you need to talk to your distributor about getting the fix into their copy of libxcb.
(In reply to comment #44)
> The bug is in a system library, not our code; you need to talk to your
> distributor about getting the fix into their copy of libxcb.

I see. Well, my distributor is Novell/SuSE (but I get Mozilla products straight from Mozilla in .tar.bz2 form) and I diligently apply all "online updates" to their distribution (currently openSUSE 11.1). Let's hope they'll upgrade their X libraries some day.
Tony Mechelynck:
Report this bug to Novell.  Some other distro's have already released updates for it.  https://bugs.freedesktop.org/show_bug.cgi?id=20254
(In reply to comment #46)
> Tony Mechelynck:
> Report this bug to Novell.  Some other distro's have already released updates
> for it.  https://bugs.freedesktop.org/show_bug.cgi?id=20254

I did, by commenting on what looked like an appropriate existing bug, see https://bugzilla.novell.com/show_bug.cgi?id=423740#c1

I did not search for duplicates, however. If you think that it is indeed the appropriate Novell bug, you may (or may not) want to add a crossref for it on your freedesktop bug.
This apparently landed on Ubuntu jaunty 9.04 (the current stable release) on July 9th (2009).

https://bugs.launchpad.net/ubuntu/+source/firefox-3.0/+bug/220628
Still happens for me with Firefox 3.5.2

The program 'firefox-bin' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadIDChoice (invalid resource ID chosen for this connection)'.
  (Details: serial 178785760 error_code 14 request_code 156 minor_code 4)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)
libxcb: WARNING! Program tries to lock an already locked connection,
        which indicates a programming error.
        There will be no further warnings about this issue.
Please ignore last comment, I use an old SuSE 10.3, and didn't realize that I need to update X11.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: