Closed Bug 1397918 Opened 7 years ago Closed 7 years ago

Crash in libc-2.19.so@0x36c37 from nsShmImage

Categories

(Core :: Graphics, defect, P1)

Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 + fixed
firefox58 --- fixed

People

(Reporter: bkelly, Assigned: lsalzman)

References

(Depends on 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [gfx-noted])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-a7417e82-ffcc-4572-940c-eed7f0170524.
=============================================================

I spoke with a user today complaining of crashes.  He pointed me at this crash report.  In the last week there have been 128 reports on FF55.  It also triggers on other branches.

Milan, any thoughts?
Flags: needinfo?(milan)
Crash happened while typing stuff into the location bar.  It has occurred several times.
Nical, does this look like we're running out of some system resources getting shmem?
Flags: needinfo?(milan) → needinfo?(nical.bugzilla)
Priority: -- → P3
(In reply to Daniel Convissor from comment #1)
> Crash happened while typing stuff into the location bar.  It has occurred
> several times.

Hmm. I may be mis-remembering what was happening at time of this crash.

Recently had another location bar typing crash. It has a different signature:
https://crash-stats.mozilla.com/report/index/9f310c26-beb9-4008-90d5-11ef80170910

Sorry for any confusion.
(In reply to Milan Sreckovic [:milan] from comment #2)
> Nical, does this look like we're running out of some system resources
> getting shmem?

That would be my guess. The crash address is interestingly somewhat consistent, not sure what to make of it, though.
The volume doesn't look very high. My guess would be that xcb isn't dealing with the resource shortage very well, unless there was a potential error earlier that we overlook. I don't know the xcb/shmputimage stuff much.
Flags: needinfo?(nical.bugzilla)
We were hoping this got fixed in bug 1296911 (this crash report was tracked in 2016 in bug 1297326.)  Bug 1297326 comment 6 also seems to suggest a workflow to reproduce the crash:
    1.Launch Fx
    2.Start downloading about 5-6 files
    3.Pause/Resume/Cancel one of them
    4. Wait for the downloads to finish with the download panel opened

Lee what do you make of this (you reviewed the patch that suggested fixing things first time around.)
Flags: needinfo?(lsalzman)
See Also: → 1297326
(In reply to Milan Sreckovic [:milan] from comment #5)
> We were hoping this got fixed in bug 1296911 (this crash report was tracked
> in 2016 in bug 1297326.)  Bug 1297326 comment 6 also seems to suggest a
> workflow to reproduce the crash:
>     1.Launch Fx
>     2.Start downloading about 5-6 files
>     3.Pause/Resume/Cancel one of them
>     4. Wait for the downloads to finish with the download panel opened
> 
> Lee what do you make of this (you reviewed the patch that suggested fixing
> things first time around.)

The entire situation with vsync I think is a different mess that is not really related to this. And the crash reports unfortunately are not specific enough to tell us everything goes to hell. But bug 1334641 which is somewhere linked therein in the maze of see alsos may be related, and if so, this could be an upstream bug in version <= 1.10 of libxcb. If that is the case, and if we can verify these crashes are happening with said versions of libxcb, then one thing we could do is maybe blacklist old versions of libxcb which are still in-use on older stable versions of Debian/Ubuntu/etc.
Flags: needinfo?(lsalzman)
Daniel, can you check in your package manager which version of the libxcb library is installed? Also, what distribution are you using, and what version of that distribution?
Flags: needinfo?(danielc)
I don't see a straight up libxcb package. Which one do you want the version for?
https://packages.debian.org/search?keywords=libxcb&searchon=names&suite=stable&section=all

I'm running Debian 8.9.  uname says "3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u3 (2017-08-15)"
Flags: needinfo?(danielc)
(In reply to Daniel Convissor from comment #8)
> I don't see a straight up libxcb package. Which one do you want the version
> for?
> https://packages.debian.org/
> search?keywords=libxcb&searchon=names&suite=stable&section=all
> 
> I'm running Debian 8.9.  uname says "3.16.0-4-amd64 #1 SMP Debian
> 3.16.43-2+deb8u3 (2017-08-15)"

You should have a package called libxcb1 installed. That is the one I need the version on.
1.10-3+b1
Based on irc discussion with Karl, we determined that there was no simple way to just check the version of libxcb we were using. He posed the idea that instead we could check for added symbols in newer versions, which still allows us to verify we are using an appropriately patched version of libxcb.

The upstream fix this helps us ensure is mentioned in bug 1293474 (https://cgit.freedesktop.org/xcb/libxcb/commit/src/xcb_out.c?id=be0fe56c3bcad5124dcc6c47a2fad01acd16f71a).

We added the XCB code in version 50 with bug 1286649, so this affects all releases right now.
Assignee: nobody → lsalzman
Status: NEW → ASSIGNED
Attachment #8911019 - Flags: review?(karlt)
Blocks: 1286649
Has Regression Range: --- → yes
Keywords: regression
Priority: P3 → P1
Attachment #8911019 - Flags: review?(karlt) → review+
Pushed by lsalzman@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a33c563d5428
don't use versions of libxcb before 1.11.1. r=karlt
https://hg.mozilla.org/mozilla-central/rev/a33c563d5428
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Too late for 56 but we could accept an uplift request to 57. Lee, could you take care of that?
Flags: needinfo?(lsalzman)
Comment on attachment 8911019 [details] [diff] [review]
don't use versions of libxcb before 1.11.1

Approval Request Comment
[Feature/Bug causing the regression]: bug 1286649
[User impact if declined]: Race condition causing random crashes on users of older Linux distributions.
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]: no 
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: no
[Why is the change risky/not risky?]: Disables some acceleration for a subset of Linux users where the underlying X protocol library is unstable and crashing.
[String changes made/needed]: none
Flags: needinfo?(lsalzman)
Attachment #8911019 - Flags: approval-mozilla-esr52?
Attachment #8911019 - Flags: approval-mozilla-beta?
Comment on attachment 8911019 [details] [diff] [review]
don't use versions of libxcb before 1.11.1

Let's take it to 57 to improve stability.
Should be in 57b3
Attachment #8911019 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Ed Lee :Mardak from comment #17)
> It hasn't been reassigned yet, but this change regressed linux64 opt tart by
> ~40%.
> https://treeherder.mozilla.org/perf.html#/
> graphs?timerange=86400&series=mozilla-inbound,1300725,1,
> 1&highlightedRevisions=a33c563d5428&zoom=1506052132814.313,1506056947844.
> 9243,4.189417800080204,6.1847780773455625

There are going to be performance regressions on machines using an outdated version of libxcb (over 4 years old). But we have absolutely no choice in this matter, since the alternative is the thread unsafety causing random crashes.

So, the question here is, which version of libxcb is the linux64 test machine using at this point?

According to bug 1293474, and bug 1334641, it would appear we are using the same old libxcb that is over 4 years old, but with just a spot fix. So this fix will not pick that up, since it's not actually using libxcb 1.11.1. So the proper fix for the test machines would be either to hack in a stub function for the symbol that we are detecting (xcb_discard_reply64) to fake the libxcb version, or just actually upgrade to the relevant libxcb 1.11.1 version. Botond, since you were involved in the patched libxcb, any pointers here?

In any case, for users on recent Linux distributions, the apparent slowdown in tart on our test infrastructure is illusory, and the fix in this bug is definitely necessary to ensure Firefox is stable.
Flags: needinfo?(botond)
I suppose alternatively, just WONTFIX the tart regression bug that will be filed unless we believe the short circuit logic from this bug would prevent measuring actual talos behavior for users. But then again if we want talos to be closer in measuring what users are running, and assuming that means "libxcb 1.11.1 or newer," that seems to point to updating libxcb.
Actually, looks like this is affecting a lot more than just tart, so that seems like the short circuit might hide desired measured behaviors.
Although this change adds risk that the test configuration might not detect
changes in what most users are running, this is just one in a larger
pool of differences in the underlying system over 5.5 years.

It would be better to prioritize updating the talos machines over adding a
solution for this one difference.  This may be a few months away.

https://github.com/taskcluster/taskcluster-rfcs/commit/abaa6a7285e540fad693cf7fdec0d2751fa6ccb7#diff-cd4a1b36ed2b93cfc52abf77a7d25d03R27
https://bugzilla.mozilla.org/show_bug.cgi?id=1384579
Depends on: 1402773
(In reply to Julien Cristau [:jcristau] from comment #23)
> I could try and get
> https://cgit.freedesktop.org/xcb/libxcb/commit/src/xcb_out.
> c?id=be0fe56c3bcad5124dcc6c47a2fad01acd16f71a into debian 8, if that would
> help?

The problem is this patch does not leave any signature that we can detect from within Firefox, so you'd also have to, within Debian's distribution of Firefox, remove the workaround we added here to match. Otherwise, we still need to keep the workaround in for other outdated Linux distributions that would not have a patched libxcb. So that would basically only help Debian, and not help us in the general case.
(In reply to Lee Salzman [:lsalzman] from comment #18)
> According to bug 1293474, and bug 1334641, it would appear we are using the
> same old libxcb that is over 4 years old, but with just a spot fix. So this
> fix will not pick that up, since it's not actually using libxcb 1.11.1. So
> the proper fix for the test machines would be either to hack in a stub
> function for the symbol that we are detecting (xcb_discard_reply64) to fake
> the libxcb version, or just actually upgrade to the relevant libxcb 1.11.1
> version. Botond, since you were involved in the patched libxcb, any pointers
> here?

I would definitely suggest going with the stub function approach. A full-blown upgrade to 1.11.1 on a Ubuntu 12.04 machine is going to run into dependency problems with libc [1].

Note that, in my experience, the process of building and packaging a patched version of libxcb has been a fair bit of hassle (documented in some detail in bug 1334641). In particular, I was only able to successfully build and package it on a physical Ubuntu 12.04 loaner machine. If someone volunteers to go through that process again, great :) Otherwise, I'd suggest going with what Karl said in comment 22, and waiting for the talos machines to be updated to 16.04.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1334641#c11
Flags: needinfo?(botond)
(In reply to Lee Salzman [:lsalzman] from comment #24)
> (In reply to Julien Cristau [:jcristau] from comment #23)
> > I could try and get
> > https://cgit.freedesktop.org/xcb/libxcb/commit/src/xcb_out.
> > c?id=be0fe56c3bcad5124dcc6c47a2fad01acd16f71a into debian 8, if that would
> > help?
> 
> The problem is this patch does not leave any signature that we can detect
> from within Firefox, so you'd also have to, within Debian's distribution of
> Firefox, remove the workaround we added here to match. Otherwise, we still
> need to keep the workaround in for other outdated Linux distributions that
> would not have a patched libxcb. So that would basically only help Debian,
> and not help us in the general case.

Looked a bit more into this, we could make this work if you could apply that patch and this one: https://cgit.freedesktop.org/xcb/libxcb/commit/?id=cb621341a62e6d2233db3e337611f6fdd4f675a6

Seems safer to apply that patch than just make a stub function as other applications could be foreseeably dlsym'ing that function, so seems least harmful to just expose the real deal since the patch seems to mostly apply against 1.10.

That would give older Debian a poor man's libxcb 1.11.1, which would allow it to get recognized as fixed and not suffer the performance degradation. Would it be possible for you to build this package for Ubuntu 12.04 as well to get onto the test machines?
Flags: needinfo?(jcristau)
Unfortunately exposing new A{B,P}I is something I would very much avoid in a (old)stable distribution, so I guess I'll leave that alone.

WRT building a new libxcb for ubuntu 12.04, I guess I could do that, though I have no idea what the process to get something on to the test machines looks like.
Flags: needinfo?(jcristau)
(In reply to Julien Cristau [:jcristau] from comment #27)
> WRT building a new libxcb for ubuntu 12.04, I guess I could do that, though
> I have no idea what the process to get something on to the test machines
> looks like.

For building a packaging a patched libxcb, see bug 1334641 comments 13 through 17. The annoying thing is that, when I tried it, it would only succeed if done from a local machine with Ubuntu 12.04 installed (but perhaps you can overcome the problems I ran into in bug 1334641 comment 13 and this won't be necessary).

For getting the modified package onto the test machines, I think it depends on whether they use taskcluster or buildbot. For taskcluster, see bug 1334641 comment 20 (and the patch in bug 1334641); for buildbot, see bug 1341387.
(In reply to Lee Salzman [:lsalzman] from comment #15)
> [Is this code covered by automated tests?]: yes
> [Has the fix been verified in Nightly?]: yes
> [Needs manual test from QE? If yes, steps to reproduce]: no 

Setting qe-verify- based on Lee's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
Depends on: 1404323
There are only a couple of crash reports per week here for ESR so I would say we shouldn't risk this for ESR52.  Unless I'm missing something here - please let me know if so.
Attachment #8911019 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52-
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #30)
> There are only a couple of crash reports per week here for ESR so I would
> say we shouldn't risk this for ESR52.  Unless I'm missing something here -
> please let me know if so.

The problem is this is a race condition, so the full extent of the damage here and what security holes this may be introducing we don't know - given that it is widely disclosed. Since 52 ESR is most likely to be used on older systems (say, Ubuntu 12.04 and 14.04, or similar), which will be using the older affected versions of libxcb1, I would say they are the exact population who are most at risk of this.

The only complication we found to using this patch was resolved in bug 1404323. That fixes the lurking bug that this patch exposed. And that particular bug existed whether or not this patch was used, but rather this patch just made the occurrence more common.
Does comment 31 change your thoughts on this Liz or should we wontfix esr52?
Flags: needinfo?(lhenry)
I can see that point but the uplift risk doesn't seem to outweigh the known benefit - I would have taken this in the 1st two cycles of ESR but I think now it is too late. If the situation changes, let relman know (me or whoever else is the current ESR owner)
Flags: needinfo?(lhenry)
Depends on: 1446905
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: