Closed Bug 1324130 Opened 8 years ago Closed 7 years ago

Crash in _pixman_implementation_lookup_composite

Categories

(Core :: Graphics, defect)

51 Branch
x86
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- fixed
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: philipp, Assigned: lsalzman)

References

Details

(Keywords: crash, regression, Whiteboard: [gfx-noted])

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-744b786c-4db6-450b-a46c-e03122161216.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	_pixman_implementation_lookup_composite 	gfx/cairo/libpixman/src/pixman-implementation.c:108
1 	xul.dll 	_moz_pixman_image_composite32 	gfx/cairo/libpixman/src/pixman.c:683
2 	xul.dll 	_cairo_image_surface_composite 	gfx/cairo/cairo/src/cairo-image-surface.c:4318
3 	xul.dll 	_cairo_surface_composite 	gfx/cairo/cairo/src/cairo-surface.c:1889
4 	xul.dll 	_cairo_surface_fallback_composite 	gfx/cairo/cairo/src/cairo-surface-fallback.c:1431
5 	xul.dll 	_cairo_surface_composite 	gfx/cairo/cairo/src/cairo-surface.c:1900
6 	xul.dll 	_composite_rectangle 	gfx/cairo/cairo/src/cairo-win32-surface.c:3176
7 	xul.dll 	_clip_and_composite_trapezoids 	gfx/cairo/cairo/src/cairo-win32-surface.c:3226
8 	xul.dll 	_cairo_win32_surface_fallback_fill 	gfx/cairo/cairo/src/cairo-win32-surface.c:3841
9 	xul.dll 	_cairo_path_fixed_fill_rectilinear_to_region 	gfx/cairo/cairo/src/cairo-path-fill.c:258
10 	xul.dll 	_cairo_gstate_fill 	gfx/cairo/cairo/src/cairo-gstate.c:1290
11 	xul.dll 	mozilla::gfx::DrawTargetCairo::DrawPattern(mozilla::gfx::Pattern const&, mozilla::gfx::StrokeOptions const&, mozilla::gfx::DrawOptions const&, mozilla::gfx::DrawTargetCairo::DrawPatternType, bool) 	gfx/2d/DrawTargetCairo.cpp:1032
12 	xul.dll 	mozilla::gfx::DrawTargetCairo::FillRect(mozilla::gfx::RectTyped<mozilla::gfx::UnknownUnits, float> const&, mozilla::gfx::Pattern const&, mozilla::gfx::DrawOptions const&) 	gfx/2d/DrawTargetCairo.cpp:1089
13 	xul.dll 	gfxContext::Fill(mozilla::gfx::Pattern const&) 	gfx/thebes/gfxContext.cpp:198
14 	xul.dll 	gfxContext::SetMatrix(gfxMatrix const&) 	gfx/thebes/gfxContext.cpp:272
15 	xul.dll 	nsCSSRendering::PaintGradient(nsPresContext*, nsRenderingContext&, nsStyleGradient*, nsRect const&, nsRect const&, nsRect const&, nsSize const&, mozilla::gfx::IntRectTyped<mozilla::CSSPixel> const&, nsSize const&) 	layout/base/nsCSSRendering.cpp:3038
16 	xul.dll 	nsImageRenderer::Draw(nsPresContext*, nsRenderingContext&, nsRect const&, nsRect const&, nsRect const&, nsPoint const&, nsSize const&, mozilla::gfx::IntRectTyped<mozilla::CSSPixel> const&) 	layout/base/nsCSSRendering.cpp:5482
17 	xul.dll 	nsImageRenderer::DrawBackground(nsPresContext*, nsRenderingContext&, nsRect const&, nsRect const&, nsPoint const&, nsRect const&, nsSize const&) 	layout/base/nsCSSRendering.cpp:5584

crashes with this signature have been for a while but they are increasing in frequency after firefox 50 and again after 51 - they are mostly happening in the content process. just looking at the beta channel, there have been ~1-5 daily crashes with this signature before 50, in 50 there were around 5-20 daily crashes, and in 51 it's rising to a volume of 20-100 daily crashes.

currently it's the #10 content crasher on 51.0b7 accounting for 0.9% of content crashes. 

Correlations for Firefox Beta:
(100.0% in signature vs 00.44% overall) address = 0x4
(100.0% in signature vs 13.99% overall) reason = EXCEPTION_ACCESS_VIOLATION_READ
(99.49% in signature vs 30.59% overall) Module "wevtapi.dll" = true
(99.49% in signature vs 32.20% overall) Module "Wpc.dll" = true
(100.0% in signature vs 41.18% overall) Module "pnrpnsp.dll" = true
(100.0% in signature vs 41.24% overall) Module "NapiNSP.dll" = true
(100.0% in signature vs 41.25% overall) Module "nlaapi.dll" = true
(92.89% in signature vs 31.13% overall) Module "samcli.dll" = true
(86.80% in signature vs 28.25% overall) Module "RpcRtRemote.dll" = true
(86.80% in signature vs 29.33% overall) Module "samlib.dll" = true
(86.80% in signature vs 29.51% overall) Module "WSHTCPIP.DLL" = true
(85.28% in signature vs 28.63% overall) Module "d3d11.dll" = true
(85.28% in signature vs 29.39% overall) Module "dxgi.dll" = true
(86.80% in signature vs 31.86% overall) Module "dhcpcsvc.dll" = true
(49.75% in signature vs 11.64% overall) os_arch = amd64
(22.84% in signature vs 00.79% overall) adapter_device_id = 0x0042
From user dump, I found cache in [1] was null and caused problem when access cache->cache[i].fast_path.
It looks like the allocation failure of TLS[2] or calloc[3].

I also found the 'Uptime' of these crashes were less than 25 seconds.

[1]http://searchfox.org/mozilla-central/source/gfx/cairo/libpixman/src/pixman-implementation.c#91
[2]http://searchfox.org/mozilla-central/source/gfx/cairo/libpixman/src/pixman-compiler.h#179
[3]http://searchfox.org/mozilla-central/source/gfx/cairo/libpixman/src/pixman-compiler.h#149

Jeff, do you have any idea about this bug?
Flags: needinfo?(jmuizelaar)
Whiteboard: [gfx-noted]
Still alive and well on 52. No idea if 53/54 are affected since this only appears to crop up on Beta and Release. Would be great if we could get someone to investigate this more.
Milan, can you help find an owner for this?
Flags: needinfo?(milan)
Here's a 52 crash - https://crash-stats.mozilla.com/report/index/d8378a6e-1729-4021-870b-3e8352170208.  Bas, can this stack happen with the default prefs?  52 should have skia content, this doesn't look like printing, but there is the DrawTargetRecording and basic layers, and it gets a bit confusing.
Flags: needinfo?(milan)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bas)
So the bottom part of that stack makes no sense to me, so I don't know where it's coming from. For the top bit, I agree, it looks like it would have to be printing. Also note the short uptime on this crash. In this case it's actually possible someone was printing, with an uptime of less than 25 seconds it seems a little unlikely. Unless they use Firefox as their PDF handler or something like that?

With non-default prefs I don't see how it could happen, but maybe there's parts of the code where we didn't convert to Skia? The people that ran the conversion would probably know better than me about that particular situation.
Flags: needinfo?(bas)
Lee, thoughts?  Printing or something going wrong with Skia pref'd Firefox, or somebody flipping the preference to Cairo?
Flags: needinfo?(lsalzman)
(In reply to Milan Sreckovic [:milan] from comment #6)
> Lee, thoughts?  Printing or something going wrong with Skia pref'd Firefox,
> or somebody flipping the preference to Cairo?

Looking through some of the 52 bug reports, there are some observations:

Sometimes we are just explicitly calling pixman_image_composite32 directly, like in this stack running through imagelib: https://crash-stats.mozilla.com/report/index/bde89733-2ed9-4943-abf5-7bd702170215

Sometimes we are unconditionally creating temporary software Cairo contexts in various places in layout, like the SVG code: https://crash-stats.mozilla.com/report/index/37256978-f04e-4d2a-8355-3c2142170215

Possibly other places too by some negligible counts, but those seem the main ones. So there are sufficient places we still use Cairo DTs, even when we're not even using Cairo as our content or canvas backend.

The stacks pretty much all end up here: https://hg.mozilla.org/releases/mozilla-beta/annotate/7b8aa893944b/gfx/cairo/libpixman/src/pixman-implementation.c#l108
The crash address is 0x4, which is strange because the line pegged is info->dest_flags, and dest_flags is stored at way higher offset than 4 within the pixman_fast_path_t struct. However, within the cache_t struct, fast_path itself WOULD be at address 0x4 on x86-32, where the pointer size is 4 bytes, given this cache_t definition:

typedef struct
{
    struct
    {
    pixman_implementation_t *   imp;
    pixman_fast_path_t      fast_path;
    } cache [N_CACHED_FAST_PATHS];
} cache_t;

The report listing indicates this only happens on x86 to back up that evidence.

So the only reason accessing &cache[0].fast_path would ever give us 0x4 is if cache is NULL. That would mean this line above is failing: cache = PIXMAN_GET_THREAD_LOCAL (fast_path_cache);

Why that is is somewhat of a mystery, but a trivial workaround for it is to just make it bypass the cache if accessing it fails and do the fallback lookup. I can write up a patch for this.
Flags: needinfo?(lsalzman)
The only place pixman actually uses TLS is this cache, and this seems to be intermittently failing for reasons we can't pin down. So for now, this chooses to just make this function more robust against the cache not being available, but still letting pixman work by going through its normal implementation-walking code that does not access the cache, which is not necessary for functioning.
Assignee: nobody → lsalzman
Status: NEW → ASSIGNED
Attachment #8837764 - Flags: review?(jmuizelaar)
Comment on attachment 8837764 [details] [diff] [review]
verify accessing the TLS fast-path cache succeeded in _pixman_implementation_lookup_composite

We need to understand when cache will be null.
Attachment #8837764 - Flags: review?(jmuizelaar) → review-
(In reply to Jeff Muizelaar [:jrmuizel] from comment #9)
> Comment on attachment 8837764 [details] [diff] [review]
> verify accessing the TLS fast-path cache succeeded in
> _pixman_implementation_lookup_composite
> 
> We need to understand when cache will be null.

Okay, I will leave that investigation to someone else who may be interested for now.
Assignee: lsalzman → nobody
Status: ASSIGNED → NEW
Milan - can you find an owner for this regression?
Flags: needinfo?(milan)
So if this is a TLS problem we can fix it post XP by switching from using the Tls functions and using __declspec(thread). However, this code has not changed in a long time, so if this is a regression the cause came from something else.
Since the last version on which we support Windows XP is 52, and trunk is at 54, we no longer need to use the pixman TLS workaround in trunk.

This just removes the enabling of that workaround in our build, so that we now use __declspec(thread), which will get rid of the inherent bugs of setting up the TLS cache.

We will uplift a different patch for beta and/or aurora to wallpaper over the problem there unless a better solution can eventually be found.
Assignee: nobody → lsalzman
Status: NEW → ASSIGNED
Attachment #8838143 - Flags: review?(jmuizelaar)
Attachment #8838143 - Flags: review?(jmuizelaar) → review+
Jeff, can you r+ the original wallpaper patch on the condition that we will only use it for uplift?
Attachment #8837764 - Flags: review- → review+
Comment on attachment 8837764 [details] [diff] [review]
verify accessing the TLS fast-path cache succeeded in _pixman_implementation_lookup_composite

Review of attachment 8837764 [details] [diff] [review]:
-----------------------------------------------------------------

Please add a better comment to this.
Added a comment explaining the situation with Tls functions failing to set up the fast-path cache on Windows XP and how this may cause the fast-path cache to end up null, yet we can never the less function without the fast-path cache, just having to look through the fast-path implementation list via linear search every time.
Attachment #8837764 - Attachment is obsolete: true
Flags: needinfo?(milan)
Attachment #8838149 - Flags: review+
Pushed by lsalzman@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/debb1b927f6a
use __declspec(thread) instead of Tls calls in pixman on Windows. r=jrmuizel
Comment on attachment 8838149 [details] [diff] [review]
Bug 1324130 - verify accessing the TLS fast-path cache succeeded in _pixman_implementation_lookup_composite. r=jrmuizel

Approval Request Comment
[Feature/Bug causing the regression]: bug 1007702, so 52+
[User impact if declined]: Crashiness for 32-bit Windows.
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: No. This patch is only for aurora/beta, as we are going with a different fix for trunk that can't be used for beta 52 due to the requirement that we still support Windows XP in that build.
[Needs manual test from QE? If yes, steps to reproduce]: no 
[List of other uplifts needed for the feature/fix]: aurora (53), beta (52)
[Is the change risky?]: No.
[Why is the change risky/not risky?]: This code in question can function without the mechanism that would otherwise make it crash. This just bypasses the crash.
[String changes made/needed]: None
Attachment #8838149 - Attachment description: verify accessing the TLS fast-path cache succeeded in _pixman_implementation_lookup_composite → Bug 1324130 - verify accessing the TLS fast-path cache succeeded in _pixman_implementation_lookup_composite. r=jrmuizel
Attachment #8838149 - Flags: approval-mozilla-beta?
Attachment #8838149 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/debb1b927f6a
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Comment on attachment 8838149 [details] [diff] [review]
Bug 1324130 - verify accessing the TLS fast-path cache succeeded in _pixman_implementation_lookup_composite. r=jrmuizel

workaround TLS issue in pixman causing crashes on windows, aurora53+, beta52+
Attachment #8838149 - Flags: approval-mozilla-beta?
Attachment #8838149 - Flags: approval-mozilla-beta+
Attachment #8838149 - Flags: approval-mozilla-aurora?
Attachment #8838149 - Flags: approval-mozilla-aurora+
Setting qe-verify- based on Lee's assessment on manual testing needs (Comment 18) and the fact that this fix has automated coverage.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: