Closed Bug 827946 Opened 7 years ago Closed 7 years ago

Linux crash in skia::BGRAConvolve2D

Categories

(Core :: Graphics, defect, critical)

20 Branch
x86_64
Linux
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla21
Tracking Status
firefox20 + verified
firefox21 + verified

People

(Reporter: scoobidiver, Assigned: tnikkel)

References

()

Details

(5 keywords)

Crash Data

Attachments

(4 files)

It has been hit by three users since the latest build. The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=605ae260b7c8&tochange=dccab70d8554
It might be related to bug 817356 although it landed one month ago.

Signature 	skia::BGRAConvolve2D(unsigned char const*, int, bool, skia::ConvolutionFilter1D const&, skia::ConvolutionFilter1D const&, int, unsigned char*, bool) More Reports Search
UUID	924f69e0-fba4-46be-8a7f-dfe922130108
Date Processed	2013-01-08 18:24:21
Uptime	985
Last Crash	16.5 minutes before submission
Install Age	2.9 hours since version was first installed.
Install Time	2013-01-08 15:29:26
Product	Firefox
Version	21.0a1
Build ID	20130108033457
Release Channel	nightly
OS	Linux
OS Version	0.0.0 Linux 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64
Build Architecture	amd64
Build Architecture Info	family 6 model 42 stepping 7
Crash Reason	SIGSEGV
Crash Address	0x7fdbe6900000
App Notes 	
OpenGL: Tungsten Graphics, Inc -- Mesa DRI Intel(R) Sandybridge Mobile  -- 3.0 Mesa 8.0.2 -- texture_from_pixmap
Processor Notes 	/data/socorro/stackwalk/bin/exploitable: ERROR: unable to analyze dump
EMCheckCompatibility	False

Frame 	Module 	Signature 	Source
0 	libxul.so 	skia::BGRAConvolve2D 	gfx/2d/convolver.cpp:178
1 	libxul.so 	skia::ImageOperations::ResizeBasic 	gfx/2d/image_operations.cpp:535
2 	libxul.so 	skia::ImageOperations::Resize 	gfx/2d/image_operations.cpp:371
3 	libxul.so 	skia::ImageOperations::Resize 	gfx/2d/image_operations.cpp:550
4 	libxul.so 	mozilla::gfx::Scale 	gfx/2d/Scale.cpp:45
5 	libxul.so 	ScaleRunner::Run 	image/src/RasterImage.cpp:326
6 	libxul.so 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:627
7 	libxul.so 	NS_ProcessNextEvent_P 	obj-firefox/xpcom/build/nsThreadUtils.cpp:237
8 	libxul.so 	nsThread::ThreadFunc 	xpcom/threads/nsThread.cpp:265
9 	libnspr4.so 	_pt_root 	nsprpub/pr/src/pthreads/ptthread.c:156
10 	libpthread-2.15.so 	libpthread-2.15.so@0x7e99 	

More reports at:
https://crash-stats.mozilla.com/report/list?signature=skia%3A%3ABGRAConvolve2D%28unsigned+char+const*%2C+int%2C+bool%2C+skia%3A%3AConvolutionFilter1D+const%26%2C+skia%3A%3AConvolutionFilter1D+const%26%2C+int%2C+unsigned+char*%2C+bool%29
Now hit by 8 users in one build.
Keywords: topcrash
I hit this clicking on the first link of https://encrypted.google.com/search?hl=en&q=smallworld%20board%20game%20geek

Graphics
Adapter Description: Tungsten Graphics, Inc -- Mesa DRI Intel(R) Sandybridge Mobile 
Device ID: Mesa DRI Intel(R) Sandybridge Mobile 
Driver Version: 2.1 Mesa 8.0.4
GPU Accelerated Windows: 0/1 Basic 
Vendor ID: Tungsten Graphics, Inc
WebGL Renderer: Tungsten Graphics, Inc -- Mesa DRI Intel(R) Sandybridge Mobile 
AzureCanvasBackend: cairo
AzureContentBackend: none
AzureFallbackCanvasBackend: none
Kevin, is it 100% reproducible for you?
Flags: needinfo?(kbrosnan)
Opening http://boardgamegeek.com/boardgame/40692/small-world (from comment #2) crashes most of the times for me. If not when loading it the first time, force-reloading with Ctrl+F5 does.
Duplicate of this bug: 830013
Flags: needinfo?(kbrosnan)
Blocks: 830639
Assignee: nobody → joe
I have tried and tried and just can't reproduce this. The high quality downscaler is turned on, too. Lots of force reloads seem to have no effect. I'm baffled.
George, any thoughts?
Flags: needinfo?(gwright)
Joe - have you been testing on Linux or OS X?
Flags: needinfo?(gwright)
It happens only on 64-bit Linux.
(In reply to Scoobidiver from comment #9)
> It happens only on 64-bit Linux.

I see that comment #0 is on Intel Sandybridge / Mesa 8.0.2, comment #2 is Intel Sandybridge Mobile / Mesa 8.0.4 - is there a commonality on terms of graphics hardware and/or Mesa version for this bug?
If anyone in MTV or SF wants to capture this it has been somewhat reproducible.
NVIDIA GeForce 9600M / Mesa 9.0.1 / x64

I just got this on the 20130115040206 build, 100% reproducible on loading: http://imgur.com/a/SWVTE

https://crash-stats.mozilla.com/report/index/bp-7830d459-5ed0-4807-b35d-3193c2130117

Updating to latest stopped the crashing on that page but I still get crashes on loading: http://sohowww.nascom.nasa.gov/data/realtime-images.html

https://crash-stats.mozilla.com/report/index/bp-1aaa631e-643b-4107-b0f3-0a80b2130117
Aurora (20.0a2) is affected too. Here is a corresponding crash report:

https://crash-stats.mozilla.com/report/index/bp-c7b4de8b-89f8-4331-a07b-e429c2130118
Summary: crash in skia::BGRAConvolve2D → Linux crash in skia::BGRAConvolve2D
Version: 21 Branch → 20 Branch
The timestamps suggest that this happened just before Firefox 20 went from central to aurora.
Joe, could this be another PGO issue?  Could that be related to not being able to reproduce?
Flags: needinfo?(joe)
I'll try to reproduce on a PGO'd build.
Flags: needinfo?(joe)
I managed to reproduce this once while either loading or closing one of
http://imgur.com/a/SWVTE
http://sohowww.nascom.nasa.gov/data/realtime-images.html
but despite much effort have not reproduced it since.
Keywords: reproducible
(In reply to Milan Sreckovic [:milan] from comment #17)
> Joe, could this be another PGO issue?

I thought we didn't do PGO on Linux? Note that this is from what I saw only happening on Linux so far - but there in pretty large numbers (for that platform).
Some of these crashes seem to occur even randomly. Like this one using Firefox 20.0a2 as of Jan. 28th, crashing Firefox while not even accessing a particular page:

https://crash-stats.mozilla.com/report/index/bp-f109ad83-e196-499c-8d6d-36c302130129
(In reply to Joe Drew (:JOEDREW! \o/) from comment #6)
> I have tried and tried and just can't reproduce this. The high quality
> downscaler is turned on, too. Lots of force reloads seem to have no effect.
> I'm baffled.
As Kairo points out in comment 10, are you testing with Intel Sandybridge? Do we have an option of testing with that?
By the way, I can't reproduce this either, but I do have nvidia card...
I created a local copy of http://sohowww.nascom.nasa.gov/data/realtime-images.html and removed as much as possible, and still get Firefox to crash. See soho.html for what I got.

Find the corresponding crash report at https://crash-stats.mozilla.com/report/index/bp-0f6236dd-7896-46b1-bbe6-c3b152130129

I hope this helps.
One more note, be sure to delete all browser data (cache, cookies etc.) and close Firefox completely when trying. My local test page didn't crash every time, but usually it did after cleaning and closing Firefox 21.0a1 completely.
I managed to remove even more, including the CSS and other meta tags. Now the page only adds http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using the img tag, and still gets Firefox to crash.

However, if I load http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg directly, Firefox doesn't crash (at least I couldn't get it to).

So it seems the difference lays somewhere between adding http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using the img tag, and the image itself.
I can now pretty consistently reproduce this myself with a nightly, but not with a build I've made myself.
try builds also fail to reproduce the crash.

What is different between try builds and nightlies that might cause this?
(In reply to Timothy Nikkel (:tn) from comment #28)
> What is different between try builds and nightlies that might cause this?

Linux PGO perhaps? I thought we weren't doing that, but I just read some other comments that imply that we actually do. And if MS PGO can cause weird crashes, gcc PGO might just as well.
I was able to reliably reproduce the crash in the pgo build http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-pgo/1359511648/ but was not able to reproduce the crash in the non-pgo build http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64/1359508908/

They are both build from the same revision. As far as I can tell pgo is the only variable here. So we are either looking at a PGO bug in gcc or a bug in our code that is only tickled when PGO is enabled for some reason.
(In reply to Timothy Nikkel (:tn) from comment #30)
> but was not able to reproduce the crash in
> the non-pgo build
> http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-
> central-linux64/1359508908/

Confirming this. Couldn't get this build to crash on http://sohowww.nascom.nasa.gov/data/realtime-images.html as opposed to January 29th Nightly.

But I'm confused, isn't this the same source that Nightlies are based on?
Yes, it is exactly the same source, but it is compiled slightly differently. Nightlies have PGO (profile guided optimizations), which take a lot longer to compile but produce slightly faster code. We also make non-PGO builds but they aren't nightlies.

Thanks for testing.
Keywords: testcase
This is starting to sound like a PGO optimizer bug, then. :(
I figured out how to turn on PGO on try and that try build has the problem. This is good news as now we can do some poking using these try builds to try to figure out the problem.
A local PGO build using gcc 4.7 did not have the crash.
IIRC, our official build machines use gcc 4.5 right now.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #36)
> IIRC, our official build machines use gcc 4.5 right now.

Correct, you can tell by looking at the build logs.
(In reply to Markus Popp from comment #26)
> Created attachment 707773 [details]
> Further stripped down soho.html
> 
> I managed to remove even more, including the CSS and other meta tags. Now
> the page only adds
> http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using
> the img tag, and still gets Firefox to crash.
> 
> However, if I load
> http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg
> directly, Firefox doesn't crash (at least I couldn't get it to).
> 
> So it seems the difference lays somewhere between adding
> http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using
> the img tag, and the image itself.

FWIW, I don't crash with latest aurora and latest nightly on this page.
(In reply to Mike Hommey [:glandium] from comment #38)
> > So it seems the difference lays somewhere between adding
> > http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using
> > the img tag, and the image itself.
> 
> FWIW, I don't crash with latest aurora and latest nightly on this page.

Sorry, that was a bit misleading. Accessing latest.jpg directly doesn't crash, but http://sohowww.nascom.nasa.gov/data/realtime-images.html (which includes latest.jpg as img) still does (also in the latest Nightly, just tried it).
(In reply to Markus Popp from comment #39)
> (In reply to Mike Hommey [:glandium] from comment #38)
> > > So it seems the difference lays somewhere between adding
> > > http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using
> > > the img tag, and the image itself.
> > 
> > FWIW, I don't crash with latest aurora and latest nightly on this page.
> 
> Sorry, that was a bit misleading. Accessing latest.jpg directly doesn't
> crash, but http://sohowww.nascom.nasa.gov/data/realtime-images.html (which
> includes latest.jpg as img) still does (also in the latest Nightly, just
> tried it).

Sorry, my message wasn't clear, I tried the attached page, not the linked page. I also tried ttp://boardgamegeek.com/boardgame/40692/small-world and only managed to crash once when going back and forth between the various pages.
So, with a crash under a debugger here's what i can see:
(gdb) p $_siginfo._sifields._sigfault
$1 = {si_addr = 0x7fffc6d00000}

Looking at /proc/self/maps:
7fffc6c00000-7fffc6d00000 rw-p 00000000 00:00 0 
7fffc6e00000-7fffc7500000 rw-p 00000000 00:00 0 

That address is not mapped, but is right at the end of an existing mapping. Could be a region that used to be mapped but isn't anymore, or a buffer overflow.
On another crash, the crash address is 0x7fc4ed500000, and the last mmap/munmap calls before the crash read:

munmap(0x7fc4ed400000, 1048576)         = 0
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed3ff000
munmap(0x7fc4ed3ff000, 4096)            = 0
munmap(0x7fc4ed500000, 1044480)         = 0
mmap(NULL, 3145728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed100000
munmap(0x7fc4ed300000, 1048576)         = 0
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed300000
munmap(0x7fc4ed100000, 2097152)         = 0
munmap(0x7fc4ed300000, 1048576)         = 0
mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fc4ecbff000


See the munmap(0x7fc4ed500000, 1044480). It sounds like a race condition.
Actually, it sounds like a buffer overflow:

This is how things are allocated:
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed3ff000
munmap(0x7fc4ed3ff000, 4096)            = 0
munmap(0x7fc4ed500000, 1044480)         = 0

That probably comes from jemalloc chunk allocation, where jemalloc tries to get something 1MB-aligned, so it gets something bigger, and munmaps what it doesn't need.
I'm slowly plodding my way through this x86 assembly stuff.

(gdb) p $_siginfo._sifields._sigfault
$1 = {si_addr = 0x7fffcce00000}

the crashing instruction is
=> 0x00007ffff4175a62 <+1186>:	movdqu 0x30(%rcx),%xmm15
and
rcx            0x7fffccdfffc1

movdqu is a 128 bit or 16 byte read, 0x7fffccdfffc1 + 0x30 + 0xF = 0x7fffcce00000 so it segfaulting on the last byte of the read.
So we are in this loop
    // Apply the filter to the row to get the destination pixel in |accum|.
    int accum[4] = {0};
    for (int filter_x = 0; filter_x < filter_length; filter_x++) {
      ConvolutionFilter1D::Fixed cur_filter = filter_values[filter_x];
      accum[0] += cur_filter * row_to_filter[filter_x * 4 + R_OFFSET_IDX];
      accum[1] += cur_filter * row_to_filter[filter_x * 4 + G_OFFSET_IDX];
      accum[2] += cur_filter * row_to_filter[filter_x * 4 + B_OFFSET_IDX];
      if (has_alpha)
        accum[3] += cur_filter * row_to_filter[filter_x * 4 + A_OFFSET_IDX];
    }
and row_to_filter = 0x7fffcc1fffc0, filter_length = 16. So the largest address of row_to_filter that the C++ might access is row_to_filter + filter_length*4 - 1.

In the disassembly %rax starts out as row_to_filter. The assembly then loads 16 pixels (4 pixels at a time) starting at %rax, and then increments %rax by 16 pixels (interleaved with various SSE instructions operating on those pixels).

The assembly initializes %rcx to %rax+1. The assembly then loads 16 pixels starting at %rcx (so %rax but offset by 1 byte) and the increments %rcx by 16 pixels (interleaved with various SSE instructions operating on those pixels). I'm not sure why the compiler wants to do these unaligned loads. GCC 4.7 with PGO doesn't do any of these unaligned loads.

So you can see the problem here, the code loads 16 pixels starting from %rax+1 and %rax is an array of 16 pixels, so we go one byte past the end.

It looks like the problem is that this "pull 16 pixels at a time" code path needs to be only used on arrays of length > 16 (because it go a few bytes past 16 pixels) but it is currently used on arrays of length >= 16.

Indeed, before the 16 pixels at a time code path we have
    cmp    $0xf,%r9d
    jbe    0x7ffff4177291
where the jbe looks to jump to alternate code if we have 15 or less.
Attached file disassembly
The disassembly for reference.
This fixes the crash for me. "-O2" does not fix it. I'm open to other ideas for how to make the compiler behave. Also if there was an easy way to make this only apply to PGO that would be better.

I haven't tested the perf difference, but I'm open to suggestions for what would be a good way to test that.
Assignee: joe → tnikkel
Attachment #711675 - Flags: review?
Attachment #711675 - Flags: review? → review?(joe)
Attachment #711675 - Flags: review?(joe) → review+
And backed out cause that gcc version macro isn't defined where one would expect
https://hg.mozilla.org/integration/mozilla-inbound/rev/c38ac98bdc95
https://hg.mozilla.org/mozilla-central/rev/be77205650e3
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla21
Comment on attachment 711675 [details] [diff] [review]
Make us use -O1 for this fuction on gcc 4.5

There are some crashes of this on aurora but it seems lower frequency. This is low risk so I think we should take it there.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): it hasn't been tracked down in detail
User impact if declined: crashes on linux
Testing completed (on m-c, etc.): on m-c for a nightly now
Risk to taking this patch (and alternatives if risky): low risk, we just instruct the compiler to use less optimization on gcc 4.5 only (so Linux only for our purposes)
String or UUID changes made by this patch: none
Attachment #711675 - Flags: approval-mozilla-aurora?
Comment on attachment 711675 [details] [diff] [review]
Make us use -O1 for this fuction on gcc 4.5

low risk, go ahead with uplift.
Attachment #711675 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Firefox Nightly on Linux64 has definitely become a lot more stable since yesterday (Feb. 10). Crashed a lot before, no crash since yesterday so far.
Is this supposed to be fixed on Aurora yet?

While Firefox Nightly looks fine, Firefox Aurora still crashes at http://sohowww.nascom.nasa.gov/data/realtime-images.html as of February 13th build. Here is a corresponding crash report:

https://crash-stats.mozilla.com/report/index/bp-90a3fb6c-c444-4469-ba33-aa2812130213
Ah, the reason is that the MOZ_GCC_VERSION_AT_LEAST macro is only defined on mozilla-central and not aurora because it was added recently by bug 833254. I'll see about getting that uplifted.
Depends on: 833254
No longer depends on: 833254
There's been enough churn that the patch for bug 833254 needs some work to apply. I'm just going to make a patch doing it the old way looking at the gcc defines directly without the shiny new macro.
Reproduced the crash on Aurora 20.0a2 2013-02-12.
Verified fixed FF 20b1 Ubuntu 12.04 x64.
Keywords: verifyme
QA Contact: paul.silaghi
Verified fixed FF 21b2 Ubuntu 12.04 x64.
Status: RESOLVED → VERIFIED
Keywords: verifyme
Depends on: 1120060
You need to log in before you can comment on or make changes to this bug.