Closed Bug 1067018 Opened 6 years ago Closed 6 years ago

crash in mozilla::gfx::SourceSurfaceAlignedRawData::InitWithStride(mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::SurfaceFormat, int, bool)

Categories

(Core :: Graphics: Layers, defect, critical)

All
Android
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla35
Tracking Status
firefox32 --- unaffected
firefox33 + verified
firefox34 + verified
firefox35 + verified
b2g-v2.0 --- unaffected
b2g-v2.0M --- unaffected
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed
fennec 33+ ---

People

(Reporter: kbrosnan, Assigned: jchen)

References

Details

(Keywords: crash, topcrash-android-armv7)

Crash Data

Attachments

(2 files, 1 obsolete file)

[Tracking Requested - why for this release]: topcrash regression

This bug was filed from the Socorro interface and is 
report bp-cdde3447-cb4b-4dd3-aad0-11edc2140910.
=============================================================

Samsung GT-I9500 (Galaxy S4) is responsible for over half the crashes for this bug.
Assignee: nobody → milan
tracking-fennec: ? → 33+
Milan, could you help us here? thanks
Flags: needinfo?(milan)
Attached patch patch (obsolete) — Splinter Review
Assignee: milan → jwatt
Attachment #8493105 - Flags: review?(milan)
Attachment #8493105 - Flags: review?(milan) → review+
Flags: needinfo?(milan)
Comment on attachment 8493105 [details] [diff] [review]
patch

Actually, this can't be the problem. I was thinking that this class was one of the classes for which we override |operator new| to be fallible, but that's not the case.
Attachment #8493105 - Attachment is obsolete: true
(I spun an opt build with debug symbols for the Nexus 5 and can confirm that we do indeed use mozalloc.h's |operator new(size_t size) MOZALLOC_THROW_BAD_ALLOC| for the |new SourceSurfaceAlignedRawData()| line.)
So we seem to have the stack:

AlignedArray<uint8_t,16>::Realloc
http://hg.mozilla.org/releases/mozilla-beta/annotate/9502b0a5c5f1/gfx/2d/Tools.h#l141
SourceSurfaceAlignedRawData::InitWithStride
http://hg.mozilla.org/releases/mozilla-beta/annotate/9502b0a5c5f1/gfx/2d/SourceSurfaceRawData.cpp#l64
Factory::CreateDataSourceSurfaceWithStride
http://hg.mozilla.org/releases/mozilla-beta/annotate/9502b0a5c5f1/gfx/2d/Factory.cpp#l718

Given comment 4, I don't see anything wrong with Factory::CreateDataSourceSurfaceWithStride. Nor do I see anything wrong with SourceSurfaceAlignedRawData::InitWithStride.

The only thing that I can see that is on the surface a bit suspicious with AlignedArray::Realloc is that we're using CheckedInt32 (instead of CheckedInt<size_t>, because that doesn't work, and because we want to keep the value below the max positive value that uint32_t can represent), so when we have:

  mStorage = new (std::nothrow) uint8_t[storageByteCount.value()];

we're coercing the value returned from storageByteCount.value() from int32_t to size_t. I don't see how that can cause this crash though. We know that the value in storageByteCount is positive, so it must be convertible to size_t without rounding or overflow. On the crashing line we then pass this value to mozalloc.h's |operator new []|, which then passes it to moz_malloc, which then passes it to malloc (size_t all the way through). The only thing that can go wrong is that the requested about of memory can't be malloc'ed and we return nullptr up the stack.

Even _if_ the value passed through to malloc was zero (it can't be), and even _if_ malloc returned non-nullptr in that case, we shouldn't crash on the line that is being blamed by the crash reports. So I'm a bit stumped here.

Jim, do you have any ideas or insights?
Flags: needinfo?(nchen)
Or you, kats?
Flags: needinfo?(bugmail.mozilla)
(In reply to Jonathan Watt [:jwatt] from comment #5)
> positive value that uint32_t can represent

(Sorry, I meant max positive value that *int32_t* can represent.
This and bug 1028802 are probably the same thing.
See Also: → 1028802
See Also: → 1067982
Updating for new post-bug 1063733 InitWithStride signature.
Crash Signature: [@ mozilla::gfx::SourceSurfaceAlignedRawData::InitWithStride(mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::SurfaceFormat, int)] → [@ mozilla::gfx::SourceSurfaceAlignedRawData::InitWithStride(mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::SurfaceFormat, int, bool)]
Summary: crash in mozilla::gfx::SourceSurfaceAlignedRawData::InitWithStride(mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::SurfaceFormat, int) → crash in mozilla::gfx::SourceSurfaceAlignedRawData::InitWithStride(mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::SurfaceFormat, int, bool)
Duplicate of this bug: 1067982
Looks like the crash happens inside the new statement at [1], which consists of calling moz_malloc to allocate memory [2] and then initializing the memory to zero. The problem is the initialization step in the generated code doesn't check for nullptr result, and we crash trying to initialize a block of memory starting at 0x00000000. I'm not sure if this is a compiler bug or something with our code. I'm not even sure the memory should be initialized to zero [3]. The same code in my local build using GCC 4.9 doesn't do initialization.

[1] http://mxr.mozilla.org/mozilla-central/source/gfx/2d/Tools.h#157
[2] http://mxr.mozilla.org/mozilla-central/source/memory/mozalloc/mozalloc.h?rev=8252eae8278c#216
[3] http://stackoverflow.com/a/7546745
Flags: needinfo?(nchen)
I wouldn't get *too* caught up in the details here wrt zeroing. If moz_malloc() indeed failed it means we're gonna crash soon anyway, because we aren't hardened against that.

If I had to guess, we are asking for a crazy amount of memory. That has happened before with these sorts of allocations in gfx.
jim and snorp seem to be on top of this. I took a quick look as well but nothing jumped out at me.
Flags: needinfo?(bugmail.mozilla)
Jim: My understanding is that the code should not be zeroing. The fact that more recent versions of GCC do not zero would seem to indicate that the older GCC (4.7.0 apparently) that we use on the build machines either had intended behavior that was later changed, or else was buggy.

I guess we could be setting some defines/compiler flags somewhere to flip on zeroing, but the fact that the crash also exists in Nightly builds seems to make that unlikely:

https://crash-stats.mozilla.com/topcrasher/products/FennecAndroid/versions/35.0a1

Do you get the zeroing in the generated code for that file if you compile it with 4.7?

Assuming the compiler is the issue, where do we go from here? Besides causing this crash, presumably this unrequested zeroing of arrays is a perf issue. But who knows what bugs the zeroing may be hiding elsewhere, making updating the compiler for beta at this stage somewhat risky.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #12)
> I wouldn't get *too* caught up in the details here wrt zeroing. If
> moz_malloc() indeed failed it means we're gonna crash soon anyway, because
> we aren't hardened against that.

Not necessarily if this is a big allocation, which it probably is. Bug 1067998 for example was really easy to repro, but after fixing it I can't get fennec to OOM even after opening loads of tabs containing the crashing page.
Indeed, it looks we can recover from this if we fix the zeroing issue.
Jim, I know you were looking further into the zeroing issue. What's the status there?
Flags: needinfo?(nchen)
So to summarize the issue, when you allocate an array of built-in type in C++, you can tell the compiler to initialize the memory to zero by using,
> int* mem = new int[10]();
Or you can tell it to *not* initialize by using,
> int* mem = new int[10];
However, the version of GCC used by the Android buildbots (4.7) apparently treats the two the same way -- the memory is always initialized to zero.

That by itself should not cause this crash; it's just potentially inefficient. The crash is caused by the fact that the Android NDK has a separate bug in its headers. Its headers declare the nothrow placement new operator as,
> void* operator new[](size_t size, const std::nothrow_t&)
Whereas the correct declaration is
> void* operator new[](size_t size, const std::nothrow_t&) throw()
The 'throw()' part is important, because the point of using nothrow is to say that if the allocation fails, instead of throwing an exception (or crashing), NULL is returned.

When you bring zero-initialization into the equation, you can see that when the compiler is doing zero-initialization, it must check for NULL when using nothrow. GCC only adds NULL-checking to its generated initialization code if we specify 'throw()' as part of the new operator declaration. Because the NDK headers don't specify that, GCC erroneously assumes the nothrow placement new operator will never return NULL, and when it does return NULL in real life, the code tries to initialize the NULL memory block to zero, causing this crash.
Flags: needinfo?(nchen)
This patch overrides the NDK <new> header with our own copy. Our copy correctly specifies "throw()" and makes GCC correctly perform null-checks.
Attachment #8494703 - Flags: review?(mh+mozilla)
Making sure allocation/deallocation all match.
Attachment #8494743 - Flags: review?(jwatt)
Actually, either patch should fix this crash.
Nice work, Jim! :)
Assignee: jwatt → nchen
Comment on attachment 8494743 [details] [diff] [review]
Make sure calloc/malloc/free usages match in Tools.h (v1)

(In reply to Jim Chen [:jchen :nchen] from comment #21)
> Actually, either patch should fix this crash.

Seems like we should definitely take the throw() patch on branches since, although this patch fixes this crash, there's a good chance there are others that only the throw() patch will fix.
Attachment #8494743 - Flags: review?(jwatt) → review+
Jim, can you land that asap so that we can have it in aurora & beta soon? Thanks
Flags: needinfo?(nchen)
I pushed one of the patches since the two are separate and either one fixes the bug,

https://hg.mozilla.org/integration/mozilla-inbound/rev/1af55e86f417

This patch also has lower risk for uplifting.
Status: NEW → ASSIGNED
Flags: needinfo?(nchen)
Keywords: leave-open
Comment on attachment 8494743 [details] [diff] [review]
Make sure calloc/malloc/free usages match in Tools.h (v1)

Approval Request Comment

[Feature/regressing bug #]: N/A

[User impact if declined]: Top crash on Android

[Describe test coverage new/current, TBPL]: Locally, m-c

[Risks and why]: Very small; no change in functionality

[String/UUID change made/needed]: None
Attachment #8494743 - Flags: approval-mozilla-beta?
Attachment #8494743 - Flags: approval-mozilla-aurora?
Comment on attachment 8494703 [details] [diff] [review]
Always specify throw() for nothrow placement new/delete (v1)

Review of attachment 8494703 [details] [diff] [review]:
-----------------------------------------------------------------

Please add a note in build/stlport/README.mozilla that overrides/ is a mozilla thing.

::: build/stlport/overrides/new
@@ +27,5 @@
> + * SUCH DAMAGE.
> + */
> +/*
> + * This header is taken from $ndk/sources/cxx-stl/system/include/new,
> + * and it fixes a bug in the NDK header where the nothrow versions of

Did you report this upstream? They sure should fix it.
Attachment #8494703 - Flags: review?(mh+mozilla) → review+
Attachment #8494743 - Flags: approval-mozilla-beta?
Attachment #8494743 - Flags: approval-mozilla-beta+
Attachment #8494743 - Flags: approval-mozilla-aurora?
Attachment #8494743 - Flags: approval-mozilla-aurora+
Embedlite tests stopped compiling after these changes 
In file included from ../../../dist/include/mozilla/gfx/Rect.h:12:0,
                 from ../../../dist/include/Units.h:12,
                 from ../../../dist/include/InputData.h:12,
                 from ../../../dist/include/mozilla/embedlite/EmbedLiteView.h:13,
/mozilla-central/embedding/embedlite/tests/embedLiteViewInitTest.cpp:8:
../../../dist/include/mozilla/gfx/Tools.h: In member function 'void mozilla::gfx::AlignedArray<T, alignment>::Dealloc()':
../../../dist/include/mozilla/gfx/Tools.h:136:22: error: there are no arguments to 'moz_free' that depend on a template parameter, so a declaration of 'moz_free' must be available [-fpermissive]
../../../dist/include/mozilla/gfx/Tools.h:136:22: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)
../../../dist/include/mozilla/gfx/Tools.h: In member function 'void mozilla::gfx::AlignedArray<T, alignment>::Realloc(size_t, bool)':
../../../dist/include/mozilla/gfx/Tools.h:143:22: error: there are no arguments to 'moz_free' that depend on a template parameter, so a declaration of 'moz_free' must be available [-fpermissive]
../../../dist/include/mozilla/gfx/Tools.h:157:79: error: there are no arguments to 'moz_calloc' that depend on a template parameter, so a declaration of 'moz_calloc' must be available [-fpermissive]
../../../dist/include/mozilla/gfx/Tools.h:159:76: error: there are no arguments to 'moz_malloc' that depend on a template parameter, so a declaration of 'moz_malloc' must be available [-fpermissive]
make[1]: *** [embedLiteViewInitTest.o] Error 1


Adding #include "mozilla/mozalloc.h" did not help to resolve the issue.
Any ideas what could be wrong ?
here is our moz.build https://github.com/tmeshkova/gecko-dev/blob/embedlite/embedding/embedlite/tests/moz.build
Does this need to be open still? Also, does this affect B2G 2.0 (b2g32)?
status-b2g-v2.0: --- → ?
Flags: needinfo?(nchen)
(In reply to Oleg Romashin (:romaxa) from comment #30)
> Embedlite tests stopped compiling after these changes 
> 
> Adding #include "mozilla/mozalloc.h" did not help to resolve the issue.
> Any ideas what could be wrong ?
> here is our moz.build
> https://github.com/tmeshkova/gecko-dev/blob/embedlite/embedding/embedlite/
> tests/moz.build

Hmm I have no idea. I guess you can print out the preprocessed source (replace -c in the GCC command line with -E) and see why the functions are not included. Maybe try declaring the functions locally as a sanity check?
Keywords: leave-open
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #32)
> Does this need to be open still? Also, does this affect B2G 2.0 (b2g32)?

Guess not. :mwu said B2G is not affected.
Flags: needinfo?(nchen)
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
I'm seeing 0 reports of new crashes in the last week with Fennec 35.0a1, 34.0a2, or 33.0b8. Marking this bug verified fixed based on crashstats.
You need to log in before you can comment on or make changes to this bug.