Closed Bug 1069369 Opened 10 years ago Closed 10 years ago

Android crash in mozilla::image::RasterImage::DecodeSomeData

Categories

(Core :: Graphics: ImageLib, defect)

All
Android
defect
Not set
critical

Tracking

()

VERIFIED FIXED
Tracking Status
firefox34 --- unaffected
firefox35 + verified
firefox-esr31 - ---
fennec 35+ ---

People

(Reporter: kairo, Assigned: seth)

References

()

Details

(4 keywords)

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-82d5c5d9-cfdb-45ce-b159-e5d7a2140916.
=============================================================

Top frames:
0 	libxul.so 	mozilla::image::RasterImage::DecodeSomeData(unsigned int, mozilla::image::DecodeStrategy) 	image/src/RasterImage.cpp
1 	libxul.so 	mozilla::image::RasterImage::DecodePool::DecodeSomeOfImage(mozilla::image::RasterImage*, mozilla::image::DecodeStrategy, mozilla::image::RasterImage::DecodePool::DecodeType, unsigned int) 	image/src/RasterImage.cpp
2 	libxul.so 	mozilla::image::RasterImage::DecodePool::DecodeJob::Run() 	image/src/RasterImage.cpp
3 	libxul.so 	nsThreadPool::Run() 	xpcom/threads/nsThreadPool.cpp
4 	libxul.so 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp
5 	libxul.so 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp
[...]

This started spiking on 35 Nightly for Android on 2014-09-16 and is now by far the #1 crash on Android Nightly.

All those crashes are SIGSEGV with an address of 0x24.

We also have crashes like bp-3279a2a4-94a0-4e58-aafe-807512140916 in mozilla::image::RasterImage::DecodePool::RequestDecode with an address of 0xc that spiked along with this and is the #2 crash right now, so given that and it also being in RasterImage, it sounds closely related.
[Tracking Requested - why for this release]: topcrash on 35 nightly
tracking-fennec: --- → ?
This does not meet ESR tracking/landing criteria.
This is the #1 top crash in Fennec Nightly 35, accounting for 43.83% of our crashes in the last week.
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #3)
> This is the #1 top crash in Fennec Nightly 35, accounting for 43.83% of our
> crashes in the last week.

Actually, I forgot to combine this with the other signature. In combination this is ~55% of our Fennec 35 crashes!
Looking at the crash reports, the first build reported was 20140916030204. I'm flagging for QAWANTED to see if we can find a reproducible case and a better regression range.
I haven't yet tried to reproduce this but here is the regression range based on crash-stats:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=bf5fcc0c4b27&tochange=3b7921328fc1

There are two bugs in this range which touch RasterImage:
Bug 1060200 - Store HQ scaled frames in the SurfaceCache
Bug 1057903 - Refactor RasterImage methods to use DrawableFrameRef's and generally clean up use of imgFrame

CCing Seth Fowler since he's assigned to both bugs.
Using a popular URL from the crash-list, I can reproduce this on my Nexus 5 (Android 4.4.4), Nightly (09/22)

URL: http://www.nexus-wallpaper.com/wallpaper/abstract/galaxy-starry-night/

Steps, visit aforementioned URL, scroll down.
Aaron, are these steps reproducible for you? I'm trying this on my Nexus 5 (as well as some of the top URLs) and have been unable to reproduce a crash.
Yes, I get a mozalloc_abort sigsegv in adb| mozalloc_abort: [12414] ###!!! ABORT: Asked to discard with open decoder!: '!mDecoder', file /builds/slave/m-cen-and-d-000000000000000000/build/image/src/RasterImage.cpp, line 1892 with a corrupt stack unfortunately
Aaron, would you be able to narrow down the regression window further?
I get the same inbound merge to m-c mentioned above 09/15->09/16 with the aforementioned two bugs posted. My guess is bug 1060200 in particular.
100% reproducible on my Samsung Galaxy SII just after page-load on http://neowin.net

https://crash-stats.mozilla.com/report/index/d99bc816-ba37-4ec1-81ca-a37fb2140923

Last good revision: 3b7921328fc1 (2014-09-16)
First bad revision: d2c01d77b9d0 (2014-09-17)

Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3b7921328fc1&tochange=d2c01d77b9d0
(In reply to Aaron Train [:aaronmt] from comment #12)
> Last good revision: 3b7921328fc1 (2014-09-16)
> First bad revision: d2c01d77b9d0 (2014-09-17)

This would seem to contradict the previous information that this first showed up in Nightly on Sept 16. Can you please confirm, Aaron?
It is what I posted as a followup.
[Tracking Requested - why for this release]:

This is now accounting for 60.20% of our crashes on Fennec Nightly.

Seth, since you were involved with bug 1060200, can you please look at this?
Blocks: 1060200
Flags: needinfo?(seth)
FWIW, looking at the historical crashes/ADI trend, Nightly was hovering around 10 C/100 ADI until Sept 16 where it spiked to 25 C/100 ADI. Over the days that followed, instability steadily climbed to ~42 C/100ADI where it seems to have plateaued.
I'm a little confused here. The evidence sounds like it points to bug 1057903. Why is this blocking bug 1060200?
Blocks: 1057903
No longer blocks: 1060200
Flags: needinfo?(seth)
If it helps, I crash 100% (so far) just loading http://junelakeloop.org/ on my HTC One (M8) w/Fennec -- here's the crash report:

https://crash-stats.mozilla.com/report/index/f2e3d4df-01ba-4da5-8244-811ef2140924
(In reply to Seth Fowler [:seth] from comment #17)
> I'm a little confused here. The evidence sounds like it points to bug
> 1057903. Why is this blocking bug 1060200?

Seth, sounds like we need to back out bug 1057903 unless you have a fix in hand
Assignee: nobody → seth
Flags: needinfo?(seth)
The first bad revision is:
changeset:   205275:606aef8e8c16
user:        Seth Fowler <seth@mozilla.com>
date:        Fri Sep 12 18:29:27 2014 -0700
summary:     Bug 1057903 - Refactor RasterImage to use DrawableFrameRef and generally clean up. r=tn
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #19)
> Seth, sounds like we need to back out bug 1057903 unless you have a fix in
> hand

No, I don't think we should do that yet.

So far I haven't been able to get this failure to happen under gdb, and I haven't been able to reliably reproduce it on try. I've been forced instead to focus on pushing speculative fixes. Bug 1070426 (which hasn't hit m-c yet, although it has been pushed) and bug 1069652 are both possible fixes for this issue.

I plan to continue investigating possible speculative fixes, and I'm hoping that newer STR like comment 18 will make it possible for me to reproduce the problem locally.

Since I cannot reproduce this crash locally right now (at least with the older STR), backing out would be extremely counterproductive. I have no way of knowing if my fixes are working if we back out. And we absolutely need this patch - it is part of a series that is critical for B2G. So we won't be able to just back out the patch and forget it; I'll have to land it again, and we'll be in the same situation.

If a couple more days go by and there's still no fix, a backout may well be warranted. But I really don't want to lose the only signal I have right now that could help me fix this problem.
Flags: needinfo?(seth)
(In reply to Stephen Donner [:stephend] from comment #18)
> https://crash-stats.mozilla.com/report/index/f2e3d4df-01ba-4da5-8244-
> 811ef2140924

Thanks for this, Stephen!

I think I have an idea of what's going on now. I'm putting together a patch.
OK, there's another speculative fix in bug 1072539. (I'm not adding it as a blocker until I can get some verification from others that it fixes the problem for them.)
So all the speculative fixes mentioned so far are pushed now. I'm hoping that we should see this crash signature take a sharp dive as soon as the next Nightly gets spun.
I can repro this reliably on Nightly by scrolling on http://www.tiltbrush.com
tracking-fennec: ? → 35+
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #25)
> I can repro this reliably on Nightly by scrolling on http://www.tiltbrush.com

Thanks for the additional testcase. Note that this wasn't tested against a build that included the fixes, though.
So today I got access to a device and tried the STR in comment 25 with the patches applied. Based upon those tests, I think this is now fixed. I'd love verification from others, though.

Note that unfortunately all the patches won't be in Nightly until tonight.
I believe current m-c (9e3d649b80a2) has all your patches, but it's still asserting. From a quick look in the debugger, you have the same bug in multiple places.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 5671]
0x59bca5d4 in mozalloc_abort (msg=
    0x5cefd284 "[5649] ###!!! ABORT: Asked to discard with open decoder!: '!mDecoder', file /home/morbo/hg/mozilla-central/image/src/RasterImage.cpp, line 1859")
    at /home/morbo/hg/mozilla-central/memory/mozalloc/mozalloc_abort.cpp:37
37          MOZ_CRASH();
(gdb) bt
#0  0x59bca5d4 in mozalloc_abort (msg=
    0x5cefd284 "[5649] ###!!! ABORT: Asked to discard with open decoder!: '!mDecoder', file /home/morbo/hg/mozilla-central/image/src/RasterImage.cpp, line 1859")
    at /home/morbo/hg/mozilla-central/memory/mozalloc/mozalloc_abort.cpp:37
#1  0x63ac9aae in Abort (aMsg=aMsg@entry=
    0x5cefd284 "[5649] ###!!! ABORT: Asked to discard with open decoder!: '!mDecoder', file /home/morbo/hg/mozilla-central/image/src/RasterImage.cpp, line 1859")
    at /home/morbo/hg/mozilla-central/xpcom/base/nsDebugImpl.cpp:479
#2  0x63ac9da6 in NS_DebugBreak (aSeverity=<optimized out>, aStr=
    0x663a76e0 "Asked to discard with open decoder!", aExpr=0x663a7704 "!mDecoder", aFile=
    0x663a6a28 "/home/morbo/hg/mozilla-central/image/src/RasterImage.cpp", aLine=1859)
    at /home/morbo/hg/mozilla-central/xpcom/base/nsDebugImpl.cpp:436
#3  0x643ee7f4 in mozilla::image::RasterImage::Discard (this=this@entry=0x6f924020, force=force@entry=true)
    at /home/morbo/hg/mozilla-central/image/src/RasterImage.cpp:1859
#4  0x643f48be in ForceDiscard (this=0x6f924020)
    at /home/morbo/hg/mozilla-central/image/src/RasterImage.h:186
#5  mozilla::image::RasterImage::GetCurrentImage (this=this@entry=0x6f924020)
    at /home/morbo/hg/mozilla-central/image/src/RasterImage.cpp:905
#6  0x643f4c7c in mozilla::image::RasterImage::GetImageContainer (this=0x6f924020, 
    aManager=<optimized out>, _retval=0x5cefd744)
    at /home/morbo/hg/mozilla-central/image/src/RasterImage.cpp:951
#7  0x651f434e in nsDisplayImage::GetContainer (this=<optimized out>, aManager=0x73c34b80, 
    aBuilder=<optimized out>) at /home/morbo/hg/mozilla-central/layout/generic/nsImageFrame.cpp:1343
#8  0x6513625a in mozilla::ThebesLayerData::CanOptimizeImageLayer (this=this@entry=0x73399790, 
    aBuilder=<optimized out>) at /home/morbo/hg/mozilla-central/layout/base/FrameLayerBuilder.cpp:1926
#9  0x6514ab2a in mozilla::ContainerState::PopThebesLayerData (this=this@entry=0x5cefdb3c)
    at /home/morbo/hg/mozilla-central/layout/base/FrameLayerBuilder.cpp:2094
#10 0x6514c9ac in mozilla::ContainerState::Finish (this=this@entry=0x5cefdb3c, 
    aTextContentFlags=aTextContentFlags@entry=0x5cefdab4, aData=aData@entry=0x6e297100, 
    aContainerPixelBounds=..., aChildItems=aChildItems@entry=0x72d4b998, aHasComponentAlphaChildren=
    @0x5cefdaaf: false) at /home/morbo/hg/mozilla-central/layout/base/FrameLayerBuilder.cpp:3623
#11 0x6514d5e8 in mozilla::FrameLayerBuilder::BuildContainerLayerFor (this=0x6e7cf970, aBuilder=
    0x5cefee68, aManager=0x73c34b80, aContainerFrame=0x70395b98, aContainerItem=0x72d4b968, aChildren=
    0x72d4b998, aParameters=..., aTransform=0x0, aFlags=0)
    at /home/morbo/hg/mozilla-central/layout/base/FrameLayerBuilder.cpp:4006
#12 0x65180d78 in nsDisplayOpacity::BuildLayer (this=0x72d4b968, aBuilder=0x5cefee68, aManager=0x73c34b80, 
    aContainerParameters=...) at /home/morbo/hg/mozilla-central/layout/base/nsDisplayList.cpp:3385
#13 0x651411c2 in mozilla::FrameLayerBuilder::AddThebesDisplayItem (this=0x6e7b5bf0, 
    aLayerData=aLayerData@entry=0x73399640, aItem=aItem@entry=0x72d4b968, aClip=..., aItemVisibleRect=..., 
    aContainerState=..., aLayerState=aLayerState@entry=mozilla::LAYER_INACTIVE, aTopLeft=...)
    at /home/morbo/hg/mozilla-central/layout/base/FrameLayerBuilder.cpp:3260
#14 0x6514c7c2 in mozilla::ContainerState::ProcessDisplayItems (this=this@entry=0x5cefe23c, 
    aList=aList@entry=0x730d6550) at /home/morbo/hg/mozilla-central/layout/base/FrameLayerBuilder.cpp:3044
#15 0x6514d5aa in mozilla::FrameLayerBuilder::BuildContainerLayerFor (this=0x6e7b5bf0, aBuilder=
    0x5cefee68, aManager=0x6f71bc00, aContainerFrame=0x6e75c2b8, aContainerItem=0x730d6520, aChildren=
    0x730d6550, aParameters=..., aTransform=0x0, aFlags=0)
    at /home/morbo/hg/mozilla-central/layout/base/FrameLayerBuilder.cpp:3998
#16 0x65169e6e in nsDisplayOwnLayer::BuildLayer (this=this@entry=0x730d6520, aBuilder=aBuilder@entry=
    0x5cefee68, aManager=aManager@entry=0x6f71bc00, aContainerParameters=...)
    at /home/morbo/hg/mozilla-central/layout/base/nsDisplayList.cpp:3663
#17 0x6519d95c in nsDisplaySubDocument::BuildLayer (this=this@entry=0x730d6520, aBuilder=aBuilder@entry=
    0x5cefee68, aManager=aManager@entry=0x6f71bc00, aContainerParameters=...)
    at /home/morbo/hg/mozilla-central/layout/base/nsDisplayList.cpp:3705
#18 0x6519d9e4 in nsDisplayResolution::BuildLayer (this=0x730d6520, aBuilder=0x5cefee68, aManager=
    0x6f71bc00, aContainerParameters=...)
---Type <return> to continue, or q <return> to quit---
    at /home/morbo/hg/mozilla-central/layout/base/nsDisplayList.cpp:3839
#19 0x6514c248 in mozilla::ContainerState::ProcessDisplayItems (this=this@entry=0x5cefe88c, 
    aList=aList@entry=0x5cefedb4) at /home/morbo/hg/mozilla-central/layout/base/FrameLayerBuilder.cpp:2891
#20 0x6514d5aa in mozilla::FrameLayerBuilder::BuildContainerLayerFor (this=0x6e7b5bf0, aBuilder=
    0x5cefee68, aManager=0x6f71bc00, aContainerFrame=0x6e92f2b8, aContainerItem=0x0, aChildren=0x5cefedb4, 
    aParameters=..., aTransform=0x0, aFlags=0)
    at /home/morbo/hg/mozilla-central/layout/base/FrameLayerBuilder.cpp:3998
#21 0x651a0946 in nsDisplayList::PaintForFrame (this=this@entry=0x5cefedb4, aBuilder=aBuilder@entry=
    0x5cefee68, aCtx=aCtx@entry=0x0, aForFrame=0x6e92f2b8, aFlags=aFlags@entry=13)
    at /home/morbo/hg/mozilla-central/layout/base/nsDisplayList.cpp:1316
#22 0x651a0f8e in nsDisplayList::PaintRoot (this=this@entry=0x5cefedb4, aBuilder=aBuilder@entry=
    0x5cefee68, aCtx=aCtx@entry=0x0, aFlags=13)
    at /home/morbo/hg/mozilla-central/layout/base/nsDisplayList.cpp:1234
#23 0x651a17e6 in nsLayoutUtils::PaintFrame (aRenderingContext=0x0, aFrame=0x6e92f2b8, aDirtyRegion=..., 
    aBackstop=0, aFlags=772) at /home/morbo/hg/mozilla-central/layout/base/nsLayoutUtils.cpp:3064
#24 0x65124870 in PresShell::Paint (this=0x6ac44680, aViewToPaint=0x6e9446a0, aDirtyRegion=..., aFlags=1)
    at /home/morbo/hg/mozilla-central/layout/base/nsPresShell.cpp:6231
#25 0x64c58c56 in nsViewManager::ProcessPendingUpdatesPaint (this=0x6e94d0d0, aWidget=aWidget@entry=
    0x6e94e000) at /home/morbo/hg/mozilla-central/view/nsViewManager.cpp:443
#26 0x64c58ecc in nsViewManager::ProcessPendingUpdatesForView (this=this@entry=0x6e94d0d0, 
    aView=<optimized out>, aFlushDirtyRegion=aFlushDirtyRegion@entry=true)
    at /home/morbo/hg/mozilla-central/view/nsViewManager.cpp:384
#27 0x64c58fdc in nsViewManager::ProcessPendingUpdates (this=0x6e94d0d0)
    at /home/morbo/hg/mozilla-central/view/nsViewManager.cpp:1075
#28 0x6513183e in nsRefreshDriver::Tick (this=this@entry=0x6b4aba00, aNowEpoch=<optimized out>, aNowTime=
    ...) at /home/morbo/hg/mozilla-central/layout/base/nsRefreshDriver.cpp:1341
#29 0x65133d9a in TickDriver (now=..., jsnow=1411719034126214, driver=0x6b4aba00)
    at /home/morbo/hg/mozilla-central/layout/base/nsRefreshDriver.cpp:173
#30 mozilla::RefreshDriverTimer::Tick (this=0x6b452700)
    at /home/morbo/hg/mozilla-central/layout/base/nsRefreshDriver.cpp:164
#31 0x65133e70 in mozilla::RefreshDriverTimer::TimerTick (aTimer=<optimized out>, aClosure=<optimized out>)
    at /home/morbo/hg/mozilla-central/layout/base/nsRefreshDriver.cpp:190
#32 0x63b23e22 in nsTimerImpl::Fire (this=0x6e5ead80)
    at /home/morbo/hg/mozilla-central/xpcom/threads/nsTimerImpl.cpp:618
#33 0x63b2404a in nsTimerEvent::Run (this=0x6e746190)
    at /home/morbo/hg/mozilla-central/xpcom/threads/nsTimerImpl.cpp:711
#34 0x63b2067e in nsThread::ProcessNextEvent (this=0x5ca39a10, aMayWait=<optimized out>, 
    aResult=<optimized out>) at /home/morbo/hg/mozilla-central/xpcom/threads/nsThread.cpp:823
#35 0x63b42c92 in NS_ProcessNextEvent (aThread=<optimized out>, aMayWait=<optimized out>)
    at /home/morbo/hg/mozilla-central/xpcom/glue/nsThreadUtils.cpp:265
#36 0x63dc3a10 in mozilla::ipc::MessagePump::Run (this=0x5ca01dc0, aDelegate=0x5ca5a0c0)
    at /home/morbo/hg/mozilla-central/ipc/glue/MessagePump.cpp:99
#37 0x63d9b650 in MessageLoop::RunInternal (this=this@entry=0x5ca5a0c0)
    at /home/morbo/hg/mozilla-central/ipc/chromium/src/base/message_loop.cc:230
#38 0x63d9b66a in MessageLoop::RunHandler (this=this@entry=0x5ca5a0c0)
    at /home/morbo/hg/mozilla-central/ipc/chromium/src/base/message_loop.cc:223
#39 0x63d9b818 in MessageLoop::Run (this=0x5ca5a0c0)
    at /home/morbo/hg/mozilla-central/ipc/chromium/src/base/message_loop.cc:197
#40 0x64c5f16e in nsBaseAppShell::Run (this=0x5caf1f40)
    at /home/morbo/hg/mozilla-central/widget/xpwidgets/nsBaseAppShell.cpp:164
#41 0x6558f910 in nsAppStartup::Run (this=0x5caf6bb0)
    at /home/morbo/hg/mozilla-central/toolkit/components/startup/nsAppStartup.cpp:280
#42 0x655da3f4 in XREMain::XRE_mainRun (this=this@entry=0x5ceff9ec)
    at /home/morbo/hg/mozilla-central/toolkit/xre/nsAppRunner.cpp:4164
#43 0x655da6c4 in XREMain::XRE_main (this=this@entry=0x5ceff9ec, argc=argc@entry=10, argv=argv@entry=
---Type <return> to continue, or q <return> to quit---
    0x5ca2f148, aAppData=aAppData@entry=0x80b76bd0 <sAppData>)
    at /home/morbo/hg/mozilla-central/toolkit/xre/nsAppRunner.cpp:4235
#44 0x655da88a in XRE_main (argc=10, argv=0x5ca2f148, aAppData=0x80b76bd0 <sAppData>, 
    aFlags=<optimized out>) at /home/morbo/hg/mozilla-central/toolkit/xre/nsAppRunner.cpp:4449
#45 0x655e102e in GeckoStart (data=0x5ca39350, appData=0x80b76bd0 <sAppData>)
    at /home/morbo/hg/mozilla-central/toolkit/xre/nsAndroidStartup.cpp:73
#46 0x80b3b164 in Java_org_mozilla_gecko_mozglue_GeckoLoader_nativeRun (jenv=0xf3508, jc=<optimized out>, 
    jargs=0x4091c418) at /home/morbo/hg/mozilla-central/mozglue/android/APKOpen.cpp:417
#47 0xaca11d38 in dvmPlatformInvoke ()
   from /home/morbo/git/android-gdb/moz-gdb/lib/42800C743000157/system/lib/libdvm.so
#48 0xaca41262 in dvmCallJNIMethod_general ()
   from /home/morbo/git/android-gdb/moz-gdb/lib/42800C743000157/system/lib/libdvm.so
#49 0xaca46864 in dvmResolveNativeMethod ()
   from /home/morbo/git/android-gdb/moz-gdb/lib/42800C743000157/system/lib/libdvm.so
#50 0xaca16f60 in dvmJitToInterpNoChain ()
   from /home/morbo/git/android-gdb/moz-gdb/lib/42800C743000157/system/lib/libdvm.so
#51 0xaca16f60 in dvmJitToInterpNoChain ()
   from /home/morbo/git/android-gdb/moz-gdb/lib/42800C743000157/system/lib/libdvm.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) up
#1  0x63ac9aae in Abort (aMsg=aMsg@entry=
    0x5cefd284 "[5649] ###!!! ABORT: Asked to discard with open decoder!: '!mDecoder', file /home/morbo/hg/mozilla-central/image/src/RasterImage.cpp, line 1859")
    at /home/morbo/hg/mozilla-central/xpcom/base/nsDebugImpl.cpp:479
479       mozalloc_abort(aMsg);
(gdb) up
#2  0x63ac9da6 in NS_DebugBreak (aSeverity=<optimized out>, aStr=
    0x663a76e0 "Asked to discard with open decoder!", aExpr=0x663a7704 "!mDecoder", aFile=
    0x663a6a28 "/home/morbo/hg/mozilla-central/image/src/RasterImage.cpp", aLine=1859)
    at /home/morbo/hg/mozilla-central/xpcom/base/nsDebugImpl.cpp:436
436           Abort(buf.buffer);
(gdb) up
#3  0x643ee7f4 in mozilla::image::RasterImage::Discard (this=this@entry=0x6f924020, force=force@entry=true)
    at /home/morbo/hg/mozilla-central/image/src/RasterImage.cpp:1859
1859      NS_ABORT_IF_FALSE(!mDecoder, "Asked to discard with open decoder!");
(gdb) up
#4  0x643f48be in ForceDiscard (this=0x6f924020)
    at /home/morbo/hg/mozilla-central/image/src/RasterImage.h:186
186       void ForceDiscard() { Discard(/* force = */ true); }
(gdb) up
#5  mozilla::image::RasterImage::GetCurrentImage (this=this@entry=0x6f924020)
    at /home/morbo/hg/mozilla-central/image/src/RasterImage.cpp:905
905         ForceDiscard();
(In reply to Seth Fowler [:seth] from comment #27)
> So today I got access to a device and tried the STR in comment 25 with the
> patches applied. Based upon those tests, I think this is now fixed. I'd love
> verification from others, though.

Were you able to reproduce the original problem?
(In reply to Seth Fowler [:seth] from comment #24)
> So all the speculative fixes mentioned so far are pushed now. I'm hoping
> that we should see this crash signature take a sharp dive as soon as the
> next Nightly gets spun.

With today's nightly build, I still crash 100% on my testcase from comment 18:

https://crash-stats.mozilla.com/report/index/3955e644-f768-4410-86b1-2d8222140926
(In reply to Aaron Train [:aaronmt] from comment #12)
> 100% reproducible on my Samsung Galaxy SII just after page-load on
> http://neowin.net

Still reproducible on Nightly (09/26).
(In reply to Gian-Carlo Pascutto [:gcp] from comment #29)
> (In reply to Seth Fowler [:seth] from comment #27)
> > So today I got access to a device and tried the STR in comment 25 with the
> > patches applied. Based upon those tests, I think this is now fixed. I'd love
> > verification from others, though.
> 
> Were you able to reproduce the original problem?

Apparently not reliably! I got it to happen on both junelakeloop.com (from comment 18) and tiltbrush.com (from comment 25) but this morning I can't make it happen at all with an unpatched build. It seems like its going to be hard for me to debug this if it happens this rarely for me.

Brad suggested this morning that we go ahead and back out and I agree; we've reached that point. I have no idea exactly what makes my device (a Samsung Galaxy Note 10.1) behave differently than the other devices mentioned here. It seems likely that differences in CPU performance, available memory, or screen resolution make this bug much harder to trigger. I'm going to see if I can get a different device and try to reproduce there.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #28)
> I believe current m-c (9e3d649b80a2) has all your patches, but it's still
> asserting. From a quick look in the debugger, you have the same bug in
> multiple places.

Hmm.. I wouldn't call that the same bug; that was existing code that should have been deleted as part of bug 1057903 and the previous bugs in that series, but got overlooked. It looks like I overlooked it because it calls RequestDecodeCore() directly instead of calling WantDecodedFrames().

It's quite possible that that code is the source of the remaining problems. (And I'll double check that there is no other code that manually triggers decoding immediately following a call to GetFrame(), which is now an antipattern, but had been correct in the past.)
So after some offline discussion the conclusion we've come to is that, given that I now have a device that reliably reproduces the issue (a Galaxy SII), we're going to try one more time to fix this. Based upon the information I have so far, this patch *should* fix the problem. I am beginning the process of testing this patch on multiple devices, including the Galaxy SII, immediately after uploading it, but I wanted to get it up here quickly so Timothy can start reviewing it.

If I can't make this work, I'm going to push a backout before leaving tonight.

What this means is that we'll either have a fix in the next Nightly (I'll push this patch directly to m-c if it passes try and review), or we'll have a backout in the following Nightly. Either way, this problem will be going away.

So on to the contents of the patch:

This bug seems to have multiple signatures with one thing in common: we're discarding in a situation where we shouldn't be. I'm also concerned that we may be sync notifying inappropriately, since there seems to be potential for that as well. This patch tries to ensure that we always do the right thing with respect to both discarding and notifying after doing a frame lookup by giving the responsibility for both to LookupFrame() itself. Callers pass in the drawing flags (which allows them to specify if they want sync decoding) and a separate flag to indicate whether we should notify synchronously. LookupFrame() and WantDecodedFrames() then take responsbility for doing the right thing. I think having all the logic in one place should make this *much* easier to get right.

I renamed GetFrame() to LookupFrame() in this patch to make it easier to review. I was already going to make this change in a followup bug, but given how important the difference between imgIContainer::GetFrame() and the internal method named GetFrame() is in reviewing this, I thought it'd be wise to make the change here. I'm hoping this will be helpful.

Also, note that the extra |aShouldSyncNotify| parameter that several methods have in this patch is temporary. I plan to write a followup that adds another decode flag for this, which in my opinion makes a lot more sense. I didn't want to do that here because it would make this patch even larger, and I'm trying to make this patch the smallest thing that could work, modulo the LookupFrame rename. But be aware that I plan to remove this parameter almost immediately.
Attachment #8496197 - Flags: review?(tnikkel)
I just tested this patch. On the Galaxy SII, I was able to reliably reproduce the crash by visiting neowin.net, junelakeloop.org, or tiltbrush.com before the patch. After the patch, I can't reproduce the crash on any of those sites.

On the Galaxy Tab 10.1, I was not able to reliably reproduce the crash even before the patch, though it did happen sometimes. FWIW, I wasn't able to make the crash happen with the patch applied.

Given that the Galaxy SII reproduced the crash every time before the patch, and doesn't at all afterward, I am feeling good about this patch. I'm trying to find more people to test it on different devices in the SF office, though, to increase my level of certainty.

If anyone reading this is interested in testing, you can download a version of Firefox for Android with the patch here:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mfowler@mozilla.com-40e8e9dc338a/try-android/
Comment on attachment 8496197 [details] [diff] [review]
Remove all manual discarding during frame lookup

>+RasterImage::LookupFrame(uint32_t aFrameNum,
>   DrawableFrameRef ref = frame->DrawableRef();
>   if (!ref) {
>     // The OS threw this frame away. We need to discard and redecode.
>     MOZ_ASSERT(!mAnim, "Animated frames should be locked");
>     if (CanForciblyDiscardAndRedecode()) {
>       ForceDiscard();
>-      WantDecodedFrames();
>+      WantDecodedFrames(aFlags, aShouldSyncNotify);
>+
>+      // If we sync decoded, we might have the frame now.
>+      if (aFlags & FLAG_SYNC_DECODE) {
>+        frame = LookupFrameNoDecode(aFrameNum);
>+        ref = frame->DrawableRef();
>+      }

I think it will be quite common to be decoded here in cases besides the sync decode flag being passed. Anytime we are allowed to do some decoding (ie StartDecoding) it will probably finish for small images. So just check always, or anytime we are allowed to do work or finish already completed work?
Attachment #8496197 - Flags: review?(tnikkel) → review+
(In reply to Timothy Nikkel (:tn) from comment #37)
> I think it will be quite common to be decoded here in cases besides the sync
> decode flag being passed. Anytime we are allowed to do some decoding (ie
> StartDecoding) it will probably finish for small images. So just check
> always, or anytime we are allowed to do work or finish already completed
> work?

Yes, I think you're right. I'll make that change.

Thanks for the quick review!
So this has been tested on a couple more phones now.

The Nexus 4 is like the Galaxy Tab 10.1; the crash doesn't seem to happen there reliably, even without this patch. So that phone isn't much use for evaluating the patch.

The Galaxy Nexus *does* reproduce the crash on tiltbrush.com and neowin.net. With the patch applied, it no longer crashes on either site.

So we have at least one additional data point that this patch fixes the crash.
This is an updated version of the patch that addresses Timothy's review comments.
Attachment #8496197 - Attachment is obsolete: true
Here's a new try job. This is the final version of the patch that will get landed on central.

https://tbpl.mozilla.org/?tree=Try&rev=05527426f472

To be sure that everything's good, I'll retest on the Galaxy S2 as well.
With the above Try build, I do not crash anymore on my SII using the URL Stephen posted nor on the one I posted.
Try looks good and approval was granted, so I pushed this directly to central:

https://hg.mozilla.org/mozilla-central/rev/818f353b7aa6
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
Volume for this crash has dropped 5% over the last few days but still remains at 40% of our crashes. Looking at the reports I don't see any instances of these signatures reported with builds following 20140926030202. I expect this to continue to drop off as more people get updates.
Depends on: 1075479
Depends on: 1083113
The patch in comment 43 broke the testcase of bug 803703.
Depends on: 1098958
It time to back this out for Aurora35.0a2 and Nightly36.0a1. 
Because, Rendering glitch (Bug 1075479, Bug 1083113 and Bug 1098958) is caused by this.
Flags: needinfo?(seth)
(In reply to Alice0775 White from comment #46)
> It time to back this out for Aurora35.0a2 and Nightly36.0a1. 
> Because, Rendering glitch (Bug 1075479, Bug 1083113 and Bug 1098958) is
> caused by this.

We won't back this out. This fixed a topcrash that will absolutely return if this gets backed out. That's more important than a rendering glitch.

We do need to fix those rendering glitches, though. Some of the patches I've already uploaded that are waiting for review may have already fixed the problem; I'll check that later today. If not, I'll get those glitches fixed separately.
Flags: needinfo?(seth)
Depends on: 1124051
No longer depends on: 1124051
Depends on: 1124615
You need to log in before you can comment on or make changes to this bug.