Closed Bug 914823 Opened 6 years ago Closed 6 years ago

crash in mozilla::gl::SharedSurface_Gralloc::~SharedSurface_Gralloc

Categories

(Core :: Graphics, defect, P1, critical)

26 Branch
Other
Gonk (Firefox OS)
defect

Tracking

()

VERIFIED FIXED
1.2 C3(Oct25)
blocking-b2g koi+
Tracking Status
firefox25 --- wontfix
firefox26 --- wontfix
firefox27 --- wontfix
firefox28 --- fixed
b2g-v1.2 --- fixed

People

(Reporter: nhirata, Assigned: bjacob)

References

(Depends on 1 open bug)

Details

(4 keywords, Whiteboard: [b2g-crash] QARegressExclude)

Crash Data

Attachments

(8 files, 4 obsolete files)

1.04 KB, patch
jgilbert
: review+
Details | Diff | Splinter Review
4.56 KB, patch
jgilbert
: review+
nical
: review+
sotaro
: feedback+
Details | Diff | Splinter Review
76.42 KB, text/plain
Details
5.12 KB, patch
Details | Diff | Splinter Review
73.91 KB, text/plain
Details
742 bytes, text/plain
Details
2.74 KB, patch
Details | Diff | Splinter Review
1.71 KB, patch
sotaro
: review+
sotaro
: feedback+
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is 
report bp-4e6f43df-cc97-436e-a74b-486172130910.
=============================================================
Crashing Thread
Frame 	Module 	Signature 	Source
0 	libxul.so 	mozilla::gl::SharedSurface_Gralloc::~SharedSurface_Gralloc 	gfx/gl/SharedSurfaceGralloc.cpp
1 	libxul.so 	mozilla::gl::SharedSurface_Gralloc::~SharedSurface_Gralloc 	gfx/gl/SharedSurfaceGralloc.cpp
2 	libxul.so 	mozilla::gfx::SurfaceStream::Delete(mozilla::gfx::SharedSurface*&) 	gfx/gl/SurfaceStream.cpp
3 	libxul.so 	mozilla::gfx::SurfaceStream::~SurfaceStream 	gfx/gl/SurfaceStream.cpp
4 	libxul.so 	mozilla::gfx::SurfaceStream_SingleBuffer::~SurfaceStream_SingleBuffer 	gfx/gl/SurfaceStream.cpp
5 	libxul.so 	mozilla::gfx::SurfaceStream_SingleBuffer::~SurfaceStream_SingleBuffer 	gfx/gl/SurfaceStream.cpp
6 	libxul.so 	mozilla::gl::GLScreenBuffer::~GLScreenBuffer 	gfx/gl/GLScreenBuffer.cpp
7 	libxul.so 	mozilla::gl::GLScreenBuffer::~GLScreenBuffer 	gfx/gl/GLScreenBuffer.cpp
8 	libxul.so 	mozilla::gl::GLContext::DestroyScreenBuffer() 	gfx/gl/GLContext.cpp
9 	libxul.so 	mozilla::gl::GLContext::MarkDestroyed() 	gfx/gl/GLContext.cpp
10 	libxul.so 	mozilla::gl::GLContextEGL::~GLContextEGL 	gfx/gl/GLContextProviderEGL.cpp
11 	libxul.so 	mozilla::gl::GLContextEGL::~GLContextEGL 	gfx/gl/GLContextProviderEGL.cpp
12 	libxul.so 	mozilla::detail::GenericRefCounted<(mozilla::detail::RefCountAtomicity)0>::Release() 	/builds/slave/b2g_m-cen_hamachi_ntly-0000000/build/objdir-gecko/gfx/gl/../../dist/include/mozilla/GenericRefCounted.h
13 	libxul.so 	mozilla::gfx::DrawTargetSkia::~DrawTargetSkia 	/builds/slave/b2g_m-cen_hamachi_ntly-0000000/build/objdir-gecko/gfx/2d/../../dist/include/mozilla/RefPtr.h
14 	libxul.so 	mozilla::gfx::DrawTargetSkia::~DrawTargetSkia 	gfx/2d/DrawTargetSkia.cpp
15 	libxul.so 	mozilla::detail::RefCounted<mozilla::gfx::DrawTarget, (mozilla::detail::RefCountAtomicity)1>::Release() const 	/builds/slave/b2g_m-cen_hamachi_ntly-0000000/build/objdir-gecko/image/src/../../dist/include/mozilla/RefPtr.h
16 	libxul.so 	mozilla::RefPtr<mozilla::gfx::DrawTarget>::operator=(mozilla::gfx::DrawTarget*) 	/builds/slave/b2g_m-cen_hamachi_ntly-0000000/build/objdir-gecko/content/canvas/src/../../../dist/include/mozilla/RefPtr.h
17 	libxul.so 	mozilla::dom::CanvasRenderingContext2D::Reset() 	content/canvas/src/CanvasRenderingContext2D.cpp
18 	libxul.so 	mozilla::dom::CanvasRenderingContext2D::~CanvasRenderingContext2D 	content/canvas/src/CanvasRenderingContext2D.cpp
19 	libxul.so 	mozilla::dom::CanvasRenderingContext2D::~CanvasRenderingContext2D 	content/canvas/src/CanvasRenderingContext2D.cpp
20 	libxul.so 	mozilla::dom::CanvasRenderingContext2D::DeleteCycleCollectable() 	content/canvas/src/CanvasRenderingContext2D.cpp
21 	libxul.so 	mozilla::dom::CanvasRenderingContext2D::cycleCollection::DeleteCycleCollectable(void*) 	content/canvas/src/CanvasRenderingContext2D.h
22 	libxul.so 	SnowWhiteKiller::~SnowWhiteKiller 	xpcom/base/nsCycleCollector.cpp
23 	libxul.so 	nsCycleCollector::FreeSnowWhite(bool) 	xpcom/base/nsCycleCollector.cpp
24 	libxul.so 	nsCycleCollector_doDeferredDeletion() 	xpcom/base/nsCycleCollector.cpp
25 	libxul.so 	AsyncFreeSnowWhite::Run() 	js/xpconnect/src/XPCJSRuntime.cpp
26 	libxul.so 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp
27 	libxul.so 	NS_ProcessNextEvent(nsIThread*, bool) 	/builds/slave/b2g_m-cen_hamachi_ntly-0000000/build/objdir-gecko/xpcom/build/nsThreadUtils.cpp
28 	libxul.so 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp
29 	libxul.so 	mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp
30 	libxul.so 	MessageLoop::RunInternal() 	ipc/chromium/src/base/message_loop.cc
31 	libxul.so 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc
32 	libxul.so 	nsBaseAppShell::Run() 	widget/xpwidgets/nsBaseAppShell.cpp
33 	libxul.so 	XRE_RunAppShell 	toolkit/xre/nsEmbedFunctions.cpp
34 	libxul.so 	mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp
35 	libxul.so 	MessageLoop::RunInternal() 	ipc/chromium/src/base/message_loop.cc
36 	libxul.so 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc
37 	libxul.so 	XRE_InitChildProcess 	toolkit/xre/nsEmbedFunctions.cpp
38 	plugin-container 	main 	ipc/app/MozillaRuntimeMain.cpp
39 	libc.so 	__libc_init 	bionic/libc/bionic/libc_init_dynamic.c
40 		@0xb00045a9 	

On ALCATEL ONE TOUCH FIRE
Build ID 	20130910040201
Release Channel 	hamachi/1.2.0/nightly

More reports : 
https://crash-stats.mozilla.com/report/list?product=B2G&signature=mozilla%3A%3Agl%3A%3ASharedSurface_Gralloc%3A%3A~SharedSurface_Gralloc
Getting this on Unagi/1.2 aswell

Gecko  http://hg.mozilla.org/mozilla-central/rev/740094c07328
Gaia  753bed59566ad14c5e032e45d2b320ef9529ca9a
BuildID 20130909195843
Version 26.0a1

The replication is in our test automation that makes a phone call from the contacts app (Contact details page) but I cannot replicate it manually.

Setting the Firefox OS flags on this.
OS: Android → Gonk (Firefox OS)
Hardware: All → Other
(In reply to Zac C (:zac) from comment #1)
> Getting this on Unagi/1.2 aswell
> 
> Gecko  http://hg.mozilla.org/mozilla-central/rev/740094c07328
> Gaia  753bed59566ad14c5e032e45d2b320ef9529ca9a
> BuildID 20130909195843
> Version 26.0a1
> 
> The replication is in our test automation that makes a phone call from the
> contacts app (Contact details page) but I cannot replicate it manually.
> 
> Setting the Firefox OS flags on this.

Is this test that's generating this crash consistently failing? Or only intermittently?

Also - Can you provide a link to the test that's failing?
I only ran it twice and got 100% replication but I can see it on our Jenkins instances too.


Link to the test:
https://github.com/mozilla/gaia-ui-tests/blob/master/gaiatest/tests/contacts/test_call_contact.py
Okay - sounds like this is reproducible then via the test in question.

Can we get a regression range of when this crash started happening?
blocking-b2g: --- → koi?
Blocks: GFXB2G1.2
The regression range is simply:
Starting from the build in comment #1  (10th Sept)
(In reply to Zac C (:zac) from comment #5)
> The regression range is simply:
> Starting from the build in comment #1  (10th Sept)

Any chance to narrow it down to a changeset?
No longer blocks: GFXB2G1.2
No longer blocks: b2g-central-dogfood
(In reply to Zac C (:zac) from comment #3)
> I only ran it twice and got 100% replication but I can see it on our Jenkins
> instances too.
> 
> 
> Link to the test:
> https://github.com/mozilla/gaia-ui-tests/blob/master/gaiatest/tests/contacts/
> test_call_contact.py

How do I run the testing? I couldn't reproduce it manually.
(In reply to Milan Sreckovic [:milan] from comment #6)
> (In reply to Zac C (:zac) from comment #5)
> > The regression range is simply:
> > Starting from the build in comment #1  (10th Sept)
> 
> Any chance to narrow it down to a changeset?

Not via automation easily. We have builds daily from release engineering, but not down to a changeset. We would need to do this via building manually and running the automation directly - which is going to take quite some time resources-wise. It's going to be easier to grab a push log and infer a potential regressing changeset.

(In reply to peter chang[:pchang] from comment #7)
> (In reply to Zac C (:zac) from comment #3)
> > I only ran it twice and got 100% replication but I can see it on our Jenkins
> > instances too.
> > 
> > 
> > Link to the test:
> > https://github.com/mozilla/gaia-ui-tests/blob/master/gaiatest/tests/contacts/
> > test_call_contact.py
> 
> How do I run the testing? I couldn't reproduce it manually.

See https://github.com/mozilla/gaia-ui-tests/blob/master/README.md.
Whiteboard: [b2g-crash]
Checking the crash stack here, I can't see any obvious changeset in the push log that caused this crash. I'm cc-ing Vlad & Jeff here for input, as it looks like they interact with a lot of source code files seen in this stack.
Looks like this is a top 5 crash on 1.2 so we should investigate.
Assignee: nobody → milan
blocking-b2g: koi? → koi+
Any crash reports since 9/18?
(In reply to Milan Sreckovic [:milan] from comment #11)
> Any crash reports since 9/18?

Yes. https://crash-stats.mozilla.com/report/index/19fbf3bd-0a9c-4ef9-977b-348d02130919.
(In reply to Jason Smith [:jsmith] from comment #12)
> (In reply to Milan Sreckovic [:milan] from comment #11)
> > Any crash reports since 9/18?
> 
> Yes.
> https://crash-stats.mozilla.com/report/index/19fbf3bd-0a9c-4ef9-977b-
> 348d02130919.

That one says: Build ID 	20130916040205

---> it's a build from 9/16.
Right - I meant to ask if there are any crashes on builds from 9/18 and later.
(In reply to Milan Sreckovic [:milan] from comment #14)
> Right - I meant to ask if there are any crashes on builds from 9/18 and
> later.

Oh. Latest build ID shown in crash stats is 20130918152049. If we need to retest, we can just see if the Gaia UI Test from https://bugzilla.mozilla.org/show_bug.cgi?id=914823#c3 is still hitting this crash.
Assign to Benoit while we're sorting out if it's still there the next few days.
Assignee: milan → bjacob
Zac - Are you still seeing this crash in Gaia UI Automation with https://github.com/mozilla/gaia-ui-tests/blob/master/gaiatest/tests/contacts/test_call_contact.py?
Flags: needinfo?(zcampbell)
https://crash-stats.mozilla.com/report/index/2bcb59ad-d67c-401b-95cc-374512130919
https://crash-stats.mozilla.com/report/index/d84c6aed-10cf-4b5e-b2fe-305602130919
along with :
https://crash-stats.mozilla.com/report/index/adf9da9c-e15d-4fc0-ad4f-d3bbd2130919

I saw this after opening and killing the usage app on the first usage.

Gaia   c932c482c6944fa32724ce7af9e5423c4c2bcccd
Gecko 2e560c5ce351
BuildID 20130919004001
Version 26.0a2
Clearing needinfo per comment 18
Flags: needinfo?(zcampbell)
Just hit this crash in 1.3 - when it crashed it said "Twitter crashed."

https://crash-stats.mozilla.com/report/index/3ae6f263-5677-4cb8-a585-37dde2130920

Gaia   78a3e2509b979bd226f01f557bdf229fd3faa42e
SourceStamp 59beb1868522
BuildID 20130920040201
Version 27.0a1
Adding the Skia folks as well.
Alright, sounds like this is still an issue, will look into this today.
I tried running the gaia-ui-tests following these instructions:

 https://github.com/mozilla/gaia-ui-tests/blob/master/README.md

but am running into a lot of trouble. I get

$ gaiatest gaiatest/tests/contacts/test_call_contact.py 
Traceback (most recent call last):
  File "/usr/local/bin/gaiatest", line 5, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2707, in <module>
    working_set.require(__requires__)
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 686, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 586, in resolve
    if dist not in req:
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2560, in __contains__
    if self.index: item = item.parsed_version  # only get if we need it
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2169, in parsed_version
    self._parsed_version = pv = parse_version(self.version)
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2185, in version
    "Missing 'Version:' header and/or PKG-INFO file", self
ValueError: ("Missing 'Version:' header and/or PKG-INFO file", marionette-client [unknown version] (/hack/mozilla-graphics/testing/marionette/client))

That's strange as /hack/mozilla-graphics is an old source directory that I haven't used in a while, and isn't the one that I built B2G from, so I don't understand where it's coming from. Also, I don't understand this error. And the gaia-ui-tests repo's wiki is saying that this project was merged into Gaia, but it's not very clear to me in the Gaia wiki what would be the up-to-date steps to run UI tests.
David and/or Dylan, who's a good person on Gaia that BenWa can talk to in order to sort these issues out, and be able to run the tests?
Flags: needinfo?(dscravaglieri)
Flags: needinfo?(doliver)
(In reply to Milan Sreckovic [:milan] from comment #24)
> David and/or Dylan, who's a good person on Gaia that BenWa can talk to in
> order to sort these issues out, and be able to run the tests?

This looks like a marionette setup issue potentially, so I'd ask jgriffin or mdas.
Good news: thanks to great support from mdas I got to run gaia unit tests.

Bad news: test_call_contact.py doesn't exist anymore.

What would be up-to-date STR?
Flags: needinfo?(zcampbell)
(In reply to Benoit Jacob [:bjacob] from comment #26)
> Good news: thanks to great support from mdas I got to run gaia unit tests.

Great, if you still need a Gaia contact for more context on the Comms apps, you can probably start with :etienne.
Flags: needinfo?(dscravaglieri)
Flags: needinfo?(doliver)
Whiteboard: [b2g-crash] → [b2g-crash] QARegressExclude
Alright, so, I can reproduce with some usage of the Contacts app in the B2G emulator, no need for a real device, no need for test runners --- helps a lot to get this in GDB.

Here are some very rough STR. Still not too sure; they work maybe once in 3 attempts (have to restart the emulator between attempts).

 1. Boot b2g emulator
 2. Start Contacts app
 3. Make sure you have at least one contact with a phone number (no need for actual phone network)
 4. Hit his phone number to call him
 5. Attach GDB to the Contacts app, or the Phone app, not too sure --- whichever one is doing graphics work with a 2D canvas at this point. Unfortunately, adb shell b2g-ps doesn't give process names in the emulator. But in adb logcat, notice the graphics spew about SharedSurface's (may require a debug build). Attach to the process that's generating it.
 6. Cancel the phone call, call again, over and over again a few times
 7. If that doesn't work, try killing the phone call app while it's dialing: hold the home button pressed to start the task manager, hit the 'close window' icon that appears.

Here's what the crash looks like:

Program received signal SIGSEGV, Segmentation fault.
0x4115aa6c in ~SharedSurface_Gralloc (this=0x45719400, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/SharedSurfaceGralloc.cpp:135
135         mAllocator->DestroySharedSurface(&desc);
(gdb) bt
#0  0x4115aa6c in ~SharedSurface_Gralloc (this=0x45719400, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/SharedSurfaceGralloc.cpp:135
#1  0x4115aac8 in ~SharedSurface_Gralloc (this=0x45719400, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/SharedSurfaceGralloc.cpp:136
#2  0x41172b66 in mozilla::gfx::SurfaceStream::Delete (this=0x4588ba50, surf=@0x4588ba58)
    at /hack/mozilla-aurora/gfx/gl/SurfaceStream.cpp:85
#3  0x41172c7c in ~SurfaceStream (this=0x4588ba50, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/SurfaceStream.cpp:144
#4  0x41172f32 in ~SurfaceStream_SingleBuffer (this=0x4588ba50, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/SurfaceStream.cpp:206
#5  0x41172f60 in ~SurfaceStream_SingleBuffer (this=0x4588ba50, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/SurfaceStream.cpp:206
#6  0x41169c52 in ~GLScreenBuffer (this=0x4572b400, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/GLScreenBuffer.cpp:70
#7  0x41169cb8 in ~GLScreenBuffer (this=0x4572b400, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/GLScreenBuffer.cpp:73
#8  0x411663b8 in mozilla::gl::GLContext::DestroyScreenBuffer (this=0x45756800)
    at /hack/mozilla-aurora/gfx/gl/GLContext.cpp:3414
#9  0x41163c98 in mozilla::gl::GLContext::MarkDestroyed (this=0x45756800)
    at /hack/mozilla-aurora/gfx/gl/GLContext.cpp:1793
#10 0x4115b7b8 in ~GLContextEGL (this=0x45756800, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/GLContextProviderEGL.cpp:301
#11 0x4115b8ac in ~GLContextEGL (this=0x45756800, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/GLContextProviderEGL.cpp:317
#12 0x4115e19a in mozilla::detail::GenericRefCounted<(mozilla::detail::RefCountAtomicity)0>::Release (
    this=0x45756800) at ../../dist/include/mozilla/GenericRefCounted.h:66
#13 0x412eef86 in mozilla::RefPtr<mozilla::GenericRefCountedBase>::unref (this=0x45823b20, 
    __in_chrg=<value optimized out>) at ../../dist/include/mozilla/RefPtr.h:203
#14 ~RefPtr (this=0x45823b20, __in_chrg=<value optimized out>) at ../../dist/include/mozilla/RefPtr.h:153
#15 ~DrawTargetSkia (this=0x45823b20, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/2d/DrawTargetSkia.cpp:146
#16 0x412eefac in ~DrawTargetSkia (this=0xbef3b2b8, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/2d/DrawTargetSkia.cpp:146
#17 0x412ed490 in mozilla::detail::RefCounted<mozilla::gfx::DrawTarget, (mozilla::detail::RefCountAtomicity)1>::Release (this=0x4430a5a0, t=0x0) at ../../dist/include/mozilla/RefPtr.h:82
#18 mozilla::RefPtr<mozilla::gfx::DrawTargetSkia>::unref (this=0x4430a5a0, t=0x0)
    at ../../dist/include/mozilla/RefPtr.h:203
#19 mozilla::RefPtr<mozilla::gfx::DrawTargetSkia>::assign (this=0x4430a5a0, t=0x0)
    at ../../dist/include/mozilla/RefPtr.h:189
#20 mozilla::RefPtr<mozilla::gfx::DrawTargetSkia>::operator= (this=0x4430a5a0, t=0x0)
    at ../../dist/include/mozilla/RefPtr.h:164
#21 0x412ed6a2 in ~SourceSurfaceSkia (this=0x4430a560, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/2d/SourceSurfaceSkia.cpp:27
#22 0x412ed6c4 in ~SourceSurfaceSkia (this=0xbef3b2b8, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/2d/SourceSurfaceSkia.cpp:29
#23 0x408fb124 in mozilla::detail::RefCounted<mozilla::gfx::SourceSurface, (mozilla::detail::RefCountAtomicity)1>::Release (this=<value optimized out>) at ../../../dist/include/mozilla/RefPtr.h:82
#24 0x408fe542 in mozilla::RefPtr<mozilla::gfx::SourceSurface>::unref (this=0x4502dc00)
    at ../../../dist/include/mozilla/RefPtr.h:203
---Type <return> to continue, or q <return> to quit---
#25 ~RefPtr (this=0x4502dc00) at ../../../dist/include/mozilla/RefPtr.h:153
#26 mozilla::dom::CanvasRenderingContext2D::Demote (this=0x4502dc00)
    at /hack/mozilla-aurora/content/canvas/src/CanvasRenderingContext2D.cpp:764
#27 0x408fe9ac in mozilla::dom::CanvasRenderingContext2D::DemoteOldestContextIfNecessary ()
    at /hack/mozilla-aurora/content/canvas/src/CanvasRenderingContext2D.cpp:791
#28 0x408fd230 in mozilla::dom::CanvasRenderingContext2D::EnsureTarget (this=0x45031000)
    at /hack/mozilla-aurora/content/canvas/src/CanvasRenderingContext2D.cpp:857
#29 0x408fdbb8 in mozilla::dom::CanvasRenderingContext2D::TransformWillUpdate (this=0xbef3b2b8)
    at /hack/mozilla-aurora/content/canvas/src/CanvasRenderingContext2D.cpp:1931
#30 0x408fde0a in mozilla::dom::CanvasRenderingContext2D::Scale (this=0xbef3b2b8, x=-nan(0xa9b4000000000), 
    y=0.62727272727272732, error=...)
    at /hack/mozilla-aurora/content/canvas/src/CanvasRenderingContext2D.cpp:1149
#31 0x40e913de in scale (cx=0x40238320, obj=<value optimized out>, self=0x45031000, args=...)
    at /hack/b2g/B2G/objdir-gecko/dom/bindings/CanvasRenderingContext2DBinding.cpp:778
#32 0x40e8fbfc in genericMethod (cx=0x40238320, argc=<value optimized out>, vp=<value optimized out>)
    at /hack/b2g/B2G/objdir-gecko/dom/bindings/CanvasRenderingContext2DBinding.cpp:4177
#33 0x414bdd32 in CallJSNative (cx=0x40238320, args=..., construct=<value optimized out>)
    at /hack/mozilla-aurora/js/src/jscntxtinlines.h:218
#34 Invoke (cx=0x40238320, args=..., construct=<value optimized out>)
    at /hack/mozilla-aurora/js/src/vm/Interpreter.cpp:478


Where are we right now?


(gdb) frame 0
#0  0x4115aa6c in ~SharedSurface_Gralloc (this=0x45719400, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/SharedSurfaceGralloc.cpp:135
135         mAllocator->DestroySharedSurface(&desc);
(gdb) l
130
131         mGL->MakeCurrent();
132         mGL->fDeleteTextures(1, (GLuint*)&mProdTex);
133
134         SurfaceDescriptor desc(mDesc);
135         mAllocator->DestroySharedSurface(&desc);
136     }
137
138     void
139     SharedSurface_Gralloc::Fence()
(gdb) p mAllocator
$21 = (class mozilla::layers::ISurfaceAllocator * const) 0x43eb8d98
(gdb) p &desc
$22 = (mozilla::layers::SurfaceDescriptor *) 0xbef3b2b8


Note how mAllocator is reported as a pointer of the pure virtual base type, ISurfaceAllocator. That's sort of suspiscious --- suggests a corrupted (maybe deleted, maybe just corrupted) mAllocator, which would explain the immediate crash in this stack frame.

(gdb) p/x *(uint32_t*)mAllocator@10
$25 = {0xfffa9b28, 0x93, 0x0, 0xf8260000, 0xefe, 0x0, 0xd7000000, 0x32ff, 0x127b05, 0xf9190000}

The first 4 bytes pointed to by mAllocator, which have the value 0xfffa9b28, should be the vptr, and indeed,

(gdb) p *mAllocator
$27 = {
  _vptr.ISurfaceAllocator = 0xfffa9b28
}

But 0xfffa9b28 is not a mapped address (not in /proc/201/maps). So, *mAllocator is definitely corrupted.

By looking at registers and disassembly, we can confirm that this is what is causing the crash:

(gdb) info registers
r0             0xbef3b2b8       -1091325256
r1             0x0      0
r2             0x0      0
r3             0xfffa9b40       -353472
r4             0x45823b20       1166162720
r5             0x0      0
r6             0x4430a560       1144038752
r7             0x0      0
r8             0x8c     140
r9             0x0      0
r10            0x4502dc74       1157815412
r11            0xbef3b9b8       -1091323464
r12            0x40000000       1073741824
sp             0xbef3b2b0       0xbef3b2b0
lr             0x40e2845d       1088586845
pc             0x4115aa6c       0x4115aa6c <~SharedSurface_Gralloc+92>
cpsr           0x30     48


Disassembly:

   0x4115aa3c <+44>:    ldr     r3, [sp, #4]
   0x4115aa3e <+46>:    ldr     r2, [r3, #28]
   0x4115aa40 <+48>:    ldr     r3, [sp, #4]
   0x4115aa42 <+50>:    add.w   r3, r3, #60     ; 0x3c
   0x4115aa46 <+54>:    mov     r0, r2
   0x4115aa48 <+56>:    mov.w   r1, #1
   0x4115aa4c <+60>:    mov     r2, r3
   0x4115aa4e <+62>:    bl      0x410efff8 <_ZN7mozilla2gl9GLContext15fDeleteTexturesEiPj>
   0x4115aa52 <+66>:    ldr     r3, [sp, #4]
   0x4115aa54 <+68>:    add.w   r3, r3, #40     ; 0x28
   0x4115aa58 <+72>:    add     r2, sp, #8
   0x4115aa5a <+74>:    mov     r0, r2
   0x4115aa5c <+76>:    mov     r1, r3
   0x4115aa5e <+78>:    bl      0x40e28454 <SurfaceDescriptor>
   0x4115aa62 <+82>:    ldr     r3, [sp, #4]
   0x4115aa64 <+84>:    ldr     r3, [r3, #36]   ; 0x24
   0x4115aa66 <+86>:    ldr     r3, [r3, #0]
   0x4115aa68 <+88>:    add.w   r3, r3, #24
=> 0x4115aa6c <+92>:    ldr     r3, [r3, #0]

At 0x4115aa62 we are loading the 'this' pointer, stored at sp+4, into r3:

(gdb) p this
$7 = (mozilla::gl::SharedSurface_Gralloc *) 0x45719400
(gdb) p/x *(uint32_t*)0xbef3b2b4
$10 = 0x45719400

At 0x4115aa64 we are reading a value stored 36 bytes into *this, storing it into r3:

(gdb) p/x *(uint32_t*)(0x45719400 + 36)
$13 = 0x43eb8d98
(gdb) p mAllocator
$15 = (class mozilla::layers::ISurfaceAllocator * const) 0x43eb8d98

So the value we just stored in r3 is mAllocator.

At 0x4115aa66 we are trying to dereference the vptr, storing the result again into r3.

The value that we are reading is:

(gdb) p/x *(uint32_t*)mAllocator
$24 = 0xfffa9b28

That's where things go wrong: we just read a bad vptr pointer, 0xfffa9b28.

Apparently we want to call the virtual method at offset 24 into the vtable:

At 0x4115aa68 we are adding 24 to r3, so we get 0xfffa9b40, the value of r3 at the time of the crash. Clearly, that method is ISurfaceAllocator::DestroySharedSurface.

The crash occurs at 0x4115aa6c as we try to dereference that to call DestroySharedSurface.

The question is why does the ISurfaceAllocator get corrupted? Does it get deleted too early, or does it get stomped over?
Flags: needinfo?(zcampbell)
Got a stack to where we are destroying that ISurfaceAllocator, which turns out to be a ClientLayerManager!

(gdb) bt
#0  0x411f1b32 in ~ClientLayerManager (this=0x4405e500, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/layers/client/ClientLayerManager.cpp:55
#1  0x411f1bb0 in ~ClientLayerManager (this=0x4405e500, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/layers/client/ClientLayerManager.cpp:60
#2  0x408401e4 in mozilla::layers::LayerManager::Release (this=0x1f) at ../../dist/include/Layers.h:163
#3  0x41259760 in ~nsRefPtr (this=0x446e8378, __in_chrg=<value optimized out>) at ../../dist/include/nsAutoPtr.h:887
#4  0x41259846 in ~SurfaceFactory_Gralloc (this=0x446e8300, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/SharedSurfaceGralloc.h:96
#5  0x4125987c in ~SurfaceFactory_Gralloc (this=0x446e8300, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/SharedSurfaceGralloc.h:96
#6  0x41268786 in ~GLScreenBuffer (this=0x43a47b00, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/GLScreenBuffer.cpp:69
#7  0x41268808 in ~GLScreenBuffer (this=0x43a47b00, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/GLScreenBuffer.cpp:73
#8  0x41264f08 in mozilla::gl::GLContext::DestroyScreenBuffer (this=0x44931400) at /hack/mozilla-aurora/gfx/gl/GLContext.cpp:3414
#9  0x412627f8 in mozilla::gl::GLContext::MarkDestroyed (this=0x44931400) at /hack/mozilla-aurora/gfx/gl/GLContext.cpp:1793
#10 0x4125a31c in ~GLContextEGL (this=0x44931400, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/GLContextProviderEGL.cpp:301
#11 0x4125a410 in ~GLContextEGL (this=0x44931400, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/gfx/gl/GLContextProviderEGL.cpp:317
#12 0x4125ccfa in mozilla::detail::GenericRefCounted<(mozilla::detail::RefCountAtomicity)0>::Release (this=0x44931400)
    at ../../dist/include/mozilla/GenericRefCounted.h:66
#13 0x413ec592 in mozilla::RefPtr<mozilla::GenericRefCountedBase>::unref (this=0x449ec820, __in_chrg=<value optimized out>)
    at ../../dist/include/mozilla/RefPtr.h:203
#14 ~RefPtr (this=0x449ec820, __in_chrg=<value optimized out>) at ../../dist/include/mozilla/RefPtr.h:153
#15 ~DrawTargetSkia (this=0x449ec820, __in_chrg=<value optimized out>) at /hack/mozilla-aurora/gfx/2d/DrawTargetSkia.cpp:146
#16 0x413ec5b8 in ~DrawTargetSkia (this=0x1f, __in_chrg=<value optimized out>) at /hack/mozilla-aurora/gfx/2d/DrawTargetSkia.cpp:146
#17 0x4081657c in mozilla::detail::RefCounted<mozilla::gfx::DrawTarget, (mozilla::detail::RefCountAtomicity)1>::Release (
    this=<value optimized out>) at ../../dist/include/mozilla/RefPtr.h:82
#18 0x409fa7b8 in mozilla::RefPtr<mozilla::gfx::DrawTarget>::unref (this=0x453e5c30, t=0x41b57b58)
    at ../../../dist/include/mozilla/RefPtr.h:203
#19 mozilla::RefPtr<mozilla::gfx::DrawTarget>::assign (this=0x453e5c30, t=0x41b57b58) at ../../../dist/include/mozilla/RefPtr.h:189
#20 mozilla::RefPtr<mozilla::gfx::DrawTarget>::operator= (this=0x453e5c30, t=0x41b57b58) at ../../../dist/include/mozilla/RefPtr.h:164
#21 0x409fc848 in mozilla::dom::CanvasRenderingContext2D::Reset (this=0x453e5c00)
    at /hack/mozilla-aurora/content/canvas/src/CanvasRenderingContext2D.cpp:616
---Type <return> to continue, or q <return> to quit---
#22 0x409ffcba in ~CanvasRenderingContext2D (this=0x1f, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/content/canvas/src/CanvasRenderingContext2D.cpp:546
#23 0x409ffd70 in ~CanvasRenderingContext2D (this=0x1f, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/content/canvas/src/CanvasRenderingContext2D.cpp:559
#24 0x409fa4ee in mozilla::dom::CanvasRenderingContext2D::DeleteCycleCollectable (this=0x1f)
    at /hack/mozilla-aurora/content/canvas/src/CanvasRenderingContext2D.cpp:463
#25 0x409fa392 in mozilla::dom::CanvasRenderingContext2D::cycleCollection::DeleteCycleCollectable (this=<value optimized out>, 
    p=0x41b57b58) at /hack/mozilla-aurora/content/canvas/src/CanvasRenderingContext2D.h:407
#26 0x4118826e in ~SnowWhiteKiller (this=0xbed86ef4, __in_chrg=<value optimized out>)
    at /hack/mozilla-aurora/xpcom/base/nsCycleCollector.cpp:1993
#27 0x41188eba in nsCycleCollector::FreeSnowWhite (this=0x40375000, aUntilNoSWInPurpleBuffer=<value optimized out>)
    at /hack/mozilla-aurora/xpcom/base/nsCycleCollector.cpp:2102
#28 0x41188ef0 in nsCycleCollector_doDeferredDeletion () at /hack/mozilla-aurora/xpcom/base/nsCycleCollector.cpp:3153
#29 0x40d3e91e in AsyncFreeSnowWhite::Run (this=0x403de040) at /hack/mozilla-aurora/js/xpconnect/src/XPCJSRuntime.cpp:231
#30 0x4118238c in nsThread::ProcessNextEvent (this=0x40316940, mayWait=<value optimized out>, result=0xbed86f97)
    at /hack/mozilla-aurora/xpcom/threads/nsThread.cpp:622
#31 0x4115fcc4 in NS_ProcessNextEvent (thread=0xc298144, mayWait=false) at /hack/mozilla-aurora/xpcom/glue/nsThreadUtils.cpp:238
#32 0x40ea1208 in mozilla::ipc::MessagePump::Run (this=0x40301be0, aDelegate=0xbed878a4)
    at /hack/mozilla-aurora/ipc/glue/MessagePump.cpp:81
#33 0x40ea12ba in mozilla::ipc::MessagePumpForChildProcess::Run (this=0x40301be0, aDelegate=0xbed878a4)
    at /hack/mozilla-aurora/ipc/glue/MessagePump.cpp:234
#34 0x4119e968 in MessageLoop::RunInternal (this=0x1000000) at /hack/mozilla-aurora/ipc/chromium/src/base/message_loop.cc:220
#35 0x4119e9e6 in MessageLoop::RunHandler (this=0xbed878a4) at /hack/mozilla-aurora/ipc/chromium/src/base/message_loop.cc:213
#36 MessageLoop::Run (this=0xbed878a4) at /hack/mozilla-aurora/ipc/chromium/src/base/message_loop.cc:187
#37 0x40e50af0 in nsBaseAppShell::Run (this=0x43595d60) at /hack/mozilla-aurora/widget/xpwidgets/nsBaseAppShell.cpp:161
#38 0x406c6576 in XRE_RunAppShell () at /hack/mozilla-aurora/toolkit/xre/nsEmbedFunctions.cpp:679
#39 0x40ea1288 in mozilla::ipc::MessagePumpForChildProcess::Run (this=0x40301be0, aDelegate=0xbed878a4)
    at /hack/mozilla-aurora/ipc/glue/MessagePump.cpp:201
#40 0x4119e968 in MessageLoop::RunInternal (this=0x43595d60) at /hack/mozilla-aurora/ipc/chromium/src/base/message_loop.cc:220
#41 0x4119e9e6 in MessageLoop::RunHandler (this=0xbed878a4) at /hack/mozilla-aurora/ipc/chromium/src/base/message_loop.cc:213
#42 MessageLoop::Run (this=0xbed878a4) at /hack/mozilla-aurora/ipc/chromium/src/base/message_loop.cc:187
#43 0x406c6a6a in XRE_InitChildProcess (aArgc=-1093109196, aArgv=0xbed879ac, aProcess=1077172224)
    at /hack/mozilla-aurora/toolkit/xre/nsEmbedFunctions.cpp:516
#44 0x00008650 in main (argc=7, argv=0xbed87a34) at /hack/mozilla-aurora/ipc/app/MozillaRuntimeMain.cpp:85
This fixes it! Explanation in the in-code comment.
Attachment #810246 - Flags: review?(jgilbert)
Attachment #810246 - Flags: review?(jgilbert) → review+
Comment on attachment 810246 [details] [diff] [review]
Reorder the ScreenBuffer teardown

[Approval Request Comment]
Bug caused by (feature/regressing bug #): A combinations of big changes between gecko 18 and gecko 26: the SurfaceStream stuff from bug 716859, the Gralloc back-end for it from bug 843599, ...
User impact if declined: Crashiness in all the apps that uses canvases, including 2d canvases. That includes the Communications app.
Testing completed (on m-c, etc.): inbound
Risk to taking this patch (and alternatives if risky): very low risk, just reorder deletes. Of course deleting things in wrong order can be crashy, but the current order _is_ crashy ;-) In my local testing, this fixes the crash and doesn't introduce another one so far. Also it inherently makes more sense to destroy things in this order. So, I'm confident.
String or IDL/UUID changes made by this patch: none
Attachment #810246 - Flags: approval-mozilla-aurora?
Comment on attachment 810246 [details] [diff] [review]
Reorder the ScreenBuffer teardown

Oh this is koi+, so no need for approval. Landing.
Attachment #810246 - Flags: approval-mozilla-aurora?
(In reply to Benoit Jacob [:bjacob] from comment #33)
> Oh this is koi+, so no need for approval. Landing.

We do typically prefer that it at *least* be green on trunk first, though.
https://hg.mozilla.org/mozilla-central/rev/3552e205f45e
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Depends on: 922336
Duplicate of this bug: 922336
I'm very sorry. It turns out that I got confused by my own debugging helpers. The stack that I pasted on comment 29 can't exist with actual mozilla-central or mozilla-aurora code, because the SurfaceFactory does /not/ actually hold a strong reference to mAllocator. Only with my own local debugging patch does it hold a strong reference to mAllocator.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
And while I did get locally to a point where this wasn't crashing anymore (with my own debugging helpers), that does not give us a hint as to what the real fix is, because this was likely leaking a cycle of strong references like this:


   CanvasLayer ----> CanvasClient -----> TextureClientShmem ----> DrawTargetSkia
         ^                                                              |
         |                                                              |
         |                                                              V
ClientLayerManager <-- SurfaceFactory_Gralloc <-- GLScreenBuffer <--  GLContext

                     ^
                     |
                     |
      This one arrow is the strong reference
      that my local debugging helpers added.

To be clear, by "my local debugging helpers", I'm not talking about stuff that landed, I'm only explaining why it had seemed to me, in my local testing, that the crash was fixed.

Back to debugging this...
I think I got it. I can't tell if it's the only problem, but it is definitely _a_ problem that seems to explain very well what we see in the stack trace to the actual crash, in comment 28.

In that stack, see how everything is triggered by a DOM call (CanvasRenderingContext2D::Scale at frame 30). This crashes as it eventually tries to do something that relies on the layers IPC system (the crash line with mAllocator->DestroySharedSurface(...), where mALlocator is a ClientLayerManager) but that was already destroyed.

So I tried to understand why the ClientLayerManager was already destroyed, but that is in fact not mysterious or interesting at all: as the DOM is managed by cycle collection, assuming that the layers IPC system is not, there is nothing that could exactly tie their lifetimes and ensure that the DOM doesn't survive the layers IPC system!

So we have two conceivable ways to fix this:

 * option #1: have the DOM keep the child-side layers IPC system alive (so that the child-side layers IPC system won't go down until the cycle collector has taken down the DOM). Here 'the DOM' probably means the DOM GlobalWindow / Window / Document (I am not a DOM expert...), and 'the layers IPC system' probably means something around PLayerTransactionChild or ClientLayerManager, but I am not a layers IPC expert.

 * option #2: have the layers IPC system notify the DOM when it's going down, so that the DOM knows that any resource that it referenced, that was managed by the layers IPC system, is now neutered/invalid.

Opinions? By default, I am more tempted to try out option #1 because it seems like it could be a very small patch solving an entire class of problems, whereas option #2 would require ad-hoc code for every occurence of this kind of problems. But I am not confident about the consequences of reordering the child processes shutdown sequence in this way.
I made some investigation on this issue on Unagi:
Gecko  http://hg.mozilla.org/mozilla-central/rev/d71579c316c1
Gaia  67243760c86561e365bdd6def519105366e24be3
BuildID 20131001040206
Version 27.0a1

https://bugzilla.mozilla.org/show_bug.cgi?id=923033#c1


We fixed the test and now it's passing but the phone is still crashing
This crash also continues to be seen by stability testing on v1.2.
Blocks: 916996
(In reply to Benoit Jacob [:bjacob] from comment #40)

So, talked with some people (:roc, :bent). Option #1 as phrased above is bad, because canvas rendering should not depend on Layers whatsoever, and because canvas elements can exist outside of the DOM tree. The closest modification of option #1 that would make sense would be to split the surface allocation protocol out of PLayerTransaction so it's not tied to Layers anymore, and have that be controlled by the PBrowser protocol. That won't guarantee exact ordering of the shutdown sequence vs. HTML canvas elements.

Which takes us to option #2: we need that anyway. That is actually very easy to implement with WeakPtr's. We have a very simple, risk-free path to fixing each of these bugs by turning raw pointers into weak pointers and checking for nullity before dereferencing them. It's tedious and unsatisfactory, but I believe that it's the right thing to do for B2G v1.2 at least.
This fixes the crash in my local testing. As it is a weak pointer, it doesn't change the lifespan of anything, so I feel confident that this time around, this is an actual fix. (Still sorry about the bad 'fix' earlier on this bug!)
Attachment #813622 - Flags: review?(sotaro.ikeda.g)
Depends on: 923530
Changed to use atomic refcounting.
Attachment #813622 - Attachment is obsolete: true
Attachment #813622 - Flags: review?(sotaro.ikeda.g)
Attachment #813633 - Flags: review?(sotaro.ikeda.g)
Comment on attachment 813633 [details] [diff] [review]
Use a WeakPtr to the SurfaceAllocator

The patch seems good to me in general. But I do not well about SharedSurface_Gralloc. It is better to ask a review to a person who know about SharedSurface_Gralloc.
Attachment #813633 - Flags: review?(sotaro.ikeda.g) → feedback+
Attachment #813633 - Flags: review?(vladimir)
Attachment #813633 - Flags: review?(jgilbert)
(In reply to Sotaro Ikeda [:sotaro] from comment #46)
> Comment on attachment 813633 [details] [diff] [review]
> Use a WeakPtr to the SurfaceAllocator
> 
> The patch seems good to me in general. But I do not well about
> SharedSurface_Gralloc. It is better to ask a review to a person who know
> about SharedSurface_Gralloc.

As we discussed, this usage does not need to be atomic weak pointer. And from an usage of atomic weak pointer, attachment 813633 [details] [diff] [review] seem not correct. If we want to use atomic weak pointer, we should not hold raw pointer, otherwise we should hold strong pointer to extend the life time of the allocator.
Indeed (as discussed offline). The 'Atomic' here is purely cosmetic at the moment. Fixing this seems to require changes in MFBT. Specifically, MFBT's WeakPtr allows casting to raw pointer, but not atomically constructing a RefPtr from a WeakPtr ; for Atomic to be meaningful there, that would have to be the converse. Will file bugs about fixing that.

Meanwhile, for B2G v1.2, this is not an immediate concern as the present code (publishing canvas frames) is currently only run on the main thread. That issue will become concrete in the next few months as we enable canvas-on-workers and let it talk directly to the compositor. But as that is all what we need for B2G v1.2 (and for current mozilla-central), let's continue with this patch as-is for now.

https://tbpl.mozilla.org/?tree=Try&rev=76b6f909dd6e
Filed bug 923554 about making AtomicSupportsWeakPtr actually usable atomically.
Comment on attachment 813633 [details] [diff] [review]
Use a WeakPtr to the SurfaceAllocator

It wouldn't hurt to have :nical's review on making ISurfaceAllocator a SupportsWeakPtr.
Attachment #813633 - Flags: review?(nical.bugzilla)
Attachment #813633 - Flags: review?(nical.bugzilla) → review+
No longer blocks: b2g-central-dogfood
Comment on attachment 813633 [details] [diff] [review]
Use a WeakPtr to the SurfaceAllocator

Review of attachment 813633 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/gl/SharedSurfaceGralloc.cpp
@@ +131,5 @@
>      mGL->MakeCurrent();
>      mGL->fDeleteTextures(1, (GLuint*)&mProdTex);
>  
> +    if (mAllocator) {
> +        SurfaceDescriptor desc(mDesc);

Let's leave this outside the branch.

@@ +132,5 @@
>      mGL->fDeleteTextures(1, (GLuint*)&mProdTex);
>  
> +    if (mAllocator) {
> +        SurfaceDescriptor desc(mDesc);
> +        mAllocator->DestroySharedSurface(&desc);

Well, I hope we don't get a race slice between the check and the deref.
Attachment #813633 - Flags: review?(jgilbert) → review+
Blocks: 924647
Comment on attachment 813633 [details] [diff] [review]
Use a WeakPtr to the SurfaceAllocator

2 reviews = enough, but it's still very useful to have as many people as possible aware of these issues.
Attachment #813633 - Flags: review?(vladimir)
https://hg.mozilla.org/mozilla-central/rev/d67872a94a98
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Deep sadness.  We still see this crash manifesting on v1.2 with the landed patch here.   Should I reopen this bug or create a new one?
Flags: needinfo?(bjacob)
Do you have steps to reproduce? The steps-to-reproduce that used to allow me to reproduce this crash, do not reproduce anymore with the patch that landed (I carefully checked, the second time!).

Maybe once we'll know STR we can decide whether to reopen or file a separate bug.

Also, do you know if at least the crash occurs less frequently with the patch that landed?
Flags: needinfo?(bjacob)
...that said, one can take a guess as to what might still be crashing. The line that used to crash was 

  mAllocator->DestroySharedSurface(&desc);

This could crash for two reasons: either (1) the 'mAllocator' being dead, or (2) the gralloc buffer referenced by 'desc' being dead. Everytime I reproduced this, we were in case (1), so in my patch I fixed (1) and that fixed the bug for me. I assumed that for whatever reason (2) didn't seem to happen and I wanted to make minimal changes to B2G v1.2. But if we lack STR for this crash and have to make guesses, (2) is probably my best guess at this point. If that is the problem, then it's 'the usual' GrallocBufferActor crash, which we've dealt with before in other cases (e.g. bug 912725), so it's not too scary to handle it there too.
I don't have STR beyond run monkey for 8 hours overnight.  We've seen the crash manifest on v1.2 builds from Oct 10, 13, 14, 15 -- but I don't have data on the frequency of the crash per build.   Looking at the crash history, looks like about the same frequency before/after the patch in this bug landed.
(In reply to Michael Vines [:m1] [:evilmachines] from comment #59)
> I don't have STR beyond run monkey for 8 hours overnight.  We've seen the
> crash manifest on v1.2 builds from Oct 10, 13, 14, 15 -- but I don't have
> data on the frequency of the crash per build.   Looking at the crash
> history, looks like about the same frequency before/after the patch in this
> bug landed.

FWIW - we saw just saw this crash today in bug 927433 when launching the music app.
(re-opening for now)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(this crash has manifested twice in our testing today)
Target Milestone: mozilla27 → 1.2 C3(Oct25)
Keywords: topcrash-b2g
Priority: -- → P4
Priority: P4 → P1
This is blocking partner certification, but we do need to find some reproducible testcase to have confidence we fixed it.  Is there any information beyond that one stack trace?
I'm confident that we have not fixed this yet as we continue to see this crash regularly on v1.2 as a part of nightly monkey testing.  No better STR from me other than let a monkey work on the device for a couple hours unfortunately.  We can easily locally add WIP patches to improve logging and/or try out fixes though
The ongoing gfx work week in Paris is keeping busy the people who could work on this (me included) but the good news is that we have discussed this class of crashes (user-after-free of GrallocBufferActor, assuming that that is what the remaining crashes here are) and agreed on a short-term solution to this class of crashes.

Meanwhile I still can't reproduce but will submit asap a patch for you to try.
:m1, could you please try this patch?

It should fix this crash if my theory, that the problem is a dead GrallocBufferActor, is correct (and given the crash line and the fact that we already fixed the case of a dead SurfaceAllocator, I don't see any other possibility).

Unfortunately I still can't reproduce so I have to ask you to try this for me. Thanks!
Flags: needinfo?(mvines)
We applied the patch from attachment 821324 [details] [diff] [review] 

But there is a new crash after applying it:

 [@ mozilla::gl::SharedSurface_Gralloc::Create(mozilla::gl::GLContext*, mozilla::gl::GLFormats const&, nsIntSize const&, bool, mozilla::layers::ISurfaceAllocator*) | mozilla::gl::SurfaceFactory_Gralloc::CreateShared(nsIntSize const&) | mozilla::gfx::SurfaceFactory::NewSharedSurface(nsIntSize const&) | mozilla::gfx::SurfaceStream::New(mozilla::gfx::SurfaceFactory*, nsIntSize const&, mozilla::gfx::SharedSurface*&) ]

I'll attach the minidump, etc.
The app URL mentioned in attachment 821797 [details] could be considered suspicious (hi membuster!) so I'll add that we saw the same crash signature manifest in the following apps as well:

app://communications.gaiamobile.org/manifest.webapp
app://crystalskull.gaiamobile.org/manifest.webapp
app://ds-test.gaiamobile.org/manifest.webapp
app://geoloc.gaiamobile.org/manifest.webapp
app://wallpaper.gaiamobile.org/manifest.webapp
Flags: needinfo?(mvines)
hi guys I have just replicated this twice on today's nightly Hamachi master build.

https://crash-stats.mozilla.com/report/index/ede217a3-4900-487e-bb79-b15102131024
https://crash-stats.mozilla.com/report/index/becba26a-e0f8-42e4-91f6-665532131024

The replication is the same as comment #3 using the test automation but it is less frequent, maybe 1 in 5 rather than 5 in 5.

It is also crashing differently, a 'soft' crash with just a prompt in Gaia rather than a reboot crash, which means it's a bit tricky for our automation to detect. 

It is not our automation submitting these reports (unless we observe, interrupt the test and submit the report like I did above) so it's occurring in the wild, too.

Build:
Gecko  http://hg.mozilla.org/mozilla-central/rev/19fd3388c372
Gaia  62a84888f1c10556af8fbbd50e126bb1afe96276
BuildID 20131024040207
Version 27.0a1
Depends on: 932537
Sorry that I didn't get back to you earlier. I still can't reproduce crashes locally, with or without the previous patch that I submitted. I don't know what that is! Note that I am only testing on b2g1.2, not on mozilla-central.

Anyway, rather than testing yet another quick and dirty fix, here is what should be the 'right' fix. It works for me locally, but then again, I couldn't reproduce the crash that it's supposed to fix. But at least, it makes sense.

I'm going to attach a combined patch to make it easier to apply to b2g1.2.
Michael, please test this on b2g 1.2 ! Only apply this patch, no need for anything else.

(Or would you like me to make a mozilla-central version now?)
Flags: needinfo?(mvines)
There were some issues in the previous combined patch; here is the latest and greatest. I'd be very interested in testing!
Attachment #824341 - Attachment is obsolete: true
From the crash info in Comment 71, I feel that the problem might not related to SurfaceDescriptor nor GrallocBufferActor, but related to SharedSurface.
Thanks Sotaro! It seems that you are right about this being unrelated to GrallocBufferActor:

The crash address is 0x18 which is 24, which is the offset of DestroySharedSurface in the ISurfaceAllocator vtable (because it is the 7th virtual method in this class), which we already encountered in comment 28.

So it seems that mAllocator is still nullptr... as if the WeakPtr didn't work. But the other threads stacks in the crash reports do not seem to suggest a race condition, so at the moment I don't understand what is happening.
The code really is

    if (mAllocator) {
        mAllocator->DestroySharedSurface(&desc);
    }

so we must be having a race condition here, for mAllocator to be null by the time we call DestroySharedSurface!
(In reply to Benoit Jacob [:bjacob] from comment #77)

Should I hold off on testing with attachment 824635 [details] [diff] [review] then?
Flags: needinfo?(bjacob)
Flags: needinfo?(mvines)
Yes, please hold off for now. Sorry about the noise.
Flags: needinfo?(bjacob)
Remove 932537 from dependent bug from Comment 75, Comment 76.
No longer depends on: 932537
Strange part of the crash is that stack shows the destructor of SurfaceStream_SingleBuffer. In normal use cases, SurfaceStream_SingleBuffer is destroyed soon after creation. It is replaced by new omtc supported one in GLScreenBuffer::Morph() during ClientCanvasLayer::Initialize().
(In reply to Sotaro Ikeda [:sotaro] from comment #81)
> Strange part of the crash is that stack shows the destructor of
> SurfaceStream_SingleBuffer. In normal use cases, SurfaceStream_SingleBuffer
> is destroyed soon after creation. It is replaced by new omtc supported one
> in GLScreenBuffer::Morph() during ClientCanvasLayer::Initialize().

Oh, I understand it. If canvas is used for thumbnail generation, SurfaceStream_SingleBuffer is used without change.
(In reply to Sotaro Ikeda [:sotaro] from comment #82)
> (In reply to Sotaro Ikeda [:sotaro] from comment #81)
> > Strange part of the crash is that stack shows the destructor of
> > SurfaceStream_SingleBuffer. In normal use cases, SurfaceStream_SingleBuffer
> > is destroyed soon after creation. It is replaced by new omtc supported one
> > in GLScreenBuffer::Morph() during ClientCanvasLayer::Initialize().
> 
> Oh, I understand it. If canvas is used for thumbnail generation,
> SurfaceStream_SingleBuffer is used without change.

I'm just going to reaffirm that this is what should be happening. (from my knowledge of snapshotting)
This combined patch, which applies on B2G 1.2, implements the completely different approach taken in bug 933082 (I filed a separate bug to discuss these patches, because I didn't want to spam the long CC list here).

I really hope that this fixes the crash, and if somehow it doesn't, at least we'll learn something, because it changes completely the way that this mAllocator is managed.

Michael, Greg, please try it!
Attachment #824635 - Attachment is obsolete: true
Flags: needinfo?(mvines)
Flags: needinfo?(ggrisco)
(In reply to Greg Grisco from comment #68)
> Created attachment 821796 [details]
> decoded minidump after applying 821324

From attachment 821796 [details], in SharedSurface_Gralloc::Create(), it become clear that argument allocator is used without nullptr check.
SharedSurface_Gralloc is still holding raw pointer of layers::ISurfaceAllocator.
  http://mxr.mozilla.org/mozilla-central/source/gfx/gl/SharedSurfaceGralloc.h#41

WeakPtr<layers::ISurfaceAllocator> is used only in SurfaceFactory_Gralloc.
bjacob, before trying attachment 825046 [details] [diff] [review], it might be better to fix Comment 85 and Comment 86.
Flags: needinfo?(bjacob)
No longer depends on: RefcntAllocator
Sotaro: Thanks, you are right that we were only using WeakPtr in one of two places. Also, Matt's comments on bug 933082 showed me how my interpretation was wrong. The crash at address 0x18 means that mAllocator->vptr is null, which means that *mAllocator was destroyed already. It does not mean that mAllocator (as a pointer) is null. So with these two things together, the mystery is resolved.... sorry!
Flags: needinfo?(mvines)
Flags: needinfo?(ggrisco)
Flags: needinfo?(bjacob)
Attachment #825046 - Attachment is obsolete: true
Sotaro, do you see more that could be done, than this? In my understanding, this should fix the crashes reported in comment 71. 

The other crash that you mentioned in comment 85 is the one mentioned in comment 68; it is with another patch applied, that we are not considering anymore. So maybe we don't need to worry about it.
Attachment #825255 - Flags: feedback?(sotaro.ikeda.g)
> The other crash that you mentioned in comment 85 is the one mentioned in
> comment 68; it is with another patch applied, that we are not considering
> anymore. So maybe we don't need to worry about it.

Yeah, I checked the code, we do not need to worry about it.
Attachment #825255 - Flags: feedback?(sotaro.ikeda.g) → feedback+
Duplicate of this bug: 933204
(In reply to Benoit Jacob [:bjacob] from comment #89)
> Sotaro, do you see more that could be done, than this? In my understanding,
> this should fix the crashes reported in comment 71. 

About this bug, I do not know another thing need to do. By Bug 933082, all holding reference to ISurfaceAllocator need to change to thread-safe strong reference.
Comment on attachment 825255 [details] [diff] [review]
Use WeakPtr also in SharedSurface_Gralloc

OK, on to review.
Attachment #825255 - Flags: review?(sotaro.ikeda.g)
Attachment #825255 - Flags: review?(sotaro.ikeda.g) → review+
https://hg.mozilla.org/integration/b2g-inbound/rev/e361a255253f

Credits really go to Sotaro and Matt for finding how my interpretation of this crash was wrong!
(In reply to Benoit Jacob [:bjacob] from comment #94)

Can you summarize what patches should be applied for re-test?  Is it just attachment 825255 [details] [diff] [review]?

Thanks,

-Greg
Flags: needinfo?(bjacob)
(In reply to Greg Grisco from comment #95)
> Can you summarize what patches should be applied for re-test?  Is it just
> attachment 825255 [details] [diff] [review]?

Yes, just this one tiny patch. Thanks!
Flags: needinfo?(bjacob)
https://hg.mozilla.org/mozilla-central/rev/e361a255253f
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
After applying the patch we could not reproduce in last night's test run.  Thanks!
v cool! 
I'll try my steps from comment #71 again too.
\o/ Thanks really go to Sotaro!
Verified via socorro and not seeing crashes based on this crash signature.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.