Closed Bug 1151711 Opened 9 years ago Closed 9 years ago

Frequent test_2d.composite.canvas.color.html | application timed out after 450.0 seconds with no output after ABORT: not on worker thread!: 'mWorkerLoopID == MessageLoop::current()->id()'

Categories

(Core :: IPC, defect)

Other Branch
ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
2.2 S14 (12june)
Tracking Status
b2g-v2.2 --- fixed
b2g-master --- unaffected

People

(Reporter: RyanVM, Assigned: jerry)

References

Details

(Keywords: assertion, crash, intermittent-failure)

Attachments

(2 files, 1 obsolete file)

Attached file test log
I've conclusively been able to pinpoint the uplift of bug 1142360 as the "cause" of this frequent (near-permanent) failure on b2g37. Given the nature of the patch, it strongly points to changing chunking boundaries as exposing whatever underlying issue is at play here.

Looks like we're having stackwalking issues on b2g37 as well. I wonder if we're missing an uplift or something there.
Flags: needinfo?(bkelly)
Bill, any ideas otherwise?
Flags: needinfo?(wmccloskey)
We've seen a couple bugs like this recently. Without a stack, though, it's hard to fix. Maybe a b2g person could run the test locally? It sounds like it's pretty reproducible.

I agree that the "regressing" patch probably didn't cause this.
Flags: needinfo?(wmccloskey)
Ryan, I'm sorry but I'm not realistically going to have time to work on this in the next month.  I have too many other fires to put out.  Is there anyone else you can ask?
Flags: needinfo?(bkelly) → needinfo?(ryanvm)
Jed maybe? To recap, this is near perma-fail on b2g37 (v2.2), our recently announced long-term supported B2G release.
Flags: needinfo?(ryanvm) → needinfo?(jld)
Actually, let's try Cervantes first.
Flags: needinfo?(jld) → needinfo?(cyu)
The stackwalking looks like it's working fine; what's needed is some way to translate the libxul.so offsets into names.  The breakpad symbols would work, but they're on pvtbuilds and I apparently don't have the right flavor of VPN set up to get there from where I currently am.

Piping the logs through fix_b2g_stack.py, with the unstripped binaries from the build in the appropriate places, would have taken care of this; I don't know what part(s) of that aren't happening for automated tests.

It's also a little worrying that the crash reporter isn't working — minidump_stackwalk can be seen complaining about 0-length .dmp files.
You can access pvtbuilds with LDAP at https://pvtbuilds.mozilla.org/pvt/mozilla.org/
Is this failure only seen on 2.2? I ran the test on emulator debug build but didn't see the abort() in question.
Flags: needinfo?(cyu)
(In reply to Cervantes Yu from comment #62)
> Is this failure only seen on 2.2? I ran the test on emulator debug build but
> didn't see the abort() in question.
  ^^^^^^^^^^^^^^^^^^^^^^ on m-c.
I didn't see this crash on local 2.2 built.
Yes, we're only seeing it on v2.2. Note that per comment 0, this is heavily dependent on the specific chunk.
(In reply to Treeherder Robot from comment #71)
> log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-b2g37_v2_2&job_id=106044
> revision: cc02ee38b252

With the hint from comment #60, I have the symbols.  There ought to be a tool for this, but it's easy enough to interpret breakpad symbols manually with less(1).

> 03:10:30 INFO - #01: ???[libxul.so +0x5729b6]

mozilla::ipc::MessageChannel::AssertWorkerThread, ipc/MessageChannel.h line 402

> 03:10:30 INFO - #02: ???[libxul.so +0x57db8c]

mozilla::ipc::MessageChannel::Send, ipc/glue/MessageChannel.cpp line 720

> 03:10:30 INFO - #03: ???[libxul.so +0x6c525c]

mozilla::layers::PTextureChild::SendClearTextureHostSync, (generated) PTextureChild.cpp line 85

> 03:10:30 INFO - #04: ???[libxul.so +0x89ad04]

mozilla::layers::TextureClient::ForceRemove, gfx/layers/client/TextureClient.cpp line 511 

> 03:10:30 INFO - #05: ???[libxul.so +0x8c1af2]

mozilla::layers::ShadowLayerForwarder::RemoveTexture, line 523

> 03:10:30 INFO - #06: ???[libxul.so +0x8a06d4]
> 03:10:30 INFO - #07: ???[libxul.so +0x86be98]
> 03:10:30 INFO - #08: ???[libxul.so +0x86c3c4]
> 03:10:30 INFO - #09: ???[libxul.so +0x86c3f4]
> 03:10:30 INFO - #10: ???[libxul.so +0x56bb14]
> 03:10:30 INFO - #11: ???[libxul.so +0x8607b6]
> 03:10:30 INFO - #12: ???[libxul.so +0x86082e]
> 03:10:30 INFO - #13: ???[libxul.so +0x8608e2]
> 03:10:30 INFO - #14: ???[libxul.so +0x860908]
> 03:10:30 INFO - #15: ???[libxul.so +0x56bb14]
> 03:10:30 INFO - #16: ???[libxul.so +0x863ed4]
> 03:10:30 INFO - #17: ???[libxul.so +0x86407a]
> 03:10:30 INFO - #18: ???[libxul.so +0x86410c]
> 03:10:30 INFO - #19: ???[libxul.so +0x857384]
> 03:10:30 INFO - #20: ???[libxul.so +0x87913e]
> 03:10:30 INFO - #21: ???[libxul.so +0xdf2f54]
> 03:10:30 INFO - #22: ???[libxul.so +0xdd10e6]
> 03:10:30 INFO - #23: ???[libxul.so +0xdd10f8]
> 03:10:30 INFO - #24: ???[libxul.so +0xde11fa]
> 03:10:30 INFO - #25: ???[libxul.so +0xde6b9e]

mozilla::WebGLContext::cycleCollection::DeleteCycleCollectable, dom/canvas/WebGLContext.h line 192

> 03:10:30 INFO - #26: ???[libxul.so +0x379b7c]

SnowWhiteKiller::~SnowWhiteKiller, xpcom/base/nsCycleCollector.cpp line 2674

> 03:10:30 INFO - #27: ???[libxul.so +0x379c62]
> 03:10:30 INFO - #28: ???[libxul.so +0x379df2]
> 03:10:30 INFO - #29: ???[libxul.so +0x37a4c4]
> 03:10:30 INFO - #30: ???[libxul.so +0x37a5c4]
> 03:10:30 INFO - #31: ???[libxul.so +0x37a660]
> 03:10:30 INFO - #32: ???[libxul.so +0x3c2466]

mozilla::ShutdownXPCOM, xpcom/build/XPCOMInit.cpp line 942

> 03:10:30 INFO - #33: XRE_TermEmbedding[libxul.so +0x165f504]
> 03:10:30 INFO - #34: ???[libxul.so +0x574222]
> 03:10:30 INFO - #35: XRE_InitChildProcess[libxul.so +0x16602f6]
> 03:10:30 INFO - #36: ???[/system/b2g/plugin-container +0x8dc0]
> 03:10:30 INFO - #37: __libc_init[libc.so +0x16722]

I could resolve the rest of them, but this is already a little weird — if I'm reading the IPDL correctly, PTexture's toplevel protocol is PCompositor, CompositorChild expects to be owned by the main thread, and this *is* the main thread… but we're hitting the assertion for trying to send from the wrong thread.
Here's a more fixed version of the log from comment #71:

> 03:10:30 INFO - [Child 790] ###!!! ABORT: not on worker thread!: 'mWorkerLoopID == MessageLoop::current()->id()', file ../../dist/include/mozilla/ipc/MessageChannel.h, line 401
> 03:10:30 INFO - #01: mozilla::ipc::MessageChannel::AssertWorkerThread /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/dist/include/mozilla/ipc/MessageChannel.h:400 (libxul.so+0x5729b6)
> 03:10:30 INFO - #02: mozilla::ipc::MessageChannel::Send ipc/glue/MessageChannel.cpp:719 (libxul.so+0x57db8c)
> 03:10:30 INFO - #03: mozilla::layers::PTextureChild::SendClearTextureHostSync /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/ipc/ipdl/PTextureChild.cpp:84 (libxul.so+0x6c525c)
> 03:10:30 INFO - #04: mozilla::layers::TextureClient::ForceRemove gfx/layers/client/TextureClient.cpp:510 (libxul.so+0x89ad04)
> 03:10:30 INFO - #05: mozilla::layers::ShadowLayerForwarder::RemoveTexture gfx/layers/ipc/ShadowLayers.cpp:523 (libxul.so+0x8c1af2)
> 03:10:30 INFO - #06: mozilla::layers::TextureClient::Finalize gfx/layers/client/TextureClient.cpp:557 (libxul.so+0x8a06d4)
> 03:10:30 INFO - #07: mozilla::AtomicRefCountedWithFinalize<mozilla::layers::TextureClient>::Release /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/dist/include/mozilla/layers/AtomicRefCountedWithFinalize.h:59 (libxul.so+0x86be98)
> 03:10:30 INFO - #08: mozilla::gl::SharedSurface_Gralloc::~SharedSurface_Gralloc /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/dist/include/mozilla/RefPtr.h:302 (libxul.so+0x86c3c4)
> 03:10:30 INFO - #09: mozilla::gl::SharedSurface_Gralloc::~SharedSurface_Gralloc gfx/gl/SharedSurfaceGralloc.cpp:164 (libxul.so+0x86c3f4)
> 03:10:30 INFO - #10: nsRefPtr<txACompileObserver>::assign_assuming_AddRef /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/dist/include/nsRefPtr.h:47 (libxul.so+0x56bb14)
> 03:10:30 INFO - #11: mozilla::gl::ShSurfHandle::~ShSurfHandle /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/dist/include/mozilla/UniquePtr.h:255 (libxul.so+0x8607b6)
> 03:10:30 INFO - #12: mozilla::RefPtr<mozilla::gl::ShSurfHandle>::unref /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/dist/include/mozilla/RefPtr.h:142 (libxul.so+0x86082e)
> 03:10:30 INFO - #13: mozilla::gl::GLScreenBuffer::~GLScreenBuffer /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/dist/include/mozilla/RefPtr.h:243 (libxul.so+0x8608e2)
> 03:10:30 INFO - #14: mozilla::gl::GLScreenBuffer::~GLScreenBuffer gfx/gl/GLScreenBuffer.cpp:88 (libxul.so+0x860908)
> 03:10:30 INFO - #15: nsRefPtr<txACompileObserver>::assign_assuming_AddRef /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/dist/include/nsRefPtr.h:47 (libxul.so+0x56bb14)
> 03:10:30 INFO - #16: mozilla::gl::GLContext::MarkDestroyed /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/dist/include/mozilla/UniquePtr.h:281 (libxul.so+0x863ed4)
> 03:10:30 INFO - #17: mozilla::gl::GLContextEGL::~GLContextEGL gfx/gl/GLContextProviderEGL.cpp:248 (libxul.so+0x86407a)
> 03:10:30 INFO - #18: mozilla::gl::GLContextEGL::~GLContextEGL gfx/gl/GLContextProviderEGL.cpp:274 (libxul.so+0x86410c)
> 03:10:30 INFO - #19: mozilla::detail::GenericRefCounted<(mozilla::detail::RefCountAtomicity)0u>::Release /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/dist/include/mozilla/GenericRefCounted.h:95 (libxul.so+0x857384)
> 03:10:30 INFO - #20: nsRefPtr<mozilla::gl::GLContext>::assign_assuming_AddRef /builds/slave/b2g_m-b37_22_emu-d_ntly-000000/build/objdir-gecko/dist/include/nsRefPtr.h:47 (libxul.so+0x87913e)
> 03:10:30 INFO - #21: mozilla::WebGLContext::~WebGLContext dom/canvas/WebGLContext.cpp:305 (libxul.so+0xdf2f54)
> 03:10:30 INFO - #22: mozilla::WebGL1Context::~WebGL1Context dom/canvas/WebGL1Context.cpp:26 (libxul.so+0xdd10e6)
> 03:10:30 INFO - #23: mozilla::WebGL1Context::~WebGL1Context dom/canvas/WebGL1Context.cpp:26 (libxul.so+0xdd10f8)
> 03:10:30 INFO - #24: mozilla::WebGLContext::DeleteCycleCollectable dom/canvas/WebGLContext.cpp:1874 (libxul.so+0xde11fa)
> 03:10:30 INFO - #25: mozilla::WebGLContext::cycleCollection::DeleteCycleCollectable dom/canvas/WebGLContext.h:192 (libxul.so+0xde6b9e)
> 03:10:30 INFO - #26: SnowWhiteKiller::~SnowWhiteKiller xpcom/base/nsCycleCollector.cpp:2674 (libxul.so+0x379b7c)
> 03:10:30 INFO - #27: nsCycleCollector::FreeSnowWhite xpcom/base/nsCycleCollector.cpp:2841 (libxul.so+0x379c62)
> 03:10:30 INFO - #28: nsCycleCollector::BeginCollection xpcom/base/nsCycleCollector.cpp:3781 (libxul.so+0x379df2)
> 03:10:30 INFO - #29: nsCycleCollector::Collect xpcom/base/nsCycleCollector.cpp:3616 (libxul.so+0x37a4c4)
> 03:10:30 INFO - #30: nsCycleCollector::ShutdownCollect xpcom/base/nsCycleCollector.cpp:3568 (libxul.so+0x37a5c4)
> 03:10:30 INFO - #31: nsCycleCollector_shutdown xpcom/base/nsCycleCollector.cpp:4264 (libxul.so+0x37a660)
> 03:10:30 INFO - #32: mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:940 (libxul.so+0x3c2466)
> 03:10:30 INFO - #33: XRE_TermEmbedding[libxul.so +0x165f504]
> 03:10:30 INFO - #34: mozilla::ipc::ScopedXREEmbed::Stop ipc/glue/ScopedXREEmbed.cpp:115 (libxul.so+0x574222)
> 03:10:30 INFO - #35: XRE_InitChildProcess[libxul.so +0x16602f6]
> 03:10:30 INFO - #36: content_process_main ipc/contentproc/plugin-container.cpp:211 (plugin-container+0xdc0)
> 03:10:30 INFO - #37: __libc_init[libc.so +0x16722]
(In reply to Jed Davis [:jld] {UTC-7} from comment #72)
> I could resolve the rest of them, but this is already a little weird — if
> I'm reading the IPDL correctly, PTexture's toplevel protocol is PCompositor,
> CompositorChild expects to be owned by the main thread, and this *is* the
> main thread… but we're hitting the assertion for trying to send from the
> wrong thread.

Maybe this is use after free? If the MessageChannel is already dead, it could fail this assertion check (the MessageLoop is still alive when this crash happends).
This appears to be use after free of the MessageChannel instance. I added log instrumentation and found that when the assertion fails, mWorkerLoopID contains a random integer value.
This has shifted to test_2d.composite.canvas.color-dodge.html now.
Summary: Frequent test_2d.composite.canvas.color-burn.html | application timed out after 450.0 seconds with no output after ABORT: not on worker thread!: 'mWorkerLoopID == MessageLoop::current()->id()', file MessageChannel.h, line 401 → Frequent test_2d.composite.canvas.color-burn.html,test_2d.composite.canvas.color-dodge.html | application timed out after 450.0 seconds with no output after ABORT: not on worker thread!: 'mWorkerLoopID == MessageLoop::current()->id()'
Josh, B2G debug mochitest-5 has been failing ~90% of the time on B2G v2.2 for nearly 2 months now. Please help me find someone who can investigate this before I'm forced to hide the suite due to lack of progress.
Flags: needinfo?(jocheng)
Hi Peter,
This seems to be gfx issue.
Could you help to find someone who can help to fix this?
Thanks!
Flags: needinfo?(pchang)
Jerry, please help to check the WebGL resource free problem based on comment 76 and comment 106.
Flags: needinfo?(pchang) → needinfo?(hshih)
I will try to reproduce locally.
Assignee: nobody → hshih
Flags: needinfo?(hshih)
Status: NEW → ASSIGNED
Thanks, Peter and Jerry!
Flags: needinfo?(jocheng)
Summary: Frequent test_2d.composite.canvas.color-burn.html,test_2d.composite.canvas.color-dodge.html | application timed out after 450.0 seconds with no output after ABORT: not on worker thread!: 'mWorkerLoopID == MessageLoop::current()->id()' → Frequent test_2d.composite.canvas.color.html | application timed out after 450.0 seconds with no output after ABORT: not on worker thread!: 'mWorkerLoopID == MessageLoop::current()->id()'
The mWorkerLoopID id in the broken case seems invalid. Investigating...


04:52:53.693 I/GonkMemoryPressure(  793): Observed XPCOM shutdown.
...........
04:53:13.653 I/Gecko   (  793): [SharedSurface_Gralloc 0x4670b820] destroyed
04:53:13.653 I/Gecko   (  793): bignose ForceRemove 1, addr:0x441bcb80, actor:0x451ab100, tid:793
04:53:13.653 I/Gecko   (  793): bignose thread error, tid:793, workerID:1515870810, MessageLoop::current()->id():1   <====
04:53:13.653 I/Gecko   (  793): [Child 793] ###!!! ABORT: not on worker thread!: 'mWorkerLoopID == MessageLoop::current()->id()', file ../../dist/include/mozilla/ipc/MessageChannel.h, line 404
I see some interested log. We delete the actor twice. Please see the address "0x44053af0".
Investigating...

23:00:15     INFO -  06-01 05:46:28.555 I/Gecko   (  791): bignose TextureClient::Finalize, actor:0x44053af0, forwarder:0x44a679b0, tid:815
23:00:15     INFO -  06-01 05:46:28.555 I/Gecko   (  791): bignose ForceRemove 1, addr:0x44053af0, actor:0x45aa3440, tid:815
...........
23:00:17     INFO -  06-01 05:50:51.606 I/Gecko   (  791): [SharedSurface_Gralloc 0x46586b50] destroyed
23:00:17     INFO -  06-01 05:50:51.615 I/Gecko   (  791): bignose TextureClient::Finalize, actor:0x44053af0, forwarder:0x44afb820, tid:791
23:00:17     INFO -  06-01 05:50:51.615 I/Gecko   (  791): bignose ForceRemove 1, addr:0x44053af0, actor:0x44f48d80, tid:791
23:00:17     INFO -  06-01 05:50:51.615 I/Gecko   (  791): bignose thread error, tid:791, worker:1515870810 current:1
TextureClient could be recycled, so ignore the comment 247.
The crash stack from comment 76 show that we use ipc actor after ipc shutdown.

ipc shutdown:
https://hg.mozilla.org/mozilla-central/annotate/56241c1f8a3b/xpcom/build/XPCOMInit.cpp#l854

CC shutdown:
https://hg.mozilla.org/mozilla-central/annotate/56241c1f8a3b/xpcom/build/XPCOMInit.cpp#l946

Let me check our ActorDestroy() function.
I know what's going on here. Will upload a patch later.
Since we call CompositorChild::Shutdown() in gfxPlatform::ShutdownLayersIPC(), we also delete the LayerTransaction ipc. Then we use the ipc during cc. Please check comment 76.
And we don't have this in m-c. Only v2.2(b2g37) has this patch. I try to remove this patch now. https://bugzilla.mozilla.org/show_bug.cgi?id=1065536#c18
Attachment #8614181 - Flags: review?(continuation)
Comment on attachment 8614181 [details] [diff] [review]
remove the shutdown call of CompositorChild in child process. v1

Review of attachment 8614181 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for causing trouble here. Please add something to the description saying that you are backing out commit 3cf774a0954d. It would probably also make more sense to land this backout in bug 1065536.
Attachment #8614181 - Flags: review?(continuation) → review+
Comment on attachment 8614181 [details] [diff] [review]
remove the shutdown call of CompositorChild in child process. v1

I can't recall the last time I've seen so much green on a b2g37 debug M5 run. *sniff* It's beautiful!
Attachment #8614181 - Flags: feedback+
update patch comment.
Attachment #8614181 - Attachment is obsolete: true
Attachment #8614459 - Attachment description: back out bug 1065536 part 3 for some ipc crash. v2 → back out bug 1065536 part 3 for some ipc crash. v2. r=mccr8
Attachment #8614459 - Flags: review+
Comment on attachment 8614459 [details] [diff] [review]
back out bug 1065536 part 3 for some ipc crash. v2. r=mccr8

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
hit abort as https://bugzilla.mozilla.org/show_bug.cgi?id=1151711#c76
User impact if declined:
m5 test failed
Testing completed:
b2g37 try link https://treeherder.mozilla.org/#/jobs?repo=try&revision=18d5f48b237e
Risk to taking this patch (and alternatives if risky):
low
maybe we will have leak as https://bugzilla.mozilla.org/show_bug.cgi?id=1065536#c0
We back out a patch in 1065536 as m-c does here.
String or UUID changes made by this patch:
none
Attachment #8614459 - Flags: approval-mozilla-b2g37?
Comment on attachment 8614459 [details] [diff] [review]
back out bug 1065536 part 3 for some ipc crash. v2. r=mccr8

Hi Jerry,
Approving and please check again whether m5 test no longer fail.
Thanks!
Flags: needinfo?(hshih)
Attachment #8614459 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
Flags: needinfo?(hshih)
Keywords: checkin-needed
(In reply to Josh Cheng [:josh] from comment #267)
> Comment on attachment 8614459 [details] [diff] [review]
> back out bug 1065536 part 3 for some ipc crash. v2. r=mccr8
> 
> Hi Jerry,
> Approving and please check again whether m5 test no longer fail.
> Thanks!

The test result.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=18d5f48b237e
Needs more tests.
Huzzah!

https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/2b930da45291
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → 2.2 S14 (12june)
No longer blocks: HiddenAutomationTests
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: