Closed Bug 1022794 Opened 10 years ago Closed 10 years ago

Mochitest-debug-3 canvas test hang on B2G (ICS debug) when enabling generational garbage collection

Categories

(Core :: Graphics: Layers, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: sfink, Unassigned)

References

Details

Attachments

(1 file, 2 obsolete files)

Enabling GGC on B2G currently fails because of a timeout in mochitest-debug-3. The symptom is a bunch of error messages "How can you drop others buffer" in test_bug866575.html, followed by a 15 minute timeout.

The problem is that in SharedBufferManagerParent.cpp, ActorDestroy() is called at some point, and then RecvDropGrallocBuffer is called multiple times, each time noticing that mBuffers does not contain the requested key (because the whole map has been cleared.)

I have been struggling to reproduce the problem, because it requires running something or other under the 32-bit emulator under runtestsb2g.py under a 32-bit mock chroot under a 64-bit linux box, and I barely have that working. I have not managed to get gdb working yet (mismatched versions between gdb and gdbserver, apparently?) nor do I know how to connect to the right process once I do get it working. The above analysis is all from printf logging on try.

My guess is that something is using GC to clean up the shared buffer stuff, and it isn't rooted, and with GGC on we discard it sooner. But I don't know what holds onto this stuff; I haven't managed to get a call stack for the ActorDestroy() call yet.
Oh, right. Forgot to mention that all of the relevant code is #ifdef MOZ_HAVE_SURFACEDESCRIPTORGRALLOC, which explains why it isn't hit on desktop. Not sure about Android, though...?

Example log:

https://tbpl.mozilla.org/php/getParsedLog.php?id=41300611&tree=Try&full=1

I also do not see it on emulator-jb.
Blocks: 1020751
Darn. I hoped bug 1037360 would fix this. But with that patch, I no longer get the assertions about freeing up buffers after all of them have been cleared, but I still get the hang in test_bug866575.html.

https://tbpl.mozilla.org/php/getParsedLog.php?id=44200855&tree=Try&full=1

It has an NS_ASSERTION in there from this:

+  if (mBuffers.size() > 0) {
+    NS_ASSERTION(false, "stack for ActorDestroy");
+  }

because I was hoping the test infrastructure would produce a stack dump, but sadly it does not.

Any ideas on what is going on? This is a hang when I enable generational garbage collection on b2g, which might mean that the something (the SharedBufferManagementParent?) is unreachable from any root and GGC is noticing sooner and cleaning it up faster. It would be nice to get a stack trace when ActorDestroy is called, but I don't know how. Any ideas or advice would be appreciated. This is the only known problem blocking GGC on B2G currently.

It may not be related to SharedBufferManagementParent, though. I've mainly been looking at that because of the assertion failures I saw only with GGC enabled, but now that bug 1037360 has fixed those, I don't have solid evidence for suspecting this file anymore.
Flags: needinfo?(sotaro.ikeda.g)
The code does not expect RecvDropGrallocBuffer() is called after ActorDestroy(). It might be caused by that SharedBufferManagementParent is TopLevelProtocol.
Flags: needinfo?(sotaro.ikeda.g)
sfink, does 8459623 fix the problem?
Flags: needinfo?(sphink)
Thank you very much for looking at this, checking now: https://tbpl.mozilla.org/?tree=Try&rev=56f05a90f188
Flags: needinfo?(sphink)
(In reply to Steve Fink [:sfink], PTO Jul 28-Aug 6 from comment #6)
> Thank you very much for looking at this, checking now:
> https://tbpl.mozilla.org/?tree=Try&rev=56f05a90f188

That did not fix the timeout. If I understand correctly, your patch would fix the previous problem, where RecvDropGrallocBuffer would be called for a buffer that was already removed in ActorDestroy(). But that problem was fixed in bug 1037360 (though I don't understand the exact sequence of things involved, so perhaps your patch here is still necessary to avoid intermittent problems.)

The problem that remains here is the timeout. It may not be related to SharedBufferManagerParent.cpp at all.

To add to the confusion, the log for the above try push does *not* show a timeout in test_bug866575.html at all now. In fact, it does not seem to contain that test. It is timing out in /tests/content/canvas/test/test_2d.composite.solid.color-dodge.html . I don't know whether that is a real effect of the change from either the |if (!mDestroyed)| patch here or bug 1037360, or whether the chunked tests have been shifted around a litle. My previous push https://tbpl.mozilla.org/?tree=Try&rev=8cfa5eed915e included only bug 1037360 and not this patch, and it was still hanging on test_bug866575.html. But I have rebased since then. I will try a test without this patch: https://tbpl.mozilla.org/?tree=Try&rev=05d04f09c058
Summary: Buffer management failure on B2G (ICS debug) when enabling generational garbage collection → Mochitest-debug-3 canvas test hang on B2G (ICS debug) when enabling generational garbage collection
(In reply to Sotaro Ikeda [:sotaro PTO July/25 - Aug/3] from comment #4)
> Created attachment 8459623 [details] [diff] [review]
> patch - Add destroyed check to RecvDropGrallocBuffer()

Sorry, I found a problem in the patch :-(
The patch make RecvDropGrallocBuffer() to try get a lock twice. The function falls into deadlock.
Fix dead lock.
Attachment #8459623 - Attachment is obsolete: true
Safer patch.
Attachment #8459778 - Attachment is obsolete: true
That doesn't fix my current problem, though it probably would have fixed my original problem (before bug 1037360 fixed it). Here's the log with patch v3 applied: https://tbpl.mozilla.org/php/getParsedLog.php?id=44303428&tree=Try&full=1

jrmuizel (picking you a little at random): do you know what a hang in this test might imply? The only change necessary to trigger it is to enable GGC. The straightforward explanation would be that GGC notices something is dead sooner and finalizes it. But I don't know what objects are involved here.

anyone: Is there some way to get a hang stack out of a b2g ics emulator test run? (Or any stack at all.)
Flags: needinfo?(jmuizelaar)
(In reply to Steve Fink [:sfink], PTO Jul 28-Aug 6 from comment #11)
> That doesn't fix my current problem, though it probably would have fixed my
> original problem (before bug 1037360 fixed it). Here's the log with patch v3
> applied:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=44303428&tree=Try&full=1
> 
> jrmuizel (picking you a little at random): do you know what a hang in this
> test might imply? The only change necessary to trigger it is to enable GGC.
> The straightforward explanation would be that GGC notices something is dead
> sooner and finalizes it. But I don't know what objects are involved here.
> 
> anyone: Is there some way to get a hang stack out of a b2g ics emulator test
> run? (Or any stack at all.)

sfink, the tryserver log have the following log out. Is it the log that you added?  Default SharedBufferManagerParent does not have such code.

>###!!! ASSERTION: stack for ActorDestroy: 'false', file ../../../gecko/gfx/layers/ipc/SharedBufferManagerParent.cpp, line 162
Flags: needinfo?(sphink)
By the way, just applying patch rev3 result is the following.
https://tbpl.mozilla.org/?tree=Try&rev=19fcfe19d6e5
> sfink, the tryserver log have the following log out. Is it the log that you
> added?  Default SharedBufferManagerParent does not have such code.
> 
> >###!!! ASSERTION: stack for ActorDestroy: 'false', file ../../../gecko/gfx/layers/ipc/SharedBufferManagerParent.cpp, line 162

I assume the above log is from Comment 2 code. We do not need the check if we want to pass the test.
(In reply to Sotaro Ikeda [:sotaro PTO July/25 - Aug/3] from comment #13)
> By the way, just applying patch rev3 result is the following.
> https://tbpl.mozilla.org/?tree=Try&rev=19fcfe19d6e5

Wow, that test took 17 seconds! Still, that's way less than the 450 second timeout.

But yes, the test should normally pass even without any changes. It's enabling GGC that breaks it.

(In reply to Sotaro Ikeda [:sotaro PTO July/25 - Aug/3] from comment #12)
> (In reply to Steve Fink [:sfink], PTO Jul 28-Aug 6 from comment #11)
> sfink, the tryserver log have the following log out. Is it the log that you
> added?  Default SharedBufferManagerParent does not have such code.
> 
> >###!!! ASSERTION: stack for ActorDestroy: 'false', file ../../../gecko/gfx/layers/ipc/SharedBufferManagerParent.cpp, line 162

Yes, I added that. ted told me that NS_ASSERTION failures would get automatically annotated with a stack. Unfortunately, that doesn't work for the emulator. (Or something.) mwu on #b2g told me that I would get a stack for a MOZ_CRASH, but that seems to not work here either. So I currently have no way to get a stack at all.

Anyway, I should remove that NS_ASSERTION(false, "stack for ActorDestroy") call now, but I didn't think it mattered for seeing the hang. It seems like NS_ASSERTION prints its message and then continues anyway. Perhaps it makes the test fail, but the test already fails due to the timeout.
Flags: needinfo?(sphink)
I feel that NS_ASSERTION might be related to the test hung with patch rev3 applied.
(In reply to Sotaro Ikeda [:sotaro PTO July/25 - Aug/3] from comment #16)
> I feel that NS_ASSERTION might be related to the test hung with patch rev3
> applied.

It still hangs without it: https://tbpl.mozilla.org/?tree=Try&rev=0e224c1a5860&showall=1

The final output contains:

19:50:07     INFO -  2528 INFO [Parent 671] WARNING: NS_ENSURE_TRUE(mCallback) failed: file ../../../../gecko/content/base/src/nsFrameMessageManager.cpp, line 669
19:50:07     INFO -  2529 INFO System JS : ERROR chrome://specialpowers/content/SpecialPowersObserver.js:96 - NS_ERROR_NOT_INITIALIZED: Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED) [nsIMessageSender.sendAsyncMessage]
19:57:39     INFO -  DeviceRunner TEST-UNEXPECTED-FAIL | /tests/content/canvas/test/test_bug866575.html | application timed out after 450.0 seconds with no output

I do not see that with a non-b2g run, eg https://tbpl.mozilla.org/php/getParsedLog.php?id=43977477&tree=Try&full=1

So it looks like the mCallback thing is what I should be investigating.
Ok, weird. The problem seems to have disappeared, and not just because the relevant test was disabled. I re-enabled it here: https://tbpl.mozilla.org/?tree=Try&rev=d2993858d2ae

(It's now hiding in M-8.)
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(jmuizelaar)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: