Closed Bug 1252677 Opened 8 years ago Closed 8 years ago

[e10s] Windows-only texture leaks

Categories

(Core :: Graphics: Layers, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
e10s m9+ ---
firefox46 --- wontfix
firefox47 --- affected
firefox48 --- fixed

People

(Reporter: jgriffin, Assigned: bas.schouten)

References

(Blocks 1 open bug)

Details

(Keywords: meta, Whiteboard: [gfx-noted][e10s-orangeblockers][MemShrink:P1])

Attachments

(3 files, 1 obsolete file)

When canvas mochitests are run on Windows 7 debug in e10s mode, they leak:

leakcheck | tab process: 4328 bytes leaked (CondVar, Mutex, PTextureChild, SharedMemory, TextureChild, ...)

In order to get the suite enabled on Treeherder, I will disable the canvas tests only on win7-debug-e10s.

Full log: https://treeherder.mozilla.org/logviewer.html#?job_id=17240284&repo=try
I think this is just a variation on bug 1232549. We could increase the Windows-specific leak thresholds I added in bug 1219919 so that this is green, too.
Depends on: 1232549
Yes, it would be much better to increase that threshold than to disable the tests completely.
Flags: needinfo?(jgriffin)
Whiteboard: [gfx-noted]
I can put together a patch to do that in this bug. Jonathan, do you have links to additional logs? I need to figure out of the number of leaked objects is entirely consistent or might vary a bit.
Assignee: nobody → continuation
Jonathan, does this make the orange go away?
Attachment #8725774 - Flags: feedback?(jgriffin)
(In reply to Andrew McCreight [:mccr8] from comment #3)
> I can put together a patch to do that in this bug. Jonathan, do you have
> links to additional logs? I need to figure out of the number of leaked
> objects is entirely consistent or might vary a bit.

Here's another log with a different number of bytes leaked:

https://treeherder.mozilla.org/logviewer.html#?job_id=17269279&repo=try

I have a couple of more jobs pending on try so we'll see what the range is.
Flags: needinfo?(jgriffin)
Comment on attachment 8725774 [details] [diff] [review]
Increase Windows e10s texture leak threshold even more.

Review of attachment 8725774 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm; I just triggered a try run with this to verify it's effective:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5324da8c8d45
Attachment #8725774 - Flags: feedback?(jgriffin) → feedback+
I've got a similar-looking leak in dom/html/test on WinXP.
https://treeherder.mozilla.org/logviewer.html#?job_id=17794695&repo=try
Blocks: 1253849
It seems like Windows graphics code can just leak an arbitrary number of textures. (Maybe there is some kind of cap, I don't know.)
I'll move forward with this tomorrow, hopefully. I lost track of it.
Flags: needinfo?(continuation)
At a certain point, we should just ignore these texture leaks entirely. But for now I'll just increase the limit some more.

This also increases it enough to cover Ryan's latest round of even more leaks.
Attachment #8729106 - Flags: review?(erahm)
Flags: needinfo?(continuation)
Attachment #8725774 - Attachment is obsolete: true
Comment on attachment 8729106 [details] [diff] [review]
Increase Windows e10s texture leak threshold even more.

Review of attachment 8729106 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, although I think we should push harder to get the leak (bug 1232549) properly fixed.
Attachment #8729106 - Flags: review?(erahm) → review+
I'll investigate this a bit to try and figure out if we are really leaking one texture per canvas or something and file a more specific bug and try to get it marked tracking.
https://hg.mozilla.org/mozilla-central/rev/021fe8e5b781
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
This still occurs, see e.g., https://treeherder.mozilla.org/logviewer.html#?job_id=17962898&repo=try
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
WARNING | leakcheck | tab process: leaked too many Mutex (expected 15, got 31) 
WARNING | leakcheck | tab process: leaked too many PTextureChild (expected 9, got 23) 

Yeah, that's a lot. I think I've narrowed down at least two of the tests that leak, if you want to try disabling those individually, rather than the whole directory: test_imagebitmap_cropping.html and test_imagebitmap.html
Further work here will need somebody to actually fix the leak. I'll file a blocking bug about the two image bitmap leaks I found.
Assignee: continuation → nobody
Depends on: 1255891
(In reply to Andrew McCreight [:mccr8] from comment #16)
> WARNING | leakcheck | tab process: leaked too many Mutex (expected 15, got
> 31) 
> WARNING | leakcheck | tab process: leaked too many PTextureChild (expected
> 9, got 23) 
> 
> Yeah, that's a lot. I think I've narrowed down at least two of the tests
> that leak, if you want to try disabling those individually, rather than the
> whole directory: test_imagebitmap_cropping.html and test_imagebitmap.html

I've disabled just those two tests and the leak persists: https://treeherder.mozilla.org/#/jobs?repo=try&revision=817f0453ed46&selectedJob=17970673
You could also try disabling test_capture.html, which leaks intermittently. I've got a script now for reporting which tests leak textures with my logging added, so I'll try rerunning that a bunch of times over the weekend and see if there are any other tests that leak.
I did 8 or so retriggers of the canvas tests, and the three tests I mentioned before (test_imagebitmap_cropping.html, test_imagebitmap.html and test_capture.html) seem to be the only ones that leak. I'll try triggering M2 XP to reproduce the leak Ryan mentioned in comment 7. I don't know how common that is.

This is the patch I used to log texture creation.

The script I used to print out the tests that these leak during is here:
  https://github.com/amccreight/mochitest-logs/blob/master/leak_matcher.py
This is the various other things you have to bunch in there to get it to run e10s Windows debug tests, based on patches from Jonathan and Ryan.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #7)
> I've got a similar-looking leak in dom/html/test on WinXP.
> https://treeherder.mozilla.org/logviewer.html#?job_id=17794695&repo=try

Are you seeing this leak all of the time or just occasionally? I tried to reproduce and wasn't able to, but maybe I just don't have it set right ( https://treeherder.mozilla.org/#/jobs?repo=try&revision=d89a4b3e3ef0 ).
Flags: needinfo?(ryanvm)
I don't recall offhand, so I've retriggered more runs on Ash. Otherwise, maybe comment 14 took care of it?
Flags: needinfo?(ryanvm)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #23)
> I don't recall offhand, so I've retriggered more runs on Ash. Otherwise,
> maybe comment 14 took care of it?

Oh, that's a good thought. But looking at the logs I see only 3 textures leaking, so I don't know. But from analyzing the logs, I see that these are leaking during dom/html/test/test_video_wakelock.html so maybe it just sometimes leaks more textures. I'll file a separate bug for that.
Depends on: 1256946
Whiteboard: [gfx-noted] → [gfx-noted][e10s-orangeblockers]
Target Milestone: mozilla48 → ---
Depends on: 1258113
I'm just going to turn this into a meta bug for the various Windows-only e10s texture leaks. I've filed bugs for the various tests that leak.
Keywords: meta
Summary: [e10s] canvas mochitests leak with leakcheck | tab process: 4328 bytes leaked (CondVar, Mutex, PTextureChild, SharedMemory, TextureChild, ...) → [e10s] Windows-only texture leaks
Component: Canvas: 2D → Graphics: Layers
this affects debug test runs (currently on Ash, but eventually on mc). Milan, can you fix this in someplace in the near future?
Flags: needinfo?(milan)
Assignee: nobody → bas
Flags: needinfo?(milan)
Priority: -- → P1
Adding to e10s-rc list. Andrew says this bug should at least be looked at enough until we can tell if these texture leaks should block e10s release.
We're going to P1 the meta bug. Blocking bugs are probably of equal priority.
Whiteboard: [gfx-noted][e10s-orangeblockers] → [gfx-noted][e10s-orangeblockers][MemShrink:P1]
It should be noted all the tests that appear to be affected involve drawing video to a Canvas. I don't know this code very well but I will have a look, ni? Nical and Matt Woodrow as they might have an idea how how drawing video into a canvas interacts with hardware video decoding and the imagebridge, so I'll have some idea of where to start looking.
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(matt.woodrow)
When the ImageContainer receives an Image, it immediately tries to send the Image to the ImageBridge, which creates a TextureClient if the latter wasn't already created and pass it to an ImageClient on the ImageBridge. When the canvas access the ImageContainer, I assume that it will copy/read-back the content of ImageContainer on the main thread. The code that reads from the ImageContainer is in nsLayoutUtils::SurfaceFromElement: among other things it calls Image::GetAsSourceSurface.
So in principle the interactions between ImageContainer+ImageBridge and ImageContainer+Canvas *should* be orthogonal. It could be that the canvas holds the ImageContainer and/or Image longer than usual and we get into a situation that we hadn't thought of as a result, but I don't see what kind of situation that would be. If that was the case I am still confident that TextureClient's destruction is robust enough to be triggered from any thread at any time without problems *as long as it happens before gfx's shutdown*, so I would guess we leak something that holds on to the texture rather than just the texture, maybe. That's all very hand-wavy, just dumping anything that comes to mind as I type.
Flags: needinfo?(nical.bugzilla)
(In reply to Nicolas Silva [:nical] from comment #30)
> It could be that the canvas
> holds the ImageContainer and/or Image longer than usual and we get into a
> situation that we hadn't thought of as a result, but I don't see what kind
> of situation that would be. 

This was my guess too, but I looked over the code and we don't ever hold onto Layers things, only SourceSurfaces.

Do we assert when we try release things after gfx has shut down?

Would be good to confirm that this is the problem, and ideally get a stack trace to who was holding the last ref.
Flags: needinfo?(matt.woodrow)
I might be looking at a similar issue in bug 1244883.

What I've observed is that the main process is trying to shut down the content process, and blocks the I/O thread in EnsureProcessTerminated() until the content process has died. All (AFAICT?) IPDL messages go via the I/O thread so no IPC can happen, including __delete__ calls.

In the log linked in the original description of this bug we even see a small flood of messages attempting to send once the content process dies, but failing (note the PTexture::Msg___delete__ call failing):

23:51:19     INFO -   => Process ID: 1816, Thread ID: 2992
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
23:51:19     INFO -  [Parent 3624] WARNING: pipe error: 109: file c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 333
23:51:19     INFO -  ###!!! [Parent][MessageChannel] Error: (msgtype=0xF60001,name=PTexture::Msg___delete__) Channel error: cannot send/recv
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19     INFO -  ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
Edwin, I don't understand what you're saying. In the case of a debug/leak-checking build, all IPDL protocols (including PBackground) should be shut down before we call EnsureProcessTerminated. If there is still any protocol system open at that point, that is a bug in the design of the protocols and should be fixed.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #35)
> Edwin, I don't understand what you're saying. In the case of a
> debug/leak-checking build, all IPDL protocols (including PBackground) should
> be shut down before we call EnsureProcessTerminated. If there is still any
> protocol system open at that point, that is a bug in the design of the
> protocols and should be fixed.

Well there are plenty of things shutting down in ShutdownXPCOM that still need IPDL to work (in particular, media shuts down there, and holds on to graphics resources that are shared between processes and need IPDL to die).

AFAICT, ShutdownXPCOM is kicked off on the child at around the same time that EnsureProcessTerminated is called on the parent -- both coming from Content*::ActorDestroy. (I would rather like to be wrong about this -- see https://bugzilla.mozilla.org/show_bug.cgi?id=1244883#c21)
There should not be any things in XPCOM shutdown that require IPC access. All IPC/IPDL protocols must be completely shut down before you start XPCOM shutdown. The protocol ActorDestroy methods should be responsible for cleaning up any open shared resources.

If we are starting XPCOM shutdown from within the ActorDestroy method, perhaps we need to move that later?
Is there a specific place where IPC-dependent shutdown should be done? ContentChild::ActorDestroy? I see we fire a |content-child-shutdown| notification there, but simply observing for that doesn't afford the ordering guarantees that we rely on (things use graphics -- lots of things -- that need to be dead before we shut down).
(In reply to Edwin Flores [:eflores] [:edwin] from comment #38)
> Is there a specific place where IPC-dependent shutdown should be done?
> ContentChild::ActorDestroy? I see we fire a |content-child-shutdown|
> notification there, but simply observing for that doesn't afford the
> ordering guarantees that we rely on (things use graphics -- lots of things
> -- that need to be dead before we shut down).

Like Javascript for example, since DOM/Javascript has a unidirectional dependency on graphics those really have to shutdown before graphics does.
(In reply to Edwin Flores [:eflores] [:edwin] from comment #38)
> Is there a specific place where IPC-dependent shutdown should be done?
> ContentChild::ActorDestroy? I see we fire a |content-child-shutdown|
> notification there, but simply observing for that doesn't afford the
> ordering guarantees that we rely on (things use graphics -- lots of things
> -- that need to be dead before we shut down).

ContentChild::ActorDestroy appears to be too late, that calls XRE_ShutdownChildProcess() which breaks the main thread out of it's run loop.

What protocol(s) are you having shutdown issues with?

http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsEmbedFunctions.cpp#812
Flags: needinfo?(edwin)
Let's not confuse the bugs too much. This bug is actually just because we're not shutting down the ImageBridge and CompositorBridge cleanly on windows. I have patches to address this issue (although in a mildly hacky way). Let's have the shutdown discussions elsewhere.
Depends on: 1262898
Blocks: e10s-gfx
Should be fixed by bug 1262898. Bas, can we close this out or is there something we should confirm first?
Flags: needinfo?(bas)
Looking good!
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Flags: needinfo?(bas)
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
You need to log in before you can comment on or make changes to this bug.