Closed
Bug 1252677
Opened 9 years ago
Closed 9 years ago
[e10s] Windows-only texture leaks
Categories
(Core :: Graphics: Layers, defect, P1)
Core
Graphics: Layers
Tracking
()
RESOLVED
FIXED
mozilla48
People
(Reporter: jgriffin, Assigned: bas.schouten)
References
(Blocks 1 open bug)
Details
(Keywords: meta, Whiteboard: [gfx-noted][e10s-orangeblockers][MemShrink:P1])
Attachments
(3 files, 1 obsolete file)
1.50 KB,
patch
|
erahm
:
review+
|
Details | Diff | Splinter Review |
1.43 KB,
patch
|
Details | Diff | Splinter Review | |
11.37 KB,
patch
|
Details | Diff | Splinter Review |
When canvas mochitests are run on Windows 7 debug in e10s mode, they leak:
leakcheck | tab process: 4328 bytes leaked (CondVar, Mutex, PTextureChild, SharedMemory, TextureChild, ...)
In order to get the suite enabled on Treeherder, I will disable the canvas tests only on win7-debug-e10s.
Full log: https://treeherder.mozilla.org/logviewer.html#?job_id=17240284&repo=try
Comment 1•9 years ago
|
||
I think this is just a variation on bug 1232549. We could increase the Windows-specific leak thresholds I added in bug 1219919 so that this is green, too.
Yes, it would be much better to increase that threshold than to disable the tests completely.
Flags: needinfo?(jgriffin)
Whiteboard: [gfx-noted]
Comment 3•9 years ago
|
||
I can put together a patch to do that in this bug. Jonathan, do you have links to additional logs? I need to figure out of the number of leaked objects is entirely consistent or might vary a bit.
Updated•9 years ago
|
Assignee: nobody → continuation
Comment 4•9 years ago
|
||
Jonathan, does this make the orange go away?
Attachment #8725774 -
Flags: feedback?(jgriffin)
Reporter | ||
Comment 5•9 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #3)
> I can put together a patch to do that in this bug. Jonathan, do you have
> links to additional logs? I need to figure out of the number of leaked
> objects is entirely consistent or might vary a bit.
Here's another log with a different number of bytes leaked:
https://treeherder.mozilla.org/logviewer.html#?job_id=17269279&repo=try
I have a couple of more jobs pending on try so we'll see what the range is.
Flags: needinfo?(jgriffin)
Reporter | ||
Comment 6•9 years ago
|
||
Comment on attachment 8725774 [details] [diff] [review]
Increase Windows e10s texture leak threshold even more.
Review of attachment 8725774 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm; I just triggered a try run with this to verify it's effective:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5324da8c8d45
Attachment #8725774 -
Flags: feedback?(jgriffin) → feedback+
Updated•9 years ago
|
tracking-e10s:
--- → +
Comment 7•9 years ago
|
||
I've got a similar-looking leak in dom/html/test on WinXP.
https://treeherder.mozilla.org/logviewer.html#?job_id=17794695&repo=try
Blocks: 1253849
Comment 8•9 years ago
|
||
It seems like Windows graphics code can just leak an arbitrary number of textures. (Maybe there is some kind of cap, I don't know.)
Comment 9•9 years ago
|
||
I'll move forward with this tomorrow, hopefully. I lost track of it.
Flags: needinfo?(continuation)
Comment 10•9 years ago
|
||
At a certain point, we should just ignore these texture leaks entirely. But for now I'll just increase the limit some more.
This also increases it enough to cover Ryan's latest round of even more leaks.
Attachment #8729106 -
Flags: review?(erahm)
Updated•9 years ago
|
Flags: needinfo?(continuation)
Updated•9 years ago
|
Attachment #8725774 -
Attachment is obsolete: true
Comment 11•9 years ago
|
||
Comment on attachment 8729106 [details] [diff] [review]
Increase Windows e10s texture leak threshold even more.
Review of attachment 8729106 [details] [diff] [review]:
-----------------------------------------------------------------
r=me, although I think we should push harder to get the leak (bug 1232549) properly fixed.
Attachment #8729106 -
Flags: review?(erahm) → review+
Updated•9 years ago
|
Keywords: checkin-needed
Comment 12•9 years ago
|
||
I'll investigate this a bit to try and figure out if we are really leaking one texture per canvas or something and file a more specific bug and try to get it marked tracking.
Comment 13•9 years ago
|
||
Keywords: checkin-needed
Comment 14•9 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox48:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Reporter | ||
Comment 15•9 years ago
|
||
This still occurs, see e.g., https://treeherder.mozilla.org/logviewer.html#?job_id=17962898&repo=try
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 16•9 years ago
|
||
WARNING | leakcheck | tab process: leaked too many Mutex (expected 15, got 31)
WARNING | leakcheck | tab process: leaked too many PTextureChild (expected 9, got 23)
Yeah, that's a lot. I think I've narrowed down at least two of the tests that leak, if you want to try disabling those individually, rather than the whole directory: test_imagebitmap_cropping.html and test_imagebitmap.html
Comment 17•9 years ago
|
||
Further work here will need somebody to actually fix the leak. I'll file a blocking bug about the two image bitmap leaks I found.
Assignee: continuation → nobody
Reporter | ||
Comment 18•9 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #16)
> WARNING | leakcheck | tab process: leaked too many Mutex (expected 15, got
> 31)
> WARNING | leakcheck | tab process: leaked too many PTextureChild (expected
> 9, got 23)
>
> Yeah, that's a lot. I think I've narrowed down at least two of the tests
> that leak, if you want to try disabling those individually, rather than the
> whole directory: test_imagebitmap_cropping.html and test_imagebitmap.html
I've disabled just those two tests and the leak persists: https://treeherder.mozilla.org/#/jobs?repo=try&revision=817f0453ed46&selectedJob=17970673
Comment 19•9 years ago
|
||
You could also try disabling test_capture.html, which leaks intermittently. I've got a script now for reporting which tests leak textures with my logging added, so I'll try rerunning that a bunch of times over the weekend and see if there are any other tests that leak.
Comment 20•9 years ago
|
||
I did 8 or so retriggers of the canvas tests, and the three tests I mentioned before (test_imagebitmap_cropping.html, test_imagebitmap.html and test_capture.html) seem to be the only ones that leak. I'll try triggering M2 XP to reproduce the leak Ryan mentioned in comment 7. I don't know how common that is.
This is the patch I used to log texture creation.
The script I used to print out the tests that these leak during is here:
https://github.com/amccreight/mochitest-logs/blob/master/leak_matcher.py
Comment 21•9 years ago
|
||
This is the various other things you have to bunch in there to get it to run e10s Windows debug tests, based on patches from Jonathan and Ryan.
Comment 22•9 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #7)
> I've got a similar-looking leak in dom/html/test on WinXP.
> https://treeherder.mozilla.org/logviewer.html#?job_id=17794695&repo=try
Are you seeing this leak all of the time or just occasionally? I tried to reproduce and wasn't able to, but maybe I just don't have it set right ( https://treeherder.mozilla.org/#/jobs?repo=try&revision=d89a4b3e3ef0 ).
Flags: needinfo?(ryanvm)
Comment 23•9 years ago
|
||
I don't recall offhand, so I've retriggered more runs on Ash. Otherwise, maybe comment 14 took care of it?
Flags: needinfo?(ryanvm)
Comment 24•9 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #23)
> I don't recall offhand, so I've retriggered more runs on Ash. Otherwise,
> maybe comment 14 took care of it?
Oh, that's a good thought. But looking at the logs I see only 3 textures leaking, so I don't know. But from analyzing the logs, I see that these are leaking during dom/html/test/test_video_wakelock.html so maybe it just sometimes leaks more textures. I'll file a separate bug for that.
Updated•9 years ago
|
status-firefox48:
fixed → ---
Whiteboard: [gfx-noted] → [gfx-noted][e10s-orangeblockers]
Target Milestone: mozilla48 → ---
Comment 25•9 years ago
|
||
I'm just going to turn this into a meta bug for the various Windows-only e10s texture leaks. I've filed bugs for the various tests that leak.
Keywords: meta
Summary: [e10s] canvas mochitests leak with leakcheck | tab process: 4328 bytes leaked (CondVar, Mutex, PTextureChild, SharedMemory, TextureChild, ...) → [e10s] Windows-only texture leaks
Updated•9 years ago
|
Component: Canvas: 2D → Graphics: Layers
![]() |
||
Comment 26•9 years ago
|
||
this affects debug test runs (currently on Ash, but eventually on mc). Milan, can you fix this in someplace in the near future?
Flags: needinfo?(milan)
Updated•9 years ago
|
Assignee: nobody → bas
Flags: needinfo?(milan)
![]() |
||
Updated•9 years ago
|
Priority: -- → P1
Comment 27•9 years ago
|
||
Adding to e10s-rc list. Andrew says this bug should at least be looked at enough until we can tell if these texture leaks should block e10s release.
Blocks: e10s-rc
status-firefox46:
--- → affected
status-firefox47:
--- → affected
status-firefox48:
--- → affected
Comment 28•9 years ago
|
||
We're going to P1 the meta bug. Blocking bugs are probably of equal priority.
Whiteboard: [gfx-noted][e10s-orangeblockers] → [gfx-noted][e10s-orangeblockers][MemShrink:P1]
Assignee | ||
Comment 29•9 years ago
|
||
It should be noted all the tests that appear to be affected involve drawing video to a Canvas. I don't know this code very well but I will have a look, ni? Nical and Matt Woodrow as they might have an idea how how drawing video into a canvas interacts with hardware video decoding and the imagebridge, so I'll have some idea of where to start looking.
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(matt.woodrow)
Comment 30•9 years ago
|
||
When the ImageContainer receives an Image, it immediately tries to send the Image to the ImageBridge, which creates a TextureClient if the latter wasn't already created and pass it to an ImageClient on the ImageBridge. When the canvas access the ImageContainer, I assume that it will copy/read-back the content of ImageContainer on the main thread. The code that reads from the ImageContainer is in nsLayoutUtils::SurfaceFromElement: among other things it calls Image::GetAsSourceSurface.
So in principle the interactions between ImageContainer+ImageBridge and ImageContainer+Canvas *should* be orthogonal. It could be that the canvas holds the ImageContainer and/or Image longer than usual and we get into a situation that we hadn't thought of as a result, but I don't see what kind of situation that would be. If that was the case I am still confident that TextureClient's destruction is robust enough to be triggered from any thread at any time without problems *as long as it happens before gfx's shutdown*, so I would guess we leak something that holds on to the texture rather than just the texture, maybe. That's all very hand-wavy, just dumping anything that comes to mind as I type.
Flags: needinfo?(nical.bugzilla)
Comment 32•9 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #30)
> It could be that the canvas
> holds the ImageContainer and/or Image longer than usual and we get into a
> situation that we hadn't thought of as a result, but I don't see what kind
> of situation that would be.
This was my guess too, but I looked over the code and we don't ever hold onto Layers things, only SourceSurfaces.
Do we assert when we try release things after gfx has shut down?
Would be good to confirm that this is the problem, and ideally get a stack trace to who was holding the last ref.
Flags: needinfo?(matt.woodrow)
Updated•9 years ago
|
I might be looking at a similar issue in bug 1244883.
What I've observed is that the main process is trying to shut down the content process, and blocks the I/O thread in EnsureProcessTerminated() until the content process has died. All (AFAICT?) IPDL messages go via the I/O thread so no IPC can happen, including __delete__ calls.
In the log linked in the original description of this bug we even see a small flood of messages attempting to send once the content process dies, but failing (note the PTexture::Msg___delete__ call failing):
23:51:19 INFO - => Process ID: 1816, Thread ID: 2992
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
23:51:19 INFO - [Parent 3624] WARNING: pipe error: 109: file c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 333
23:51:19 INFO - ###!!! [Parent][MessageChannel] Error: (msgtype=0xF60001,name=PTexture::Msg___delete__) Channel error: cannot send/recv
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
23:51:19 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
The offending line in EnsureProcessTerminated: https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/ipc/chromium/src/chrome/common/process_watcher_win.cc#77
I've a try run here with full symbols that crashes when it hits this and makes a minidump of the parent process: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1f4af293d1d53e8bb8d1d36a702a1c6da6ba3a8
Comment 35•9 years ago
|
||
Edwin, I don't understand what you're saying. In the case of a debug/leak-checking build, all IPDL protocols (including PBackground) should be shut down before we call EnsureProcessTerminated. If there is still any protocol system open at that point, that is a bug in the design of the protocols and should be fixed.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #35)
> Edwin, I don't understand what you're saying. In the case of a
> debug/leak-checking build, all IPDL protocols (including PBackground) should
> be shut down before we call EnsureProcessTerminated. If there is still any
> protocol system open at that point, that is a bug in the design of the
> protocols and should be fixed.
Well there are plenty of things shutting down in ShutdownXPCOM that still need IPDL to work (in particular, media shuts down there, and holds on to graphics resources that are shared between processes and need IPDL to die).
AFAICT, ShutdownXPCOM is kicked off on the child at around the same time that EnsureProcessTerminated is called on the parent -- both coming from Content*::ActorDestroy. (I would rather like to be wrong about this -- see https://bugzilla.mozilla.org/show_bug.cgi?id=1244883#c21)
Comment 37•9 years ago
|
||
There should not be any things in XPCOM shutdown that require IPC access. All IPC/IPDL protocols must be completely shut down before you start XPCOM shutdown. The protocol ActorDestroy methods should be responsible for cleaning up any open shared resources.
If we are starting XPCOM shutdown from within the ActorDestroy method, perhaps we need to move that later?
Is there a specific place where IPC-dependent shutdown should be done? ContentChild::ActorDestroy? I see we fire a |content-child-shutdown| notification there, but simply observing for that doesn't afford the ordering guarantees that we rely on (things use graphics -- lots of things -- that need to be dead before we shut down).
Assignee | ||
Comment 40•9 years ago
|
||
(In reply to Edwin Flores [:eflores] [:edwin] from comment #38)
> Is there a specific place where IPC-dependent shutdown should be done?
> ContentChild::ActorDestroy? I see we fire a |content-child-shutdown|
> notification there, but simply observing for that doesn't afford the
> ordering guarantees that we rely on (things use graphics -- lots of things
> -- that need to be dead before we shut down).
Like Javascript for example, since DOM/Javascript has a unidirectional dependency on graphics those really have to shutdown before graphics does.
![]() |
||
Comment 41•9 years ago
|
||
(In reply to Edwin Flores [:eflores] [:edwin] from comment #38)
> Is there a specific place where IPC-dependent shutdown should be done?
> ContentChild::ActorDestroy? I see we fire a |content-child-shutdown|
> notification there, but simply observing for that doesn't afford the
> ordering guarantees that we rely on (things use graphics -- lots of things
> -- that need to be dead before we shut down).
ContentChild::ActorDestroy appears to be too late, that calls XRE_ShutdownChildProcess() which breaks the main thread out of it's run loop.
What protocol(s) are you having shutdown issues with?
http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsEmbedFunctions.cpp#812
Flags: needinfo?(edwin)
Assignee | ||
Comment 42•9 years ago
|
||
Let's not confuse the bugs too much. This bug is actually just because we're not shutting down the ImageBridge and CompositorBridge cleanly on windows. I have patches to address this issue (although in a mildly hacky way). Let's have the shutdown discussions elsewhere.
Depends on: 1262898
Flags: needinfo?(edwin)
![]() |
||
Comment 45•9 years ago
|
||
Should be fixed by bug 1262898. Bas, can we close this out or is there something we should confirm first?
Flags: needinfo?(bas)
Assignee | ||
Comment 46•9 years ago
|
||
Looking good!
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Flags: needinfo?(bas)
Resolution: --- → FIXED
Updated•9 years ago
|
Target Milestone: --- → mozilla48
Comment 47•9 years ago
|
||
Comment 48•9 years ago
|
||
bugherder |
Comment 49•9 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•