Open Bug 1126078 Opened 9 years ago Updated 2 years ago

fds in use by ImageBridgeClient grows to an unreasonable level when in a webrtc call with limited CPU

Categories

(Core :: Graphics: Layers, defect, P3)

x86
Linux
defect

Tracking

()

Tracking Status
e10s + ---

People

(Reporter: bwc, Unassigned)

References

Details

(Whiteboard: [gfx-noted])

Attachments

(1 file)

See discussion over in bug 1095218 for context.
Blocks: 1180000
Roc, nical: will roc's patchset for changing how compositor selects frames help here?
Flags: needinfo?(roc)
Flags: needinfo?(nical.bugzilla)
If not, we may need to find a way to monitor and limit frame backlog in e10s
(In reply to Randell Jesup [:jesup] from comment #1)
> Roc, nical: will roc's patchset for changing how compositor selects frames
> help here?

I doubt it.
Flags: needinfo?(roc)
Flags: needinfo?(nical.bugzilla)
Hi Milan --  We need a gfx person (likely with compositor expertise) to look at this for us (WebRTC).  We have a test that's about to be disabled (see bug 1180000) because of this issue.  Do you have someone who can look at this?
Flags: needinfo?(milan)
The discussion in the bug 1095218 is quite involved, but the claims in there is that we are not leaking fds, and that we are just using more than system can handle.  The "groups frames into shared shmems" is not something that we are seriously entertaining right now, certainly not for something that's a blocker.  But then some of the later comments in that bug are suggesting that perhaps we're not hitting the fd limits after all(Byron's, #51, #52), or am I reading that wrong?

Can somebody clarify?  Are we running out of fds, and why do we think that?  If we are running out of fds, are we leaking them, or just sending too many?  If we're sending too many, can we stop?
Flags: needinfo?(milan)
So, to summarize bug 1095218:
*) We don't appear to be leaking fd's, but they're building up somewhere (i.e. processing of them (compositor likely) is running behind, causing a buildup).
*) On Linux, the fd process limit on many linux distros is 1K/process (not 4K), and we're hitting 1K
*) We disabled the tests that hit this bug on e10s.  These are important tests, and so leaving them disabled as we move to e10s across the board is a problem
*) I suspect this is a combination of CPU overload, and a design which is not tolerant  of such overloads (no built-in limits to block adding to an overloaded queue, or use of a fixed-size queue, or some type of lockless circular producer/consumer queue in a single shmem or small set of them, etc)

NI to milan for re-assessment, and to roc for any comments on how the compositor might be involved or ways to mitigate this.  If someone else should be thinking about this, please pull them in.
Flags: needinfo?(roc)
Flags: needinfo?(milan)
It would be better to ask mattwoodrow or nical.

We might want to add some about:memory-like infrastructure to track the fds in use by different parts of the system.
Flags: needinfo?(roc)
I looked into this a bit since it showed up in rr issue #1534. I have an rr run of dom/media/tests/mochitest/ipc/test_ipc.html which runs out of fds while we have >500 shmems open in the content process.

It looks like we open 1535 shmem fds and close 901 before we run out of fds. That's bad, it sure looks like we're leaking.

Every single allocation is via the ImageBridge for a YCbCr image, and every single deallocation also through the ImageBridge. So it's all video frames.
Matt, Nical - any ideas here?
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(matt.woodrow)
I don't have any answers right now, but I suspect this could be related to bug 1171209.

It appears that something is racing (though it isn't clear what) and we're getting confused about the state of our shmems.
Flags: needinfo?(matt.woodrow)
Give me some time and I should be able to track this down with rr
Attached file summary data
Here's what I've learned...

The first Shmem allocated that's not freed before we crash out of fds is allocated at rr event 279751.

At rr event 282390 SendRemoveTexture() is called on its associated TextureChild.

Unfortunately we don't start processing that message until rr event 359211! That is a very long time.

The reply message telling the child to destroy its shmem is sent, but simply isn't processed until we crash at rr event 366434.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)

> The first Shmem allocated that's not freed before we crash out of fds is
> allocated at rr event 279751.
> 
> At rr event 282390 SendRemoveTexture() is called on its associated
> TextureChild.
> 
> Unfortunately we don't start processing that message until rr event 359211!
> That is a very long time.
> 
> The reply message telling the child to destroy its shmem is sent, but simply
> isn't processed until we crash at rr event 366434.

Ok.  This matches my expectations from comment 6 i.e. shmems are building up in the "pipe", but not being actively leaked/lost.  The processing overload is stopping us from releasing them in a timely manner.

Digging a little, I see PImageBridge is under PContent (not PBackground).  This strongly implies to me that all messages for PImageBridge are touching MainThread on both sender and receiver.  Mainthread on Content (or even in some cases Master) is subject to delays, and in severe overload cases (as on test machines/VMs at times) serious queuing delays (seconds perhaps).  One B2G problem we found with UDPSocket was that B2G Master process's MainThread was being eaten alive animating the Network Activity statusbar item, causing UDPSocket packets that were looping back (never leaving the phone) to have RTT's in the 3 second range.....  All because MainThread was hosed.

roc's thread can confirm this (and perhaps what the MainThread queue depth is when the RemoveTexture hits the event queue), but I think it's pretty clear.

No realtime-sensitive IPC should be using PContent, and something carrying major compositing resources would fall into that.  Either that, or we need some way for critical messages to "jump" the queue with priority.  This would require knowledge that there are no ordering requirements against messages of other types (it would retain in-order processing for the specific message type).   Mostly this would be used for resource-releases and "I must free this on thread X" cases (and not most of those), I imagine, and for things that realtime-response-related and carefully-vetted for a) safety, and b) need.

While this issue is clearly a gfx/compositor/ipc issue at the moment, NI-ing kyle and nathan for comments about whether there are other things we can do to minimize the impact of MainThread queuing.  (Of course, not being in such an overload is the best option, and likely we should still look to move the ImageBridge stuff over to PBackground - but perhaps that's impossible for some reason...)
Flags: needinfo?(nfroyd)
Flags: needinfo?(khuey)
Alternatively, is there *any* way PContent IPC as a whole could simply get processed as a priority on MainThread (separate queue, or they *all* jump the queue)?  I realize there's a strong likelihood this is going to be infeasible, but better to ask and be told "you're on drugs to even think of that".
PImageBridge is not "under" PContent.  You open it via PContent, but after that it is a top-level protocol, just like PBackground.  So which thread is actually bogged down here?
Flags: needinfo?(rjesup)
Flags: needinfo?(nfroyd)
Flags: needinfo?(khuey)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #15)
> PImageBridge is not "under" PContent.  You open it via PContent, but after
> that it is a top-level protocol, just like PBackground.  So which thread is
> actually bogged down here?

Excellent; I assumed "opened by" meant it was part of PContent and used mainthread.

A quick breakpoint on RecvRemoveTexture shows it's running on the Compositor thread in non-e10s and e10s.  So presumably that's where the backup is.  SendRemoveTexture is being called from MainThread in non-e10s, and ImageBridgeChild thread in e10s in Content.
Flags: needinfo?(rjesup)
(In reply to Randell Jesup [:jesup] from comment #14)
> Alternatively, is there *any* way PContent IPC as a whole could simply get
> processed as a priority on MainThread (separate queue, or they *all* jump
> the queue)?  I realize there's a strong likelihood this is going to be
> infeasible, but better to ask and be told "you're on drugs to even think of
> that".

Depends on which queue they're stuck in, since IPC messages can be stuck in the actual IPC socket, or in the IPCChannel's internal message queue, or in the message loop of the thread that will do the message dispatch. I don't know about the latter case, but if messages are stuck in the channel's queue then that's easy to jump (we already sort of do it for message priority).
(In reply to Randell Jesup [:jesup] from comment #13)
> Digging a little, I see PImageBridge is under PContent (not PBackground). 
> This strongly implies to me that all messages for PImageBridge are touching
> MainThread on both sender and receiver.

This had better not be true, since the entire point of PImageBridge is to support image updates while the main thread is blocked! And that does work, since async video works last I checked.
(In reply to David Anderson [:dvander] from comment #17)
> Depends on which queue they're stuck in, since IPC messages can be stuck in
> the actual IPC socket, or in the IPCChannel's internal message queue, or in
> the message loop of the thread that will do the message dispatch. I don't
> know about the latter case, but if messages are stuck in the channel's queue
> then that's easy to jump (we already sort of do it for message priority).

roc's rr replay should actually be able to answer that, I think.  Be very useful to know.
:nical is on PTO until Thursday, but this conversation is going well, so let's not wait.
Flags: needinfo?(milan)
Looks like we are not trying to reuse any of the textures (at least in the webrtc case). Using a pool to reuse video frames would at least reduce the fd churn, but I am more worried about the accumulation of textures. I am really surprised that messages from PImageBridge would accumulate for that long. Assuming we are most of the time somewhat close to running the compositor at 60 fps (or even if we are at half the target framerate), we shouldn't be building up that much. Or something is blocking the ImageBridgeThread (which it shoudln't).

What mechanism throttles the creation of video frames in the case of webrtc (or video in genral)? I assume there is something that does that in the media pipeline in case the comsumer can't catch up. It's worth looking at whether this throttling mechanism (assuming its existence) happens with or without knowledge of ImageBridge, TextureClient etc.
Flags: needinfo?(nical.bugzilla)
(In reply to Nicolas Silva [:nical] from comment #21)
> Looks like we are not trying to reuse any of the textures (at least in the
> webrtc case). Using a pool to reuse video frames would at least reduce the
> fd churn, but I am more worried about the accumulation of textures. I am
> really surprised that messages from PImageBridge would accumulate for that
> long. Assuming we are most of the time somewhat close to running the
> compositor at 60 fps (or even if we are at half the target framerate), we
> shouldn't be building up that much. Or something is blocking the
> ImageBridgeThread (which it shoudln't).

Roc's rr trace (if he kept it) might be able to answer that.  A profile of this case (even without the failure) might illustrate some of what's going on.

> 
> What mechanism throttles the creation of video frames in the case of webrtc
> (or video in genral)? I assume there is something that does that in the
> media pipeline in case the comsumer can't catch up. It's worth looking at
> whether this throttling mechanism (assuming its existence) happens with or
> without knowledge of ImageBridge, TextureClient etc.

Unless it's being throttled by HTMLMediaElements (I doubt it), then there is no throttle.  These are all pipes that are hooked to each other (webRTC or getUserMedia -> MediaStream -> HTMLMediaElement (via srcObject, or in less-desired cases src fed from URL.createObjectURL()) and then HTMLMediaElement passes frames through to the compositor.
I had to make a new trace. In this trace:

Breakpoint 1, IPC::Channel::ChannelImpl::ProcessIncomingMessages (this=0x7f9af54ec000)
    at /home/roc/mozilla-inbound/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:439
439	  msg.msg_iov = &iov;
$888 = 445742
...
Breakpoint 1, IPC::Channel::ChannelImpl::ProcessIncomingMessages (this=0x7f9af3e9c000)
    at /home/roc/mozilla-inbound/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:439
439	  msg.msg_iov = &iov;
$889 = 483623

where the crash out of fds happens just before 483623. Before 445742 we're hitting the breakpoint at least once per thousand rr events.
We enter ProcessOutgoingMessages and then we're stuck there until after the content process crashes out of fds.

Breakpoint 2, IPC::Channel::ChannelImpl::ProcessOutgoingMessages (this=0x7f9af54ec000)
    at /home/roc/mozilla-inbound/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:676
676	  while (!output_queue_.empty()) {
(gdb) when
$1 = 445750
(gdb) finish
...
0x00007f9b29022993 in IPC::Channel::ChannelImpl::OnFileCanWriteWithoutBlocking (this=0x7f9af54ec000, fd=43)
    at /home/roc/mozilla-inbound/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:933
933	  if (!ProcessOutgoingMessages()) {
Value returned is $2 = true
(gdb) when
$3 = 483623
OK, it turned out that there was an rr bug causing unnecessary context switches during the sendmsgs performed by ProcessOutgoingMessages. With that fixed, I can no longer reproduce this bug under rr :-(.

To make further progress on this bug I'd try adding logging for try builds to track the progress of RecvRemoveTexture messages through the system and to make sure that ProcessOutgoingMessages/ProcessIncomingMessages are flushing all queues frequently.
Blocks: 1171255
Unable to reach you on IRC nical. Here will do.

I was hitting this pretty hard with bug 1208371, and have experimented some to that effect.

[1] is a try that is reliably green on e10s.

I made the test test_peerConnection_twoAudioVideoStreams.html visible, to get some feedback from the compositor to ImageContainer for logging buildup of frames that have been sent to the compositor but not yet been composited.
I made all sites creating images for getUserMedia and WebRTC use async ImageContainers, to avoid copying to shmem later in the pipe.
I changed the NS_ERROR for "Wrong size for this shmem" to just a printf_stderr to avoid the stack walking.
I added some logs for outgoing ipc message buildup and ipc processing errors (including when writing is blocked).

In [1] I now see e10s passing, with a lot of wrong-sized shmems. I tried to figure out who is destructing the shmems locally, and it very rarely is ImageBridgeClient (a few times early on in ImageClientSingle::UpdateImage()). Most of the time it seems to be MediaStreamGraph as it's sending images to ImageContainer (old ones get dropped). Not sure where the ones whose shmem goes over the wire are destroyed though. Have they been sent but nothing is keeping a reference?

On e10s I also do not see any buildup of frames in ImageContainer (I log when the FrameID just composited is more than 20 IDs away from the first image in the ImageContainer, at intervals of 10, so if the diff is 30, 40, 50, etc. - to reduce the spam. Not a single such log seen.)

I do it similarly for outgoing IPC messages, when the queue hits 10, 20, 30, etc. No such log messages seen.

[1] also seems to cause a shutdown race for non-e10s, or at least that's what the gfxCriticalLog() suggests. Any input on that would be handy. I see it happening during MSG shutdown - that suggests that jib's MSG shutdown fixes might help somewhat, but I'm not sure. I don't think I had those in this build.


[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=3e4e9793018f
Flags: needinfo?(nical.bugzilla)
We ended up talking over email and I forgot to answer here

(In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #26)
> I made the test test_peerConnection_twoAudioVideoStreams.html visible, to
> get some feedback from the compositor to ImageContainer for logging buildup
> of frames that have been sent to the compositor but not yet been composited.
> I made all sites creating images for getUserMedia and WebRTC use async
> ImageContainers, to avoid copying to shmem later in the pipe.

Nice

> I changed the NS_ERROR for "Wrong size for this shmem" to just a
> printf_stderr to avoid the stack walking.

Good idea, this warning can be seen either as ipdl's lack of a way to handle short-lived shared memory segments (we create and destroy a segment before sending it to the compositor), or us not using ipdl the way it was intended when it was made, but either this race isn't going to cause bugs as far as I know since an error happens deserializing a shmem that won't ever be used on the compositor.

It does show, however that we are using shared memory for something we don't end up sharing, which is certaily wasteful.

> I added some logs for outgoing ipc message buildup and ipc processing errors
> (including when writing is blocked).
> 
> In [1] I now see e10s passing, with a lot of wrong-sized shmems. I tried to
> figure out who is destructing the shmems locally, and it very rarely is
> ImageBridgeClient (a few times early on in
> ImageClientSingle::UpdateImage()). Most of the time it seems to be
> MediaStreamGraph as it's sending images to ImageContainer (old ones get
> dropped).

Makes sense. So this where the "wastefulness" I mentioned comes from, and I guess it's okay because the reason we don't share the frames is because we skip them since the cpu can't keep up (debug build), so users should not get in this situation too much.

> Not sure where the ones whose shmem goes over the wire are
> destroyed though. Have they been sent but nothing is keeping a reference?

TextureClients's data that do go over the wire are usually destroyed from the compositor side, so depending on where you put the logging you may not see the shared memory being cleanup up internally by ipdl on the content side.
Textures that are currently shared and in used are kept alive by the ImageClient (it contains an array of current *textures*, but not *Images* so the Image may go away while the texture is stays alive until the compositor is done with it.

> [1] also seems to cause a shutdown race for non-e10s, or at least that's
> what the gfxCriticalLog() suggests. Any input on that would be handy. I see
> it happening during MSG shutdown - that suggests that jib's MSG shutdown
> fixes might help somewhat, but I'm not sure. I don't think I had those in
> this build.

That's either a TextureClient or an ImageClient being kept alive past the shutdown of gfx.
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(milan)
Priority: -- → P3
All part of the "use fewer fds".
Assignee: nobody → nical.bugzilla
Flags: needinfo?(milan)
Depends on: 1128503
Assignee: nical.bugzilla → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: