Closed Bug 1431056 Opened 3 years ago Closed 3 years ago

test_peerConnection_replaceTrack.html fails with loopback tone enabled.


(Core :: WebRTC: Audio/Video, enhancement, P2)




Tracking Status
firefox60 --- fixed


(Reporter: achronop, Assigned: achronop)




(6 files, 1 obsolete file)

This is a follow up Bug 1406350. It is a failure we get on test_peerConnection_replaceTrack.html test when loopback tone is enabled. The problem is described in:
Assignee: nobody → achronop
Rank: 15
Priority: -- → P2
See Also: → 1406350
Added some more logs in cubeb-rs backend to investigate why Thread 11 of crash report is waiting in cubeb_stream_stop() method. Copy some log lines bellow:


[task 2018-01-17T18:35:28.322Z] 18:35:28     INFO - GECKO(3045) | cubeb-rs: enumerate device enter
... after 5 mins ...
[task 2018-01-17T18:40:26.723Z] 18:40:26     INFO - GECKO(3045) | cubeb-rs: enumerate device: after lock loop
[task 2018-01-17T18:40:28.213Z] 18:40:28     INFO - GECKO(3045) | cubeb-rs: enumerate devices: after default device
[task 2018-01-17T18:40:28.214Z] 18:40:28     INFO - GECKO(3045) | cubeb-rs: enumerate devices: after input devices
[task 2018-01-17T18:40:28.235Z] 18:40:28     INFO - GECKO(3045) | cubeb-rs: Drain - latency: 303202251
[task 2018-01-17T18:40:28.236Z] 18:40:28     INFO - GECKO(3045) | cubeb-rs: Stream stop 
... after 6 mins ...
[task 2018-01-17T18:46:49.724Z] 18:46:49    ERROR - TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_restartIce.html | application timed out after 370 seconds with no output
[task 2018-01-17T18:46:49.725Z] 18:46:49    ERROR - Force-terminating active process(es).

First, this crash report is created from the test next to test_peerConnection_replaceTrack.html. From logs we see that latency is ridiculous high. 303202251 usec is 5.05337085 mins. When we drain we wait `2 * latency` in order to allow device to consume the last frames before we stop and declare drain state. In other words stream stop, stream destroy and declaring new state are waiting on that timer. 

So I believe this is a side effect of a blocked or overloaded mainloop rather than the root cause. In order to verify that I created a patch that does not allow latency go over 100 ms. The result is that the error of this test remains but there is no crash on next test which turns green.
That said I believe we should make this patch permanent and check latency before set it on timer. Since it can take very high values there is no point to wait for that. Waiting for 200ms at most (which means a latency of 100ms) looks reasonable.

@kinetik: thoughts about it.
Flags: needinfo?(kinetik)
Do we know what's causing the latency to grow like this?  Is it just server load causing the PA server to grow latency or is there some other bug involved?  Enabling logging in the PA server when reproducing might give some more clues.  I'd be reluctant to apply any workarounds like capping latency for the drain timer until we better understand what's going on.  If we do end up capping the timer, something in the range of 1-2s seems like a better compromise... 100ms is too short if we're using a higher latency playback stream, especially if the PA server is remote.
Flags: needinfo?(kinetik)
Unfortunately we do not know the root cause. Probably it's the same cause that creates this bug. It's not possible to activate Pulse logs because this is reproducible only in try. I have not reproduced it in a loaner. I will continue looking for the cause here.
Attached file bug1431056-thread-all-gdb.txt (obsolete) —
I managed to repro once in my system. I attached the debugger and got bt for every thread. It looks like threads 21 and 67 are blocked in mainloop. In stack of Thread 67 it's strange that callback makes use of StorageDBThread.

@kinetik: Can you please have a look and tell me what you think?
Flags: needinfo?(kinetik)
(In reply to Alex Chronopoulos [:achronop] from comment #5)
> Created attachment 8944379 [details]
> bug1431056-thread-all-gdb.txt
> I managed to repro once in my system. I attached the debugger and got bt for
> every thread. It looks like threads 21 and 67 are blocked in mainloop. In
> stack of Thread 67 it's strange that callback makes use of StorageDBThread.

The log is messed up, possibly by the Gecko logging intermingled with gdb's output.  If you look at the stack you're talking about, thread 67 (StorageDBThread) has stack entries 0-9, then there's stack entries 4-27 containing cubeb/audioipc stuff.  Those 4-27 entries must be from a different thread (66) and there must be some output missing.

So thread 66 holds the PA mainloop lock while it executes the server side of the data_callback and is waiting on a response from the client side data_callback (which will be running in the content process where the test is executing).

Thread 21 is the audioipc "main" thread and is blocked waiting for the PA mainloop lock while trying to service an enumerate request.  There will be a content process that is waiting on a response for this.

So these stacks alone doesn't tell us much, especially if you can repro this without the remoting enabled.  I think we either need the stacks from the main/chrome Gecko process *and* the content process where the test is running (so we can see what's happening on the client side of the remoting) or we need to repro with remoting disabled and then we can just examine the stacks within the content process where the test is running.
Flags: needinfo?(kinetik)
It's probably also worth trying to repro inside rr so we can debug this more easily, especially since it'll capture all of the processes involved.  And if you could post the exact steps you used to repro this time, I can try running parallel to you.
Attached file gdb-all-threads-bt.txt
I had hit a bug in rr that prevented me from using it. This is just fixed so I came up with a better bt. I post it instead of the other one but I have not done any analysis yet.
Attachment #8944379 - Attachment is obsolete: true
bt from all thread in content process (child 1)
Content process, child 2. Nothing in here but I post anyway.
Thanks Alex.

On the server side, we see the same as the last stacks:
Thread 45 - data_callback - waiting in recv on client response (holding mainloop lock)
Thread 19 - enumerate (on server main thread) - waiting on mainloop lock

On the client side, there's some indication of a problem:
Thread 30 - client enumerate from MSG event loop waiting on server response
Thread 33 - data_callback thread - waiting in recv on server socket!

So both sides of the data_callback threads are waiting on responses.  That suggests that they're out of step somehow (or these are unrelated threads from different streams, but given there's no evidence of other streams running and assuming you're running just a single test, it's unlikely), since they are expected to operate in a send/recv lockstep with each other.  The client side (thread 33) is waiting for a new command, so it thinks it finished processing the last command.  The server side (thread 45) has sent the data_callback request to the client and is waiting for the client to confirm the shm buffers have been updated.  This assumes that 

It's not possible to tell what went wrong from the stacks, so we'd either need to enable audioipc logging (env var RUST_LOG=debug) or step through an rr recording (preferable) to see what's going wrong.  If you've got a good rr recording, stepping through the last iteration of the client side data_callback thread (33) would be the place to start... otherwise let me know what I need to do to repro locally and I'll investigate.

If you can reproduce this with the audio remoting disabled, I think that needs investigating too, because if the above issue is a remoting bug *and* this test timeout issue still reproduces without remoting, then there must be multiple issues involved.
Thank you for the help so far!

I have a good rr. Can you give me a few points in the code to break. I am also preparing the instructions how to repro.
In order to repro follow the steps bellow:
1. Apply the patch. This patch assumes that you have gst-launch-1.0 available in your system instead of the gst-launch-0.10 that is used in try server. If that not the case you need to remove that change.

2. You need to install and load the v4l2loopback module. The way I did it is to clone the repo from build, install and `sudo modprob v4l2loopback`

3. Run the test under stress. In rr it reproduced at once, in GDB it reproduced after a few tries (~10). I have never reproduced outside of those two. I use the following command to apply stress in the system:
`while :; do stress -c 8 -i 8 -m 8 -d 8 -t 2; sleep 1; done`

4. Run the test:
./mach mochitest --use-test-media-devices --keep-open=false dom/media/tests/mochitest/test_peerConnection_replaceTrack.html
In the content process (where thread 33 lives in the stacks you posted), break on  This is where the thread is currently stuck in the posted stacks, so you'll have to rewind a bit and step through a few iterations to see what it's doing.  Given what the chrome thread (45) is doing, we'd expect to see the content thread (33) receive a ClientMessage::StreamDataCallback, enter data_cb, then respond with a ServerMessage::StreamDataCallback (which is what thread 45 is waiting for in the stacks).  Since thread 33 is waiting to receive, it has either pulled that message and the response was lost/not sent, or it has not yet pulled the message for some reason.

Thanks for the repo steps, I'll give it a shot locally.  If you have time to try reproducing with the audio remoting/sandboxing disabled, it'd be good to investigate what's going on in that case too.
I iterate through that part and I see that we enter in data_cb we get 274 frames and we send ServerMessage::StreamDataCallback without an error. This is the same for a number of iterations around that point.
So where and when does that thread stop making progress?  

Any progress on reproducing with remoting disabled?
Reproduced locally, but the stacks don't match what you posted.  It looks like MSG (invoked from the data_callback) is waiting on a MediaEngineRemoteVideoSource::NotifyPull monitor, which is held by the Cameras IPC thread waiting on an IPC to the ImageBridgeChild thread, which is trying to open Gecko IPC shared memory.  It's not clear if this is blocked or just progressing very slowly... but I'm surprised to see the data_callback being blocked on this kind of stuff.

I reproduced the same failure with the same stacks with remoting disabled, so we can eliminate the remoting from the equation.
CamerasChild?  This is e10s I presume.  Why would CamerasChild be waiting on ImageBridge???

Can you post the stacks?
Flags: needinfo?(kinetik)
The rescaling in MediaEngineRemoteVideoSource::DeliverFrame holds the monitor very generously [1] and it all happens on the Cameras IPC thread.

Includes a bunch of allocations, including one of a shmem-backed image.

Bug 1299515 removes holding the monitor while rescaling (which should help this problem).
Optimizing the rescaling to do less copying will happen in a followup to bug 1423582.

Note that this rescaling (and presumably the path to ImageBridgeChild) happens only if you have multiple gUM requests for the same camera in the same child process at different resolutions. This could explain why kinetik and achronop see different failure modes.

The only tests I can see triggering multiple requests however, are test_getUserMedia_constraints.html and test_getUserMedia_basicScreenshare.html. And we check that all tests stop all tracks before they finish.

Stacks might help understand this better.

I imagine the stacks aren't needed, given pehrson's comment
Flags: needinfo?(kinetik)
This is a new all threds stack from child process 1. I do not know why me and Mathew have different stacks. I pulled yesterday so this is a fresh one.

Thread 23 is stopped in ImageBridgeChild. It is stopped inside sandbox methods on a while loop (SandboxBrokerCommon.cpp:68). I will try to verify if that add any delay.

Another good idea would be to repro without sandbox.
I have tested without cubeb remote and I was able to repro (sorry did not report earlier). I have tested also with both backends, rust and plain c, and I just tested without sandbox. I am able to repro in all cases.
Creating a dependency here in case it has anything to say. Re comment 17-19.
Depends on: 1434353
Depends on: 1434538
No longer depends on: 1434353
Alex, could you retest per the original try failures now that the video fix has landed?
Flags: needinfo?(achronop)
I did a pull this morning and I still have it, unfortunately.
Flags: needinfo?(achronop)
Try is green, I will remove the fake:true from the test. Andreas can you mention the bug introduced the issue?
That would have been bug 1388219.
The issue introduced by bug 1388219 and fixed by bug 1434538. I will use this bug to remove the "fake: true" true constraint from the test.
Comment on attachment 8950874 [details]
Bug 1431056 - Stop using fake constraints in a media mochitest.
Attachment #8950874 - Flags: review?(apehrson) → review+
Pushed by
Stop using fake constraints in a media mochitest. r=pehrsons
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
See Also: → 1059240
You need to log in before you can comment on or make changes to this bug.