Closed Bug 1431056 Opened 6 years ago Closed 6 years ago

test_peerConnection_replaceTrack.html fails with loopback tone enabled.

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla60

Tracking Flags:

Tracking

Status

firefox60

---

fixed

People

(Reporter: achronop, Assigned: achronop)

References

Details

Attachments

(6 files, 1 obsolete file)

bug1431056-thread-all-gdb.txt 6 years ago Alex Chronopoulos [:achronop] 208.31 KB, text/plain		Details
gdb-all-threads-bt.txt 6 years ago Alex Chronopoulos [:achronop] 223.25 KB, text/plain		Details
gdb-all-threads-bt-child1-process.txt 6 years ago Alex Chronopoulos [:achronop] 206.53 KB, text/plain		Details
gdb-all-threads-bt-child2-process.txt 6 years ago Alex Chronopoulos [:achronop] 92.09 KB, text/plain		Details
Patch to reproduce the test failure 6 years ago Alex Chronopoulos [:achronop] 2.88 KB, message/rfc822		Details
New stack from child process 1 6 years ago Alex Chronopoulos [:achronop] 192.47 KB, text/plain		Details
Bug 1431056 - Stop using fake constraints in a media mochitest. 6 years ago Alex Chronopoulos [:achronop] 59 bytes, text/x-review-board-request	pehrsons : review+	Details

Alex Chronopoulos [:achronop]

Assignee

Description

•

6 years ago

This is a follow up Bug 1406350. It is a failure we get on test_peerConnection_replaceTrack.html test when loopback tone is enabled. The problem is described in:

https://bugzilla.mozilla.org/show_bug.cgi?id=1406350#c71
https://bugzilla.mozilla.org/show_bug.cgi?id=1406350#c73
https://bugzilla.mozilla.org/show_bug.cgi?id=1406350#c74

Alex Chronopoulos [:achronop]

Assignee

Updated

•

6 years ago

Assignee: nobody → achronop

Rank: 15

Priority: -- → P2

Comment 1

•

6 years ago

Added some more logs in cubeb-rs backend to investigate why Thread 11 of crash report is waiting in cubeb_stream_stop() method. Copy some log lines bellow:

(from https://public-artifacts.taskcluster.net/OHAdSjLSTgCkL-DN7KFX_w/0/public/logs/live_backing.log)

[task 2018-01-17T18:35:28.322Z] 18:35:28     INFO - GECKO(3045) | cubeb-rs: enumerate device enter
... after 5 mins ...
[task 2018-01-17T18:40:26.723Z] 18:40:26     INFO - GECKO(3045) | cubeb-rs: enumerate device: after lock loop
[task 2018-01-17T18:40:28.213Z] 18:40:28     INFO - GECKO(3045) | cubeb-rs: enumerate devices: after default device
[task 2018-01-17T18:40:28.214Z] 18:40:28     INFO - GECKO(3045) | cubeb-rs: enumerate devices: after input devices
[task 2018-01-17T18:40:28.235Z] 18:40:28     INFO - GECKO(3045) | cubeb-rs: Drain - latency: 303202251
[task 2018-01-17T18:40:28.236Z] 18:40:28     INFO - GECKO(3045) | cubeb-rs: Stream stop 
... after 6 mins ...
[task 2018-01-17T18:46:49.724Z] 18:46:49    ERROR - TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_restartIce.html | application timed out after 370 seconds with no output
[task 2018-01-17T18:46:49.725Z] 18:46:49    ERROR - Force-terminating active process(es).

First, this crash report is created from the test next to test_peerConnection_replaceTrack.html. From logs we see that latency is ridiculous high. 303202251 usec is 5.05337085 mins. When we drain we wait `2 * latency` in order to allow device to consume the last frames before we stop and declare drain state. In other words stream stop, stream destroy and declaring new state are waiting on that timer. 

So I believe this is a side effect of a blocked or overloaded mainloop rather than the root cause. In order to verify that I created a patch that does not allow latency go over 100 ms. The result is that the error of this test remains but there is no crash on next test which turns green.

Alex Chronopoulos [:achronop]

Assignee

Comment 2

•

6 years ago

That said I believe we should make this patch permanent and check latency before set it on timer. Since it can take very high values there is no point to wait for that. Waiting for 200ms at most (which means a latency of 100ms) looks reasonable.

@kinetik: thoughts about it.

Flags: needinfo?(kinetik)

Matthew Gregan [:kinetik]

Comment 3

•

6 years ago

Do we know what's causing the latency to grow like this?  Is it just server load causing the PA server to grow latency or is there some other bug involved?  Enabling logging in the PA server when reproducing might give some more clues.  I'd be reluctant to apply any workarounds like capping latency for the drain timer until we better understand what's going on.  If we do end up capping the timer, something in the range of 1-2s seems like a better compromise... 100ms is too short if we're using a higher latency playback stream, especially if the PA server is remote.

Flags: needinfo?(kinetik)

Alex Chronopoulos [:achronop]

Assignee

Comment 4

•

6 years ago

Unfortunately we do not know the root cause. Probably it's the same cause that creates this bug. It's not possible to activate Pulse logs because this is reproducible only in try. I have not reproduced it in a loaner. I will continue looking for the cause here.

Alex Chronopoulos [:achronop]

Assignee

Comment 5

•

6 years ago

Attached file bug1431056-thread-all-gdb.txt (obsolete) — Details

I managed to repro once in my system. I attached the debugger and got bt for every thread. It looks like threads 21 and 67 are blocked in mainloop. In stack of Thread 67 it's strange that callback makes use of StorageDBThread.

@kinetik: Can you please have a look and tell me what you think?

Flags: needinfo?(kinetik)

Matthew Gregan [:kinetik]

Comment 6

•

6 years ago

(In reply to Alex Chronopoulos [:achronop] from comment #5)
> Created attachment 8944379 [details]
> bug1431056-thread-all-gdb.txt
> 
> I managed to repro once in my system. I attached the debugger and got bt for
> every thread. It looks like threads 21 and 67 are blocked in mainloop. In
> stack of Thread 67 it's strange that callback makes use of StorageDBThread.

The log is messed up, possibly by the Gecko logging intermingled with gdb's output.  If you look at the stack you're talking about, thread 67 (StorageDBThread) has stack entries 0-9, then there's stack entries 4-27 containing cubeb/audioipc stuff.  Those 4-27 entries must be from a different thread (66) and there must be some output missing.

So thread 66 holds the PA mainloop lock while it executes the server side of the data_callback and is waiting on a response from the client side data_callback (which will be running in the content process where the test is executing).

Thread 21 is the audioipc "main" thread and is blocked waiting for the PA mainloop lock while trying to service an enumerate request.  There will be a content process that is waiting on a response for this.

So these stacks alone doesn't tell us much, especially if you can repro this without the remoting enabled.  I think we either need the stacks from the main/chrome Gecko process *and* the content process where the test is running (so we can see what's happening on the client side of the remoting) or we need to repro with remoting disabled and then we can just examine the stacks within the content process where the test is running.

Flags: needinfo?(kinetik)

Matthew Gregan [:kinetik]

Comment 7

•

6 years ago

It's probably also worth trying to repro inside rr so we can debug this more easily, especially since it'll capture all of the processes involved.  And if you could post the exact steps you used to repro this time, I can try running parallel to you.

Alex Chronopoulos [:achronop]

Assignee

Comment 8

•

6 years ago

Attached file gdb-all-threads-bt.txt — Details

I had hit a bug in rr that prevented me from using it. This is just fixed so I came up with a better bt. I post it instead of the other one but I have not done any analysis yet.

Attachment #8944379 - Attachment is obsolete: true

Alex Chronopoulos [:achronop]

Assignee

Comment 9

•

6 years ago

Attached file gdb-all-threads-bt-child1-process.txt — Details

bt from all thread in content process (child 1)

Alex Chronopoulos [:achronop]

Assignee

Comment 10

•

6 years ago

Attached file gdb-all-threads-bt-child2-process.txt — Details

Content process, child 2. Nothing in here but I post anyway.

Matthew Gregan [:kinetik]

Comment 11

•

6 years ago

Thanks Alex.

On the server side, we see the same as the last stacks:
Thread 45 - data_callback - waiting in recv on client response (holding mainloop lock)
Thread 19 - enumerate (on server main thread) - waiting on mainloop lock

On the client side, there's some indication of a problem:
Thread 30 - client enumerate from MSG event loop waiting on server response
Thread 33 - data_callback thread - waiting in recv on server socket!

So both sides of the data_callback threads are waiting on responses.  That suggests that they're out of step somehow (or these are unrelated threads from different streams, but given there's no evidence of other streams running and assuming you're running just a single test, it's unlikely), since they are expected to operate in a send/recv lockstep with each other.  The client side (thread 33) is waiting for a new command, so it thinks it finished processing the last command.  The server side (thread 45) has sent the data_callback request to the client and is waiting for the client to confirm the shm buffers have been updated.  This assumes that 

It's not possible to tell what went wrong from the stacks, so we'd either need to enable audioipc logging (env var RUST_LOG=debug) or step through an rr recording (preferable) to see what's going wrong.  If you've got a good rr recording, stepping through the last iteration of the client side data_callback thread (33) would be the place to start... otherwise let me know what I need to do to repro locally and I'll investigate.

If you can reproduce this with the audio remoting disabled, I think that needs investigating too, because if the above issue is a remoting bug *and* this test timeout issue still reproduces without remoting, then there must be multiple issues involved.

Alex Chronopoulos [:achronop]

Assignee

Comment 12

•

6 years ago

Thank you for the help so far!

I have a good rr. Can you give me a few points in the code to break. I am also preparing the instructions how to repro.

Alex Chronopoulos [:achronop]

Assignee

Comment 13

•

6 years ago

Attached file Patch to reproduce the test failure — Details

In order to repro follow the steps bellow:
1. Apply the patch. This patch assumes that you have gst-launch-1.0 available in your system instead of the gst-launch-0.10 that is used in try server. If that not the case you need to remove that change.

2. You need to install and load the v4l2loopback module. The way I did it is to clone the repo from https://github.com/umlaeute/v4l2loopback build, install and `sudo modprob v4l2loopback`

3. Run the test under stress. In rr it reproduced at once, in GDB it reproduced after a few tries (~10). I have never reproduced outside of those two. I use the following command to apply stress in the system:
`while :; do stress -c 8 -i 8 -m 8 -d 8 -t 2; sleep 1; done`

4. Run the test:
./mach mochitest --use-test-media-devices --keep-open=false dom/media/tests/mochitest/test_peerConnection_replaceTrack.html

Matthew Gregan [:kinetik]

Comment 14

•

6 years ago

In the content process (where thread 33 lives in the stacks you posted), break on https://dxr.mozilla.org/mozilla-central/source/media/audioipc/client/src/stream.rs#38.  This is where the thread is currently stuck in the posted stacks, so you'll have to rewind a bit and step through a few iterations to see what it's doing.  Given what the chrome thread (45) is doing, we'd expect to see the content thread (33) receive a ClientMessage::StreamDataCallback, enter data_cb, then respond with a ServerMessage::StreamDataCallback (which is what thread 45 is waiting for in the stacks).  Since thread 33 is waiting to receive, it has either pulled that message and the response was lost/not sent, or it has not yet pulled the message for some reason.

Thanks for the repo steps, I'll give it a shot locally.  If you have time to try reproducing with the audio remoting/sandboxing disabled, it'd be good to investigate what's going on in that case too.

Alex Chronopoulos [:achronop]

Assignee

Comment 15

•

6 years ago

I iterate through that part and I see that we enter in data_cb we get 274 frames and we send ServerMessage::StreamDataCallback without an error. This is the same for a number of iterations around that point.

Matthew Gregan [:kinetik]

Comment 16

•

6 years ago

So where and when does that thread stop making progress?  

Any progress on reproducing with remoting disabled?

Matthew Gregan [:kinetik]

Comment 17

•

6 years ago

Reproduced locally, but the stacks don't match what you posted.  It looks like MSG (invoked from the data_callback) is waiting on a MediaEngineRemoteVideoSource::NotifyPull monitor, which is held by the Cameras IPC thread waiting on an IPC to the ImageBridgeChild thread, which is trying to open Gecko IPC shared memory.  It's not clear if this is blocked or just progressing very slowly... but I'm surprised to see the data_callback being blocked on this kind of stuff.

I reproduced the same failure with the same stacks with remoting disabled, so we can eliminate the remoting from the equation.

Randell Jesup [:jesup] (needinfo me)

Comment 18

•

6 years ago

CamerasChild?  This is e10s I presume.  Why would CamerasChild be waiting on ImageBridge???

Can you post the stacks?

Flags: needinfo?(kinetik)

Andreas Pehrson [:pehrsons]

Comment 19

•

6 years ago

The rescaling in MediaEngineRemoteVideoSource::DeliverFrame holds the monitor very generously [1] and it all happens on the Cameras IPC thread.

Includes a bunch of allocations, including one of a shmem-backed image.

Bug 1299515 removes holding the monitor while rescaling (which should help this problem).
Optimizing the rescaling to do less copying will happen in a followup to bug 1423582.

Note that this rescaling (and presumably the path to ImageBridgeChild) happens only if you have multiple gUM requests for the same camera in the same child process at different resolutions. This could explain why kinetik and achronop see different failure modes.

The only tests I can see triggering multiple requests however, are test_getUserMedia_constraints.html and test_getUserMedia_basicScreenshare.html. And we check that all tests stop all tracks before they finish.

Stacks might help understand this better.


[1] https://searchfox.org/mozilla-central/rev/11d0ff9f36465ce19b0c43d1ecc3025791eeb808/dom/media/webrtc/MediaEngineRemoteVideoSource.cpp#452

Randell Jesup [:jesup] (needinfo me)

Comment 20

•

6 years ago

I imagine the stacks aren't needed, given pehrson's comment

Flags: needinfo?(kinetik)

Alex Chronopoulos [:achronop]

Assignee

Comment 21

•

6 years ago

Attached file New stack from child process 1 — Details

This is a new all threds stack from child process 1. I do not know why me and Mathew have different stacks. I pulled yesterday so this is a fresh one.

Thread 23 is stopped in ImageBridgeChild. It is stopped inside sandbox methods on a while loop (SandboxBrokerCommon.cpp:68). I will try to verify if that add any delay.

Another good idea would be to repro without sandbox.

Alex Chronopoulos [:achronop]

Assignee

Comment 22

•

6 years ago

I have tested without cubeb remote and I was able to repro (sorry did not report earlier). I have tested also with both backends, rust and plain c, and I just tested without sandbox. I am able to repro in all cases.

Andreas Pehrson [:pehrsons]

Comment 23

•

6 years ago

Creating a dependency here in case it has anything to say. Re comment 17-19.

Depends on: 1434353

Andreas Pehrson [:pehrsons]

Updated

•

6 years ago

Depends on: 1434538

Andreas Pehrson [:pehrsons]

Comment 24

•

6 years ago

shuffle

No longer depends on: 1434353

Andreas Pehrson [:pehrsons]

Comment 25

•

6 years ago

Alex, could you retest per the original try failures now that the video fix has landed?

Flags: needinfo?(achronop)

Alex Chronopoulos [:achronop]

Assignee

Comment 26

•

6 years ago

I did a pull this morning and I still have it, unfortunately.

Flags: needinfo?(achronop)

Alex Chronopoulos [:achronop]

Assignee

Comment 27

•

6 years ago

Same version on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8bfd6d9f15117878567e42200e2f52bcb0e185a0

Alex Chronopoulos [:achronop]

Assignee

Comment 28

•

6 years ago

Try is green, I will remove the fake:true from the test. Andreas can you mention the bug introduced the issue?

Andreas Pehrson [:pehrsons]

Comment 29

•

6 years ago

That would have been bug 1388219.

Alex Chronopoulos [:achronop]

Assignee

Comment 30

•

6 years ago

The issue introduced by bug 1388219 and fixed by bug 1434538. I will use this bug to remove the "fake: true" true constraint from the test.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/220120/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/220120/

Andreas Pehrson [:pehrsons]

Comment 32

•

6 years ago

mozreview-review

Comment on attachment 8950874 [details]
Bug 1431056 - Stop using fake constraints in a media mochitest.

https://reviewboard.mozilla.org/r/220120/#review226000

Attachment #8950874 - Flags: review?(apehrson) → review+

Pulsebot

Comment 33

•

6 years ago

Pushed by achronop@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/43c9ac3ada6c
Stop using fake constraints in a media mochitest. r=pehrsons

Andreea Pavel [:apavel]

Comment 34

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/43c9ac3ada6c

Status: NEW → RESOLVED

Closed: 6 years ago

status-firefox60: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla60

Alex Chronopoulos [:achronop]

Assignee

Updated

•

6 years ago