Closed Bug 1586057 Opened 5 years ago Closed 5 years ago

MediaManager walks descendant docshells of a docshell that it got from a non-current inner, tearing down mediacapture for unrelated documents

Categories

(Core :: WebRTC: Audio/Video, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla72
Tracking Status
firefox-esr68 --- wontfix
firefox70 --- wontfix
firefox71 --- wontfix
firefox72 --- fixed

People

(Reporter: pehrsons, Assigned: jib)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file)

STR:

  1. Build rev d202f2122d71 from try
  2. ./mach mochitest dom/media/test/test_imagecapture.html --verify

Expected result:
Test passes the verify suite

Actual result:
When the test reaches the gcTest and calls SpecialPowers.gc() the inner window where the mochitest is running gets released and the MediaStreamTrack ends because of MediaManager::OnNavigation(). That happens sometimes. Sometimes the test passes, but never the whole suite. Sometimes it is just left hanging. Issues seem to only happen on the first test run after launching the browser.

Surely when this happens the browser must have a reference to the inner window? I can see its content and all with my own eyes.

Or did some other window go away that the getUserMedia() call this test did was handed by mistake? That could explain what happens when the track ends, but not when we're left hanging.

To help debug this I managed to catch the track-ending case with rr, and I have a pernosco recording. See the Notebook for where the window goes away and the track ends. The track ending is what fails the test as seen in the Alerts panel.

This might be the cause of bug 1114828, but I can't say for sure if it's the only reason we could have timed out in the old version of this test. The linked version above is my modernized version of it where we still have issues.

I'll add that I have seen similar things with other tests that do getUserMedia early on in the test. Either that gUM succeeds and it gets some tracks that unexpectedly end, or that the async gUM request is in flight and later rejected with an AbortError. Both would be possible if MediaManager::OnNavigation() is called for the window the request was made with.

What makes this easier to reproduce for test_imagecapture.html is the fact that it's triggering GC explicitly, I suppose.

Also see test_imagecapture.html --verify with some more coverage on try. The failure mode that times out seems to later leak the world.

This seems to me like a real and unexpected issue. Could we get this triaged?

Flags: needinfo?(nkochar)

Andrew, can you look into this?

Flags: needinfo?(nkochar) → needinfo?(continuation)

Sorry for the delay. I don't know what could be going wrong here. I think it is very unlikely that the window that is currently being displayed is being collected. Note that the OnNavigation doesn't actually require collection, just navigating away from the page. But I'm not sure how a GC would trigger that, and the test doesn't seem to do any navigation as far as I can see. I'll try retriggering the TV builds in your try run and see if I can come up with anything from the logs.

You should be able to get all the logs (and so much more) you need from pernosco.

Ah, the database for the pernosco recording has expired. I have requested to get it rebuilt.

I had another look, and it's not the window of the gUM request that gets collected, it's the ancestor of the window of the gUM request. I'm updating the title to reflect this.

  • We get a gUM request for window 36 (outer window 30).
  • Window 11 gets cleaned up through FreeInnerObjects, and calls MediaManager::OnNavigation with window id 11.

See the pernosco notebook for details.

Summary: Inner window running a mochitest gets garbage collected mid-test → Ancestor of inner window running a mochitest gets garbage collected mid-test

Sorry for the delay here. It sounds like this is deliberate behavior on the part of the media code? I'm afraid I don't really have any idea what the right behavior for it. Boris, do you have any idea how this media manager code should deal with child windows?

Flags: needinfo?(continuation) → needinfo?(bzbarsky)

OK, so the sequence of events here is:

  1. An about:blank document is placed into bfcache. I expect this is the initial about:blank that pre-exists the initial test harness toplevel load.
  2. Eventually it is evicted from bfcache via GC calling ~nsSHEntryShared (I'm a little surprised this causes nsSHEntryShared::RemoveFromBFCacheSync to do something useful, but I haven't looked into how we maintain our bfcache state recently).
  3. This calls nsGlobalWindowInner::FreeInnerObjects on the about:blank inner, which calls mNavigator->OnNavigation(). This is almost certainly wrong semantically: the navigation happened a long time ago, when we were placed in bfcache. Of course we do call OnNavigation then too, via nsGlobalWindowOuter::SetNewDocument calling it on the inner we are navigating away from.
  4. Navigator::OnNavigation calls MediaManager::OnNavigation, fine.
  5. MediaManager::OnNavigation calls IterateWindowListeners which starts poking at aWindow->GetDocShell(), where aWindow is the dying inner. Importantly, the inner is NOT THE CURRENT INNER FOR THAT DOCSHELL. So now we are walking the child shells, which represent subframes in whatever document is in fact currently loaded in the docshell, not the document being evicted from bfcache, and tearing down their state. This seems pretty bogus.

A minimal fix is to bail out from MediaManager::OnNavigation at the very least when it's called for a non-current inner. But IterateWindowListeners is called in various places, and I have approximately 0 confidence that other callsites don't call it for non-current inners. For example, MediaManager::DeviceListChanged calls it on all inners in the inner window table, which would certainly include non-current ones. That means, incidentally, that it multiple-walks all subframes (once via every ancestor, and once per subframe), which is likely also not great...

Looks like MediaManager::MediaCaptureWindowState will always be calling IterateWindowListeners with a current inner. On the other hand, StopScreensharing looks like it just starts with some random windowid, and no guarantees that it's current.

Someone who actually knows the MediaManager code needs to decide what it's trying to do here and how to accomplish it, but the current setup is just broken.

Component: Document Navigation → WebRTC: Audio/Video
Flags: needinfo?(bzbarsky) → needinfo?(jib)
Summary: Ancestor of inner window running a mochitest gets garbage collected mid-test → Media manager walks descendant doshells of a docshell that it got from a non-current inner, tearing down unrelated documents

I suspect just not walking descendants in IterateWindowListeners if the window is not current would probably be enough to fix most of the issue here, though MediaManager::DeviceListChanged would still be doing things O(N^2) in depth of the window tree...

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #11)

NOT THE CURRENT INNER FOR THAT DOCSHELL.

Wow. Thank you gentlemen, that explains things. The iteration code here is from bug 1062876, which is an old sec bug. Prior to that we were not handling iframes properly, so I suspect we still have to do some form of window iteration here. jib, you reviewed that one, wanna take this?

The DeviceListChanged behavior is from bug 1364038, and, well, is not great.

P3 given how old this behavior is.

Priority: -- → P3
Regressed by: CVE-2014-1585, 1364038
Summary: Media manager walks descendant doshells of a docshell that it got from a non-current inner, tearing down unrelated documents → MediaManager walks descendant docshells of a docshell that it got from a non-current inner, tearing down mediacapture for unrelated documents
Assignee: nobody → jib
Flags: needinfo?(jib)
Pushed by jbruaroey@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6c86995d33de
Avoid stepping on docShells of unrelated docs in MediaManager device stopping code. r=bzbarsky,pehrsons
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72

Given the age of this bug, I assume we can live with it on older releases. Feel free to nominate for uplift if you feel strongly otherwise.

Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: