Closed Bug 1254029 Opened 8 years ago Closed 8 years ago

Running e10s mochitest locally takes an extensive time to start

Categories

(Testing :: Mochitest, defect)

x86_64
Windows 10
defect
Not set
normal

Tracking

(e10s+, firefox48 fixed)

RESOLVED FIXED
mozilla48
Tracking Status
e10s + ---
firefox48 --- fixed

People

(Reporter: xidorn, Assigned: sotaro)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 2 obsolete files)

When I run e10s mochitest on my Windows, it takes a very long time (~1min) before the test starts. In most of the time, the browser window freezes.

Non-e10s mochitest is not affected, e10s reftest is fine as well. Not sure what's wrong there.
Attached file log
This is the log before the test actually started (the last line is the start of the test).

The browser window froze after line
> 1457335296294   Marionette      TRACE   conn1 <- [1,12,null,{"value":null}]

After that, the log was still going but very slow.
This makes debugging issues in e10s tests really annoying... Hopefully we can have this fixed soon.
tracking-e10s: --- → ?
It seems if I remove both sanity-test.* and extensions.* from prefs.js, this would happen.
Is this on debug or opt builds? I have seen a freeze similar to your description on my debug builds, though I just blamed it on "debug being debug", though that may have been naive. I have not seen any similar slow-down in my opt builds.
Flags: needinfo?(quanxunzhen)
I have no idea. My local build is debug build (--enable-debug with --enable-optimize). But it seems I can reproduce this issue on neither opt build nor debug build from the server :/

I think a freeze up to ~1min is far too long even for debug build.
Flags: needinfo?(quanxunzhen)
Summary: Running e10s mochitest takes an extensive time to start → Running e10s mochitest locally takes an extensive time to start
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #3)
> It seems if I remove both sanity-test.* and extensions.* from prefs.js, this
> would happen.

I meant, if remove those prefs from the test profile, and execute |mach run|, the freeze also happens.
It seems this gets stuck at this loop: https://dxr.mozilla.org/mozilla-central/rev/55d557f4d73ee58664bdf2fa85aaab555224722e/ipc/glue/GeckoChildProcessHost.cpp#439-441

And when the loop finishes, it reports several pipe error:
> [Parent 764] WARNING: pipe error: 109: file central/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 333
> [Parent 764] WARNING: pipe error: 232: file central/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 446
> [Parent 764] WARNING: pipe error: 232: file central/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 446
> [Parent 764] WARNING: pipe error: 109: file central/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 333
> [Parent 764] WARNING: pipe error: 109: file central/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 333
Actually it gets stuck only at the lock.Wait() rather than the whole loop. And it seems to me this happens only if there is another run of LaunchAdnWaitForProcessHandle before.
It seems that is waiting for some notification from the IOThread, and the IOThread is blocked on ProcessWatcher::EnsureProcessTerminated for ~GeckoChildProcessHost().
ProcessWatcher::EnsureProcessTerminated waits for the child process to terminate, and the child process is blocked on MediaShutdown on this loop: https://dxr.mozilla.org/mozilla-central/rev/55d557f4d73ee58664bdf2fa85aaab555224722e/dom/media/MediaShutdownManager.cpp#142-144
And the MediaShutdown seems to be blocked on this stack:
> xul.dll!mozilla::layers::AsyncTransactionWaiter::WaitComplete() Line 29	C++
> xul.dll!mozilla::layers::ImageBridgeChild::FlushAllImages(mozilla::layers::ImageClient * aClient, mozilla::layers::ImageContainer * aContainer) Line 675	C++
> xul.dll!mozilla::layers::ImageContainer::ClearAllImages() Line 283	C++
> xul.dll!mozilla::VideoFrameContainer::ClearCurrentFrame() Line 96	C++
> xul.dll!mozilla::MediaFormatReader::ReleaseMediaResources() Line 1626	C++
> xul.dll!mozilla::MediaDecoderReader::Shutdown() Line 371	C++
> xul.dll!mozilla::MediaFormatReader::Shutdown() Line 142	C++
> xul.dll!mozilla::detail::MethodCallInvokeHelper<RefPtr<mozilla::MozPromise<bool,bool,0> >,mozilla::MediaDecoderReader>(RefPtr<mozilla::MozPromise<bool,bool,0> >(mozilla::MediaDecoderReader::*)() aMethod, mozilla::MediaDecoderReader * aThisVal, mozilla::Tuple<> & aArgs, mozilla::IndexSequence<> __formal) Line 917	C++

AsyncTransactionWaiter waits for a complete 50s. According to the stack, it seems the waiter is waiting for FlushAllImagesSync to finish. However, if I understand correctly, that function is never executed, and thus the waiter waits it for the whole timeout and finally abandons.
Anthony, could you have a look at this issue?

This happens reliably for me on my Windows 10 machine with debug build and e10s. The simplest steps to reproduce are:
1. remove $OBJDIR\tmp\scratch_user (the profile directory)
2. mach run
Then a small window appears and disappears, and then the main window freezes there for ~1min.

It seems to me it only happens if the sanity test window appears before the main browser window. According to the information I got above, I believe it is because the chrome process waits for the content process of sanity test to finish. And that content process is blocked on waiting for media shutdown.

It seems to me builds from servers do not run sanity test, and thus this issue is not triggered. I guess I can disable the sanity test to workaround this.
Flags: needinfo?(ajones)
Blocks: 1243083
Disabling the sanity test during testing seems reasonable to me.
Flags: needinfo?(ajones)
But it is still an issue that the window freeze ~1min every time when I run a debug build with fresh profile.

Why does the MediaShutdown take such a long time (and AsyncTransactionWaiter actually can wait forever if there is no timeout)? Is that expected behavior?
Flags: needinfo?(ajones)
MediaShutdown waits until all media related objects have shut down, then gfx is shutdown.  If not for the wait, and as all media objects are typically asynchronous and running on secondary threads you would find gfx being shutdown and media still attempting to use it causing nasty crashes.

Why gfx is hanging waiting on something apparently non useful is something gfx related, and I'd say what needs fixing.
Flags: needinfo?(nical.bugzilla)
AsyncTransactionWaiter lets us wait for something specific to happen on the compositor side like the creation and delivery to the content side of Fence objects. It looks like the compositor should have sent a message specific to the content side but didn't, and it indeed very much needs fixing.
Passing the needinfo to Sotaro who knows best the ins and outs of AsyncTransactionWaiter (And is in the process of refactoring it pretty heavily).
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(ajones)
ImageBridgeChild::FlushAllImages() waits flush completion synchronously. It is originally implemented for b2g. It could be remove except b2g gonk.
Assignee: nobody → sotaro.ikeda.g
Flags: needinfo?(sotaro.ikeda.g)
Attachment #8739821 - Flags: review?(nical.bugzilla)
Comment on attachment 8739821 [details] [diff] [review]
patch - Do not wait in ImageBridgeChild::FlushAllImages() except gonk

Review of attachment 8739821 [details] [diff] [review]:
-----------------------------------------------------------------

We indeed don't seem to need this on non-gralloc platforms AFAIK, but it'd be very useful to know why we are missing that notification from the compositor to the content process.
Attachment #8739821 - Flags: review?(nical.bugzilla) → review+
See Also: → 1244883
(In reply to Nicolas Silva [:nical] from comment #19)
> Comment on attachment 8739821 [details] [diff] [review]
> We indeed don't seem to need this on non-gralloc platforms AFAIK, but it'd
> be very useful to know why we are missing that notification from the
> compositor to the content process.

I tried with my laptop but I did not see the above. Nor "Waiting async transaction complete" log. I suspect that debug build with E10s is just very heavy weight an slow. Anyway, the patch remove one bottleneck.
attachment 8739821 [details] [diff] [review] removes all wait. But ImageBridgeChild::FlushAllImages() needs to wait FlushAllImagesSync() call to ensure ImageContainer alive.
Hmm, attachment 8740320 [details] [diff] [review] does not work well when e10s is disabled.
Addressed the problem.
Attachment #8740320 - Attachment is obsolete: true
Comment on attachment 8740356 [details] [diff] [review]
patch - Do not wait in ImageBridgeChild::FlushAllImages() except gonk

:nical, can you review the patch again? It was changed since the review.
Attachment #8740356 - Flags: review?(nical.bugzilla)
Attachment #8740356 - Flags: review?(nical.bugzilla) → review+
I can confirm the patch does fix this issue. Thanks!
Thanks for the confirmation.
https://hg.mozilla.org/mozilla-central/rev/0dcdf655c75e
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
You need to log in before you can comment on or make changes to this bug.