[e10s-multi] Use the preallocated process manager by default

RESOLVED FIXED in Firefox 54

Status

()

defect
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: gkrizsanits, Assigned: gkrizsanits)

Tracking

(Depends on 1 bug)

unspecified
mozilla55
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox53 unaffected, firefox54 fixed, firefox55 fixed)

Details

(Whiteboard: [e10s-multi:+])

Attachments

(5 attachments, 1 obsolete attachment)

Whiteboard: [e10s-multi:?]
Whiteboard: [e10s-multi:?] → [e10s-multi:+]
Whiteboard: [e10s-multi:+] → [e10s-multi:+][MemShrink]
Gabor, do you have details on the preallocated process manager somewhere? In general I'm curious about the memory overhead of a preallocated process.
Flags: needinfo?(gkrizsanits)
Whiteboard: [e10s-multi:+][MemShrink] → [e10s-multi:+]
(In reply to Eric Rahm [:erahm] from comment #1)
> Gabor, do you have details on the preallocated process manager somewhere? In
> general I'm curious about the memory overhead of a preallocated process.

It's basically an extra content process without any tabs. That means it has all the chrome scripts are already loaded in it. I'd say if you open up an about:blank page in a content process, you should get a pretty close estimation for it. It does respect the maximum content process count, so once we hit the maximum number of processes, we stop pre-allocating a new one. I'm trying to enable it on nightly but it is not looking great at the moment: https://treeherder.mozilla.org/#/jobs?repo=try&revision=67391d8bc05021d194d9e5ad260193c8e8c42836
Flags: needinfo?(gkrizsanits)
It looks somewhat better now with some trivial bug fixes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4af13ed36b7b7208f9ed05a4e024ed99a51d672b&selectedJob=82785861

Although some webextension tests seem to have an issue with the patch.
Posted patch WIP (obsolete) — Splinter Review
Hi Kris, I'm trying to enable the pre-allocated process manager by default for e10s-multi. It seems like it is causing some timing issues for web extensions. One test that keeps timing out reliably with the patch is browser/components/extensions/test/mochitest/test_ext_all_apis.html (seems like the extension loaded promise does not resolve). But I see quite some test-oop-extensions failures as well (some of them are the same issue, have not checked all of them though). Is there a chance you could help me figuring out what's going on? We've got some heavy pressure on the project recently...
Flags: needinfo?(kmaglione+bmo)
My best guess, without looking very deeply, is that it's because we rely on the initial process data to determine what content scripts to inject, and the preloaded process has data from before the test extension was started, and doesn't get the message that we send to existing processes at extension startup.
Flags: needinfo?(kmaglione+bmo)
(In reply to Kris Maglione [:kmag] from comment #6)
> My best guess, without looking very deeply, is that it's because we rely on
> the initial process data to determine what content scripts to inject, and
> the preloaded process has data from before the test extension was started,
> and doesn't get the message that we send to existing processes at extension
> startup.

Thanks for the quick reply, this is a very likely scenario. It is a bit tricky because currently WE is also relying on the invariant that any present and future content process can be updated after a state change by setting the initial process data and broadcast a change message. Preallocated process breaks that assumption. Broadcasting will not reach it and setting initial process data is just too late.

[1] We should have an API to send message to the preallocated process.
[2] We should make sure that where WE assumed it's enough to set initial process data also sends a message to the preallocated process and handle this case on the content side.

[1] should be straightforward, with [2] I will probably need some help from the WE team.

Bill, does this plan makes sense to you or do you have an alternative approach that you would prefer here?
Flags: needinfo?(wmccloskey)
Isn't the preallocated process just a normal content process? Why wouldn't it get broadcast messages about changes to WebExtension data?
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #8)
> Isn't the preallocated process just a normal content process? Why wouldn't
> it get broadcast messages about changes to WebExtension data?

Well, we only load ExtensionContent.jsm from frame scripts, so depending on whether the preallocated process is hooked up as a child of cpmm at extension startup or not, we're probably either waiting on a response that will never come, or starting with stale extension data once the first frame script is loaded.
Oh, right. I guess we should move that stuff to a process script.
(In reply to Bill McCloskey (:billm) from comment #8)
> Isn't the preallocated process just a normal content process? Why wouldn't
> it get broadcast messages about changes to WebExtension data?

You're right. I thought we hook that up as a child of ppmm just before we start using it but that is not the case. Which means it can be observed from chrome js from the start which is something I wanted to avoid but for this case it makes my life easier.

(In reply to Kris Maglione [:kmag] from comment #9)
> Well, we only load ExtensionContent.jsm from frame scripts, so depending on
> whether the preallocated process is hooked up as a child of cpmm at
> extension startup or not, we're probably either waiting on a response that
> will never come, or starting with stale extension data once the first frame
> script is loaded.

What about Schemas? It looks to me that we start with stale data there as well. (http://searchfox.org/mozilla-central/source/toolkit/components/extensions/Schemas.jsm#2205)

Similar cases are WebRequestContent, FormAutofillContent, RemoteAddonsChild(?). (http://searchfox.org/mozilla-central/search?q=initialprocessdata&path=)

(In reply to Bill McCloskey (:billm) from comment #10)
> Oh, right. I guess we should move that stuff to a process script.

Thanks, that sounds like the right thing to do.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #11)
> What about Schemas? It looks to me that we start with stale data there as
> well.
> (http://searchfox.org/mozilla-central/source/toolkit/components/extensions/Schemas.jsm#2205)

Yeah, those listeners are also initialized from ExtensionContent.jsm, so
loading that from a process script should also fix the problem.

> Similar cases are WebRequestContent, FormAutofillContent,
> RemoteAddonsChild(?).
> (http://searchfox.org/mozilla-central/search?q=initialprocessdata&path=)

WebRequestCotent is loaded as a process script the first time it's needed, so
that should be fine. I'm not sure about the others.
(In reply to Kris Maglione [:kmag] from comment #12)
> Yeah, those listeners are also initialized from ExtensionContent.jsm, so
> loading that from a process script should also fix the problem.

Oh, I see.

> WebRequestCotent is loaded as a process script the first time it's needed, so
> that should be fine. I'm not sure about the others.

Right, I'll check them and fix where needed then.
Fixing FormAutofillContent and extensions were easy and I'm not sure about RemoteAddonsChild yet.

There is also a blob crash I'm currently investigating with Baku.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=78c235b7c79f42e725cd9723719dd1a19c4c4682&selectedJob=83366692
I want to see why this patch cause a crash in BlobImpl. NI myself. I'll try the patch locally tomorrow morning.
Flags: needinfo?(amarchesini)
(In reply to Andrea Marchesini [:baku] from comment #15)
> I want to see why this patch cause a crash in BlobImpl. NI myself. I'll try
> the patch locally tomorrow morning.

Calling RemoveDataEntry here:[1] with |false| for avoid re-broadcasting it fixed the issue, I'm testing the patch and will ask you for a review for that part.

[1]: http://searchfox.org/mozilla-central/rev/ef0b6a528e0e7b354ddf8cdce08392be8b8ca679/dom/ipc/ContentChild.cpp#2995

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f2f4a21cb49f4893da0effe079fe1622530a5061
Attachment #8845479 - Attachment is obsolete: true
Attachment #8846766 - Flags: review?(amarchesini)
Ideally we should have a process script that does the minimal extension initialization, as this whole thing is loaded and executed before we would process the first url in the tab. But that's out of the scope of this bug, I just want to have a minimalist patch that can be uplifted to aurora if needed and let me land this patch fast.
Attachment #8846768 - Flags: review?(kmaglione+bmo)
I think it's enough to enable it on firefox.js instead of all.js.

Launching the first preallocated process from ContentParent::StartUp was just too early and caused some troubles. I removed that part.

In case of e10s is disabled we should not launch the preallocated process (the BrowserTabsRemoteAutostart part).

Finally, I remember you mentioning that mShutdown was not used in the original patch, and I thought I removed the flag because the orignal version didn't handle it either (I didn't... :( ). Anyway, as it turns out I have to handle the shutdown situation and not launch a preallocated process in that case either.
Attachment #8846774 - Flags: review?(wmccloskey)
Assignee: nobody → gkrizsanits
Attachment #8846769 - Flags: review?(wmccloskey) → review+
Attachment #8846774 - Flags: review?(wmccloskey) → review+
Comment on attachment 8846768 [details] [diff] [review]
part2: Preload ExtensionContent.jsm from a process script. v1

Review of attachment 8846768 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/extensions/Extension.jsm
@@ +973,5 @@
>    }
>  };
> +
> +Services.ppmm.loadProcessScript("data:,new " + function() {
> +  Components.utils.import("resource://gre/modules/ExtensionContent.jsm");

We should also be able to remove the import at the top of Extension.jsm now.
Attachment #8846768 - Flags: review?(kmaglione+bmo) → review+
Attachment #8846766 - Flags: review?(amarchesini) → review+
I think we need this one too, just yesterday I didn't have time for it.

I think this is the earliest moment when we can safely run the process script.
Attachment #8847125 - Flags: review?(wmccloskey)
Attachment #8847125 - Flags: review?(wmccloskey) → review+
Pushed by gkrizsanits@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fc94903b628a
Stop rebroadcasting BlobURLUnregistration. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/26389c6e73ff
Use the preallocated process manager by default. r=billm, r=kmag
I merged part2-part5 for easier uplift. Part 1 is really independent from the rest, so I wanted to keep that part separated.
Backed out for reftest failures on debug after asserting at nsLayoutStylesheetCache.cpp:253:

https://hg.mozilla.org/integration/mozilla-inbound/rev/0d5375cf02555456fa3dcbce36fd4b64416bd1a7
https://hg.mozilla.org/integration/mozilla-inbound/rev/121622840147bb70700e414f5946fbf74694acfc

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=26389c6e73ff8fbfca87d5246ff3dd08cb41088b&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=83998078&repo=mozilla-inbound

[task 2017-03-15T12:36:02.273965Z] 12:36:02     INFO - REFTEST SUITE-END | Shutdown
[task 2017-03-15T12:36:02.276755Z] 12:36:02     INFO - REFTEST INFO | Slowest test took 2479ms (http://localhost:41331/1489580582935/4/default.html)
[task 2017-03-15T12:36:02.278397Z] 12:36:02     INFO - REFTEST INFO | Total canvas count = 0
[task 2017-03-15T12:36:02.416206Z] 12:36:02     INFO - --DOCSHELL 0xd14ce800 == 2 [pid = 1008] [id = {d37d225c-6efb-4d9a-8b19-126f5738a7d9}]
[task 2017-03-15T12:36:02.439920Z] 12:36:02     INFO - --DOCSHELL 0xd14d6000 == 1 [pid = 1008] [id = {8ac0fe03-8991-4821-a38c-970c94a206be}]
[task 2017-03-15T12:36:02.476150Z] 12:36:02     INFO - --DOCSHELL 0xe65c8c00 == 0 [pid = 1008] [id = {eb653a89-f416-4160-9bbe-775284ab218e}]
[task 2017-03-15T12:36:03.289704Z] 12:36:03     INFO - JavaScript error: chrome://reftest/content/reftest.jsm, line 2024: NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIPropertyBag2.getPropertyAsAString]
[task 2017-03-15T12:36:03.306467Z] 12:36:03     INFO - JavaScript error: chrome://reftest/content/reftest.jsm, line 2024: NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIPropertyBag2.getPropertyAsAString]
[task 2017-03-15T12:36:03.348257Z] 12:36:03     INFO - [Child 1128] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 1014
[task 2017-03-15T12:36:03.348418Z] 12:36:03     INFO - [Child 1128] WARNING: NS_ENSURE_TRUE(aObserver) failed: file /home/worker/workspace/build/src/modules/libpref/nsPrefBranch.cpp, line 715
[task 2017-03-15T12:36:03.349056Z] 12:36:03     INFO - [Child 1128] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file /home/worker/workspace/build/src/modules/libpref/Preferences.cpp, line 1709
[task 2017-03-15T12:36:03.403778Z] 12:36:03     INFO - [Child 1087] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 1014
[task 2017-03-15T12:36:03.406118Z] 12:36:03     INFO - [Child 1087] WARNING: NS_ENSURE_TRUE(aObserver) failed: file /home/worker/workspace/build/src/modules/libpref/nsPrefBranch.cpp, line 715
[task 2017-03-15T12:36:03.406556Z] 12:36:03     INFO - [Child 1087] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file /home/worker/workspace/build/src/modules/libpref/Preferences.cpp, line 1709
[task 2017-03-15T12:36:03.791885Z] 12:36:03     INFO - Assertion failure: !gUserContentSheetURL (Got the URL but never used?), at /home/worker/workspace/build/src/layout/style/nsLayoutStylesheetCache.cpp:253
[task 2017-03-15T12:36:23.714498Z] 12:36:23     INFO - #01: nsLayoutStylesheetCache::Shutdown [layout/style/nsLayoutStylesheetCache.cpp:253]
[task 2017-03-15T12:36:23.714571Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.715579Z] 12:36:23     INFO - #02: nsLayoutStatics::Shutdown [xpcom/base/StaticPtr.h:158]
[task 2017-03-15T12:36:23.715629Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.715681Z] 12:36:23     INFO - #03: LayoutModuleDtor [layout/build/nsLayoutModule.cpp:1247]
[task 2017-03-15T12:36:23.715843Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.715928Z] 12:36:23     INFO - #04: nsComponentManagerImpl::KnownModule::~KnownModule [xpcom/components/nsComponentManager.h:227]
[task 2017-03-15T12:36:23.717316Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.718490Z] 12:36:23     INFO - #05: nsAutoPtr<nsComponentManagerImpl::KnownModule>::~nsAutoPtr [memory/mozalloc/mozalloc.h:218]
[task 2017-03-15T12:36:23.719460Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.720647Z] 12:36:23     INFO - #06: nsTArray_Impl<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayInfallibleAllocator>::RemoveElementsAt [xpcom/ds/nsTArray.h:559]
[task 2017-03-15T12:36:23.721653Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.722766Z] 12:36:23     INFO - #07: nsTArray_Impl<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayInfallibleAllocator>::Clear [xpcom/ds/nsTArray.h:1668]
[task 2017-03-15T12:36:23.723765Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.724888Z] 12:36:23     INFO - #08: nsComponentManagerImpl::Shutdown [xpcom/components/nsComponentManager.cpp:856]
[task 2017-03-15T12:36:23.725885Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.727370Z] 12:36:23     INFO - #09: mozilla::ShutdownXPCOM [xpcom/build/XPCOMInit.cpp:1011]
[task 2017-03-15T12:36:23.728370Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.729461Z] 12:36:23     INFO - #10: XRE_TermEmbedding [toolkit/xre/nsEmbedFunctions.cpp:225]
[task 2017-03-15T12:36:23.730419Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.731798Z] 12:36:23     INFO - #11: mozilla::ipc::ScopedXREEmbed::Stop [ipc/glue/ScopedXREEmbed.cpp:118]
[task 2017-03-15T12:36:23.732827Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.733944Z] 12:36:23     INFO - #12: XRE_InitChildProcess [toolkit/xre/nsEmbedFunctions.cpp:691]
[task 2017-03-15T12:36:23.735001Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.761389Z] 12:36:23     INFO - #13: content_process_main [ipc/contentproc/plugin-container.cpp:64]
[task 2017-03-15T12:36:23.762526Z] 12:36:23     INFO - 
[task 2017-03-15T12:36:23.763849Z] 12:36:23     INFO - #14: main [browser/app/nsBrowserApp.cpp:286]
Flags: needinfo?(gkrizsanits)
Hi, the patch I'm trying to land is about enabling a preallocated process mechanism to mitigate the slow content process startup time. It just launches an empty content process in the background without any tabs in it and loads the process scripts and does the basic initializations, so the next time a user opens a tab we can just grab and use it.

You asked for an assertion in your review [1] that I'm hitting here, and I assume the reason is that we send the URI for gUserContentSheetURL at process startup but don't use it until we load the first content into this process, which in the case of these reftests never happen.

My questions:
1. Can this url outdate? (we start the preallocated process with one url but by the time we start using it (possibly much later), the url is changed and we never let the preallocated process know about the change)
2. Can I just remove this assertion, or should I validate it somehow that this case can only happen for content processes that didn't load any content yet?

[1]: https://bugzilla.mozilla.org/show_bug.cgi?id=1046166#c54

Note: this is a quite urgent patch that we plan to uplift to aurora.
Flags: needinfo?(gkrizsanits) → needinfo?(dbaron)
Flags: needinfo?(amarchesini)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #26)
> My questions:
> 1. Can this url outdate? (we start the preallocated process with one url but
> by the time we start using it (possibly much later), the url is changed and
> we never let the preallocated process know about the change)

I don't understand this question.

> 2. Can I just remove this assertion, or should I validate it somehow that
> this case can only happen for content processes that didn't load any content
> yet?

It sounds like replacing the assertion with an assignment to null:
  gUserContentSheetURL = nullptr;
would be fine.
Flags: needinfo?(dbaron)
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #27)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #26)
> > My questions:
> > 1. Can this url outdate? (we start the preallocated process with one url but
> > by the time we start using it (possibly much later), the url is changed and
> > we never let the preallocated process know about the change)
> 
> I don't understand this question.

Pre-patch every time we launch a new content process, we
(1) send this data from parent to the new child process:
http://searchfox.org/mozilla-central/source/dom/ipc/ContentParent.cpp#2187
(2) then store it as gUserContentSheetURL: http://searchfox.org/mozilla-central/source/dom/ipc/ContentChild.cpp#1030
(3) and I would assume quickly after, use it here: http://searchfox.org/mozilla-central/source/layout/style/nsLayoutStylesheetCache.cpp#346

Post-patch there can be a huge delay between step (2) and (3) when the content process was started by the preallocated process manager. So let's say the return value of nsLayoutStylesheetCache::For(StyleBackendType::Gecko)->UserContentSheet()->GetSheetURI() from step (1) is something that can change over time, the content process might end up starting with the old value. And then I should fix that somehow.

Does that make sense?

> 
> > 2. Can I just remove this assertion, or should I validate it somehow that
> > this case can only happen for content processes that didn't load any content
> > yet?
> 
> It sounds like replacing the assertion with an assignment to null:
>   gUserContentSheetURL = nullptr;
> would be fine.

Thanks, that would work for me.
Flags: needinfo?(dbaron)
I think we just load the user sheet once, at startup, so I don't think that's a problem.
Flags: needinfo?(dbaron)
Pushed by gkrizsanits@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3a6cb84f8cf8
Stop rebroadcasting BlobURLUnregistration. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/8a1c36e9ba80
Use the preallocated process manager by default. r=billm, r=kmag
Backed out for frequently failing toolkit/components/extensions/test/xpcshell/test_ext_i18n.js on Windows:

https://hg.mozilla.org/integration/mozilla-inbound/rev/0c50d37c0cf4b9943a36d278ed712b25a15ae8e9
https://hg.mozilla.org/integration/mozilla-inbound/rev/69d73a6475dd349e56472583356a9ddfa6d1ccca

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8a1c36e9ba805f20ec6d63793f18268516d69d95&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=84280550&repo=mozilla-inbound

06:19:23     INFO -  TEST-START | xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_i18n.js
06:24:23  WARNING -  TEST-UNEXPECTED-TIMEOUT | xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_i18n.js | Test timed out
06:24:23     INFO -  TEST-INFO took 300001ms
...
06:24:23     INFO -  xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_i18n.js | Starting test_detect_language
06:24:23     INFO -  (xpcshell/head.js) | test test_detect_language pending (2)
06:24:23     INFO -  "Extension attached"
06:24:23     INFO -  (xpcshell/head.js) | test run_next_test 3 finished (2)
06:24:23     INFO -  PID 11176 | [Child 7608] WARNING: pipe error: 109: file c:/builds/moz2_slave/m-in-w64-000000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
06:24:23     INFO -  "CONSOLE_MESSAGE: (error) [JavaScript Error: "No WebExtension found with ID {464c1279-19a9-47b1-a236-06e0241611cf}" {file: "resource://gre/modules/ExtensionParent.jsm" line: 530}]
06:24:23     INFO -  createProxyContext@resource://gre/modules/ExtensionParent.jsm:530:13
06:24:23     INFO -  receiveMessage@resource://gre/modules/ExtensionParent.jsm:498:11
06:24:23     INFO -  _do_main@C:\\slave\\test\\build\\tests\\xpcshell\\head.js:211:5
06:24:23     INFO -  _execute_test@C:\\slave\\test\\build\\tests\\xpcshell\\head.js:546:5
06:24:23     INFO -  @-e:1:1
06:24:23     INFO -  "
06:24:23     INFO -  "CONSOLE_MESSAGE: (error) [JavaScript Error: "No WebExtension found with ID {d84c25c8-b85c-482d-9777-2903d3d86573}" {file: "resource://gre/modules/ExtensionParent.jsm" line: 530}]
06:24:23     INFO -  createProxyContext@resource://gre/modules/ExtensionParent.jsm:530:13
06:24:23     INFO -  receiveMessage@resource://gre/modules/ExtensionParent.jsm:498:11
06:24:23     INFO -  _do_main@C:\\slave\\test\\build\\tests\\xpcshell\\head.js:211:5
06:24:23     INFO -  _execute_test@C:\\slave\\test\\build\\tests\\xpcshell\\head.js:546:5
06:24:23     INFO -  @-e:1:1
06:24:23     INFO -  "
Flags: needinfo?(gkrizsanits)
Pushed by gkrizsanits@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c3831ae7c9ff
Stop rebroadcasting BlobURLUnregistration. r=baku
This is taking longer than I wanted, so I'm landing the critical bug fix part first and try to uplift it ASAP and don't block it by the bigger patch that causes a few test failures.
Flags: needinfo?(gkrizsanits)
Keywords: leave-open
Comment on attachment 8846766 [details] [diff] [review]
part1: Stop rebroadcasting BlobURLUnregistration. v1

We need this uplift for e10s-multi.

Approval Request Comment
[Feature/Bug causing the regression]: bug 1279186
[User impact if declined]: It can get really nasty, an infinite loop of messaging between parent and children with possibly crashing in the end.
[Is this code covered by automated tests?]: Yes
[Has the fix been verified in Nightly?]: I tested it manually, landing it on nightly just now.
[Needs manual test from QE? If yes, steps to reproduce]: It would be quite hard to test it without a debugger... so not really
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: It is not risky at all.
[Why is the change risky/not risky?]: It's a trivial fix.
[String changes made/needed]: None.
Attachment #8846766 - Flags: approval-mozilla-aurora?
I'm going to land part 2 of this with bug 1348442, since I'm running into that race in tests there, too.
Actually, scratch that. It looks like even with that there's a race, and we still wind up with stale initial process data in new processes.
(In reply to Kris Maglione [:kmag] from comment #37)
> Actually, scratch that. It looks like even with that there's a race, and we
> still wind up with stale initial process data in new processes.

Is it by any chance test_ext_i18n.js? I keep getting timeouts there especially if I launch the pre-allocated process late: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3afae48a4d615b34e081f2bef3248e2b402b3fac&selectedJob=86041342
I'm trying to figure out why this test times out. It seems like any xpcshell remote test with a content script can fail to unload properly. Seems like something goes wrong in the MessageChannel because of the preallocated process. This error is the weirdest:

https://treeherder.mozilla.org/logviewer.html#?job_id=87031418&repo=try&lineNumber=11725

I just don't get how can the |channelId| be undefined and the "No matching message response handler for 7-0" is also surprising to me I thought the message name should be something more human readable...

In the other case it's hard to tell why do promise for unload never gets resolved from the log.

https://treeherder.mozilla.org/logviewer.html#?job_id=87031418&repo=try&lineNumber=11691

Do you have any hints what can go wrong or where should I start looking? I wish I could reproduce this locally then it would be a lot simpler ofc...
Flags: needinfo?(kmaglione+bmo)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #38)
> (In reply to Kris Maglione [:kmag] from comment #37)
> > Actually, scratch that. It looks like even with that there's a race, and we
> > still wind up with stale initial process data in new processes.
> 
> Is it by any chance test_ext_i18n.js? I keep getting timeouts there
> especially if I launch the pre-allocated process late:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=3afae48a4d615b34e081f2bef3248e2b402b3fac&selectedJob=8
> 6041342

Oh, sorry, it looks like my original comment for this got lost. No, it's not related for that. It turned out that we just weren't cleaning up initial data for disabled extensions correctly.

I'm not sure what's going on there. It looks like the test is completing properly, but the test harness times out waiting for a child process to exit, or something like that.

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #39)
> I'm trying to figure out why this test times out. It seems like any xpcshell
> remote test with a content script can fail to unload properly. Seems like
> something goes wrong in the MessageChannel because of the preallocated
> process. This error is the weirdest:
> 
> https://treeherder.mozilla.org/logviewer.
> html#?job_id=87031418&repo=try&lineNumber=11725
> 
> I just don't get how can the |channelId| be undefined and the "No matching
> message response handler for 7-0" is also surprising to me I thought the
> message name should be something more human readable...

This is usually not an error, just a warning that happens when we shut down the extension and kill its response listeners before the response comes in. There's some code in newer builds to silence these warnings in newer versions, but it's not 100% effective.

It's harder to have a more human readable error, because at that point we don't have any data about the response, and the contents are opaque. I suppose we could have the child side include the original message name so we could include that in the error.

> In the other case it's hard to tell why do promise for unload never gets
> resolved from the log.
> 
> https://treeherder.mozilla.org/logviewer.
> html#?job_id=87031418&repo=try&lineNumber=11691
> 
> Do you have any hints what can go wrong or where should I start looking? I
> wish I could reproduce this locally then it would be a lot simpler ofc...

Oh, is that's what's happening? In that case it's probably because we broadcast FlushJarCache to all of the child message managers and wait for a response. The preallocated manager may not be listening for the message yet.

http://searchfox.org/mozilla-central/rev/7419b368156a6efa24777b21b0e5706be89a9c2f/toolkit/components/extensions/Extension.jsm#950-956
Flags: needinfo?(kmaglione+bmo)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #33)
> This is taking longer than I wanted, so I'm landing the critical bug fix
> part first and try to uplift it ASAP and don't block it by the bigger patch
> that causes a few test failures.

(In reply to Kris Maglione [:kmag] from comment #37)
> Actually, scratch that. It looks like even with that there's a race, and we
> still wind up with stale initial process data in new processes.

Sorry but maybe due to this bug if any one has more than 20 or 32 process enabled then tabs and data loss occurs.
Is there a bug? eg set 32 processes and click folder to open many/all tabs from 1 folder then quickly click folder to open many/all tabs from 2 folder. tabs of 1 folder are replaced/overwritten.

P.s: sorry for bad english and not so good STR
(In reply to Sunil Kumar from comment #41)
> Sorry but maybe due to this bug if any one has more than 20 or 32 process
> enabled then tabs and data loss occurs.
> Is there a bug? eg set 32 processes and click folder to open many/all tabs
> from 1 folder then quickly click folder to open many/all tabs from 2 folder.
> tabs of 1 folder are replaced/overwritten.
> 
> P.s: sorry for bad english and not so good STR

Hello there. Please file a separate bug for this. It's unrelated to the work happening here.

Thanks.
Does that answer your questions? Is there anything else I can do to help?
Flags: needinfo?(gkrizsanits)
(In reply to Kris Maglione [:kmag] from comment #40)
> Oh, is that's what's happening? In that case it's probably because we
> broadcast FlushJarCache to all of the child message managers and wait for a
> response. The preallocated manager may not be listening for the message yet.
> 
> http://searchfox.org/mozilla-central/rev/
> 7419b368156a6efa24777b21b0e5706be89a9c2f/toolkit/components/extensions/
> Extension.jsm#950-956

Thanks, your comment was really helpful I actually was sure that it must be the reason. I was looking for something like that but could not found this FlushJarCache broadcasting. Anyway, it seems like that wasn't it.

So I tried to kill the preallocated process before broadcasting that message (and remove its mm from the ppmm), which should have fixed this problem, but it seems like it didn't: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8369b495c2507f0fb6cbf9f82a211dd2370e2a2&selectedJob=87622263

In the meanwhile it also seems like I have to deal with yet another few intermittent tests that my patch upsets and turning them almost perma orange.

I'm running out of ideas about these two web extension tests so I'm considering to run them with the preallocated process manager disabled and figure it out later what's going on there exactly, before I get flooded again with some more intermittent tests. Would you be against that approach?
Flags: needinfo?(gkrizsanits) → needinfo?(kmaglione+bmo)
I added some additional logging, and the tests are definitely completing as expected. The failure comes after the tests are complete and the harness detects that the process is still running:

13:30:49     INFO -  PID 7360 | [Child 5820] WARNING: pipe error: 109: file c:/builds/moz2_slave/try-w32-0000000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
13:30:49     INFO -  PID 7360 | [Child 5820] WARNING: pipe error: 109: file c:/builds/moz2_slave/try-w32-0000000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
13:30:49     INFO -  <<<<<<<
13:30:49     INFO -  xpcshell return code: None
13:30:49     INFO -  xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_contentscript.js | Process still running after test!

I don't know why that is, but it seems like something we should figure out.

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #44)
> I'm running out of ideas about these two web extension tests so I'm
> considering to run them with the preallocated process manager disabled and
> figure it out later what's going on there exactly, before I get flooded
> again with some more intermittent tests. Would you be against that approach?

I really don't want to turn this on while tests are failing, in case it turns out to actually break things in production.
Flags: needinfo?(kmaglione+bmo)
(In reply to Kris Maglione [:kmag] from comment #45)
> I added some additional logging, and the tests are definitely completing as
> expected. The failure comes after the tests are complete and the harness
> detects that the process is still running:
> 
> 13:30:49     INFO -  PID 7360 | [Child 5820] WARNING: pipe error: 109: file
> c:/builds/moz2_slave/try-w32-0000000000000000000000/build/src/ipc/chromium/
> src/chrome/common/ipc_channel_win.cc, line 346
> 13:30:49     INFO -  PID 7360 | [Child 5820] WARNING: pipe error: 109: file
> c:/builds/moz2_slave/try-w32-0000000000000000000000/build/src/ipc/chromium/
> src/chrome/common/ipc_channel_win.cc, line 346
> 13:30:49     INFO -  <<<<<<<
> 13:30:49     INFO -  xpcshell return code: None
> 13:30:49     INFO - 
> xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/
> test_ext_contentscript.js | Process still running after test!
> 
> I don't know why that is, but it seems like something we should figure out.

Sure.

> 
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #44)
> > I'm running out of ideas about these two web extension tests so I'm
> > considering to run them with the preallocated process manager disabled and
> > figure it out later what's going on there exactly, before I get flooded
> > again with some more intermittent tests. Would you be against that approach?
> 
> I really don't want to turn this on while tests are failing, in case it
> turns out to actually break things in production.

I see where you're coming from but my worry is that this can take a while and most likely something very xpcshell test specific. Which is not something that can affect production easily. On the other hand if we don't test for long enough on nightly we might not get bug reports in time if there are some actual breakage somewhere caused by this patch, that is not covered by our tests. So I would take a risk and back it out later if something breaks, because I don't trust our tests to cover everything.

Is there a specific reason why you find this issue scary? I can be totally missing something and underestimate the severity of this problem...
Sorry, I completely missed your last reply. It was in the middle of a work week...

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #46)
> I see where you're coming from but my worry is that this can take a while
> and most likely something very xpcshell test specific. Which is not
> something that can affect production easily. On the other hand if we don't
> test for long enough on nightly we might not get bug reports in time if
> there are some actual breakage somewhere caused by this patch, that is not
> covered by our tests. So I would take a risk and back it out later if
> something breaks, because I don't trust our tests to cover everything.
> 
> Is there a specific reason why you find this issue scary? I can be totally
> missing something and underestimate the severity of this problem...

Just the fact that I don't know what's causing the problem makes me worry that
it might not be limited to tests, and will wind up biting us in production.
But at this point I'd be willing to disable the preallocated process manager
for those tests, since I really don't want to block this from landing.

You might want to test again since bug 1317697 landed, though. That changed a
lot of things about the way that we handle content process
initialization/deinitialization, and definitely fixed at least one or two bugs
in the process.
(In reply to Kris Maglione [:kmag] from comment #47)
> Sorry, I completely missed your last reply. It was in the middle of a work
> week...

No worries, I had a couple of more urgent release criteria related work so I haven't got time for this, and won't have this week either. That's why I didn't needinfo/pinged you :)

> 
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #46)
> Just the fact that I don't know what's causing the problem makes me worry
> that
> it might not be limited to tests, and will wind up biting us in production.
> But at this point I'd be willing to disable the preallocated process manager
> for those tests, since I really don't want to block this from landing.

We'll see, I will give it another try to fix all the tests next week, there were a few patches landed that might helped fixing this issue.

> 
> You might want to test again since bug 1317697 landed, though. That changed a
> lot of things about the way that we handle content process
> initialization/deinitialization, and definitely fixed at least one or two
> bugs
> in the process.

That's great news! I will be working on this next week and find a way to land it.
Comment on attachment 8846766 [details] [diff] [review]
part1: Stop rebroadcasting BlobURLUnregistration. v1

54 was merged to Beta today.
Attachment #8846766 - Flags: approval-mozilla-aurora? → approval-mozilla-beta?
See Also: → 1358899
Iteration: --- → 55.3 - Apr 17
Comment on attachment 8846766 [details] [diff] [review]
part1: Stop rebroadcasting BlobURLUnregistration. v1

sorry for the delay.  beta54+
Attachment #8846766 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #34)
> [Is this code covered by automated tests?]: Yes
> [Has the fix been verified in Nightly?]: I tested it manually, landing it on
> nightly just now.
> [Needs manual test from QE? If yes, steps to reproduce]: It would be quite
> hard to test it without a debugger... so not really

Setting qe-verify- based on Gabor's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
With an old build forcing sequential execution and verbose logging I could reproduce the timeout locally (happens very rarely...)

It's the same error that was mentioned in Comment 31 and somehow I missed it, because usually I don't see this error in the logs but that could be because of some filtering.


 0:12.69 TEST_STATUS: Thread-4 test_get_ui_language PASS [test_get_ui_language : 246] Got expected getUILanguage result in contentScript - Expected: he, Actual: he - true == true
 0:12.70 TEST_STATUS: Thread-4 test_get_ui_language PASS [test_get_ui_language : 246] Got expected getMessage result in contentScript - Expected: he, Actual: he - true == true
 0:12.71 LOG: Thread-4 INFO (xpcshell/head.js) | test run_next_test 3 pending (2)
 0:12.71 LOG: Thread-4 INFO (xpcshell/head.js) | test test_get_ui_language finished (2)
 0:12.71 LOG: Thread-4 INFO xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_i18n.js | Starting test_detect_language
 0:12.71 LOG: Thread-4 INFO (xpcshell/head.js) | test test_detect_language pending (2)
 0:12.71 LOG: Thread-4 INFO "Extension attached"
 0:12.71 LOG: Thread-4 INFO (xpcshell/head.js) | test run_next_test 3 finished (2)
 0:12.71 PROCESS_OUTPUT: Thread-4 (pid:7728) "[Child 8648] WARNING: pipe error: 109: file c:/development/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346"
 0:12.72 PROCESS_OUTPUT: Thread-4 (pid:7728) "[Child 8648] WARNING: pipe error: 109: file c:/development/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346"
 0:12.80 LOG: Thread-4 INFO "CONSOLE_MESSAGE: (error) [JavaScript Error: "No WebExtension found with ID {409e4957-7ec4-42ca-8e50-f267e640a7c8}" {file: "resource://gre/modules/ExtensionParent.jsm" line: 530}]
createProxyContext@resource://gre/modules/ExtensionParent.jsm:530:13
receiveMessage@resource://gre/modules/ExtensionParent.jsm:498:11
_do_main@c:\\development\\mozilla-central\\testing\\xpcshell\\head.js:211:5
_execute_test@c:\\development\\mozilla-central\\testing\\xpcshell\\head.js:546:5
@-e:1:1

After this error the test runs just fine except the hang at the very end:

 0:13.02 LOG: Thread-4 INFO (xpcshell/head.js) | test run_next_test 4 pending (2)
 0:13.02 LOG: Thread-4 INFO (xpcshell/head.js) | test test_detect_language finished (2)
 0:13.02 LOG: Thread-4 INFO (xpcshell/head.js) | test run_next_test 4 finished (1)
 0:13.02 LOG: Thread-4 INFO exiting test
 0:13.03 PROCESS_OUTPUT: Thread-4 (pid:7728) "[Child 7984] WARNING: pipe error: 109: file c:/development/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346"
 0:13.03 PROCESS_OUTPUT: Thread-4 (pid:7728) "[Child 8196] WARNING: pipe error: 109: file c:/development/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346"

Now I'm trying to understand how is possible that the extension is in a state that a child process can send a API:CreateProxyContext with its ID but the parent's GlobalManager is not aware of that extension yet.

I'm trying to understand what happens exactly in ExtensionTestUtils.loadExtension that might result something like this before we get to the extension.startup() but I could not find anything yet. Anyway, I'm going to try to update my tree and reproduce the issue with the latest version, but if you have any hints that would be great. (or if I'm on the wrong track and this error should be just ignored and a well known issue that is totally unrelated)
Flags: needinfo?(kmaglione+bmo)
It depends on how old the build is.

I recently fixed a bug where we weren't removing extensions from the initial process data after they shut down, so depending on when processes started, they would wind up trying to execute code for extensions that didn't exist anymore.

If that's what you're running into, the problem isn't that the extensions hasn't started yet, but that it's already started and also already been destroyed.

If it's not, I'd probably need to try to reproduce it myself to tell you more.
Flags: needinfo?(kmaglione+bmo)
Thanks I think that was it. Anyway, with some work I figured out a way to reproduce the issue reliably locally by choosing the right timing. If we launch the preallocated process after the test, so after _do_main, but before leaving _execute_test in xpcshell/head.js the process->LaunchSubprocess call just gets deadlocked. It should be a lot easier from here :)

(In reply to Kris Maglione [:kmag] from comment #54)
> It depends on how old the build is.
> 
> I recently fixed a bug where we weren't removing extensions from the initial
> process data after they shut down, so depending on when processes started,
> they would wind up trying to execute code for extensions that didn't exist
> anymore.
> 
> If that's what you're running into, the problem isn't that the extensions
> hasn't started yet, but that it's already started and also already been
> destroyed.
> 
> If it's not, I'd probably need to try to reproduce it myself to tell you
> more.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #55)
> process->LaunchSubprocess call just gets deadlocked. It should be a lot
> easier from here :)

Once we started SW shutdown we should not try to launch a new process. I've added a listener for "profile-change-teardown" in the Preallocated PM to shut itself down.

Looks OK on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e981a945b2cab0760f28f4b20f07db7c016e7e5c&selectedJob=95992016

(hard to tell because of some known intermittents...)
Pushed by gkrizsanits@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b62eed78de3d
Use the preallocated process manager by default. r=billm, r=kmag
Depends on: 1361808
https://hg.mozilla.org/mozilla-central/rev/b62eed78de3d
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
It was mentioned in Mozillazine forum that this bug may be the cause of a great deal of slowness. 

Is there any way to test for this to confirm either yes or no?

Thank you,

(a Firefox "tester")
(In reply to Worcester12345 from comment #59)
> It was mentioned in Mozillazine forum that this bug may be the cause of a
> great deal of slowness. 
> 
> Is there any way to test for this to confirm either yes or no?
> 
> Thank you,
> 
> (a Firefox "tester")

Thanks for reporting it! Could you link me to the Mozillazine conversation where this came up? It might be some add-on that the patch broke... otherwise it should not cause any slowness as much as I can tell and our related perf tests.

You can turn it off by setting dom.ipc.processPrelaunch.enabled to false in about:config.
Flags: needinfo?(worcester12345)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #60)
> (In reply to Worcester12345 from comment #59)
> > It was mentioned in Mozillazine forum that this bug may be the cause of a
> > great deal of slowness. 
> > 
> > Is there any way to test for this to confirm either yes or no?
> > 
> > Thank you,
> > 
> > (a Firefox "tester")
> 
> Thanks for reporting it! Could you link me to the Mozillazine conversation
> where this came up? ...

http://forums.mozillazine.org/viewtopic.php?f=23&t=3029999&start=15
Flags: needinfo?(worcester12345)
https://hg.mozilla.org/integration/mozilla-inbound/rev/5c78702d43ddccee9527230f59e0e0960b188aef
Bug 1341008: Follow-up: Don't needlessly load ExtensionContent.jsm into all processes.
Hello,
now in nightly 55 the start-up is way faster, thanks for this good job! (I suppose it's due to this fixed bug)
I see a ts_paint improvement on windows 8!

== Change summary for alert #6933 (as of May 28 2017 21:50 UTC) ==

Improvements:

  2%  ts_paint windows8-64 opt e10s     887.44 -> 868.83

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=6933
Yeah, this originally landed with some stale code that regressed ts_paint by loading code that wasn't needed.
See Also: → 1349840
You need to log in before you can comment on or make changes to this bug.