Closed Bug 845701 Opened 12 years ago Closed 9 years ago

Reevaluate necko remoting in the context of bug 761935

Categories

(Core :: Networking, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: cjones, Unassigned)

References

Details

We chose to remote necko at the data level because we've wanted to share HTTP caches etc., and because that approach was least invasive to the existing code.

(There was also a concern about per-process NSS that I don't recall.)

However, this adds an unknown overhead to the http stack (bug 648433).  It also means that problems like bug 828997 currently jank the entire gecko process tree.  We should of course fix those, but it's not good that the design is vulnerable to them.

The "other" approach to remote low-privilege necko, that we chose not to follow because of above, would have been to proxy requests for *resources* to the parent process and then use those directly in the child.  E.g. ask the parent to hand us back a socket fd for foo.com, and then do what we need with it.

When processes are nested one more level, the proxying overhead goes up to 2x, so this will be a more important concern.  The proxy will also put level-1-process jank on the paths that network data travels for level-2-process, which is also not desirable.  E.g., browser UI hits a GC, throughput in youtube.com goes to zero.  (This is currently the case for the browser app, but the corners we cut in v1 require browser app to be a model citizen so it's less of an issue.)

We need to make sure we hit bug 761935 without regressing the primary use case, browser app.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #0)
> We chose to remote necko at the data level because we've wanted to share
> HTTP caches etc., and because that approach was least invasive to the
> existing code.
> 
> (There was also a concern about per-process NSS that I don't recall.)

IMO, if we are at all constrained for memory (which we obviously are), we should continue on with the current design and fix the core issues (jank).

> However, this adds an unknown overhead to the http stack (bug 648433).

If the e10s Gecko stack has high overhead, I think there is likely a lot of work we can do to improve that (e.g. more efficient IPDL protocols, more efficient IPC mechanisms like shared memory).

> It also means that problems like bug 828997 currently jank the entire
> gecko process tree.  We should of course fix those, but it's not good
> that the design is vulnerable to them.

bug 828997 and all those cache bugs need to be fixed for desktop Firefox anyway.

> The "other" approach to remote low-privilege necko, that we chose not to
> follow because of above, would have been to proxy requests for *resources*
> to the parent process and then use those directly in the child.  E.g. ask
> the parent to hand us back a socket fd for foo.com, and then do what we need
> with it.

This would greatly limit the security benefit of process separation. With the current design, we can (but don't *yet*) ensure that HttpOnly cookies (typically auth cookies are HttpOnly) cannot be stolen by a compromised child process. Further, that socket fd is not for foo.com, but rather for whatever IP address that foo.com happened to resolve to at that time. The compromised child process could send HTTP requests on that socket with any "Host: " header it wants, completely violating origin restrictions (including, but not limited to, CSP).

> When processes are nested one more level, the proxying overhead goes up to
> 2x, so this will be a more important concern.  The proxy will also put
> level-1-process jank on the paths that network data travels for
> level-2-process, which is also not desirable.  E.g., browser UI hits a GC,
> throughput in youtube.com goes to zero.  (This is currently the case for the
> browser app, but the corners we cut in v1 require browser app to be a model
> citizen so it's less of an issue.)

You know more than me about the specific requirements, but I am not convinced that the browser needs to be composed of nested child processes. Instead, I can see a design where the browser chrome lives in a specially-privileged *peer* (not parent) process to the brower content processes, and where the content processes communicate directly with the root parent process. One way of thinking about it is that the browser process could spawn one transient "app" per tab, and those apps would be managed much like real apps, and the browser chrome app can control its peer browser content processes, perhaps indirectly by sending commands to the parent process. I would not be surprised to find that this is not only more efficient, but also easier for us to implement.

I strongly suggest that the B2G team talk to the security engineering team about this work. The security engineering team is extremely interested in this, including working on the implementation.
> When processes are nested one more level, the proxying overhead goes up to 2x

If there's any way to avoid doing two levels of IPDL traffic for the nested process case (at least when we know from the start that the traffic target is the chrome process) we should really do that.  Seems really undesirable to add another IPC hop with all the (high average, high variability) latency that adds.

I agree that we should look into benchmarking and optimizing e10s necko. Besides shared memory we might also want to look into off-main-thread IPDL send/receipt. Once bug 497003 happens the amount of inter-thread event passing under e10s will be much higher than desktop: we could be passing payload data directly from the socket thread to the IPDL thread (and on the child from the IPDL thread to the HTML parser thread) but will have to hop to the main thread in between because of IPDL's main-thread only rule.  Should we open a bug for off-main thread IPDL?

My sympathies are with Brian re: the security issue of having child processes do the socket I/O themselves.  Chrome seems to be zippy enough having the parent do all the networking.  I suspect the memory cost of having NSS in each child might be high (init is not cheap either IIRC).
We can't skip the second hop in the current design because the top-level process won't have a TabParent for the nested process.  It's trivial to implement at the IPC level, but security implications need to be thought through.
Also note that disk caching makes the "pass fd to child" case more complex.  Almost all necko loads either read from or write to the cache.  Reads might be simple enough (pass cache fd to child, with some IPC to release entry when done).  Writes might involve shlepping the content to the parent.  Certainly anything that involves grabbing the global cache lock ought to be done on the parent: allowing children to lock would probably be slow, and allow a trivial DOS attack. There's possibly other reasons why write I/O ought to happen only on the parent (I don't know offhand if the cache block files are shared between different appids: I suspect they probably are, and writing into other apps cache == bad).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.