Closed Bug 1231222 Opened 4 years ago Closed Last year

perform service worker interception in parent process nsHttpChannel

Categories

(Core :: DOM: Service Workers, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: bkelly, Assigned: asuth)

References

(Blocks 6 open bugs)

Details

(Whiteboard: [e10s-multi:-], )

Attachments

(3 files, 16 obsolete files)

106.74 KB, patch
Details | Diff | Splinter Review
5.96 KB, patch
asuth
: feedback+
Details | Diff | Splinter Review
10.12 KB, patch
asuth
: review+
Details | Diff | Splinter Review
As part of our service worker e10s redesign we would like to perform network interception in the parent.  This means essentially just doing the nsHttpChannel interception.  All of the interception code in HttpChannelChild and HttpChannelParent could be removed.
Attached patch WIP (obsolete) — Splinter Review
This makes test_fetch_event.html pass in e10s mode, but there are still a bunch of timeouts and assertion failures when running the WPT service-worker suite.
Assignee: nobody → josh
Status: NEW → ASSIGNED
Attached patch WIP (obsolete) — Splinter Review
Checkpoint. Most WPT tests are passing now; I'm winnowing down the remaining 8 failing tests.
Attachment #8718165 - Attachment is obsolete: true
Attached patch WIP (obsolete) — Splinter Review
Removing some old debugging.
Attachment #8722762 - Attachment is obsolete: true
Note to self:
/_mozilla/service-workers/service-worker/appcache-ordering-main.https.html
--------------------------------------------------------------------------
CRASH [Parent]
/_mozilla/service-workers/service-worker/clients-get-cross-origin.https.html
----------------------------------------------------------------------------
TIMEOUT Test Clients.get() cross origin
TIMEOUT [Parent]
/_mozilla/service-workers/service-worker/fetch-canvas-tainting-cache.https.html
-------------------------------------------------------------------------------
FAIL Verify canvas tainting of fetched image in a Service Worker
/_mozilla/service-workers/service-worker/fetch-canvas-tainting.https.html
-------------------------------------------------------------------------
FAIL Verify canvas tainting of fetched image in a Service Worker
/_mozilla/service-workers/service-worker/fetch-cors-xhr.https.html
------------------------------------------------------------------
FAIL Verify CORS XHR of fetch() in a Service Worker
/_mozilla/service-workers/service-worker/fetch-request-xhr.https.html
---------------------------------------------------------------------
TIMEOUT Verify the body of FetchEvent using XMLHttpRequest
TIMEOUT [Parent]
/_mozilla/service-workers/service-worker/navigation-redirect.https.html
-----------------------------------------------------------------------
TIMEOUT [Parent]
/_mozilla/service-workers/service-worker/shared-worker-controlled.https.html
----------------------------------------------------------------------------
TIMEOUT Verify SharedWorker construction is controlled by a Service Worker
Attached patch WIP (obsolete) — Splinter Review
Fixed half of the WPT errors. The only ones left are the timeouts and the crash.
Attachment #8722772 - Attachment is obsolete: true
Attached patch WIP (obsolete) — Splinter Review
Checkpoint. Only the timeouts in clients-get-cross-origin.https.html and shared-worker-controlled.https.html, then it will be time for another try run.
Attachment #8723195 - Attachment is obsolete: true
Blocks: 1262224
Whiteboard: [e10s-multi:M1]
Some notes from testing the current set of patches:

1) Devtools network monitor no longer reports that requests were handled by "service worker".  Instead it says "cached".
2) When I try to refresh https://blog.wanderview.com with the devtools network monitor open stylesheets never load.
3) We have pretty massive regressions on this benchmark: https://jakearchibald.github.io/service-worker-benchmark/

The cached worker case seems about 50% worse (~300ms regressed to ~450ms on my machine).  Chrome is at ~375ms on this case.

The data URL case is about 500% worse (~250ms regressed to ~1300ms).  Chrome is at ~320ms on this case.

I think we expect some perf regression as we do the e10s refactor, but these seem a bit large.
I'm going to see if I can hack in body streaming instead of store-and-send.  It seems likely we are missing out on a lot parallelism without that change.
I think I understand why these patches trigger some failures in:

https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/service-workers/service-worker/unregister-controller.https.html

The issue is that we SendUnregister() to the parent immediately when we set the uninstalling flag in the child:

https://dxr.mozilla.org/mozilla-central/source/dom/workers/ServiceWorkerUnregisterJob.cpp#131

We don't wait for the parent to get it, though, so its possible this races with other parts of the test.

I believe with these patches you will see the RecvUnregister() in the parent and stop thinking that registration is controlling anything.

Our current code is designed around the idea that the SendUnregister() is really there to evict the registration from disk.  So if we restart the browser with the uninstalling flag set, then the registration is gone on restart.

Now that we are trying to do more in the parent we might need to change this to "SendSetUninstallingFlag".  We could either still evict from disk or possibly store an uninstalling flag in the file.  Then on restart we would remove any registrations with uninstalling.

Josh, what do you think?
Flags: needinfo?(josh)
Keep in mind that for subresource requests, the parent relies on TabParent to inform us as to whether the corresponding document in the child is controlled. This occurs independently of SendUnregister, via MaybeStopControlling.
Additionally, I have a patch locally which delayes the resolution of the unregister() promise until the parent sends back confirmation that the unregistration was processed in the parent. It caused other failures in that test, unfortunately :<
(In reply to Josh Matthews [:jdm] from comment #17)
> Keep in mind that for subresource requests, the parent relies on TabParent
> to inform us as to whether the corresponding document in the child is
> controlled. This occurs independently of SendUnregister, via
> MaybeStopControlling.

I'm trying to investigate this.  Right now, though, I think our current implementation is correct.  So I suspect a bug in the new patch.
So the problem with unregister-controller.https.html is:

1) We are storing mIsControlled on the TabChild/TabParent actor.
2) I believe there is only one TabChild/TabParent actor for a given window tab.
3) The test runs three tests that create controlled iframes within the same window tab.
4) Sometimes one of the controlled iframes from the other test cases goes uncontrolled while executing the first test case.

I believe TabChild/TabParent is shared here because I logged all the actor used to call SendMarkDocumentControlled().  The same pointer is used throughout the entire test.
It seems the controlled state of the document must always be available in the child.  Even after our re-architecture the window must have synchronous access to navigator.serviceWorker.controller.

Therefore, can we simply attach a "controlled by service worker ID 1234" to the nsIChannel and send it across to the parent?  In theory, though, it could make this decision completely client-side for subresource loads.  (I think :asuth has suggested this in the past as well.)
Erk, the mismatch between documents and PBrowser makes perfect sense. Humbug.
Flags: needinfo?(josh)
Some perf notes about where time is going.  Here I am using an optimized DEBUG build so I can printf_stderr to a log file.  In this build config we regress from ~850ms to ~1250ms for the cached worker.

In these traces:

SWI START = service worker interception started with a FetchEvent
SWR START = service worker respondWith() called
SWR STOP = body stream passed to respondWith() is complete
IMG START/DATA/STOP = nsIStreamListener callbacks to imgLoader.cpp ProxyListener class
Times are milliseconds from the creation of the nsIChannel

The current e10s necko code produces these timings for the first image in the benchmark:

### ### IMG   NEW https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 0.370 SWI START https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 6.541 SWR START https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 6.683 SWR  STOP https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 229.774 IMG START https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 229.848 IMG  DATA https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 229.913 IMG  STOP https://jakearchibald.github.io/service-worker-benchmark/images/0.png

The parent-side interception produces these timings:

### ### IMG   NEW https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 370.549 SWI START https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 373.609 SWR START https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 373.732 SWR  STOP https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 619.406 IMG START https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 619.525 IMG  DATA https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 619.653 IMG  STOP https://jakearchibald.github.io/service-worker-benchmark/images/0.png

Overall we are about 390ms slower to load this first image.  Almost all of this time (370ms) is lost between starting the request in imgLoader.cpp and when we fire the FetchEvent in the service worker.
I added some more probes:

SAO = send async open from child
RAO = receive async open in parent
SDF = send dispatch fetch from parent
RDF = receive dispatch fetch in child
SWI START = service worker interception started with a FetchEvent
SWR START = service worker respondWith() called
SWR STOP = body stream passed to respondWith() is complete
SSR = send synthesize response from child
RSR = receive synthesize response in parent
IMG START/DATA/STOP = nsIStreamListener callbacks to imgLoader.cpp ProxyListener class

### ### IMG   NEW https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 0.279 NET SAO https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 29.646 NET RAO https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 67.442 NET SDF https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 182.982 NET RDF https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 183.157 SWI START https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 190.017 SWR START https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 190.163 SWR  STOP https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 224.278 NET SSR https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 447.756 NET RSR https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 594.748 IMG START https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 594.875 IMG  DATA https://jakearchibald.github.io/service-worker-benchmark/images/0.png
### ### 595.025 IMG  STOP https://jakearchibald.github.io/service-worker-benchmark/images/0.png

The vast majority of the time is being lost in the IPC hops:

send/recv async open diff:           ~30ms
send/recv dispatch fetch diff:      ~115ms
send/recv synthesize response diff: ~225ms

This is almost certainly due to a busy main thread in the child and parent.
Oh, I should mention these timing are with what looked like an unnecessary dispatch to main thread already removed.  The patches add an AsyncNotifyController runnable same thread dispatch in nsHttpChannel::OpenCacheEntry().  That adds noticeable delay for this workload since the main thread appears to be saturated.
Here is a profile I'm digging into:

https://cleopatra.io/#report=03ca7a5d631e37ea3d638d62f6c06a0b32a52727

So far it seems the SendDispatchFetch back to child is janking on RemoteAddonsChild.jsm doing a ton of sync IPC.  Not sure if thats GeckoProfiler itself or some other addon.
(In reply to Ben Kelly [:bkelly] from comment #27)
> So far it seems the SendDispatchFetch back to child is janking on
> RemoteAddonsChild.jsm doing a ton of sync IPC.  Not sure if thats
> GeckoProfiler itself or some other addon.

This is caused by GeckoProfiler itself thanks to some stupidity in addon-sdk.
An updated profile without this sync RPC stuff:

https://cleopatra.io/#report=8547e20a97878e1bd4b7fdde02450eb2dd94ddd8

It appears our initial SendDispatchFetch() IPCs back to the child are blocked on html parsing runnables that take ~100ms.
(In reply to Ben Kelly [:bkelly] from comment #29)
> It appears our initial SendDispatchFetch() IPCs back to the child are
> blocked on html parsing runnables that take ~100ms.

In theory we could avoid this sort of thing by dispatching to a dedicated worker process.
Here is an opt profile:

https://cleopatra.io/#report=8b52aa6bdb297f4e7389b02931638a00048eef69

The HTML parsing jank is about 50ms in the opt build.

The parent process, however, spends about 75ms handling RecvPredLearn() messages.  Maybe we can delay predictor learning until after the load completes.
Depends on: 1295565
Commenting out the Predictor::Learn() logic gets us down from ~450ms to around ~375ms.  This puts us roughly equivalent to chrome's current behavior.
I have patches that dispatch the fetch event in another content process, but it didn't really help much from comment 32.  I wrote some more thoughts here:

https://github.com/w3c/ServiceWorker/issues/756#issuecomment-242217294

From working on this prototype I have formed some opinions on perhaps where the code should go. For example, I think it would be preferable to dispatch the fetch event using a separate actor so that we can move the target to a different process in the future.  This is mostly doable, but does require some changes like passing a "service worker identifier" through as meta-data on an intercepted channel.

Josh, since you are splitting your time with other things, would you like me or Andrew to take this bug?
Flags: needinfo?(josh)
Whiteboard: [e10s-multi:M1] → [e10s-multi:M3]
Blocks: 1300464
I picked this up again a few days ago and I'm starting with a new patch that checks for interception in the child process, reusing pieces of the old one as it makes sense. Just got it passing all of test_fetch_events.html.
Flags: needinfo?(josh)
I think wires may have gotten crossed as plans changed.

The current ServiceWorker remoting patch I'm working on right now is to remote ServiceWorkerPrivate to a dedicated child process from the parent.  We enforce an abstraction boundary such that ServiceWorkerManager expects an (Internal)Request to characterize the request, not a channel.  Likewise, the response is an (Internal)Response.  At this interaction boundary we hope for nsIInterceptedChannel to be able to generate the request and consume the response.

Which is to say, I don't think we want any interception in the child anymore.  I think consistent with your (:jdm's) desires, this enables moving the intercept logic exclusively to nsHttpChannel and wholesale elimination of the complexity of nsINetworkInterceptController and its related actors.

There may be some gaps in my conceptualization of this; it's been slow going for a few reasons and I'm not fully to the strawman hookup of this.  I do plan to have this resolved and be in Toronto from Oct 11-14 when Catalin is there.  Ben I believe will also be off of PTO at that point, so it might be preferable to defer any deep discussions until then. Although I can absolutely attempt to further clarify what I'm doing/trying to do currently if that helps.
Well, if the network request is from a controlled client then the "interception check" is effectively done in the child.  The controlling service worker unambiguously will intercept and we know that in the child.

Is that what you meant, Josh?
Yes. Andrew and I chatted after the last message and it sounds like we're working on orthogonal, complementary pieces.
Andrew, if I understand correctly, you suggested in our last conversation that the check for whether to intercept a non-subresource request would need to be performed in the parent process. However, a naive way of performing that check (ie. duplicating the relevant check from nsDocShell::ShouldPrepareForIntercept which invokes ServiceWorkerManager::IsAvailable) fails because the resulting registration returns a null pointer for GetActive. Is this just a race (ie. background threads managing SW registration in the parent haven't finished before the content process' promise from register() resolves), or do I need to find another way of determining if a non-subresource request should be intercepted?
Flags: needinfo?(bugmail)
It's the race you surmise.

We may be be able to alter the state machine so we don't fully advance an SW to active until we hear the Register echoed back from the parent process.

Specifically, we would:
- Modify ServiceWorkerRegistrationInfo::FinishActivate to not longer call `mActiveWorker->UpdateState(ServiceWorkerState::Activated)` so only the StoreRegistration call is made for any side-effect.  (Our propagation mechanism is PBackground-based so exists even without PContent.)
- Modify ServiceWorkerManagerService::PropagateRegistration to tell all content processes instead of skipping the one that told us about the registration.
- Modify ServiceWorkerManager::LoadRegistration's `if (!registration)` else-case to cause the postponed `mActiveWorker->UpdateState(ServiceWorkerState::Activated)` to be triggered.  Probably by amending its clause to be:
  if (registration->GetActive() &&
      registration->GetActive()->CacheName() == aRegistration.cacheName()) {
     if (registration->GetActive()->State() == ServiceWorkerState::Activating) {
       // Complete the activation.
       registration->GetActive()->UpdateState(ServiceWorkerState::Activated);
     } else {
      // No needs for updates.
      return;
     }
  }

NB: ServiceWorkerManagerService::PropagateRegistration is notifying the parent process ServiceWorkerManager at the same time it's announcing back to the child content process.  If we assume the SWM in the parent gets registered prior to any content process SWM's (seems safe), then I think that rules out any pathological races.

Disclaimer: If tests are setting the SW-killing timers to 0, because of how the KeepAliveTokens are released, this may open a window for the SW to get reaped when it would not previously have been reaped in a way that could impact sketchy tests.
Flags: needinfo?(bugmail)
Attached patch WIP (obsolete) — Splinter Review
Checkpoint. test_fetch_events.html passes; browser_download.js does not due to the aforementioned race.
Attachment #8732427 - Attachment is obsolete: true
Double apologies; I inferred some wrong things about what was going wrong in comment 38 and then poor time management delayed me actually figuring out what went wrong.

browser_download.js's test failed because the download/window.html was listening for the "controllerchange" event.  This event is dispatched to the main thread at ServiceWorkerRegistrationInfo::Activate() time, the same time the "activate" extendable event is dispatched to the worker.  FinishActivate, which is still when we were triggering the StoreRegistration call that would notify the parent process, would not run until the the extendable event resolves.  This was a race that was deterministically guaranteed to be lost.  (If the test had been listening on the ServiceWorker bindings, things would have worked out.)

The attached, much simpler fix, is that we invoke StoreRegistration prior to dispatching the "controllerchange" event.  This is sufficiently spec-friendly to not be crazy, but regrettably does still allow for the the potential race originally discussed.  Namely, there is a "content process main thread -> parent process pbackground thread -> content process main thread" registration information flow that could potentially race the "content process main thread -> parent process main thread" channel construction.

I'll see if I can come up with an improved-but-still-simple fix that rules this race out too, but this at least makes the test pass for me now.  (Unfortunately, one of my other original assumptions about the ordering of the ServiceWorkerManagerService actors was wrong; the child process ended up getting notified prior to the parent too.)
Attachment #8802245 - Attachment is obsolete: true
The remaining test failures I'm getting seem to be hitting the remaining race :( I've got an rr recording showing that I'm ending up in HttpChannelParentListener::ShouldPrepareForIntercept before the ServiceWorkerManager stores any service worker registration info related to the URI that's being checked.
Specifically, the race is that ServiceWorkerManagerChild::RecvRegister is received, but then it dispatches CheckPrincipalWithCallbackRunnable to the main thread, which then sends a NotifyRegister message via PServiceWorkerManager which is then received on a background thread in the parent process (very confusingly!), which proceeds to actually store registration details. Of course, this happens significantly after the child has already decided that the service worker has been activated has already initiated the HTTP request.
Attached patch WIP (obsolete) — Splinter Review
Rebased against recent m-c.
Attachment #8801979 - Attachment is obsolete: true
Attached patch WIP (obsolete) — Splinter Review
Attachment #8807292 - Attachment is obsolete: true
browser_net_service-worker-status.js fails with this patch because:
* the service-worker-synthesized-response notification is observed in the content process
* the http-on-examine-cached-response notification is is observed in the parent process
Any ideas of the best way to deal with this? The new code is now using the same codepaths as the old non-e10s code except that the Service Worker still executes in the content process.
Flags: needinfo?(poirot.alex)
Sorry for the delay!

Here is a patch to forward the service-worker-synthesized-response notification
to the parent, but it is racy. message sent to the parent via sendSyncMessage
happens to be processed after http-on-examine-cached-response is fired. Racy,
so not always. This patch also removes a bunch of stuff that is no longer
needed thanks to your patch.

Ideally, the nsIHttpChannel would have a flag saying that the channel has been
intercepted. Then we wouldn't need any of this junk code in the network monitor.
We would just have something like:
  let fromServiceWorker = channel.hasBeenIntercepted;
Flags: needinfo?(poirot.alex)
Depends on: 1318142
(In reply to Alexandre Poirot [:ochameau] from comment #55)
> Ideally, the nsIHttpChannel would have a flag saying that the channel has
> been
> intercepted. Then we wouldn't need any of this junk code in the network
> monitor.
> We would just have something like:
>   let fromServiceWorker = channel.hasBeenIntercepted;

nsIHttpChanelInternal has a "responseSynthesized" flag that I believe covers this.  I haven't fully paged this patch back into my brain so that may be flawed somehow, but this patch does that (and removes the message-propagating logic) and it makes the (previously failing) test green for me.

Does this make sense?

Note: full try run of the ServiceWorkerManager race fixes from bug 1318142 and the WIP patch and the 2 devtools fixes in it, hopefully it goes well!
Attachment #8803132 - Attachment is obsolete: true
Attachment #8812696 - Flags: feedback?(poirot.alex)
Comment on attachment 8812696 [details] [diff] [review]
Devtools fix, use nsIHttpChannelInternal.responseSynthesized and avoid message passing. Apply on top of ochameau fix.

Hey Andrew, glad to see your working on Firefox!

Yes, that looks exactly like what is needed.
I can polish my patch and get it reviewed, but I think it will have to land in coordination with the other patches in this bug.
Attachment #8812696 - Flags: feedback?(poirot.alex) → feedback+
Also stop listening for events in the child as the observer service notifications are correctly spawn in the parent process.
(We no longer have to instanciate a NetworkMonitor instance in the child just for the sake of service workers)


MozReview-Commit-ID: A3RGd31fVIC
Attachment #8812753 - Flags: review?(odvarko)
Attachment #8812695 - Attachment is obsolete: true
Attachment #8812696 - Attachment is obsolete: true
Last full try run showed the xpcshell test netwerk/test/unit_ipc/test_synthesized_response_wrap.js failing.  (Your try runs hadn't involved xpcshell and I had cargo culted your try string for my try runs.)

This test (which is really just: `run_test_in_child("../unit/test_synthesized_response.js");`), was asserting that its nsINetworkInterceptController::ShouldPrepareForIntercept method should be called at most once.  However, it was getting called twice.

Investigating, it appeared the following problems were happening:
- HttpChannelChild::OnRedirectVerifyCallback (triggered by the ResetInterception call inducing a redirect to perform the actual network load) was invoking its own ShouldIntercept(uri) method and its load flags did not include LOAD_BYPASS_SERVICE_WORKER so it re-ran the ShouldPrepareForIntercept call (and returned true!).
- If you fix that call to be against the mRedirectChannelChild, its loadFlags also don't include LOAD_BYPASS_SERVICE_WORKER so the same thing happens.  Which is to say, we send shouldIntercept=true up to the parent again and even though the parent knows LOAD_BYPASS_SERVICE_WORKER, the value still gets passed to mParentListener->InterceptNextChannel(aShouldIntercept) 

Both changes in behavior seems like artifacts of the refactoring of the HttpChannelChild::SetupRedirect code (ands it call to ShouldIntercept via the now-removed ShouldInterceptURI wrapper) that got moved up to nsHttpChannel::SetupReplacementChannel.  That logic would call ShouldInterceptURI against the newChannel (instead of the current channel), and more significantly would check that the redirectFlags included REDIRECT_TEMPORARY or REDIRECT_PERMANENT (which the automatic REDIRECT_INTERNAL would not match).  The refactored logic in nsHttpChannel::SetupReplacementChannel uses similar checks (albeit a demorgan's law negation) to decide to set LOAD_BYPASS_SERVICE_WORKER.  That's also where the `printf("marking SW bypass for %p (%s)\n", raw.get(), spec.get());` happens, suggesting maybe this was an area of concern?

The patch I'm attaching here addresses the first problem by doing the QI/cast dance to be able to call ShouldIntercept on the new channel.  The second I address by introducing an explicit REDIRECT_INTERCEPTION_RESET redirectFlag that we use to effectively propagate LOAD_BYPASS_SERVICE_WORKER from the parent to the child for the redirect.

My primary goal was to propagate LOAD_BYPASS_SERVICE_WORKER since that makes ShouldIntercept return false and avoids spreading the logic everywhere.  The choice of the redirect flag was primarily due to redirectFlags being passed but loadFlags not being passed.  But there is a nice explicitness to it.  Also, I added the very explicit redirect flag because REDIRECT_INTERNAL is used in other places where it would be confusing/wrong to map to LOAD_BYPASS_SERVICE_WORKER.

This patch goes on top of your un-bitrotted WIP.  Please feel free to suggest alternate approaches or provide a better patch or say we should just delete the incriminating xpcshell test!
Attachment #8813103 - Flags: feedback?(josh)
Comment on attachment 8812753 [details] [diff] [review]
Use responseSynthesized instead of parent/child message passing to identify service worker requests

Review of attachment 8812753 [details] [diff] [review]:
-----------------------------------------------------------------

The changes looks good to me, but I am seeing bunch of test-timeouts on Try:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f77df597b9348c14840dcd3b07480c59d1531c7&selectedJob=31592111

Honza
Attachment #8812753 - Flags: review?(odvarko)
Comment on attachment 8812753 [details] [diff] [review]
Use responseSynthesized instead of parent/child message passing to identify service worker requests

Honza, this has to land with the other patches attached on that bug. That's why I didn't opened a specific devtools bug.
See this try:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=2159cf9cc333
Attachment #8812753 - Flags: review?(odvarko)
Comment on attachment 8812753 [details] [diff] [review]
Use responseSynthesized instead of parent/child message passing to identify service worker requests

Review of attachment 8812753 [details] [diff] [review]:
-----------------------------------------------------------------

R+

Honza
Attachment #8812753 - Flags: review?(odvarko) → review+
Comment on attachment 8813103 [details] [diff] [review]
Add redirect flag to convey reset interception; use ShouldIntercept check of redirected channel

I concur with your analysis (good writeup!), and I haven't come up with a better solution than this.
Attachment #8813103 - Flags: feedback?(josh) → feedback+
De-bitrotted patches with the diversion suspend count issue addressed green try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a2dc3f096190df707ed2144b7bbb866225fab0c

I'm going to refresh the patches on the bugs now.  Because of where we are in the release/merge calendar I think we'd want to wait until after the merge to land parent intercept, but I'd like to try and get jdm's patch up for review from the proper peoples.  I'll set flags as appropriate.
Although it would be fantastic if you could verify I didn't screw up anything majorly in my various de-bitrottings, I mainly would just appreciate if you can suggest an appropriate reviewer for your patch (and my redirect flag patch you feedback+'ed, please :).

To any reviewers, please note that the bugs in this patch also depend on the patches in bug 1318142.  Those are present in the try run from comment 70.  (And the patches attached on this bug and that bug are currently exactly the patches from the try run in comment 70.)
Attachment #8812693 - Attachment is obsolete: true
Attachment #8827767 - Flags: feedback?(josh)
(just de-bitrotting, carrying over :jdm's feedback+, although bugzilla seems to not propagate the original flag source for me anymore?)
Attachment #8813103 - Attachment is obsolete: true
Attachment #8827768 - Flags: feedback+
Comment on attachment 8827767 [details] [diff] [review]
jdm's parent intercept patch (de-bitrotted)

Review of attachment 8827767 [details] [diff] [review]:
-----------------------------------------------------------------

De-bitrotting looks sensible to me. For reviewers, mayhemer has the most experience reviewing this code, but it would be worth checking with jduell if there are any other Necko folks that should understand this stuff.

I've added some comments about things that I never got around to dealing with when I was working on this. It's tempting to suggest splitting this patch into two - one that removes all of the old code and breaks IPC interception; the second one adds the new model. There are large parts that don't read well when reviewing because these things are entwined.

::: netwerk/protocol/http/HttpChannelChild.cpp
@@ +2446,4 @@
>  
> +  mResponseCouldBeSynthesized = false;
> +  if (mIPCOpen) {
> +    printf("actually resetting\n");

Let's remove this.

::: netwerk/protocol/http/HttpChannelChild.h
@@ +217,5 @@
>    // Set if SendSuspend is called. Determines if SendResume is needed when
>    // diverting callbacks to parent.
>    bool mSuspendSent;
>  
> +  bool mResponseIsSynthesized;

We should add a comment here.

::: netwerk/protocol/http/HttpChannelParent.cpp
@@ +1797,5 @@
>    return NS_OK;
>  }
>  
> +already_AddRefed<nsIChannel>
> +HttpChannelParent::GetActiveChannel()

This isn't used any more.

::: netwerk/protocol/http/HttpChannelParent.h
@@ +195,5 @@
>  
>    nsresult ReportSecurityMessage(const nsAString& aMessageTag,
>                                   const nsAString& aMessageCategory) override;
>  
> +  already_AddRefed<nsIChannel> GetActiveChannel();

This isn't used any more.

::: netwerk/protocol/http/HttpChannelParentListener.cpp
@@ +363,5 @@
> +  RefPtr<HttpChannelParent> channel = do_QueryObject(mNextListener);
> +  MOZ_ASSERT(channel);
> +  nsAutoCString spec;
> +  channel->mChannel->GetName(spec);
> +  printf("resetting for %s\n", spec.get());

Remove this.

@@ +366,5 @@
> +  channel->mChannel->GetName(spec);
> +  printf("resetting for %s\n", spec.get());
> +
> +  //XXXjdm unsure why we receive a ResetInterception message after ActorDestroy...
> +  if (mInterceptedChannel) {

We should figure out if this is still necessary. It may have been a symptom of something that got fixed along the way.

@@ +373,5 @@
> +    mInterceptedChannel = nullptr;
> +  }
> +}
> +
> +class FinishSynthesizeOnMainThread : public Runnable

I wonder if this could be a NewRunnableMethod thing as well.

@@ +525,2 @@
>  
> +  if (mInterceptedChannel && channel == caller) {

I no longer remember why this was necessary; we should definitely have an explanation or remove this if it's no longer needed.

::: netwerk/protocol/http/HttpChannelParentListener.h
@@ +80,4 @@
>  
>    nsAutoPtr<nsHttpResponseHead> mSynthesizedResponseHead;
> +  nsCOMPtr<nsIInputStream> mSynthesizedBody;
> + 

nit: trailing space.

::: netwerk/protocol/http/InterceptedChannel.cpp
@@ +373,2 @@
>                             getter_AddRefs(mResponseBody),
> +                           0, UINT32_MAX, true, true);*/

We should remove this and file a followup about using a PSendStream for the synthesized body instead of this storage stream, since bug 1294450 was fixed.

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +3451,5 @@
> +    }
> +
> +    NS_IMETHODIMP Run()
> +    {
> +        mIntercepted->NotifyController();

This whole class can probably be replaced with NewRunnableMethod or whatever that's called.

@@ +5339,5 @@
> +            nsCOMPtr<nsIURI> url;
> +            newChannel->GetURI(getter_AddRefs(url));
> +            nsAutoCString spec;
> +            url->GetSpec(spec);
> +            printf("marking SW bypass for %p (%s)\n", raw.get(), spec.get());

This can be removed now.

::: xpcom/base/Logging.cpp
@@ +191,5 @@
>      bool addTimestamp = false;
>      bool isSync = false;
>      int32_t rotate = 0;
>      const char* modules = PR_GetEnv("MOZ_LOG");
> +    //const char* modules = "nsHttp:5";

Nix this.
Attachment #8827767 - Flags: feedback?(josh) → feedback+
Jason, this is the big redesign of Service Worker interception that I described in Hawaii. Honza is the person who has traditionally reviewed changes to this code, but is there anybody else who it would be valuable to have understand the new model?
Flags: needinfo?(jduell.mcbugs)
There's lots of people in necko who I'd like to see understand this stuff :)  Honza is still the obvious person to review, but if he has ideas about someone else who could take it, I'm fine with him delegating it.
Flags: needinfo?(jduell.mcbugs) → needinfo?(honzab.moz)
I'd always at least overlook the changes to nsHttpChannel to keep my knowledge in sync, hence I'll spend some amount of time on this after all.  

Ideas about someone else?  Valentin?

jdm or Andrew, is there some overview of the patch architecture?  either in this bug or some wiki/blog post/whatever so that a reviewer can read it to understand the patch(es) better?  Thanks.
Flags: needinfo?(honzab.moz) → needinfo?(josh)
Whiteboard: [e10s-multi:M3] → [e10s-multi:?]
Whiteboard: [e10s-multi:?] → [e10s-multi:+]
Status update: I'm finishing up updating Josh's patch and documenting what's going on in a blog post.  The blog post has had a bit of scope creep as I found myself investigating and describing the various things going on (channel opening, redirects, channel event queues as motivated by IPC, diversions, etc.) at a higher level and with various diagrams.  (Necko has really great comments in the various interfaces and throughout the codebase, but apart from some older wiki pages, there isn't a lot in the way of higher level overviews/documentation.  Hopefully my efforts here mean everyone won't need to synthesize it for themselves.  Or, at the very least, I'll understand what's going on! :)
Assignee: josh → bugmail
Flags: needinfo?(josh) → needinfo?(bugmail)
Comment on attachment 8827767 [details] [diff] [review]
jdm's parent intercept patch (de-bitrotted)

Review of attachment 8827767 [details] [diff] [review]:
-----------------------------------------------------------------

I have 2 outstanding action items marked as "PENDING" below.  I'm publishing now because I think splinter's draft storage is based on localStorage and I'm changing machines to finish this out.  All of :jdm's specific comments have responses here.

Other changes I'm making to this patch:
- HttpChannelChild::FinishInterceptedRedirect, and the members it referenced no longer had a reason to exist and so have been removed.
  - mInterceptingChannel, mInterceptedRedirectListener, mInterceptedRedirectContext removed.

I'm also folding in my "Add redirect flag to convey reset interception; use ShouldIntercept check of redirected channel" patch to this patch.

::: netwerk/protocol/http/HttpChannelChild.cpp
@@ +2446,4 @@
>  
> +  mResponseCouldBeSynthesized = false;
> +  if (mIPCOpen) {
> +    printf("actually resetting\n");

done

::: netwerk/protocol/http/HttpChannelChild.h
@@ +217,5 @@
>    // Set if SendSuspend is called. Determines if SendResume is needed when
>    // diverting callbacks to parent.
>    bool mSuspendSent;
>  
> +  bool mResponseIsSynthesized;

PENDING

::: netwerk/protocol/http/HttpChannelParent.cpp
@@ +1797,5 @@
>    return NS_OK;
>  }
>  
> +already_AddRefed<nsIChannel>
> +HttpChannelParent::GetActiveChannel()

done

::: netwerk/protocol/http/HttpChannelParent.h
@@ +195,5 @@
>  
>    nsresult ReportSecurityMessage(const nsAString& aMessageTag,
>                                   const nsAString& aMessageCategory) override;
>  
> +  already_AddRefed<nsIChannel> GetActiveChannel();

done

::: netwerk/protocol/http/HttpChannelParentListener.cpp
@@ +363,5 @@
> +  RefPtr<HttpChannelParent> channel = do_QueryObject(mNextListener);
> +  MOZ_ASSERT(channel);
> +  nsAutoCString spec;
> +  channel->mChannel->GetName(spec);
> +  printf("resetting for %s\n", spec.get());

done

@@ +366,5 @@
> +  channel->mChannel->GetName(spec);
> +  printf("resetting for %s\n", spec.get());
> +
> +  //XXXjdm unsure why we receive a ResetInterception message after ActorDestroy...
> +  if (mInterceptedChannel) {

The underlying problems have been fixed.

If both nsIInterceptedChannel::ResetInterception and nsIInterceptedChannel::FinishSynthesizedResponse got invoked (or one of them twice, probably ResetInterception), then we could end up here in ResetInterception with a null mInterceptedChannel.  (Both code paths culminate in HttpChannelParentListener consuming mInterceptedChannel and then nulling it out.)

Bug 1301678 (from around when your revised strategy came into existence) addressed this problem by introducing an InterceptedChannel{Chrome,Content} mClosed guard state.  After the first call to either of those methods, the intercepted channel will be marked closed and all future calls will early return with NS_ERROR_NOT_AVAILABLE.  It is no longer possible for bad nsIInterceptedChannel users to induce the state machine to try and do conflicting things at the same time.

A related and illuminating bug is the more recent bug 1342255 where:
- A "fetch" event handler would both invoke respondWith() and then throw an error.
- The error would cause interception to be reset, but the respondWith was still in play and would be expected to eventually resolve or reject.  (And in particular the race could result in a crash.)

So we could expect such races to occur under normal execution behavior.

HttpChannelParentListener::SynthesizeResponse already had an MOZ_ASSERT(mInterceptedChannel) followed by an if-guard.  I've updated the block here to be:
  MOZ_ASSERT(mInterceptedChannel, "nsIInterceptedChannel mClosed guard violated");
  if (mInterceptedChannel) {
and updated SynthesizeResponse's assertion to have the same annotation.

There is enough going on that I think the assertions are a good way to help avoid breakage and the annotations provide a nice hint about where the alleged invariant comes from.

@@ +373,5 @@
> +    mInterceptedChannel = nullptr;
> +  }
> +}
> +
> +class FinishSynthesizeOnMainThread : public Runnable

Converted to NewRunnableMethod.  The SynthesizedBodyCopyComplete callback is still required to deal with refcounting and to invoke NS_DispatchToMainThread.

@@ +525,2 @@
>  
> +  if (mInterceptedChannel && channel == caller) {

PENDING

This exists because there is a period of time during a redirect when both the old and new HttpChannelParent have mParentListener set.

This happens because HttpChannelParentListener::OnRedirectResult:
 * Calls oldParentChannel->Delete() which is an async operation that calls DoSendDeleteSelf which calls SendDeleteSelf() and sets mIPCClosed = true.
 * Calls newParentChannel->SetParentListener(this)

Tentatively planned fix: Explicitly null out the parent listener in the old channel prior to calling Delete.

::: netwerk/protocol/http/HttpChannelParentListener.h
@@ +80,4 @@
>  
>    nsAutoPtr<nsHttpResponseHead> mSynthesizedResponseHead;
> +  nsCOMPtr<nsIInputStream> mSynthesizedBody;
> + 

done.

::: netwerk/protocol/http/InterceptedChannel.cpp
@@ +373,2 @@
>                             getter_AddRefs(mResponseBody),
> +                           0, UINT32_MAX, true, true);*/

PENDING.  Bug 1204254 "service workers should stream response bodies when intercepting a channel" is related.

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +3451,5 @@
> +    }
> +
> +    NS_IMETHODIMP Run()
> +    {
> +        mIntercepted->NotifyController();

done

@@ +5339,5 @@
> +            nsCOMPtr<nsIURI> url;
> +            newChannel->GetURI(getter_AddRefs(url));
> +            nsAutoCString spec;
> +            url->GetSpec(spec);
> +            printf("marking SW bypass for %p (%s)\n", raw.get(), spec.get());

done

::: xpcom/base/Logging.cpp
@@ +191,5 @@
>      bool addTimestamp = false;
>      bool isSync = false;
>      int32_t rotate = 0;
>      const char* modules = PR_GetEnv("MOZ_LOG");
> +    //const char* modules = "nsHttp:5";

done
Flags: needinfo?(bugmail)
Flags: needinfo?(bugmail)
At today's weekly ServiceWorker meeting :bkelly raised concerns about the expected performance regression from the patch due to main-thread contention (per his previous investigations starting at comment 24) and our current limited ability to gather metrics.  Of particular interest was the possibility of being able to flip a pref between parent intercept and pre-patch child intercept so we could run A/B experiments.  I'm going to investigate options for this.

Note that a motivating concern was to avoid this patch series experiencing significant bit-rot by landing it.  Given that this bug in large part is about reducing the combinatorial explosion of trying to avoid talking to the parent and the most naive approach of doing the old and new thing is yet another permutation, I'm going to keep the following options in mind:

1. Have "child intercept" mode operate in a "write-through" style.  Much of the complexity of the current intercept implementation is due to attempting to entirely avoid creating a parent channel, requiring the child to accumulate and track a lot of state.  And then there's extra complexity from the fallout of redirects where state is maintained so that the parent channel can be opened, the redirect performed, and then the parent channel torn down again.  If we allow the child intercept case to always open the parent channel and depend on the parent-intercept machinery to avoid going to the network, we may be able to keep a lot of the cleanup wins of the patch while avoiding introducing main-thread latency-inducing round-trips.  (And we allow redirects and other slow-path cases like HSTS upgrades to fall off the fast path and end up handled by parent intercept.)  There is the wasted overhead cost of creating the parent channel if the SW invokes respondWith(), but it's a win if there is no respondWith() call made (or we implement fetch(event.request) pass-through, unlikely for the relevant timeline), and may be an acceptable compromise in the interest of forward progress.

2. Just accept the bit-rot and wait until we have a) proper ServiceWorkerPrivate remoting and possibly b) the response streams are able to avoid the parent process main thread and controlled document's content process's main thread so that we can avoid the performance regressions.


I was thinking TaskTracer might also help in the investigation, but I just skimmed the current set of bugs and it seems like there's a ton of patches waiting yet to land and so perhaps :bkelly's comment 24 era patches are the best bet.
Whiteboard: [e10s-multi:+] → [e10s-multi:-]
:asuth, are you still working on this?  It was noted that bug 1262224 (a high frequency intermittent failure) is pending on this being finished.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #81)
> :asuth, are you still working on this?  It was noted that bug 1262224 (a
> high frequency intermittent failure) is pending on this being finished.

This bug had gotten shelved for a while for a variety of reasons.  It's back, but is going to be transformed by :bkelly's aggressively pursuit of changing up how we generate fetch responses in bug 1391693 which allows us to simplify a lot of the complexity of this bug.  In terms of your interest via bug 1262224, it should either be directly resolved via the changes there or in the modified changes to this bug.
Depends on: 1391693
Flags: needinfo?(bugmail)
Blocks: 1337783
No longer blocks: 1351521
Blocks: 1428130
Depends on: 1431847
Priority: -- → P2
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → WORKSFORME
Whiteboard: [e10s-multi:-] → [e10s-multi:-],
Works for me when pref is flipped, Andrew please confirm.
Flags: needinfo?(bugmail)
removing need-info
Flags: needinfo?(bugmail)
You need to log in before you can comment on or make changes to this bug.