FetchDriver channel stalls when Service Worker returns from fetch event too early

RESOLVED FIXED in Firefox 39

Status

()

RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: bkelly, Assigned: mayhemer)

Tracking

unspecified
mozilla39
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox39 fixed)

Details

Attachments

(4 attachments, 1 obsolete attachment)

I spent some more time this weekend try to track down why trained-to-thrill won't store the train images in the Cache.  I am using this combined patch queue:

  https://github.com/wanderview/gecko-patches/tree/serviceworkers

I resolved some issues (for example, needed to apply clone() patches), but I'm now stuck on a particular thorny issue.  It seems that the FetchDriver's channel stalls if the Service Worker resolves the promise passed to RespondWith() too early.

To see this I added instrumentation to FetchDriver::OnDataAvailable() and FetchDriver::OnStopRequest().  When the problem presents I see we get one or two OnDataAvailable() calls, but never an OnStopRequest().

The code in trained-to-thrill looks like this:

    return fetch(request.clone()).then(function(response) {
      caches.open('trains-imgs').then(function(cache) {
        cache.put(request, response).then(function() {
          console.log('yey img cache');
        }, function() {
          console.log('nay img cache');
        });
      });

      return response.clone();
    });

If I remove the Cache completely and just do a fetch, it still stalls:

    return fetch(request.clone()).then(function(response) {
      return response;
    });

Note that fetch() resolve its promise when headers are available, but before the body is fully downloaded.  So returning the response here resolves the fetch event RespondWith() promise before the channel has completed.  This *should* work, but, as I said, seems to stall currently.

If I change the code to store in Cache, but only return the response after the Cache.put() completes, then everything works:

    return fetch(request.clone()).then(function(response) {
      return caches.open('trains-imgs').then(function(cache) {
        return cache.put(request, response.clone());
      }).then(function() {
        console.log('yey img cache');
        return response;
      }, function() {
        console.log('nay img cache');
        return response;
      });
    });

This is different in that Cache.put() does not resolve until the fully body has been received and stored to disk.  So this in effect delays resolving the promise pass to RespondWith() until the channel has completed.

I don't know if its relevant, but these are <img> elements and are created with NewImageChannel().
Josh, do you have any theories here?  The abstract channel interfaces make it a bit hard to trace whats happening in the nsHttpChannel.  For example, its not clear to me what calls nsHttpChannel::OnDataAvailable() (which then calls FetchDriver::OnDataAvailable()).

I did verify with instrumentation that the channel is not getting suspended or destroyed.
Flags: needinfo?(josh)
Oh, I also thought maybe this was just general "fetch(img) doesn't work on workers" problem.  For example, maybe the retargeting to STS doesn't work for image channels.

I wrote a separate standalone test, though, and fetch() was able to load the image just fine.
The relevant place to start looking here is InterceptedChannelChrome::FinishSynthesizedResponse and InterceptedChannelContent::FinishSynthesizedResponse. Does the presence e10s change the observed behaviour here? The OnDataAvailable will either come from the cache pump (nsHttpChanel::mCachePump) in non-e10s, or from the input stream pump (InterceptedChannelContent::mStoragePump). It's possible that this stall is observed because we clear mResponseBody/mSynthesizedCacheEntry in FinishSynthesizedResponse; I'm not entirely certain what the effect of removing those statements would be.
Flags: needinfo?(josh)
I don't see the FinishSynthesizedResponse calls being made.  I think it stalls before it gets to that point.

I can't test e10s mode because we have a problem with SW registrations failing with e10s.

I do however, see this in the log:

  [59312] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /Users/bkelly/devel/hg/mozilla-central/netwerk/protocol/http/nsHttpChannel.cpp, line 2945

My current theory is that we're running afoul of the speculative http caching that is only done for images.  Perhaps something like this:

  1) The DOM creates an <img> elements.
  2) The imgLoader creates its channel and puts it into the http cache immediately after AsyncOpen().
  3) The imgLoader channel is intercepted.
  4) The ServiceWorker script's onfetch handler does a fetch() for the same image URL
  5) FetchDriver creates a new channel
  6) The new channel finds the cache entry that was created in (2) above.
  7) Stream deadlock because we're trying to read from intercepted channel in order to populate the synthesized response.

Thats just a theory, though.

Josh, does that sounds plausible?
Flags: needinfo?(josh)
If we could determine if the channel is being intercepted after AsyncOpen, maybe we could change imgLoader not to put it into the cache in that case.
Or better, necko should not in general use cache entries whose associated channel is currently being intercepted.  (If this is the problem...)
Patrick suggests we might want to set BYPASS_LOCAL_CACHE_IF_BUSY for all fetch requests.  See bug 1127429 for where this is being added to XHR.
See Also: → bug 1127429
Honza has requested logs with these settings:

  NSPR_LOG_FILE=filepath
  NSPR_LOG_MODULES=cache2:5,nsHttp:4,timestamp
I added BYPASS_LOCAL_CACHE_IF_BUSY and it did not help.  I will debug further.
In the debugger I see these channels closing with NS_ERROR_DOM_BAD_URI.  So we're probably hitting some CORS thing and going off in the weeds when we try to handle the error.
Although I can't explain why we would hit a CORS error with an early respondWith(), but not hit a CORS error with a late respondWith(). :-\
When I inadvertently change the timing of things with the debugger I sometimes see the FetchDriver OnStopRequest() method called for these requests.  I'm beginning to suspect the thread retargeting code in nsInputStreamPump.  I'll investigate that more tomorrow.
Retargeting is failing because nsCORSListener does not implement nsIThreadRetargetableStreamListener.  So its not the retargeting stuff.

I added a bunch of instrumentation to nsInputStreamPump and nsHttpTransaction ODA.  With this slowing things down the stall does not occur.  So definitely seems like a race somewhere.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
It started to look like this was a problem with reading/writing a pipe from multiple threads.  I tried backing out bug 1100398 and temporally implementing clone() with nsInputStreamTee.  This seems to avoid the stall.

Either I broke something in bug 1100398 or there is a subtle timing issue at work.  If I can't figure it out in the next day or two I will back out bug 1100398.
Blocks: 1100398
No longer depends on: 940273, 1065216
Flags: needinfo?(josh)
I believe this is a problem only with cloned streams.  It does not seem to effect existing pipe usage.  For example, if I simply back out the use of the new NS_CloneInputStream() call in Request/Response, then the stall is not hit.

My current theory is that the pipe is not updating the pipe read state when an input stream is closed.  This means cloning a stream, forgetting the original, and then trying to continue to use the clone will hit the pipe size limit and stall.
The implication of comment 15 being that I don't intend to back out bug 1100398 since this appears to be a problem for only new code.
Created attachment 8565759 [details]
necko.log.zip

I'm no longer convinced this is a pipe issue.  I fixed the problem I saw there, but I still get the deadlock.  It just seems very sensitive to timing changes.

I've traced things as far back as the nsHttpTransaction.  As far as I can tell it just stops trying to write to the pipe.  Its hard for me to trace further, though, because adding more printf's tends to change the timing enough to hide the deadlock.  Using the debugger likewise changes the timing enough to avoid the deadlock.

I'm getting kind of stuck here and could use some necko help to go further.

Honza, I grabbed the NSPR logs you requested earlier.  Can you take a look at these?  See anything in there that could explain this?
Flags: needinfo?(honzab.moz)
Actually, of the 8 or so <img> requests made at the same time, I only see one OnStartRequest() callback:

### ### [0x1292cc940] FetchDriver created channel for https://farm8.staticflickr.com/7403/1656296704
2_4150c389b6_c.jpg
### ### [0x1292cc940] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x12b5db0d0] FetchDriver created channel for https://farm8.staticflickr.com/7386/1656240687
1_a802511879_c.jpg
### ### [0x12b5db0d0] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x12b5db430] FetchDriver created channel for https://farm9.staticflickr.com/8592/1653858409
6_8b1de4b226_c.jpg
### ### [0x12b5db430] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x126e0c080] FetchDriver created channel for https://farm8.staticflickr.com/7293/1653806900
6_1d68696a62_c.jpg
### ### [0x126e0c080] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x12b5dbaf0] FetchDriver created channel for https://farm8.staticflickr.com/7439/1637815521
7_7e3bdfbed9_c.jpg
### ### [0x12b5dbaf0] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x126e0c470] FetchDriver created channel for https://farm8.staticflickr.com/7314/1637814765
7_240baf67fb_c.jpg
### ### [0x126e0c470] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x126e0c740] FetchDriver created channel for https://farm9.staticflickr.com/8614/16376416028_fd986e6880_c.jpg
### ### [0x126e0c740] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x126e0ca10] FetchDriver created channel for https://farm8.staticflickr.com/7290/16376413058_086e940c92_c.jpg
### ### [0x126e0ca10] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x12b5e58f0] FetchDriver created channel for https://farm8.staticflickr.com/7301/16376405958_a725bafdaf_c.jpg
### ### [0x12b5e58f0] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x126e0cc50] FetchDriver created channel for https://farm8.staticflickr.com/7460/15943893073_25d299dc70_c.jpg
### ### [0x126e0cc50] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x126e0c080] FetchDriver::OnStartRequest()
### ### [0x126e0c080] FetchDriver::OnDataAvailable()
honza is out of office for a bit - adding ni for :michal who also has a lot of cache experience.
Flags: needinfo?(michal.novotny)
Thanks Patrick.

I need to go tend to some review issues on other bugs.  I probably won't get back to this for a few days.

The next thing I will probably try is creating a reduced test case.  I expect it will probably look like this:

1) register service worker script
2) dynamically create <img> element(s)
3) service worker onfetch handler simply does |event.respondWith(fetch(event.request));|

The need to run a minimal server on localhost or https for the SW makes this marginally more work.
(Assignee)

Comment 21

4 years ago
Created attachment 8566169 [details]
necko.log - http top levels/http cache subset

Patrick, this looks to me like a spdy issue.  For one particular failing URL I can see two channels, one intercepted and one that does a regular network fetch.  Both use different http cache entries.  I also see the transaction getting to SENDING_TO state and there it ends, no more entries for the transaction in the log (I consider the log complete.)  

I can see X-Firefox-Spdy: 3.1 resp header when manually going to https://farm8.staticflickr.com/.


Log snip filters: 
60d1000 - the intercepted channel
c4ad800 - the net fetch channel
c4af800 - and its transaction
f2e0e20 - intercepted artificial cache entry
f2e1f50 - net fetch cache entry
https://farm8.staticflickr.com/7403/16562967042_4150c389b6_c.jpg - the url loading
Flags: needinfo?(honzab.moz)
(Assignee)

Updated

4 years ago
Flags: needinfo?(michal.novotny) → needinfo?(mcmanus)
obviously that log is really complicated - and I can't find a STR in this bug anyhwere that I can follow.

Ben, can you disable spdy (network.http.spdy.enabled) and rerun the test to confirm honza's suspicion?

If you can't that's ok.. it just seems easier than starting with the log.
Flags: needinfo?(mcmanus)
I disabled spdy, but the stall/deadlock still occurs.
Created attachment 8566230 [details]
necko-no-spdy.log.zip

New log with spdy disabled.  I'll post URLs in next comment.
URLs for log in comment 24:

### ### [0x11778e740] FetchDriver created channel for https://farm8.staticflickr.com/7452/1657312315
5_ee0c56de07_c.jpg
### ### [0x11778e740] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x11778faf0] FetchDriver created channel for https://farm9.staticflickr.com/8587/1657171661
2_5b608f5dd3_c.jpg
### ### [0x11778faf0] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x11e94fce0] FetchDriver created channel for https://farm8.staticflickr.com/7404/1657148769
1_05e894c85c_c.jpg
### ### [0x11e94fce0] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x11e950a60] FetchDriver created channel for https://farm8.staticflickr.com/7386/16571380211_4d25f23e7d_c.jpg
### ### [0x11e950a60] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x12ddf44c0] FetchDriver created channel for https://farm8.staticflickr.com/7380/16547069256_d6fb9f398a_c.jpg
### ### [0x12ddf44c0] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x12e2fc1a0] FetchDriver created channel for https://farm8.staticflickr.com/7356/16386914787_59b15c0eea_c.jpg
### ### [0x12e2fc1a0] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x12e2fc500] FetchDriver created channel for https://farm8.staticflickr.com/7440/16385635610_cd19dec034_c.jpg
### ### [0x12e2fc500] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x11ec7df70] FetchDriver created channel for https://farm9.staticflickr.com/8592/16385513108_d674d47ca8_c.jpg
### ### [0x11ec7df70] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x11eca7aa0] FetchDriver created channel for https://farm9.staticflickr.com/8571/15952917143_6d484ac8d3_c.jpg
### ### [0x11eca7aa0] FetchDriver::HttpFetch() calling channel AsyncOpen()
### ### [0x12e2fcf20] FetchDriver created channel for https://farm9.staticflickr.com/8606/15950272594_1aef346d1d_c.jpg
Start of a reduced test case.  This still does not trigger the bug, though.

  https://github.com/wanderview/sw-fetch-stall-test

To try it:

  git clone git@github.com:wanderview/sw-fetch-stall-test.git
  cd sw-fetch-stall-test
  npm install
  npm start

Then browse to localhost:3333/index.html with the build from here:

  https://blog.wanderview.com/sw-builds/
Honza, looking at the network panel I actually see a partial entry show up for this URL:

https://api.flickr.com/services/rest/?method=flickr.photos.search&extras=description&format=json&api_key=f2cca7d09b75c6cdea6864aca72e9895&text=train station&license=4,5,6,7&content_type=1&nojsoncallback=1&per_page=10

This request seems to get made twice.  The first one succeeds and the second one gets only partially displayed in the network panel.

Perhaps this is the request that is getting stuck and breaking all the following img requests.

Do you see this URL in the log at all?
Flags: needinfo?(honzab.moz)
Actually, that second request only happens with a hacked version I have here.  It won't be in the logs.
Flags: needinfo?(honzab.moz)
I have a feeling this is related to the CORS bug. The CORS bug occurs because the interception patch does not correctly set the necko response's status and headers from what the SW returned. It is plausible that Necko stalls because it thinks the response hasn't been delivered yet due to not having headers. I'm going to attempt a fix for the CORS and upload it then we can test this.
I changed the number of images to fetch to 1, added some debugging and the outcome is that FetchDriver behaves oddly in TtT. I fetched the same image using fetch() from the window scope and all the bytes were downloaded, but in TtT the FetchDriver is not reading the entire reply from the network. Now I'm going to check if this is a ServiceWorker lifetime issue or some wrong http channel setting or something to do with workers.
Nikhil, is it possible we have the LoadGroup from the original document when we try to open the refreshes new document?  Perhaps the LoadGroup is getting canceled by the navigation.  I have a bug to make SharedWorker use a separate LoadGroup iin all cases, but its not reviewed yet.
You could try the patch from this bug 1118845, perhaps with some extra logic in ServiceWorkers.
Assignee: bkelly → nsm.nikhil
This doesn't seem like a loadgroup issue, rather some issue with NS_AsyncCopy and both streams being passed to it being pipes. I'm not a 100% sure yet and having all the factors identical on every run is a little difficult. I'll continue looking into it.
Did some more digging. This may be hitting some Necko or OS max-sockets limit or something. The TtT page loads 10 images from Flickr by default. For each, the SW interception means that there are 2 channels open per connection, the one originally initiated by the img, and then the one initiated by the call to fetch() in the SW. With spdy disabled, I changed the number of images. With other networking params set to default, on my Linux machine 6 is the number of images which load successfully, and at 7 it stops. This is the most common limit, although a few times it was lower (4), and once, when I had NSPR_LOG_MODULES=nsHttp:5, it even loaded all 10, but I could not get it to reproduce.

Anyway, when these limits are hit, the FetchDriver's ::OnDataAvailable callback doesn't get called with new data, stalling the pipe. I'm not sure exactly which prefs in necko govern various limits, happy to try twiddling them if someone could point them out.

Josh, Honza any ideas about how to verify this is some pref issue, and how it can be mitigated in case of interception.
Flags: needinfo?(josh)
Flags: needinfo?(honzab.moz)
The intercepted channel doesn't create a socket, so that shouldn't be hitting any limits.
Flags: needinfo?(josh)
No, but the call to fetch() in the SW script does create a channel, so we have 2 channels.
(Assignee)

Comment 37

4 years ago
:nsm, can I try to reproduce this somehow locally?  Maybe it will be better then trying to get a log from you and then see nothing I could catch on in it.
Flags: needinfo?(honzab.moz) → needinfo?(nsm.nikhil)
1) Visit https://blog.wanderview.com/sw-builds/, make sure you are on the m-c changeset it says, and the follow the right patch-queue link to make sure you don't rejects when applying them.
2) Visit https://jakearchibald.github.io/trained-to-thrill/. This will register a serviceworker.
3) Reload the page, the ServiceWorker should not be handling the fetches. There are 10 images, and you should find that they don't load.

FetchDriver::OnDataAvailable() seems to be stalling due to the channel not feeding it new data.
If you want to run Trained to Thrill locally to make changes and simplify it, you'll need to install node.js and npm and then clone https://github.com/jakearchibald/trained-to-thrill
then:
npm install .
npm start

This will start TtT on localhost:3000. All the sources are in www/static/js-unmin. Edit these and the |npm start| command will autogenerate the compressed sources, so you can just refresh the browser. Note that since we don't have devtooling support, you'll have to restart the browser if you make changes to the service worker script, otherwise it will fetch from the HTTP cache (or you could globally disable caching if necko has some flag like that)
Flags: needinfo?(nsm.nikhil)

Comment 39

4 years ago
I successfully reproduced this under rr which allowed me to debug this extensively.  Unfortunately it's still not clear why this happens.  In my test recorded session, I have a flicker image for which FetchDriver receives OnStartRequest, and 2 OnDataAvailable notifications.

Comment 40

4 years ago
Sorry the previous comment is half written.

The OnDataAvailable's give us 4096, 4096 and 3141 bytes, respectively.  When we receive the 3141 bytes, we successfully set up the next nsInputStreamReadyEvent object, but that object's OnInputStreamReady function never gets called, and I was unable to find anything higher up the chain of streams and listeners, so to the best of my knowledge, we really don't receive any more data from Necko on the input pipe.  I tried to understand where in Necko to look for more information but got stuck.
(Assignee)

Comment 41

4 years ago
Could it be bug 1034309?
Workers have a special event target that notifies a monitor amongst other things.  I wonder if we are somehow dispatching runnables, but not going through that event target.
(Assignee)

Comment 43

4 years ago
Created attachment 8572040 [details] [diff] [review]
v1

Cause:
- there is a pipe that is pushed the payload from the Fetch side
- the pipe's input stream is copied to (in this case) cache entry's output stream ; NS_AsyncCopy is used on the STS thread (should use thread pool?)
- the pipe's input stream is open as blocking
- when the read is fast and there are no data to copy, we indefinitely wait on the STS thread for data on the pipe to come *)
- this effectively kills (deadlocks) the networking stack

Solution:
- input stream of the pipe has to be non-blocking, so that copier gets WOULD_BLOCK errors on no data in the pipe and waits for notifications


*) Socket Thread

      if (rv == NS_BASE_STREAM_WOULD_BLOCK) {
        // pipe is empty
        if (!mBlocking) {
          break;
        }
        // wait for some data to be written to the pipe
>       rv = Wait();
        if (NS_SUCCEEDED(rv)) {
          continue;
        }
      }


 	xul.dll!nsPipeInputStream::Wait() Line 1001	C++
 	xul.dll!nsPipeInputStream::ReadSegments(0x0f73fada, 0x03aaf46c, 4096, 0x03aaf474) Line 1122	C++
 	xul.dll!nsStreamCopierIB::DoCopy(0x03aaf4f4, 0x03aaf4f0) Line 549	C++
 	xul.dll!nsAStreamCopier::Process() Line 305	C++
 	xul.dll!nsAStreamCopier::Run() Line 431	C++
 	xul.dll!nsThread::ProcessNextEvent(true, 0x03aaf5c9) Line 855	C++
 	xul.dll!NS_ProcessNextEvent(0x027adbc0, true) Line 265	C++
 	xul.dll!nsSocketTransportService::Run() Line 739	C++
Assignee: nsm.nikhil → honzab.moz
Attachment #8572040 - Flags: review?(bkelly)
Comment on attachment 8572040 [details] [diff] [review]
v1

Review of attachment 8572040 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Honza!  Please just remove the extra hunk in the patch.

Flag ehsan for DOM peer sign off.

::: dom/cache/FileUtils.cpp
@@ +226,5 @@
>      if (NS_WARN_IF(NS_FAILED(rv))) { return rv; }
>  
>      rv = tmpFile->Remove(false /* recursive */);
> +    if (rv == NS_ERROR_FILE_TARGET_DOES_NOT_EXIST ||
> +        rv == NS_ERROR_FILE_NOT_FOUND) {

This is already fixed in a separate patch.  Please remove it from this one.  Also, this code is not in m-c, yet.
Attachment #8572040 - Flags: review?(ehsan)
Attachment #8572040 - Flags: review?(bkelly)
Attachment #8572040 - Flags: review+
thanks honza - that's awesome

Updated

4 years ago
Attachment #8572040 - Flags: review?(ehsan) → review+
(Assignee)

Comment 46

4 years ago
Created attachment 8572089 [details] [diff] [review]
v1.1
Attachment #8572040 - Attachment is obsolete: true
Attachment #8572089 - Flags: review+
(Assignee)

Updated

4 years ago
Keywords: checkin-needed
(Assignee)

Updated

4 years ago
Keywords: checkin-needed
(Assignee)

Updated

4 years ago
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/f222c7e15308
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
status-firefox39: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
You need to log in before you can comment on or make changes to this bug.