Open Bug 1132041 Opened 7 years ago Updated 2 years ago

Service Worker Cache should be smarter about compressing body files


(Core :: Storage: Cache API, defect, P5)





(Reporter: bkelly, Unassigned)



Currently the Service Worker Cache implementation snappy compresses all body streams when it writes them to the file.  This is not great for saving things like jpg which are already compressed.  The Cache should be smarter about what to compress.

Some ideas:

1) Look for a content-encoding header in Response objects.  If its set to "gzip", or another compression string, then do our compression.
2) Look at the content-type header.  Don't compress things like image/jpg, etc.
3) Tee the stream and write both compressed and uncompressed to temp files on disk.  Pick the smallest one to keep.
4) A modified version of (2) that checks for smallest size after only 128kb have been read from the source stream.  Cancel the file writing for the larger file.  Complete the smaller file.  This would avoid blowing up disk usage for giant resources.

Option (3) is the most robust, but at the cost of wasted cpu and temp disk space.  We probably want a heuristic that does (1) and if content-encoding is not specified then does (4).

For synthetically created Responses, we will always need to do something like (4).
Or do content-encoding sniffing to a small temp buffer, decide on compress-or-not, then write our leader and rest of stream to file.
Duplicate of this bug: 1365171
If bug 1365171 is to be considered a dup of this bug, then the additional idea of preserving the wire compression rather than stripping it in the network stack should be included here.
Just to set expectations, I doubt this will be worked any time soon.  So far I have not seen any reports of actual problems due to the current setup.  Our Cache API is also still more performant than chrome's implementation AFAIK.  Optimizing our compression scheme is probably going to be a low priority for a while.
Priority: -- → P5
Note, it may not be worth doing anything for this bug.  Since writing comment 0 I have learned that snappy already automatically disables compression when it detects the content does not benefit from it.  See:

Where it says:

"Typical compression ratios (based on the benchmark suite) are about 1.5-1.7x for plain text, about 2-4x for HTML, and of course 1.0x for JPEGs, PNGs and other already-compressed data."

This came to my attention when I was talking with one of the snappy maintainers.  He said that applications should not need to manually remove snappy from their data stream for some content.  Snappy is designed to be neutral or better with all data streams.

Comment 3 talks about trying to use the line encoding for responses.  This is likely to be extremely complex given the way fetch Responses disconnect cache API from the network stack.  Its also probably a relatively small compression gain a large complexity cost.

Given all this, it might make sense to just mark this bug WONTFIX.
On a large text (HTML, CSS, SVG) test corpus I measured snappy to be 1.8 times the size of brotli. Obviously I chose to store brotli on the server, rather than snappy.

You may be quite correct regarding the complexity of dealing with the server data, but a factor of 1.8 times is NOT a relatively small compression gain. Granted, not every object is going to see that compression factor, because many sites are going to be dominated by image sizes... but if you expanded JPG and PNG to BMP when received, you'd notice that in lots of applications.

Preserving the wire compression is just a generalization of preserving an internal compression. If FF would always preserve the compressed form received, then it should be just a bit of bookkeeping regarding the compression type, so it can be decompressed when used instead of decompressed when received... just like JPG/PNG.  There shouldn't be any need for additional communication between the fetch process and the cache process, just an enhancement of the file type data, and a move of the decompression step from the fetch process to the cache-reference process.
First, let me say I'm not working on firefox any more, so its not up to me.  I'm just trying to provide new information I came across relative to this bug.

Second, let me try to highlight the problem with propagating forward the line encoding to cache_storage as I recall them:

1) Necko strips line encoding automatically before the fetch() code gets passed the data.  (There are ways to disable this, but read on.)
2) Normally fetch() *wants* line encoding stripped because its required to provide decoded bytes via Response.
3) Many Response objects do not end up in Cache API and therefore never need line encoding data again.
4) Many Response objects that do end up in Cache API do not directly come from fetch().  For example, a network Response may be Clone()'d effectively creating a tee where some data might end up in Cache API and some data may be read by content script.  Other Response objects are synthesized by content script completely.
5) Recompressing the stream using the same encoder used by the server is not practical for a variety of reasons.
6) Cache API would have to learn to decode all the types of data that necko currently handles conditionally for different Responses.

All of this is fixable, but it comes at a complexity cost.  Personally I don't think disk compression ratio is nearly as critical as network compression ratio.  I don't think this work would be worth the marginal improvement, but as I said, its not up to me.
Component: DOM → DOM: Core & HTML
Component: DOM: Core & HTML → Storage: Cache API
No longer blocks: 1110136
You need to log in before you can comment on or make changes to this bug.