Closed Bug 1448476 Opened 6 years ago Closed 6 years ago

Cache entry corruption after writing the alternate data.

Categories

(Core :: Networking: Cache, defect, P2)

59 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: mar.kolya, Assigned: michal)

References

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0
Build ID: 20180316021353

Steps to reproduce:

Visit some sites with very large JS files (e.g. corporate Jira)


Actual results:

The pages are very slow to open. Upon further investigation in turned out that large (15+MB) JS files where reloaded on every visit eventhough headers suggested they should be cached.
I've looked in about:cache and that URL was there - but it was not used.


Expected results:

All cache entries should be used.

I did a bit more investigation into this problem:
If I load that JS URL directly into the browser I then can find a cache entry (in cace2/entries) that looks 'sensible' - it contains javascript text and in the end headers and most importantly original URL. If I visit Jira site after that then that URL is loaded from cache. But after a a few visits it stops being loaded from cache and is always transferred from remote. At that point if I look into the file then beginning of it still contains JS code, but end of it looks like something encoded in two-byte encoding (i.e. lots if '@<char>' kind of sequences). And most importantly this cache entry no longer contains original URL.

It looks like disabling 'dom.script_loader.bytecode_cache.enabled' prevents problem from happening.
Component: Untriaged → Networking: Cache
Product: Firefox → Core
Flags: needinfo?(nicolas.b.pierron)
Component: Networking: Cache → JavaScript Engine
Based on the description, ...

This is weird, it sounds that the addition of the bytecode cache content would cause a cache entry corruption, thus preventing the cache entry from being loaded again from the cache, while not being invalidated by another entry.

The 2 byte encoding is the beginning of the encoded bytecode cache, which from what I recall is appended at the end of the JS file. I have no idea how a large bytecode cache could prevent the indexing of a cache entry, this would be more likely a question for valentin.

Have you tried multiplying by 4 the amount of cache space available, could this problem arise when the cache is full?
Maybe this is a race-network-with-cache issue which is not supposed to be enabled if we request any alternate data.

One simple fix might be to cap the size of the bytecode cache entry.
Flags: needinfo?(nicolas.b.pierron) → needinfo?(valentin.gosu)
Summary: bytecode cache corruption → Cache entry corruption after writing the alternate data.
Thanks for this bug report. Could you maybe post the contents of the about:cache both before and after you notice the corruption?
It would be helpful to figure out what's going on here.
If the problem isn't obvious from the contents of the cache entry, we may also ask you to gather some logs, if that's OK.
Flags: needinfo?(valentin.gosu) → needinfo?(mar.kolya)
(In reply to Nicolas B. Pierron [:nbp] {backlog: ~36} from comment #1)
> Based on the description, ...
> 
> This is weird, it sounds that the addition of the bytecode cache content
> would cause a cache entry corruption, thus preventing the cache entry from
> being loaded again from the cache, while not being invalidated by another
> entry.
> 
> The 2 byte encoding is the beginning of the encoded bytecode cache, which
> from what I recall is appended at the end of the JS file. I have no idea how
> a large bytecode cache could prevent the indexing of a cache entry, this
> would be more likely a question for valentin.
> 
> Have you tried multiplying by 4 the amount of cache space available, could
> this problem arise when the cache is full?
> Maybe this is a race-network-with-cache issue which is not supposed to be
> enabled if we request any alternate data.

We don't race if the cache index says the entry has alt-data.
But I think there could be cases where the cache index is unavailable or doesn't return the correct answer.

> 
> One simple fix might be to cap the size of the bytecode cache entry.

If it's the size causing the problem. Maybe we're exceeding the maximum size of the cache file? (question for Honza or Michal)
Flags: needinfo?(michal.novotny)
Valentin, could you please clarify if you would like me to post literally contents of 'about:cache' - which is like some sort of summary. Or would you like me to paste actual entry that gets broken?

If former then I should be able to do that - and I can reproduce the problem after complete cache wipe out - so 'before' would be empty cache. And even before I do that I know for a fact that cache gets full very quickly, so 'after' would be full cache. And this entry is big enough to not go into RAM.

If you would like the actual broken entry I might not be able to post it publicly. I might be able to share it privately, however.

Please let me know in a bit more detail what information you require.
Thanks!
Flags: needinfo?(mar.kolya)
(In reply to Nikolay Martynov from comment #4) 
> If you would like the actual broken entry I might not be able to post it
> publicly. I might be able to share it privately, however.

If you could, please send me the complete contents of the cache entries (both the one that works and the one that is corrupted): about:cache-entry?storage=disk&... 
Send it to me by email if you are worried about sensitive data it contains.

Thanks!
Flags: needinfo?(mar.kolya)
(In reply to Valentin Gosu [:valentin] from comment #3)
> If it's the size causing the problem. Maybe we're exceeding the maximum size
> of the cache file? (question for Honza or Michal)

The maximum cache file size is browser.cache.disk.max_entry_size or browser.cache.disk.capacity/8 (the lower one). This includes data, alt-data and metadata. We can fail early at https://searchfox.org/mozilla-central/rev/a0665934fa05158a5a943d4c8b277465910c029c/netwerk/cache2/CacheEntry.cpp#1345 or later at https://searchfox.org/mozilla-central/rev/a0665934fa05158a5a943d4c8b277465910c029c/netwerk/cache2/CacheFileOutputStream.cpp#101 which would be this case.

In any case, this would doom the entry immediately so it shouldn't be stored in cache2/entries anymore.
Flags: needinfo?(michal.novotny)
Actual URLs would be good to test with.  Anyway, I can imagine we can easily run over the limit, as 15+MB js source can produce more than 35MB (to fill the 50MB limit per entry) very easily.

Bad is that we could at least keep the source in the cache and not throw everything away.  Could we introduce a limit in the script loader to not store alt-data when the source is above a certain size?

How complicated would be to not doom the entry on reaching the limit but only cut the alt data away?
(In reply to Honza Bambas (:mayhemer) from comment #7)
> Bad is that we could at least keep the source in the cache and not throw
> everything away.  Could we introduce a limit in the script loader to not
> store alt-data when the source is above a certain size?

This is something which should be easy to add:

https://searchfox.org/mozilla-central/rev/a0665934fa05158a5a943d4c8b277465910c029c/dom/script/ScriptLoader.cpp#2121,2152-2155

> How complicated would be to not doom the entry on reaching the limit but
> only cut the alt data away?

If we do so, then we would need an extra flag on the channel, once completed, to avoid re-encoding it every time.
(In reply to Honza Bambas (:mayhemer) from comment #7)
> How complicated would be to not doom the entry on reaching the limit but
> only cut the alt data away?

After a quick look at the code I think it's not easy to do. IMO, much easier would be to check the limit before we start writing alt-data. Does JS know exact size of the alt-data before it opens the output stream?
Thanks for sending the cache entries, Nikolay.
Flags: needinfo?(mar.kolya)
Assigning to Michal until we get to the root of the problem.
Assignee: nobody → michal.novotny
Component: JavaScript Engine → Networking: Cache
Priority: -- → P2
Whiteboard: [necko-triaged]
(In reply to Michal Novotny (:michal) from comment #9)
> (In reply to Honza Bambas (:mayhemer) from comment #7)
> > How complicated would be to not doom the entry on reaching the limit but
> > only cut the alt data away?
> 
> After a quick look at the code I think it's not easy to do. IMO, much easier
> would be to check the limit before we start writing alt-data. Does JS know
> exact size of the alt-data before it opens the output stream?

Yes.

https://searchfox.org/mozilla-central/rev/a0665934fa05158a5a943d4c8b277465910c029c/dom/script/ScriptLoader.cpp#2472,2490
The entry isn't written completely because the data is written to output stream faster than the disk is able to write. Unwritten CacheFileChunks start to cumulate and when the limit is reached CacheFileChunk::CanAllocate() fails.

The entry isn't doomed because the error is propagated down to CacheFile::RemoveOutput() (https://searchfox.org/mozilla-central/rev/2ce99e8054b0ff6ed1adf484aeaacacf2fea084c/netwerk/cache2/CacheFile.cpp#2170) where we only set mStatus instead of calling SetError() which would doom the entry.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
(In reply to Michal Novotny (:michal) from comment #13)
> The entry isn't written completely because the data is written to output
> stream faster than the disk is able to write. Unwritten CacheFileChunks
> start to cumulate and when the limit is reached
> CacheFileChunk::CanAllocate() fails.

I'm not sure how to fix this. We could increase browser.cache.disk.max_chunks_memory_usage and browser.cache.disk.max_priority_chunks_memory_usage, but the question is what's the right value? I did some tests with 13MB JS which produces 30MB alt-data (total entry size is just below the max entry size for default 350MB cache size) and sometimes the backlog of unwritten data reaches 30MB.

Other option is to bypass chunk usage limiting for alt-data, but I don't like this idea because then the memory usage can grow a lot and that's exactly why we have this limiting mechanism. On the other hand we won't probably write large alt-data very often, so this would probably happen rarely.

Honza, what's your opinion?
Flags: needinfo?(honzab.moz)
Can we just disable alternate data caching for js resources over a certain high threshold?
(In reply to Michal Novotny (:michal) from comment #14)
> (In reply to Michal Novotny (:michal) from comment #13)
> > The entry isn't written completely because the data is written to output
> > stream faster than the disk is able to write. Unwritten CacheFileChunks
> > start to cumulate and when the limit is reached
> > CacheFileChunk::CanAllocate() fails.
> 
> I'm not sure how to fix this. We could increase
> browser.cache.disk.max_chunks_memory_usage and
> browser.cache.disk.max_priority_chunks_memory_usage, but the question is
> what's the right value? I did some tests with 13MB JS which produces 30MB
> alt-data (total entry size is just below the max entry size for default
> 350MB cache size) and sometimes the backlog of unwritten data reaches 30MB.
> 
> Other option is to bypass chunk usage limiting for alt-data, but I don't
> like this idea because then the memory usage can grow a lot and that's
> exactly why we have this limiting mechanism. 

That was one reason, the other was to keep the write backlog short on slow HW, I think.  Which we both break with not enforcing the limit for alt-data, yes...

Anyway, are there at least some data if a 30MB bytecode cache has a positive effect on performance?  30MB takes some time to load...

I had a (somewhat wild) idea of doing a writesegment thing on the cache output stream, but the bytecode is generated on content and we write on the parent, so we do IPC message data copy anyway. 

> On the other hand we won't
> probably write large alt-data very often, so this would probably happen
> rarely.
> 
> Honza, what's your opinion?

I would agree with Ben here and it also seems we know the bytecode size soon enough to prevent cache write.  So, have a limit (a js engine pref?) to write or not write in the script loader.
Flags: needinfo?(honzab.moz)
Sorry to jump in but I would like to clarify one thing: this problem exhibits itself on quite powerful laptop: Intel(R) Xeon(R) CPU E3-1505M v5, 4C/8T, plenty of RAM, reasonably fast SSD - there is no reason it should be too slow for any sort of web browsing operation :). My observation is that downloading (unreasonably) large JS is slow due to slow networks, not slowness of the laptop. It takes 10s of seconds to download that. Modern SSDs can read 100s of MBs per second probably making storage of anything locally faster than downloading it.

Just wanted to clarify that this is not an issue observed due to some underpowered HW.
Thanks for looking into this problem!
(In reply to Honza Bambas (:mayhemer) from comment #16)
> Anyway, are there at least some data if a 30MB bytecode cache has a positive
> effect on performance?  30MB takes some time to load...

Nicolas, do you have numbers about that? Also note that the testing JS was not gzipped, if server compress the response it's about 5MB. So in this case to avoid JS compiling we need to read 6x more data from the disk.
Flags: needinfo?(nicolas.b.pierron)
(In reply to Nikolay Martynov from comment #17)
> Sorry to jump in 

We evict the entry because the compiled data are too big.  We then get rid of the original server data as well.  The outcome of this bug will likely be to not store the compiled data, which will preserve the server content in the cache.
(In reply to Nikolay Martynov from comment #17)
> My observation is that downloading (unreasonably) large JS is slow due 
> to slow networks, not slowness of the laptop. It takes 10s of seconds to
> download that.

Unfortunately, I don't have numbers how much storing compiled JS improves performance. But when it fails we doom already downloaded plain JS which makes things much worse. We really need to be able to just throw away alt-data in case we're unable to write complete alt-data for whatever reason. As I already wrote, this isn't probably easy, but I'll have a look at this solution again...
(In reply to Michal Novotny (:michal) from comment #18)
> (In reply to Honza Bambas (:mayhemer) from comment #16)
> > Anyway, are there at least some data if a 30MB bytecode cache has a positive
> > effect on performance?  30MB takes some time to load...
> 
> Nicolas, do you have numbers about that? Also note that the testing JS was
> not gzipped, if server compress the response it's about 5MB. So in this case
> to avoid JS compiling we need to read 6x more data from the disk.

I have no size-based measurements, but only time-based measurements.

Bug 900784 comment 116 suggest that this optimization is improving the long-tail of time-consuming scripts, which "might" be related the large scripts.
Flags: needinfo?(nicolas.b.pierron)
Attached patch patch (obsolete) — Splinter Review
See the commit message in the patch for description.

Try looks good:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=df2e7486cc2df417b4e4c7b54c4ab4918c2f55fe&selectedJob=174638951

There are testverify failures of netwerk/test/unit/test_http2.js but it fails without this patch too.
Attachment #8970539 - Flags: review?(honzab.moz)
Comment on attachment 8970539 [details] [diff] [review]
patch

Review of attachment 8970539 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsICacheInfoChannel.idl
@@ +106,5 @@
>     * The consumer may choose to replace the saved alt representation.
>     * Opening the output stream will fail if there are any open input streams
>     * reading the already saved alt representation.
>     */
> +  nsIOutputStream openAlternativeOutputStream(in ACString type, in long long predictedSize);

some doc for the new arg should be added

::: netwerk/cache2/CacheFile.cpp
@@ +2188,5 @@
> +      }
> +      if (altDataInputExists) {
> +        SetError(aStatus);
> +      } else {
> +        rv = Truncate(mAltDataOffset);

maybe MOZ_ASSERT(mAltDataOffset > 0) ?

::: netwerk/cache2/CacheFileChunk.cpp
@@ -779,5 @@
>  
>  nsresult
>  CacheFileChunk::GetStatus()
>  {
> -  AssertOwnsLock();

why this change?

::: netwerk/cache2/CacheFileOutputStream.cpp
@@ -104,2 @@
>  
> -    mFile->DoomLocked(nullptr);

can you please explain why we no longer doom here and what consequences this has?

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +5390,5 @@
>      mCacheInputStream.CloseAndRelease();
>  
> +    int64_t predictedSize = mResponseHead->TotalEntitySize();
> +    if (predictedSize != -1) {
> +      predictedSize -= offset;

4 spaces indent

::: netwerk/test/unit/test_alt-data_too_big.js
@@ +35,5 @@
> +  var written = os.write(data, data.length);
> +  Assert.equal(written, data.length);
> +  os.close();
> +
> +  os = entry.openAlternativeOutputStream("text/binary", -1);

it would be good to also clone this test and pass the alt-data length here to check we fail and leave the data intact, just as a regression test.

::: netwerk/test/unit/test_bug248970_cache.js
@@ +54,5 @@
>  }
>  
>  var store_data = function(status, entry) {
>    Assert.equal(status, Cr.NS_OK);
> +  var os = entry.openOutputStream(0, entries[store_idx][1].length);

can this now fail here?  if so, it might turn this test to something no longer working..
Attachment #8970539 - Flags: review?(honzab.moz) → review+
Comment on attachment 8970539 [details] [diff] [review]
patch

Review of attachment 8970539 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/script/ScriptLoader.cpp
@@ +2493,4 @@
>                                                           getter_AddRefs(output));
>    if (NS_FAILED(rv)) {
>      LOG(("ScriptLoadRequest (%p): Cannot open bytecode cache (rv = %X, output = %p)",
>           aRequest, unsigned(rv), output.get()));

I presume that I would have to make a patch to check for NS_ERROR_FILE_TOO_BIG, and use SetMetaDataElement & GetMetaDataElement to prevent future bytecode encoding on such failures.

Is that right?
(In reply to Nicolas B. Pierron [:nbp] {backlog: ~38} from comment #24)
> I presume that I would have to make a patch to check for
> NS_ERROR_FILE_TOO_BIG, 

Yes, but I think you can just ignore it for your use case?

> and use SetMetaDataElement & GetMetaDataElement to
> prevent future bytecode encoding on such failures.

The check is fast and when users change preferences (enlarge the limits) we may allow the bytecode caching.  So, I don't think you need it unless you are also saving some additional work we would just always throw away here.  Then probably yes, do remember it.

> 
> Is that right?
(In reply to Honza Bambas (:mayhemer) from comment #25)
> (In reply to Nicolas B. Pierron [:nbp] {backlog: ~38} from comment #24)
> > and use SetMetaDataElement & GetMetaDataElement to
> > prevent future bytecode encoding on such failures.
> 
> The check is fast and when users change preferences (enlarge the limits) we
> may allow the bytecode caching.  So, I don't think you need it unless you
> are also saving some additional work we would just always throw away here. 
> Then probably yes, do remember it.

Which is the case. Therefore, I opened a follow-up bug (Bug 1456834) to handle it.
Thanks.
Attached patch patch v2Splinter Review
(In reply to Honza Bambas (:mayhemer) from comment #23)
> ::: netwerk/cache2/CacheFile.cpp
> @@ +2188,5 @@
> > +      }
> > +      if (altDataInputExists) {
> > +        SetError(aStatus);
> > +      } else {
> > +        rv = Truncate(mAltDataOffset);
> 
> maybe MOZ_ASSERT(mAltDataOffset > 0) ?

I've added MOZ_ASSERT(mAltDataOffset != -1), I can imagine alt-data being added to entry with no data so there is no reason to not allow 0.

> ::: netwerk/cache2/CacheFileChunk.cpp
> @@ -779,5 @@
> >  
> >  nsresult
> >  CacheFileChunk::GetStatus()
> >  {
> > -  AssertOwnsLock();
> 
> why this change?

I've added call to this function from CacheFile::Truncate to make sure the error didn't happen in the last chunk and CacheFileChunk::AssertOwnsLock dereferences mFile which is null in case the chunk is inactive.

> ::: netwerk/cache2/CacheFileOutputStream.cpp
> @@ -104,2 @@
> >  
> > -    mFile->DoomLocked(nullptr);
> 
> can you please explain why we no longer doom here and what consequences this
> has?

CloseWithStatusLocked is called immediately which calls CacheFile::RemoveOutput which handles the failure, it either cuts the alt-data or dooms the entry.

> ::: netwerk/test/unit/test_alt-data_too_big.js
> @@ +35,5 @@
> > +  var written = os.write(data, data.length);
> > +  Assert.equal(written, data.length);
> > +  os.close();
> > +
> > +  os = entry.openAlternativeOutputStream("text/binary", -1);
> 
> it would be good to also clone this test and pass the alt-data length here
> to check we fail and leave the data intact, just as a regression test.

I've rewritten the test to cover this case too.

> ::: netwerk/test/unit/test_bug248970_cache.js
> @@ +54,5 @@
> >  }
> >  
> >  var store_data = function(status, entry) {
> >    Assert.equal(status, Cr.NS_OK);
> > +  var os = entry.openOutputStream(0, entries[store_idx][1].length);
> 
> can this now fail here?  if so, it might turn this test to something no
> longer working..

No, the data is too small. If the test doesn't test the size limit it doesn't matter if the real size or -1 is used.
Attachment #8970539 - Attachment is obsolete: true
Attachment #8970885 - Flags: review+
Keywords: checkin-needed
Pushed by nbeleuzu@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/db2555234bfc
Cache entry corruption after writing the alternate data. r=honzab
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/db2555234bfc
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/d3137aa77a2d
Port bug 1448476 to mailnews (IMAP, NNTP). rs=bustage-fix
Depends on: 1475701
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: