Bugzilla

Updated

•

6 years ago

Component: Networking: Cache → JavaScript Engine

Comment 1

•

6 years ago

Based on the description, ...

This is weird, it sounds that the addition of the bytecode cache content would cause a cache entry corruption, thus preventing the cache entry from being loaded again from the cache, while not being invalidated by another entry.

The 2 byte encoding is the beginning of the encoded bytecode cache, which from what I recall is appended at the end of the JS file. I have no idea how a large bytecode cache could prevent the indexing of a cache entry, this would be more likely a question for valentin.

Have you tried multiplying by 4 the amount of cache space available, could this problem arise when the cache is full?
Maybe this is a race-network-with-cache issue which is not supposed to be enabled if we request any alternate data.

One simple fix might be to cap the size of the bytecode cache entry.

Flags: needinfo?(nicolas.b.pierron) → needinfo?(valentin.gosu)

Summary: bytecode cache corruption → Cache entry corruption after writing the alternate data.

Comment 2

•

6 years ago

Thanks for this bug report. Could you maybe post the contents of the about:cache both before and after you notice the corruption?
It would be helpful to figure out what's going on here.
If the problem isn't obvious from the contents of the cache entry, we may also ask you to gather some logs, if that's OK.

Flags: needinfo?(valentin.gosu) → needinfo?(mar.kolya)

Comment 3

•

6 years ago

(In reply to Nicolas B. Pierron [:nbp] {backlog: ~36} from comment #1)
> Based on the description, ...
> 
> This is weird, it sounds that the addition of the bytecode cache content
> would cause a cache entry corruption, thus preventing the cache entry from
> being loaded again from the cache, while not being invalidated by another
> entry.
> 
> The 2 byte encoding is the beginning of the encoded bytecode cache, which
> from what I recall is appended at the end of the JS file. I have no idea how
> a large bytecode cache could prevent the indexing of a cache entry, this
> would be more likely a question for valentin.
> 
> Have you tried multiplying by 4 the amount of cache space available, could
> this problem arise when the cache is full?
> Maybe this is a race-network-with-cache issue which is not supposed to be
> enabled if we request any alternate data.

We don't race if the cache index says the entry has alt-data.
But I think there could be cases where the cache index is unavailable or doesn't return the correct answer.

> 
> One simple fix might be to cap the size of the bytecode cache entry.

If it's the size causing the problem. Maybe we're exceeding the maximum size of the cache file? (question for Honza or Michal)

Flags: needinfo?(michal.novotny)

Nikolay Martynov

Reporter

Comment 4

•

6 years ago

Valentin, could you please clarify if you would like me to post literally contents of 'about:cache' - which is like some sort of summary. Or would you like me to paste actual entry that gets broken?

If former then I should be able to do that - and I can reproduce the problem after complete cache wipe out - so 'before' would be empty cache. And even before I do that I know for a fact that cache gets full very quickly, so 'after' would be full cache. And this entry is big enough to not go into RAM.

If you would like the actual broken entry I might not be able to post it publicly. I might be able to share it privately, however.

Please let me know in a bit more detail what information you require.
Thanks!

Flags: needinfo?(mar.kolya)

Comment 5

•

6 years ago

(In reply to Nikolay Martynov from comment #4) 
> If you would like the actual broken entry I might not be able to post it
> publicly. I might be able to share it privately, however.

If you could, please send me the complete contents of the cache entries (both the one that works and the one that is corrupted): about:cache-entry?storage=disk&... 
Send it to me by email if you are worried about sensitive data it contains.

Thanks!

Flags: needinfo?(mar.kolya)

Assignee

Comment 6

•

6 years ago

(In reply to Valentin Gosu [:valentin] from comment #3)
> If it's the size causing the problem. Maybe we're exceeding the maximum size
> of the cache file? (question for Honza or Michal)

The maximum cache file size is browser.cache.disk.max_entry_size or browser.cache.disk.capacity/8 (the lower one). This includes data, alt-data and metadata. We can fail early at https://searchfox.org/mozilla-central/rev/a0665934fa05158a5a943d4c8b277465910c029c/netwerk/cache2/CacheEntry.cpp#1345 or later at https://searchfox.org/mozilla-central/rev/a0665934fa05158a5a943d4c8b277465910c029c/netwerk/cache2/CacheFileOutputStream.cpp#101 which would be this case.

In any case, this would doom the entry immediately so it shouldn't be stored in cache2/entries anymore.

Flags: needinfo?(michal.novotny)

Comment 7

•

6 years ago

Actual URLs would be good to test with.  Anyway, I can imagine we can easily run over the limit, as 15+MB js source can produce more than 35MB (to fill the 50MB limit per entry) very easily.

Bad is that we could at least keep the source in the cache and not throw everything away.  Could we introduce a limit in the script loader to not store alt-data when the source is above a certain size?

How complicated would be to not doom the entry on reaching the limit but only cut the alt data away?

Comment 8

•

6 years ago

(In reply to Honza Bambas (:mayhemer) from comment #7)
> Bad is that we could at least keep the source in the cache and not throw
> everything away.  Could we introduce a limit in the script loader to not
> store alt-data when the source is above a certain size?

This is something which should be easy to add:

https://searchfox.org/mozilla-central/rev/a0665934fa05158a5a943d4c8b277465910c029c/dom/script/ScriptLoader.cpp#2121,2152-2155

> How complicated would be to not doom the entry on reaching the limit but
> only cut the alt data away?

If we do so, then we would need an extra flag on the channel, once completed, to avoid re-encoding it every time.

Assignee

Comment 9

•

6 years ago

(In reply to Honza Bambas (:mayhemer) from comment #7)
> How complicated would be to not doom the entry on reaching the limit but
> only cut the alt data away?

After a quick look at the code I think it's not easy to do. IMO, much easier would be to check the limit before we start writing alt-data. Does JS know exact size of the alt-data before it opens the output stream?

Comment 10

•

6 years ago

Thanks for sending the cache entries, Nikolay.

Flags: needinfo?(mar.kolya)

Comment 11

•

6 years ago

Assigning to Michal until we get to the root of the problem.

Assignee: nobody → michal.novotny

Component: JavaScript Engine → Networking: Cache

Priority: -- → P2

Whiteboard: [necko-triaged]

Comment 12

•

6 years ago

(In reply to Michal Novotny (:michal) from comment #9)
> (In reply to Honza Bambas (:mayhemer) from comment #7)
> > How complicated would be to not doom the entry on reaching the limit but
> > only cut the alt data away?
> 
> After a quick look at the code I think it's not easy to do. IMO, much easier
> would be to check the limit before we start writing alt-data. Does JS know
> exact size of the alt-data before it opens the output stream?

Yes.

https://searchfox.org/mozilla-central/rev/a0665934fa05158a5a943d4c8b277465910c029c/dom/script/ScriptLoader.cpp#2472,2490

Assignee

Comment 13

•

6 years ago

The entry isn't written completely because the data is written to output stream faster than the disk is able to write. Unwritten CacheFileChunks start to cumulate and when the limit is reached CacheFileChunk::CanAllocate() fails.

The entry isn't doomed because the error is propagated down to CacheFile::RemoveOutput() (https://searchfox.org/mozilla-central/rev/2ce99e8054b0ff6ed1adf484aeaacacf2fea084c/netwerk/cache2/CacheFile.cpp#2170) where we only set mStatus instead of calling SetError() which would doom the entry.

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 14

•

6 years ago

(In reply to Michal Novotny (:michal) from comment #13)
> The entry isn't written completely because the data is written to output
> stream faster than the disk is able to write. Unwritten CacheFileChunks
> start to cumulate and when the limit is reached
> CacheFileChunk::CanAllocate() fails.

I'm not sure how to fix this. We could increase browser.cache.disk.max_chunks_memory_usage and browser.cache.disk.max_priority_chunks_memory_usage, but the question is what's the right value? I did some tests with 13MB JS which produces 30MB alt-data (total entry size is just below the max entry size for default 350MB cache size) and sometimes the backlog of unwritten data reaches 30MB.

Other option is to bypass chunk usage limiting for alt-data, but I don't like this idea because then the memory usage can grow a lot and that's exactly why we have this limiting mechanism. On the other hand we won't probably write large alt-data very often, so this would probably happen rarely.

Honza, what's your opinion?

Flags: needinfo?(honzab.moz)

Comment 15

•

6 years ago

Can we just disable alternate data caching for js resources over a certain high threshold?

Comment 16

•

6 years ago

(In reply to Michal Novotny (:michal) from comment #14)
> (In reply to Michal Novotny (:michal) from comment #13)
> > The entry isn't written completely because the data is written to output
> > stream faster than the disk is able to write. Unwritten CacheFileChunks
> > start to cumulate and when the limit is reached
> > CacheFileChunk::CanAllocate() fails.
> 
> I'm not sure how to fix this. We could increase
> browser.cache.disk.max_chunks_memory_usage and
> browser.cache.disk.max_priority_chunks_memory_usage, but the question is
> what's the right value? I did some tests with 13MB JS which produces 30MB
> alt-data (total entry size is just below the max entry size for default
> 350MB cache size) and sometimes the backlog of unwritten data reaches 30MB.
> 
> Other option is to bypass chunk usage limiting for alt-data, but I don't
> like this idea because then the memory usage can grow a lot and that's
> exactly why we have this limiting mechanism. 

That was one reason, the other was to keep the write backlog short on slow HW, I think.  Which we both break with not enforcing the limit for alt-data, yes...

Anyway, are there at least some data if a 30MB bytecode cache has a positive effect on performance?  30MB takes some time to load...

I had a (somewhat wild) idea of doing a writesegment thing on the cache output stream, but the bytecode is generated on content and we write on the parent, so we do IPC message data copy anyway. 

> On the other hand we won't
> probably write large alt-data very often, so this would probably happen
> rarely.
> 
> Honza, what's your opinion?

I would agree with Ben here and it also seems we know the bytecode size soon enough to prevent cache write.  So, have a limit (a js engine pref?) to write or not write in the script loader.

Flags: needinfo?(honzab.moz)

Nikolay Martynov

Reporter

Comment 17

•

6 years ago

Sorry to jump in but I would like to clarify one thing: this problem exhibits itself on quite powerful laptop: Intel(R) Xeon(R) CPU E3-1505M v5, 4C/8T, plenty of RAM, reasonably fast SSD - there is no reason it should be too slow for any sort of web browsing operation :). My observation is that downloading (unreasonably) large JS is slow due to slow networks, not slowness of the laptop. It takes 10s of seconds to download that. Modern SSDs can read 100s of MBs per second probably making storage of anything locally faster than downloading it.

Just wanted to clarify that this is not an issue observed due to some underpowered HW.
Thanks for looking into this problem!

Assignee

Comment 18

•

6 years ago

(In reply to Honza Bambas (:mayhemer) from comment #16)
> Anyway, are there at least some data if a 30MB bytecode cache has a positive
> effect on performance?  30MB takes some time to load...

Nicolas, do you have numbers about that? Also note that the testing JS was not gzipped, if server compress the response it's about 5MB. So in this case to avoid JS compiling we need to read 6x more data from the disk.

Assignee

Updated

•

6 years ago

Flags: needinfo?(nicolas.b.pierron)

Comment 19

•

6 years ago

(In reply to Nikolay Martynov from comment #17)
> Sorry to jump in 

We evict the entry because the compiled data are too big.  We then get rid of the original server data as well.  The outcome of this bug will likely be to not store the compiled data, which will preserve the server content in the cache.

Assignee

Comment 20

•

6 years ago

(In reply to Nikolay Martynov from comment #17)
> My observation is that downloading (unreasonably) large JS is slow due 
> to slow networks, not slowness of the laptop. It takes 10s of seconds to
> download that.

Unfortunately, I don't have numbers how much storing compiled JS improves performance. But when it fails we doom already downloaded plain JS which makes things much worse. We really need to be able to just throw away alt-data in case we're unable to write complete alt-data for whatever reason. As I already wrote, this isn't probably easy, but I'll have a look at this solution again...

Comment 21

•

6 years ago

(In reply to Michal Novotny (:michal) from comment #18)
> (In reply to Honza Bambas (:mayhemer) from comment #16)
> > Anyway, are there at least some data if a 30MB bytecode cache has a positive
> > effect on performance?  30MB takes some time to load...
> 
> Nicolas, do you have numbers about that? Also note that the testing JS was
> not gzipped, if server compress the response it's about 5MB. So in this case
> to avoid JS compiling we need to read 6x more data from the disk.

I have no size-based measurements, but only time-based measurements.

Bug 900784 comment 116 suggest that this optimization is improving the long-tail of time-consuming scripts, which "might" be related the large scripts.

Flags: needinfo?(nicolas.b.pierron)

Assignee

Comment 22

•

6 years ago

Attached patch patch (obsolete) — Details — Splinter Review

See the commit message in the patch for description.

Try looks good:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=df2e7486cc2df417b4e4c7b54c4ab4918c2f55fe&selectedJob=174638951

There are testverify failures of netwerk/test/unit/test_http2.js but it fails without this patch too.

Attachment #8970539 - Flags: review?(honzab.moz)

Comment 23

•

6 years ago

Comment on attachment 8970539 [details] [diff] [review]
patch

Review of attachment 8970539 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsICacheInfoChannel.idl
@@ +106,5 @@
>     * The consumer may choose to replace the saved alt representation.
>     * Opening the output stream will fail if there are any open input streams
>     * reading the already saved alt representation.
>     */
> +  nsIOutputStream openAlternativeOutputStream(in ACString type, in long long predictedSize);

some doc for the new arg should be added

::: netwerk/cache2/CacheFile.cpp
@@ +2188,5 @@
> +      }
> +      if (altDataInputExists) {
> +        SetError(aStatus);
> +      } else {
> +        rv = Truncate(mAltDataOffset);

maybe MOZ_ASSERT(mAltDataOffset > 0) ?

::: netwerk/cache2/CacheFileChunk.cpp
@@ -779,5 @@
>  
>  nsresult
>  CacheFileChunk::GetStatus()
>  {
> -  AssertOwnsLock();

why this change?

::: netwerk/cache2/CacheFileOutputStream.cpp
@@ -104,2 @@
>  
> -    mFile->DoomLocked(nullptr);

can you please explain why we no longer doom here and what consequences this has?

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +5390,5 @@
>      mCacheInputStream.CloseAndRelease();
>  
> +    int64_t predictedSize = mResponseHead->TotalEntitySize();
> +    if (predictedSize != -1) {
> +      predictedSize -= offset;

4 spaces indent

::: netwerk/test/unit/test_alt-data_too_big.js
@@ +35,5 @@
> +  var written = os.write(data, data.length);
> +  Assert.equal(written, data.length);
> +  os.close();
> +
> +  os = entry.openAlternativeOutputStream("text/binary", -1);

it would be good to also clone this test and pass the alt-data length here to check we fail and leave the data intact, just as a regression test.

::: netwerk/test/unit/test_bug248970_cache.js
@@ +54,5 @@
>  }
>  
>  var store_data = function(status, entry) {
>    Assert.equal(status, Cr.NS_OK);
> +  var os = entry.openOutputStream(0, entries[store_idx][1].length);

can this now fail here?  if so, it might turn this test to something no longer working..

Attachment #8970539 - Flags: review?(honzab.moz) → review+

Comment 24

•

6 years ago

Comment on attachment 8970539 [details] [diff] [review]
patch

Review of attachment 8970539 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/script/ScriptLoader.cpp
@@ +2493,4 @@
>                                                           getter_AddRefs(output));
>    if (NS_FAILED(rv)) {
>      LOG(("ScriptLoadRequest (%p): Cannot open bytecode cache (rv = %X, output = %p)",
>           aRequest, unsigned(rv), output.get()));

I presume that I would have to make a patch to check for NS_ERROR_FILE_TOO_BIG, and use SetMetaDataElement & GetMetaDataElement to prevent future bytecode encoding on such failures.

Is that right?

Comment 25

•

6 years ago

(In reply to Nicolas B. Pierron [:nbp] {backlog: ~38} from comment #24)
> I presume that I would have to make a patch to check for
> NS_ERROR_FILE_TOO_BIG, 

Yes, but I think you can just ignore it for your use case?

> and use SetMetaDataElement & GetMetaDataElement to
> prevent future bytecode encoding on such failures.

The check is fast and when users change preferences (enlarge the limits) we may allow the bytecode caching.  So, I don't think you need it unless you are also saving some additional work we would just always throw away here.  Then probably yes, do remember it.

> 
> Is that right?

Updated

•

6 years ago

Blocks: 1456834

Comment 26

•

6 years ago

(In reply to Honza Bambas (:mayhemer) from comment #25)
> (In reply to Nicolas B. Pierron [:nbp] {backlog: ~38} from comment #24)
> > and use SetMetaDataElement & GetMetaDataElement to
> > prevent future bytecode encoding on such failures.
> 
> The check is fast and when users change preferences (enlarge the limits) we
> may allow the bytecode caching.  So, I don't think you need it unless you
> are also saving some additional work we would just always throw away here. 
> Then probably yes, do remember it.

Which is the case. Therefore, I opened a follow-up bug (Bug 1456834) to handle it.
Thanks.

Assignee

Comment 27

•

6 years ago

Attached patch patch v2 — Details — Splinter Review

(In reply to Honza Bambas (:mayhemer) from comment #23)
> ::: netwerk/cache2/CacheFile.cpp
> @@ +2188,5 @@
> > +      }
> > +      if (altDataInputExists) {
> > +        SetError(aStatus);
> > +      } else {
> > +        rv = Truncate(mAltDataOffset);
> 
> maybe MOZ_ASSERT(mAltDataOffset > 0) ?

I've added MOZ_ASSERT(mAltDataOffset != -1), I can imagine alt-data being added to entry with no data so there is no reason to not allow 0.

> ::: netwerk/cache2/CacheFileChunk.cpp
> @@ -779,5 @@
> >  
> >  nsresult
> >  CacheFileChunk::GetStatus()
> >  {
> > -  AssertOwnsLock();
> 
> why this change?

I've added call to this function from CacheFile::Truncate to make sure the error didn't happen in the last chunk and CacheFileChunk::AssertOwnsLock dereferences mFile which is null in case the chunk is inactive.

> ::: netwerk/cache2/CacheFileOutputStream.cpp
> @@ -104,2 @@
> >  
> > -    mFile->DoomLocked(nullptr);
> 
> can you please explain why we no longer doom here and what consequences this
> has?

CloseWithStatusLocked is called immediately which calls CacheFile::RemoveOutput which handles the failure, it either cuts the alt-data or dooms the entry.

> ::: netwerk/test/unit/test_alt-data_too_big.js
> @@ +35,5 @@
> > +  var written = os.write(data, data.length);
> > +  Assert.equal(written, data.length);
> > +  os.close();
> > +
> > +  os = entry.openAlternativeOutputStream("text/binary", -1);
> 
> it would be good to also clone this test and pass the alt-data length here
> to check we fail and leave the data intact, just as a regression test.

I've rewritten the test to cover this case too.

> ::: netwerk/test/unit/test_bug248970_cache.js
> @@ +54,5 @@
> >  }
> >  
> >  var store_data = function(status, entry) {
> >    Assert.equal(status, Cr.NS_OK);
> > +  var os = entry.openOutputStream(0, entries[store_idx][1].length);
> 
> can this now fail here?  if so, it might turn this test to something no
> longer working..

No, the data is too small. If the test doesn't test the size limit it doesn't matter if the real size or -1 is used.

Attachment #8970539 - Attachment is obsolete: true

Attachment #8970885 - Flags: review+