Closed Bug 1013395 Opened 5 years ago Closed 5 years ago

HTTP cache v2: have a limit for write backlog

Categories

(Core :: Networking: Cache, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox30 --- unaffected
firefox31 --- unaffected
firefox32 --- affected
firefox33 --- fixed
b2g-v1.4 --- unaffected
b2g-v2.0 --- affected
b2g-v2.1 --- fixed

People

(Reporter: mayhemer, Assigned: michal)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 7 obsolete files)

Slow storage devices may let us write data longer time then we download them from net.  In that case we should cache only the critical content (html/css/js) and not images/media.  Easiest way to impl this seems to monitor the number and size of chunks waiting for write.  When over the limit, first fail writes of the non-critical data, if still over the limit, fail writes of the critical data too.  I don't have a concrete plan on how to sort/prioritize the entries to be threw away.
Blocks: 1020345
Attached patch WIP patch (obsolete) — Splinter Review
The logic is quite simple. There is no sophisticated prioritization. Priority and normal entries have separate limits and the idea is that priority resources are typically small and they will usually fit into the limit and won't be affected by normal entries that could be potentially very large.

We also need to somehow limit PreloadChunkCount according to these new limits since e.g. it makes no sense to limit memory occupied by non-written chunks to 2 MB and at the same time try to preload 4 chunks in advance.
Assignee: nobody → michal.novotny
Attachment #8437563 - Flags: feedback?(honzab.moz)
Comment on attachment 8437563 [details] [diff] [review]
WIP patch

Review of attachment 8437563 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/cache2/CacheFileChunk.cpp
@@ +13,5 @@
>  
>  #define kMinBufSize        512
>  
> +mozilla::Atomic<uint32_t> CacheFileChunk::mChunksMemoryUsage(0);
> +mozilla::Atomic<uint32_t> CacheFileChunk::mPriorityChunksMemoryUsage(0);

bad.. this is a static initializer...  have a lazy getter:

mozilla::Atomic<uint32_t>& CacheFileChunk::ChunksMemoryUsage()
{
  static mozilla::Atomic<uint32_t> chunksMemoryUsage(0);
  static mozilla::Atomic<uint32_t> prioChunksMemoryUsage(0);
  return mIsPriorityChunk ? prioChunksMemoryUsage : chunksMemoryUsage;
}

@@ +161,5 @@
> +    if (mBuf) {
> +      MemoryAllocationChanged(0, kMinBufSize);
> +      mBufSize = kMinBufSize;
> +    }
> +  }

is it checked we are OK further in the code path when this fails?

@@ +500,5 @@
>  
>            // Merge data with write buffer
>            if (mRWBufSize < mBufSize) {
>              mRWBuf = static_cast<char *>(moz_xrealloc(mRWBuf, mBufSize));
> +            MemoryAllocationChanged(mRWBufSize, mBufSize);

how complicated would be to make this also fallible?

@@ +668,5 @@
> +      return NS_ERROR_OUT_OF_MEMORY;
> +    }
> +  } else {
> +    return NS_ERROR_OUT_OF_MEMORY;
> +  }

maybe:

if (!CanAllocate())
  return OOM;

newBuf = moz_realloc()
if (!newBuf)
  return OOM;

...

@@ +713,5 @@
> +    usage = mChunksMemoryUsage;
> +    limit = CacheObserver::MaxDiskChunksMemoryUsage();
> +  }
> +
> +  if (usage + aSize > limit) {

maybe just |return usage + aSize <= limit;| ?

::: netwerk/cache2/CacheFileChunk.h
@@ +102,5 @@
>    void     SetError(nsresult aStatus);
>  
>    char *       BufForWriting() const;
>    const char * BufForReading() const;
> +  nsresult     EnsureBufSize(uint32_t aBufSize);

also called from CacheFile::PadChunkWithZeroes.  

use MOZ_WARN_UNUSED_RESULT for this method.

@@ +117,5 @@
>  
>    virtual ~CacheFileChunk();
>  
> +  bool CanAllocate(uint32_t aSize);
> +  void MemoryAllocationChanged(uint32_t aFreed, uint32_t aAllocated);

this may cause confusion with DoMemoryReport (that is more generic in the cache context)

maybe call it just ChunkAllocationChanged() ?

@@ +139,5 @@
>    uint32_t mDataSize;
>  
> +  bool     mLimitAllocation; // Whether this chunk respects limit for disk
> +                             // chunks memory usage.
> +  bool     mIsPriorityChunk;

wouldn't mIsPriority be enough?

and what is it actually?  how does it change?

seems like both these don't change during lifetime, if possible, please make them |bool const| to express that at compile time.

::: netwerk/cache2/CacheObserver.cpp
@@ +160,5 @@
>  
> +  mozilla::Preferences::AddUintVarCache(
> +    &sMaxDiskChunksMemoryUsage, "browser.cache.disk.max_chunks_memory_usage", kDefaultMaxDiskChunksMemoryUsage);
> +  mozilla::Preferences::AddUintVarCache(
> +    &sMaxDiskPriorityChunksMemoryUsage, "browser.cache.disk.max_priority_chunks_memory_usage", kDefaultMaxDiskPriorityChunksMemoryUsage);

please add them to all.js as well.

::: netwerk/cache2/CacheObserver.h
@@ +48,5 @@
>      { return sMaxDiskEntrySize << 10; }
> +  static uint32_t const MaxDiskChunksMemoryUsage() // result in bytes.
> +    { return sMaxDiskChunksMemoryUsage << 10; }
> +  static uint32_t const MaxDiskPriorityChunksMemoryUsage() // result in bytes.
> +    { return sMaxDiskPriorityChunksMemoryUsage << 10; }

*maybe* have a single method with an argument bool priority.
Attachment #8437563 - Flags: feedback?(honzab.moz) → feedback+
BTW, I like this is actually also a fix for bug 976217 :)

According the algo itself, it seems ok - simple enough.  Tho, we need a telemetry 
how often this triggers during normal use.

Also, we can cause intermittent test failures with this...

Then, is there a way to just turn the limiting off?

And, can we have some kind of an automated test?  At least to exercise the code paths.  To have a real check, I think it's OK to just reread the same (disk) entry and see we won't get the data back when the limit is overreached.

Do you think this patch will be easily "upliftable"?  Since this IMO must go to aurora (we wanted to have some solution here for 32).
Attached patch patch v1 (obsolete) — Splinter Review
> @@ +161,5 @@
> > +    if (mBuf) {
> > +      MemoryAllocationChanged(0, kMinBufSize);
> > +      mBufSize = kMinBufSize;
> > +    }
> > +  }
> 
> is it checked we are OK further in the code path when this fails?

I've removed this allocation completely because the write almost never fits into the minimum buffer, so we have to reallocate during Write().


> @@ +500,5 @@
> >  
> >            // Merge data with write buffer
> >            if (mRWBufSize < mBufSize) {
> >              mRWBuf = static_cast<char *>(moz_xrealloc(mRWBuf, mBufSize));
> > +            MemoryAllocationChanged(mRWBufSize, mBufSize);
> 
> how complicated would be to make this also fallible?

I did it. See the code, it's neither nice nor too ugly.


> @@ +713,5 @@
> > +    usage = mChunksMemoryUsage;
> > +    limit = CacheObserver::MaxDiskChunksMemoryUsage();
> > +  }
> > +
> > +  if (usage + aSize > limit) {
> 
> maybe just |return usage + aSize <= limit;| ?

I kept it because I want to log when we return false here.


> @@ +139,5 @@
> >    uint32_t mDataSize;
> >  
> > +  bool     mLimitAllocation; // Whether this chunk respects limit for disk
> > +                             // chunks memory usage.
> > +  bool     mIsPriorityChunk;
> 
> wouldn't mIsPriority be enough?
> 
> and what is it actually?  how does it change?
> 
> seems like both these don't change during lifetime, if possible, please make
> them |bool const| to express that at compile time.

It's a priority flag passed to CacheFile::Init() and it doesn't change. I made it const.

 
> According the algo itself, it seems ok - simple enough.  Tho, we need a
> telemetry 
> how often this triggers during normal use.

But what exactly to log? Only denied allocations? Or both so we would know the percentage of failures? And what if user changes the memory limit to some very low value? This would provide bad statistics.


> Also, we can cause intermittent test failures with this...
> 
> Then, is there a way to just turn the limiting off?

0 means no limit.


> And, can we have some kind of an automated test?  At least to exercise the
> code paths.  To have a real check, I think it's OK to just reread the same
> (disk) entry and see we won't get the data back when the limit is
> overreached.

I'll create some test as a separate patch.


> Do you think this patch will be easily "upliftable"?  Since this IMO must go
> to aurora (we wanted to have some solution here for 32).

It's quite a big change, but I think we have still enough time to test it properly.
Attachment #8437563 - Attachment is obsolete: true
Attachment #8447468 - Flags: review?(honzab.moz)
(In reply to Michal Novotny (:michal) from comment #4)
> > According the algo itself, it seems ok - simple enough.  Tho, we need a
> > telemetry 
> > how often this triggers during normal use.
> 
> But what exactly to log? Only denied allocations? Or both so we would know
> the percentage of failures? And what if user changes the memory limit to
> some very low value? This would provide bad statistics.

The number of users that will change the pref will be pretty negligible.

Maybe log whole-entry successful writes vs failed writes?  So that we know what percentage of entries we loose?
Attached patch test (obsolete) — Splinter Review
Attachment #8448027 - Flags: review?(honzab.moz)
Blocks: 979900
Any try runs?
https://tbpl.mozilla.org/?tree=Try&rev=b3c7424de471

The patch in this push didn't include limiting of the preloadChunkCount.
Attached patch patch v2 - added telemetry (obsolete) — Splinter Review
Attachment #8448779 - Flags: review?(honzab.moz)
Comment on attachment 8447468 [details] [diff] [review]
patch v1

Review of attachment 8447468 [details] [diff] [review]:
-----------------------------------------------------------------

This actually fixes 3 or even 4 bugs at once!

r+ with some changes.

::: modules/libpref/src/init/all.js
@@ +58,5 @@
>  // Max-size (in KB) for entries in memory cache. Set to -1 for no limit.
>  // (Note: entries bigger than than 90% of the mem-cache are never cached)
>  pref("browser.cache.memory.max_entry_size",  5120);
> +// Memory limit (in kB) for chunks of cache entries stored on the disk.
> +// Note: 0 means no limit.

This comment doesn't really well explain what this is doing, feel free to go deeper in details here, also because this really significantly influences how the cache works (and if at all..)

::: netwerk/cache2/CacheFile.cpp
@@ +1859,5 @@
>  CacheFile::SetError(nsresult aStatus)
>  {
>    if (NS_SUCCEEDED(mStatus)) {
>      mStatus = aStatus;
> +    if (mHandle) {

maybe assert owns lock here?  or is it that mHandle never nulls out?
can this be called exactly at the moment when on another threads mHandle is being assigned in OnFileOpened so that there could be a race when mHandle is assigned but not yet AddRef()'ed when we get here?

::: netwerk/cache2/CacheFileChunk.cpp
@@ +160,2 @@
>  
>    DoMemoryReport(MemorySize());

please remove this one, probably no longer needed here.

@@ +495,5 @@
> +            // from the disk. Simply copy the valid pieces.
> +            mValidityMap.Log();
> +            for (uint32_t i = 0; i < mValidityMap.Length(); i++) {
> +              memcpy(mRWBuf + mValidityMap[i].Offset(),
> +                     mBuf + mValidityMap[i].Offset(), mValidityMap[i].Len());

hmm... we should crop the validity map values to the buf len, so that when mValidityMap bits get somehow out of sync we don't end up with a serious security bug.  (I really must do a complete review of this code..)

or, just memcpy all regardless what's valid or not, but this piece of code is already well tested, so we should be OK here w/o any more changes.

@@ +509,5 @@
> +          } else {
> +            // Buffer holding the new data is larger. Use it as the destination
> +            // buffer to avoid reallocating mRWBuf. We need to copy those pieces
> +            // from mRWBuf which are not valid in mBuf.
> +            uint32_t invalid_offset = 0;

use camel case: invalidOffset

@@ +514,5 @@
> +            uint32_t invalid_length;
> +            mValidityMap.Log();
> +            for (uint32_t i = 0; i < mValidityMap.Length(); i++) {
> +              MOZ_ASSERT(invalid_offset <= mValidityMap[i].Offset());
> +              invalid_length =  mValidityMap[i].Offset() - invalid_offset;

two spaces after =

@@ +518,5 @@
> +              invalid_length =  mValidityMap[i].Offset() - invalid_offset;
> +              if (invalid_length > 0) {
> +                MOZ_ASSERT(invalid_offset + invalid_length <= mRWBufSize);
> +                memcpy(mBuf + invalid_offset, mRWBuf + invalid_offset,
> +                       invalid_length);

as well here.  note that ASSERT is not enough.  we could however do MOZ_CRASH (that AFAIK crashes in release too) when the boundaries are off.

or, here just really copy all and not just valid pieces to keep the code simpler and safer.

@@ +606,5 @@
>  {
> +  MOZ_ASSERT(NS_FAILED(aStatus));
> +
> +  if (NS_FAILED(mStatus)) {
> +    // Remember onlu the first error code.

onlu

@@ +748,5 @@
> +    return;
> +  }
> +
> +  ChunksMemoryUsage() += aAllocated;
> +  ChunksMemoryUsage() -= aFreed;

I went manually through all the places this is called and where the buffers and their respective sizes are changed.  all seems ok, also some amount of manual testing shows this works as expected.

however, I would strongly suggest to keep the aFreed (a.k.a the last reported count) as a member to make the counter 100% reliable and always report mBufSize + mRWBufSize as the new size (so that this method will not take any arguments at all.)  Just update the sizes and then call ChunkAllocationChanged(), similar way as DoMemoryReport() works, also see [1] as an example what I have in mind.  

The reason is that when we screw up the counter so that it leaks in either positive or negative way we actually effectively disable the whole disk cache!  This really needs to work absolutely perfectly and from my experience it's not possible to keep absolutely perfect track the way you did it - there is simply a design risk it can count wrong.

this will make us sure this code is correct and also make any future changes (and uplifts) safer, faster to review and easier to test.

[1] http://hg.mozilla.org/mozilla-central/annotate/ffb8b976548b/netwerk/cache2/CacheStorageService.cpp#l1033

::: netwerk/cache2/CacheFileChunk.h
@@ +139,5 @@
>    uint32_t mDataSize;
>  
> +  bool const mLimitAllocation; // Whether this chunk respects limit for disk
> +                               // chunks memory usage.
> +  bool const mIsPriority;

to save some memory, could we make these bit fields?

::: netwerk/cache2/CacheFileInputStream.cpp
@@ +544,5 @@
>    uint32_t chunkOffset = mPos - (mPos / kChunkSize) * kChunkSize;
>    *aCanRead = mChunk->DataSize() - chunkOffset;
> +  if (*aCanRead > 0) {
> +    *aBuf = mChunk->BufForReading() + chunkOffset;
> +  } else {

nit: please add a comment that *aCanRead is signed.

also, should not we crop it to 0?  To make sure that any checks like |if (canRead) { do_something(); }| will not pass for negative values.

::: netwerk/cache2/CacheObserver.cpp
@@ +316,5 @@
> +  static uint32_t const kDefaultMemoryUsagePerPreloadedChunk =
> +    (kDefaultMaxDiskChunksMemoryUsage << 10) / kDefaultPreloadChunkCount;
> +  static uint32_t const kDefaultMemoryUsagePerPreloadedPriorityChunk =
> +    (kDefaultMaxDiskPriorityChunksMemoryUsage << 10) /
> +    kDefaultPreloadChunkCount;

maybe move this to the global scope, this is called often and we may save the internal was-initialized check when not local.

@@ +323,5 @@
> +  if (memoryLimit == 0) {
> +    return sPreloadChunkCount;
> +  }
> +
> +  uint32_t derivedMaxCount =  memoryLimit /

two spaces after =

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +4947,5 @@
>  
> +    // cache file could be deleted on our behalf, it could contain errors or
> +    // it failed to allocate memory, reload from network here.
> +    if (mCacheEntry && mCachePump && RECOVER_FROM_CACHE_FILE_ERROR(mStatus)) {
> +        LOG(("  cache file error, reloading from server"));

this definitely needs a telemetry (followup), I though this would only affect writes.  reads are tho serious (and I'm really worried about our infra tests intermittent failures).

can we get a read error of OOM in the middle of reading?  because there is no reliable way to recover from that!  and that is pretty bad...

you can say we have the same problem when disk entry is screwed, but that will happen much less often than mem boundary hit.
Attachment #8447468 - Flags: review?(honzab.moz) → review+
Comment on attachment 8448027 [details] [diff] [review]
test

Review of attachment 8448027 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/test/unit/test_cache2-25-chunk-memory-limit.js
@@ +26,5 @@
> +  // set max chunks memory so that only one full chunk fits within the limit
> +  prefBranch.setIntPref("browser.cache.disk.max_chunks_memory_usage", 300);
> +
> +  asyncOpenCacheEntry("http://a/", "disk", Ci.nsICacheStorage.OPEN_NORMALLY, null,
> +    new OpenCallback(NEW|DONTFILL, "", "", function(entry) {

in this combination you may go with just:

asyncOpenCacheEntry("http://a/", "disk", Ci.nsICacheStorage.OPEN_NORMALLY, null, function(entry) {
});

No need for instantiating OpenCallback I think, but up to you if you want the 'newness' and other checks.

@@ +27,5 @@
> +  prefBranch.setIntPref("browser.cache.disk.max_chunks_memory_usage", 300);
> +
> +  asyncOpenCacheEntry("http://a/", "disk", Ci.nsICacheStorage.OPEN_NORMALLY, null,
> +    new OpenCallback(NEW|DONTFILL, "", "", function(entry) {
> +      oStr = entry.openOutputStream(0);

var oStr

@@ +34,5 @@
> +
> +      asyncOpenCacheEntry("http://b/", "disk", Ci.nsICacheStorage.OPEN_NORMALLY, null,
> +        new OpenCallback(NEW|DONTFILL, "", "", function(entry) {
> +          var oStr2 = entry.openOutputStream(0);
> +          var data = gen_200k();

reuse data in the caller context?
Attachment #8448027 - Flags: review?(honzab.moz) → review+
Comment on attachment 8448779 [details] [diff] [review]
patch v2 - added telemetry

Review of attachment 8448779 [details] [diff] [review]:
-----------------------------------------------------------------

nice!  r+ only for the telemetry added stuff

::: toolkit/components/telemetry/Histograms.json
@@ +6136,5 @@
> +  "NETWORK_CACHE_V2_OUTPUT_STREAM_STATUS": {
> +    "expires_in_version": "never",
> +    "kind": "enumerated",
> +    "n_values": "7",
> +    "description": "Final status of the CacheFileOutputStream."

please add the enum values meaning (at least abbreviated) to the description
Attachment #8448779 - Flags: review?(honzab.moz) → review+
(In reply to Honza Bambas (:mayhemer) from comment #12)
> > +    new OpenCallback(NEW|DONTFILL, "", "", function(entry) {
> > +      oStr = entry.openOutputStream(0);
> 
> var oStr

Ah, you probably don't want this be gc'ed (global).  Got it.
(In reply to Honza Bambas (:mayhemer) from comment #11)
> this definitely needs a telemetry (followup), I though this would only
> affect writes.  reads are tho serious (and I'm really worried about our
> infra tests intermittent failures).

Limiting just writes is a bit tricky. We could not limit writes when there is also an input stream reading the same entry. But should we count such allocations to the limit? And should we count to the limit also memory that was allocated for reading?


> can we get a read error of OOM in the middle of reading?  because there is
> no reliable way to recover from that!  and that is pretty bad...

Do you mean in the middle of the whole entry, or in the middle of what we have in the cache?
(In reply to Michal Novotny (:michal) from comment #15)
> (In reply to Honza Bambas (:mayhemer) from comment #11)
> > this definitely needs a telemetry (followup), I though this would only
> > affect writes.  reads are tho serious (and I'm really worried about our
> > infra tests intermittent failures).
> 
> Limiting just writes is a bit tricky. We could not limit writes when there
> is also an input stream reading the same entry. But should we count such
> allocations to the limit? And should we count to the limit also memory that
> was allocated for reading?

Not sure and I don't think it's that important to know exactly to fix this bug.  I'm mostly fine with the patch.

> 
> 
> > can we get a read error of OOM in the middle of reading?  because there is
> > no reliable way to recover from that!  and that is pretty bad...
> 
> Do you mean in the middle of the whole entry, or in the middle of what we
> have in the cache?

I'm concerned about a case when e.g. a page with some larger number of larger images that are found valid in the cache and want to load all at once may fill the limits and one of them will hit the limit on say chunk 10 of 20.  Then user may end up with just half of the image.  The rest will not restart, probably.

I'm trying to reproduce but it does work for me well so far.  So maybe there is not any bug or it's just because I'm hitting bug 1034309 (unrelated to this patch).
Attached patch patch v3 (obsolete) — Splinter Review
Carrying r+ from the previous reviews, feel free to check the patch again.


(In reply to Honza Bambas (:mayhemer) from comment #11)
> ::: netwerk/cache2/CacheFile.cpp
> @@ +1859,5 @@
> >  CacheFile::SetError(nsresult aStatus)
> >  {
> >    if (NS_SUCCEEDED(mStatus)) {
> >      mStatus = aStatus;
> > +    if (mHandle) {
> 
> maybe assert owns lock here?  or is it that mHandle never nulls out?
> can this be called exactly at the moment when on another threads mHandle is
> being assigned in OnFileOpened so that there could be a race when mHandle is
> assigned but not yet AddRef()'ed when we get here?

I've added AssertOwnsLock() here. We never null out mHandle and we assign it also under the lock. And at that place we doom the handle if NS_FAILED(mStatus), so there should be no race condition.


(In reply to Honza Bambas (:mayhemer) from comment #16)
> I'm concerned about a case when e.g. a page with some larger number of
> larger images that are found valid in the cache and want to load all at once
> may fill the limits and one of them will hit the limit on say chunk 10 of
> 20.  Then user may end up with just half of the image.  The rest will not
> restart, probably.

This can IMO happen. Couldn't this be solved somehow in nsHttpChannel::OnStopRequest() similarly to the check in nsHttpChannel::OnStartRequest()?
Attachment #8447468 - Attachment is obsolete: true
Attachment #8448027 - Attachment is obsolete: true
Attachment #8448779 - Attachment is obsolete: true
Attachment #8451666 - Flags: review+
(In reply to Michal Novotny (:michal) from comment #17)
> (In reply to Honza Bambas (:mayhemer) from comment #16)
> > I'm concerned about a case when e.g. a page with some larger number of
> > larger images that are found valid in the cache and want to load all at once
> > may fill the limits and one of them will hit the limit on say chunk 10 of
> > 20.  Then user may end up with just half of the image.  The rest will not
> > restart, probably.
> 
> This can IMO happen. Couldn't this be solved somehow in
> nsHttpChannel::OnStopRequest() similarly to the check in
> nsHttpChannel::OnStartRequest()?

That's what I'm afraid not :/  And IMO shouldn't anyway.  We only resume from network (do a range request) when we do a concurrent read.  This is not a concurrent read and it's also not sure the response is always resumable.

Why is it actually so complicated to exclude chunks initiated for read from the limit counter?  AFAIK, we throw them away ASAP they are read (except preload, but that is what you also throttle, so we should be OK).
Blocks: 913822
Attached patch patch v4 (obsolete) — Splinter Review
(In reply to Honza Bambas (:mayhemer) from comment #18)
> > This can IMO happen. Couldn't this be solved somehow in
> > nsHttpChannel::OnStopRequest() similarly to the check in
> > nsHttpChannel::OnStartRequest()?
> 
> That's what I'm afraid not :/  And IMO shouldn't anyway.  We only resume
> from network (do a range request) when we do a concurrent read.  This is not
> a concurrent read and it's also not sure the response is always resumable.
> 
> Why is it actually so complicated to exclude chunks initiated for read from
> the limit counter?  AFAIK, we throw them away ASAP they are read (except
> preload, but that is what you also throttle, so we should be OK).

OK, so if I understand correctly, it isn't problem when the input stream fails at EOF in following scenario:

- output stream writes to the entry
- input stream read from the same entry
- output stream fails with NS_ERROR_OUT_OF_MEMORY
- input stream reads till the end and then fails with NS_ERROR_OUT_OF_MEMORY

In this case it's not a big problem to limit just writes. The patch now limits memory only for chunks that were created by writer. Unfortunately, it's not easy to handle a case when the output stream append a data to a chunk that was previously created by an input stream and stayed cached. But this case will be very rare. Also, limiting preload chunk count was removed since it doesn't make sense anymore.

Please review just the difference between v3-v4. I'll attach an interdiff.
Attachment #8451666 - Attachment is obsolete: true
Attachment #8452278 - Flags: review?(honzab.moz)
Attached patch v3-v4 interdiff (obsolete) — Splinter Review
(In reply to Michal Novotny (:michal) from comment #19)
> OK, so if I understand correctly, it isn't problem when the input stream
> fails at EOF in following scenario:
> 
> - output stream writes to the entry
> - input stream read from the same entry
> - output stream fails with NS_ERROR_OUT_OF_MEMORY
> - input stream reads till the end and then fails with NS_ERROR_OUT_OF_MEMORY

Yes, in other words - concurrent read!

> 
> In this case it's not a big problem to limit just writes. The patch now
> limits memory only for chunks that were created by writer. Unfortunately,
> it's not easy to handle a case when the output stream append a data to a
> chunk that was previously created by an input stream and stayed cached. But
> this case will be very rare. Also, limiting preload chunk count was removed
> since it doesn't make sense anymore.

Agree.

> 
> Please review just the difference between v3-v4. I'll attach an interdiff.
(In reply to Michal Novotny (:michal) from comment #19)
> Also, limiting preload chunk count was removed
> since it doesn't make sense anymore.

Actually, why doesn't it make sense anymore?  I kinda liked the property of the code that limiting write backlog also limited the preload count.  The overall goal of this patch is to save memory (and have a pref for it).

Anyway, this can be a followup if found to be still a problem.
(In reply to Honza Bambas (:mayhemer) from comment #22)
> (In reply to Michal Novotny (:michal) from comment #19)
> > Also, limiting preload chunk count was removed
> > since it doesn't make sense anymore.
> 
> Actually, why doesn't it make sense anymore?  I kinda liked the property of
> the code that limiting write backlog also limited the preload count.  The
> overall goal of this patch is to save memory (and have a pref for it).
> 
> Anyway, this can be a followup if found to be still a problem.

I'm not against limiting preload chunk count somehow, but they are now totally unrelated to max_chunks_memory_usage preference. In patch v3, preloaded chunk were counted to the allocated memory and it could happen that all the limit would be used just by preloaded chunks of one entry and no other entry could be read or written. In patch v4 they are not counted because preloaded chunks are created by preloader/reader.
Comment on attachment 8452279 [details] [diff] [review]
v3-v4 interdiff

Review of attachment 8452279 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/libpref/src/init/all.js
@@ +69,1 @@
>  // NS_ERROR_OUT_OF_MEMORY.

I'd probably write something like this (up to you to to adopt it or not):

Memory limit (in kB) for new cache data not yet written to disk.  Writes to the cache are buffered and written to disk in background with low priority.  With a slow persistent storage these buffers may grow when data is coming fast from the network.  When the amount of unwritten data is overreached, new writes will simply fail.  We have two buckets, one for important data (priority) like html, css, fonts and js, and one for other data like images, video.
Attachment #8452279 - Flags: review+
Attachment #8452278 - Flags: review?(honzab.moz) → review+
(In reply to Michal Novotny (:michal) from comment #23)
> I'm not against limiting preload chunk count somehow, but they are now
> totally unrelated to max_chunks_memory_usage preference. In patch v3,
> preloaded chunk were counted to the allocated memory and it could happen
> that all the limit would be used just by preloaded chunks of one entry and
> no other entry could be read or written. In patch v4 they are not counted
> because preloaded chunks are created by preloader/reader.

Aha, now I understand.  So if we ever want to limit preloads, it's for a different bug this one would not depend on.
https://hg.mozilla.org/mozilla-central/rev/8e969cdb2251
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Michal/Honza: can you give a risk/reward tradeoff here for uplifting this to Aurora?  Do we want to uplift?
Approval Request Comment

[Feature/regressing bug #]:
When the disk storage is slower than the network, we could end up with a lot of queued data waiting for the write. There is a suspicion that this could be cause of some of the reported OOM. This patch limits the size of the data waiting for the write. The patch also fixes some error handling issues in the cache code.

[User impact if declined]:
1) Heavy memory usage during large network activity on devices with slow disk.
2) Possible asserts in debug builds in case of any disk failure. Probably corrupted entries in release builds.

[Describe test coverage new/current, TBPL]:
https://tbpl.mozilla.org/?tree=Try&rev=1534628a7c9f
Slow storage tested locally.
Landed on m-c 2 days ago and so far no problems.

[Risks and why]:
Hard to say. This is quite a large change, but it was extensively tested in both situations - with the fast storage as well as with very slow storage. If there is any bug in the new code, it would probably cause failures when reading/writing data from/to the cache. It would be probably hidden to the user and we would just cache the data less effectively.

[String/UUID change made/needed]:
none
Attachment #8452278 - Attachment is obsolete: true
Attachment #8452279 - Attachment is obsolete: true
Attachment #8454751 - Flags: review+
Attachment #8454751 - Flags: checkin+
Attachment #8454751 - Flags: approval-mozilla-aurora?
In case it's not obvious from Michal's comment, the platform that we're most worried about here is probably B2G (it's unliklely that we'll shove so much stuff into the write queue on desktop that we'll hit OOM.  But I can imagine some of our lower-end B2G devices being vulnerable to this.  Not that we've got much data here--we don't get any stack-trace-like data from OOM reports, do we?
Ping on the a? flag, this needs to go in ASAP!
Comment on attachment 8454751 [details] [diff] [review]
patch v5 - commited patch

I spoke with Michal about this on irc. This fix is needed for HTTP cache v2, which landed in 32 and impacts B2G 2.0. This patch has been fairly well tested and has been on nightly for several days. (Issues should have surfaced in the desktop browser along with B2G.) Aurora+
Attachment #8454751 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Blocks: 976217
Comment on attachment 8454751 [details] [diff] [review]
patch v5 - commited patch

Honza asked to hold off on uplift while he checks into whether the primary need has been addressed elsewhere. I'm reverting to a? until Honza responds.
Attachment #8454751 - Flags: approval-mozilla-aurora+ → approval-mozilla-aurora?
Flags: needinfo?(honzab.moz)
(In reply to Lawrence Mandel [:lmandel] from comment #33)
> Comment on attachment 8454751 [details] [diff] [review]
> patch v5 - commited patch
> 
> Honza asked to hold off on uplift while he checks into whether the primary
> need has been addressed elsewhere. I'm reverting to a? until Honza responds.

Please see bug 976217 comment 8.  

* There is no need to uplift this. * 

Somebody had to take a look sooner...  It's good to fix this, but there is in 4 weeks on *all branches and products* only 2 such crashes...
Flags: needinfo?(honzab.moz)
Attachment #8454751 - Flags: approval-mozilla-aurora?
Depends on: 1056919
Blocks: 1066726
You need to log in before you can comment on or make changes to this bug.