Closed Bug 913822 Opened 11 years ago Closed 8 years ago

HTTP cache v2: fix shutdown time regression on a slow storage

Categories

(Core :: Networking: Cache, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
firefox45 --- fixed
firefox46 --- fixed

People

(Reporter: mayhemer, Assigned: mayhemer)

References

(Blocks 1 open bug)

Details

(Whiteboard: [cache2])

Attachments

(1 file, 1 obsolete file)

Writes and file closes are done as the last operation on the I/O thread, with the least priority.  When user visits a page with a lot (400+) resources on a page, like images or so, writes to disk are serialized and slow.  

When we shut down, we wait for all the writes to finish.  This may keep the browser open for minutes (!).  We should just abandon never written files and partially written just leave as broken or delete them when there is just few of such files and it won't take a lot of time.

On a slow storage this introduces a huge and unbearable shutdown time regression.
Blocks: cache2enable
Blocks: 920573
Suggestion:
- at the earliest moment we know the process is going to exit:
  - set CacheIOThread to a FastShutdown state
  - create a timestamp now() + say 2s 
- when executing events on levels WRITE or CLOSE, check we are in the FastShutdown
- if so, check that now() is smaller then the timestamp we took
- if not, it means we are over the shutdown time limit and just prevent call of Run() for events on those two levels, thus just throw file writes and closings away

Not sure this may introduce any leaks or hangs, but seems as a simple solution.
It's been decided this shutdown time regression is not worth the risk of breaking the code since we are approaching final stage.  Also, I don't believe we will be so much blocked even on mobile devices.  I was testing with an extremely slow storage and an extreme number of data to write so this problem was visible.  For production I think we can fix this as one of first bugs after cache2 enable.

Also doesn't block the never-remember-case since we tend to keep data only in memory for that setting.  Hence that bug's not dependent on any shutdown cleanup we would handle/introduce/base here.
Assignee: michal.novotny → nobody
Blocks: cache2afterenable
No longer blocks: cache2enable, 920573
Blocks: 1020345
Blocks: 1029213
No longer blocks: 1029213
Depends on: 1013395
I think it's simpler now to fix and becomes pretty important.  During my tests with cache on a slower sdcard I can hang for several tens of seconds to write all the unwritten stuff to the cache.  We should not spend more then 2-3 seconds by delayed writes, after that they should be just bypassed as NOOP.  

We unfortunatelly don't have tele for write and close times...  so it's hard to find out if we also have to skip PR_Close.  But according my research in the past, we probably have to skip it too...  So, option just for a release build w/o leak detection?
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Blocks: 1215970
Based on bug 1215970 comment 86, it seems like leaking is a way we can go with.

I'll try a patch according comment 1.
Attached patch v1 (obsolete) — Splinter Review
- there is a 2 seconds limit to write and close anything
- the limit has a preference "browser.cache.max_shutdown_io_lag"
- when we pass that limit we never write anything and leave the files open
- files that have been doomed before shutdown are tho deleted from disk every time
- "invalid" handles (=with unwritten metadata) are left to prevent I/O (dooming) ; these will never load back, checksums will not validate

https://treeherder.mozilla.org/#/jobs?repo=try&revision=95e6f9365a5b
Attachment #8706475 - Flags: review?(michal.novotny)
Comment on attachment 8706475 [details] [diff] [review]
v1

Review of attachment 8706475 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/cache2/CacheFileIOManager.cpp
@@ +2292,5 @@
>    DebugOnly<bool> found;
>    found = mHandlesByLastUsed.RemoveElement(aHandle);
>    MOZ_ASSERT(found);
>  
> +  if (aHandle->mIsDoomed || aIgnoreShutdownLag || !IsPastShutdownIOLag()) {

Why we need to close handles of doomed entries?
(In reply to Michal Novotny (:michal) from comment #7)
> Comment on attachment 8706475 [details] [diff] [review]
> v1
> 
> Review of attachment 8706475 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: netwerk/cache2/CacheFileIOManager.cpp
> @@ +2292,5 @@
> >    DebugOnly<bool> found;
> >    found = mHandlesByLastUsed.RemoveElement(aHandle);
> >    MOZ_ASSERT(found);
> >  
> > +  if (aHandle->mIsDoomed || aIgnoreShutdownLag || !IsPastShutdownIOLag()) {
> 
> Why we need to close handles of doomed entries?

Because entries that are doomed prior shutdown must be removed from disk.  Later you are deleting the files, hence the handle must be closed (at least on windows, but osx has a problem too I think).
(In reply to Honza Bambas (:mayhemer) from comment #8)
> Because entries that are doomed prior shutdown must be removed from disk. 
> Later you are deleting the files, hence the handle must be closed (at least
> on windows, but osx has a problem too I think).

If CacheFileHandle::mIsDoomed is true then the file was already moved to doomed directory. It's not critical to remove these files during shutdown because they will be deleted on next startup.
(In reply to Michal Novotny (:michal) from comment #9)
> (In reply to Honza Bambas (:mayhemer) from comment #8)
> > Because entries that are doomed prior shutdown must be removed from disk. 
> > Later you are deleting the files, hence the handle must be closed (at least
> > on windows, but osx has a problem too I think).
> 
> If CacheFileHandle::mIsDoomed is true then the file was already moved to
> doomed directory. It's not critical to remove these files during shutdown
> because they will be deleted on next startup.

You are right!  I will update the patch.  Thanks.
Attachment #8706475 - Flags: review?(michal.novotny)
Attached patch v1.1Splinter Review
So, the only one difference is removal of |aHandle->mIsDoomed| from the condition.  There is no need to do anything else, since we don't break on mFile->Remove() failure later in the code.

Actually, I'm thinking of not PR_Close()'ing the doomed and invalid files immediately after shutdown at all, just leak them, they won't load next or will be deleted next time anyway.  This may save even more I/O and get more metadata be successfully written to disk.  What do you think?

And I'm also thinking of giving the metadata write a bit more priority somehow during the shutdown time.  But that would definitely be an optimization that may or may not have any real affect and definitely for a different bug.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=469a0a6cc7c8
Attachment #8706475 - Attachment is obsolete: true
Attachment #8707675 - Flags: review?(michal.novotny)
Attachment #8707675 - Flags: review?(michal.novotny) → review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/825be41800cd
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
If we think this can impact top crashers 1215970 we should consider uplift to 45. Honza do you think its a reasonable candidate? looks that way to me..
Flags: needinfo?(honzab.moz)
There has been no report of error since we've landed this.  So, yes.  I will request beta approval.
Flags: needinfo?(honzab.moz)
Comment on attachment 8707675 [details] [diff] [review]
v1.1

Approval Request Comment
[Feature/regressing bug #]: we could say the HTTP cache v2
[User impact if declined]: very long shutdown times when storage media where HTTP cache is located is slow
[Describe test coverage new/current, TreeHerder]: currently rides aurora
[Risks and why]: hard to say, but according zero error feedback so far I think minimal
[String/UUID change made/needed]: none
Attachment #8707675 - Flags: approval-mozilla-beta?
Comment on attachment 8707675 [details] [diff] [review]
v1.1

If it can fix the skype issue, taking it.
Should be in 45 beta 3.
Attachment #8707675 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Unfortunately, this patch seems to not have helped bug 1215970 or bug 1158189 in 45.0b3.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: