If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

HTTP cache v2: fix shutdown time regression on a slow storage

RESOLVED FIXED in Firefox 45

Status

()

Core
Networking: Cache
--
major
RESOLVED FIXED
4 years ago
2 years ago

People

(Reporter: mayhemer, Assigned: mayhemer)

Tracking

(Blocks: 1 bug)

unspecified
mozilla46
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox45 fixed, firefox46 fixed)

Details

(Whiteboard: [cache2])

Attachments

(1 attachment, 1 obsolete attachment)

(Assignee)

Description

4 years ago
Writes and file closes are done as the last operation on the I/O thread, with the least priority.  When user visits a page with a lot (400+) resources on a page, like images or so, writes to disk are serialized and slow.  

When we shut down, we wait for all the writes to finish.  This may keep the browser open for minutes (!).  We should just abandon never written files and partially written just leave as broken or delete them when there is just few of such files and it won't take a lot of time.

On a slow storage this introduces a huge and unbearable shutdown time regression.
(Assignee)

Updated

4 years ago
Blocks: 913806
(Assignee)

Updated

4 years ago
Blocks: 920573
(Assignee)

Comment 1

4 years ago
Suggestion:
- at the earliest moment we know the process is going to exit:
  - set CacheIOThread to a FastShutdown state
  - create a timestamp now() + say 2s 
- when executing events on levels WRITE or CLOSE, check we are in the FastShutdown
- if so, check that now() is smaller then the timestamp we took
- if not, it means we are over the shutdown time limit and just prevent call of Run() for events on those two levels, thus just throw file writes and closings away

Not sure this may introduce any leaks or hangs, but seems as a simple solution.
(Assignee)

Comment 2

4 years ago
It's been decided this shutdown time regression is not worth the risk of breaking the code since we are approaching final stage.  Also, I don't believe we will be so much blocked even on mobile devices.  I was testing with an extremely slow storage and an extreme number of data to write so this problem was visible.  For production I think we can fix this as one of first bugs after cache2 enable.

Also doesn't block the never-remember-case since we tend to keep data only in memory for that setting.  Hence that bug's not dependent on any shutdown cleanup we would handle/introduce/base here.
Assignee: michal.novotny → nobody
Blocks: 941841
No longer blocks: 913806, 920573
(Assignee)

Updated

3 years ago
Blocks: 1020345
(Assignee)

Updated

3 years ago
Blocks: 1029213
(Assignee)

Updated

3 years ago
No longer blocks: 1029213
(Assignee)

Updated

3 years ago
Depends on: 1013395
(Assignee)

Comment 3

3 years ago
I think it's simpler now to fix and becomes pretty important.  During my tests with cache on a slower sdcard I can hang for several tens of seconds to write all the unwritten stuff to the cache.  We should not spend more then 2-3 seconds by delayed writes, after that they should be just bypassed as NOOP.  

We unfortunatelly don't have tele for write and close times...  so it's hard to find out if we also have to skip PR_Close.  But according my research in the past, we probably have to skip it too...  So, option just for a release build w/o leak detection?
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1020345
(Assignee)

Updated

2 years ago
Blocks: 1215970
(Assignee)

Comment 5

2 years ago
Based on bug 1215970 comment 86, it seems like leaking is a way we can go with.

I'll try a patch according comment 1.
(Assignee)

Comment 6

2 years ago
Created attachment 8706475 [details] [diff] [review]
v1

- there is a 2 seconds limit to write and close anything
- the limit has a preference "browser.cache.max_shutdown_io_lag"
- when we pass that limit we never write anything and leave the files open
- files that have been doomed before shutdown are tho deleted from disk every time
- "invalid" handles (=with unwritten metadata) are left to prevent I/O (dooming) ; these will never load back, checksums will not validate

https://treeherder.mozilla.org/#/jobs?repo=try&revision=95e6f9365a5b
Attachment #8706475 - Flags: review?(michal.novotny)
Comment on attachment 8706475 [details] [diff] [review]
v1

Review of attachment 8706475 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/cache2/CacheFileIOManager.cpp
@@ +2292,5 @@
>    DebugOnly<bool> found;
>    found = mHandlesByLastUsed.RemoveElement(aHandle);
>    MOZ_ASSERT(found);
>  
> +  if (aHandle->mIsDoomed || aIgnoreShutdownLag || !IsPastShutdownIOLag()) {

Why we need to close handles of doomed entries?
(Assignee)

Comment 8

2 years ago
(In reply to Michal Novotny (:michal) from comment #7)
> Comment on attachment 8706475 [details] [diff] [review]
> v1
> 
> Review of attachment 8706475 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: netwerk/cache2/CacheFileIOManager.cpp
> @@ +2292,5 @@
> >    DebugOnly<bool> found;
> >    found = mHandlesByLastUsed.RemoveElement(aHandle);
> >    MOZ_ASSERT(found);
> >  
> > +  if (aHandle->mIsDoomed || aIgnoreShutdownLag || !IsPastShutdownIOLag()) {
> 
> Why we need to close handles of doomed entries?

Because entries that are doomed prior shutdown must be removed from disk.  Later you are deleting the files, hence the handle must be closed (at least on windows, but osx has a problem too I think).
(In reply to Honza Bambas (:mayhemer) from comment #8)
> Because entries that are doomed prior shutdown must be removed from disk. 
> Later you are deleting the files, hence the handle must be closed (at least
> on windows, but osx has a problem too I think).

If CacheFileHandle::mIsDoomed is true then the file was already moved to doomed directory. It's not critical to remove these files during shutdown because they will be deleted on next startup.
(Assignee)

Comment 10

2 years ago
(In reply to Michal Novotny (:michal) from comment #9)
> (In reply to Honza Bambas (:mayhemer) from comment #8)
> > Because entries that are doomed prior shutdown must be removed from disk. 
> > Later you are deleting the files, hence the handle must be closed (at least
> > on windows, but osx has a problem too I think).
> 
> If CacheFileHandle::mIsDoomed is true then the file was already moved to
> doomed directory. It's not critical to remove these files during shutdown
> because they will be deleted on next startup.

You are right!  I will update the patch.  Thanks.
(Assignee)

Updated

2 years ago
Attachment #8706475 - Flags: review?(michal.novotny)
(Assignee)

Comment 11

2 years ago
Created attachment 8707675 [details] [diff] [review]
v1.1

So, the only one difference is removal of |aHandle->mIsDoomed| from the condition.  There is no need to do anything else, since we don't break on mFile->Remove() failure later in the code.

Actually, I'm thinking of not PR_Close()'ing the doomed and invalid files immediately after shutdown at all, just leak them, they won't load next or will be deleted next time anyway.  This may save even more I/O and get more metadata be successfully written to disk.  What do you think?

And I'm also thinking of giving the metadata write a bit more priority somehow during the shutdown time.  But that would definitely be an optimization that may or may not have any real affect and definitely for a different bug.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=469a0a6cc7c8
Attachment #8706475 - Attachment is obsolete: true
Attachment #8707675 - Flags: review?(michal.novotny)
Attachment #8707675 - Flags: review?(michal.novotny) → review+
(Assignee)

Updated

2 years ago
Keywords: checkin-needed

Comment 12

2 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/825be41800cd
Keywords: checkin-needed

Comment 13

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/825be41800cd
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox46: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
If we think this can impact top crashers 1215970 we should consider uplift to 45. Honza do you think its a reasonable candidate? looks that way to me..
Flags: needinfo?(honzab.moz)
(Assignee)

Comment 15

2 years ago
There has been no report of error since we've landed this.  So, yes.  I will request beta approval.
Flags: needinfo?(honzab.moz)
(Assignee)

Comment 16

2 years ago
Comment on attachment 8707675 [details] [diff] [review]
v1.1

Approval Request Comment
[Feature/regressing bug #]: we could say the HTTP cache v2
[User impact if declined]: very long shutdown times when storage media where HTTP cache is located is slow
[Describe test coverage new/current, TreeHerder]: currently rides aurora
[Risks and why]: hard to say, but according zero error feedback so far I think minimal
[String/UUID change made/needed]: none
Attachment #8707675 - Flags: approval-mozilla-beta?
status-firefox45: --- → affected
Comment on attachment 8707675 [details] [diff] [review]
v1.1

If it can fix the skype issue, taking it.
Should be in 45 beta 3.
Attachment #8707675 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 18

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/e4b4c84f6607
status-firefox45: affected → fixed

Comment 19

2 years ago
Unfortunately, this patch seems to not have helped bug 1215970 or bug 1158189 in 45.0b3.
You need to log in before you can comment on or make changes to this bug.