Closed
Bug 707436
Opened 13 years ago
Closed 13 years ago
nsSetSmartSizeEvent can cause a lot of IO on the main thread
Categories
(Core :: Networking: Cache, defect)
Core
Networking: Cache
Tracking
()
RESOLVED
FIXED
mozilla12
People
(Reporter: michal, Assigned: michal)
References
Details
(Keywords: hang, main-thread-io, Whiteboard: [inbound])
Crash Data
Attachments
(1 file, 5 obsolete files)
18.80 KB,
patch
|
bjarne
:
review+
akeybl
:
approval-mozilla-aurora-
|
Details | Diff | Splinter Review |
We count the smart size of the cache on the background thread to avoid IO on the main thread, but we set the capacity on the main thread and if the capacity is lower than the current cache usage then we evict necessary space on the main thread.
Frame Module Signature [Expand] Source
0 ntdll.dll ZwQueryFullAttributesFile
1 KERNELBASE.dll KERNELBASE.dll@0x81cc
2 xul.dll nsAString_internal::Assign xpcom/string/src/nsTSubstring.cpp:336
3 xul.dll GetFileInfo xpcom/io/nsLocalFileWin.cpp:460
4 xul.dll nsLocalFile::ResolveAndStat xpcom/io/nsLocalFileWin.cpp:814
5 msvcr90.dll sprintf
6 nspr4.dll nspr4.dll@0xa69f
7 xul.dll IsShortcutPath xpcom/io/nsLocalFileWin.cpp:320
8 xul.dll nsLocalFile::IsDirectory xpcom/io/nsLocalFileWin.cpp:2471
9 xul.dll nsLocalFile::Remove xpcom/io/nsLocalFileWin.cpp:1833
10 xul.dll nsDiskCacheMap::DeleteStorage
11 nspr4.dll MD_CURRENT_THREAD nsprpub/pr/src/md/windows/w95thred.c:308
12 xul.dll nsDiskCacheMap::DeleteStorage netwerk/cache/nsDiskCacheMap.cpp:996
13 xul.dll nsDiskCacheEvictor::VisitRecord netwerk/cache/nsDiskCacheDevice.cpp:189
14 xul.dll nsDiskCacheMap::VisitEachRecord netwerk/cache/nsDiskCacheMap.cpp:541
15 xul.dll nsDiskCacheMap::EvictRecords netwerk/cache/nsDiskCacheMap.cpp:614
16 xul.dll nsDiskCacheDevice::EvictDiskCacheEntries
17 xul.dll nsDiskCacheDevice::SetCapacity netwerk/cache/nsDiskCacheDevice.cpp:1148
18 xul.dll nsCacheService::SetDiskCacheCapacity netwerk/cache/nsCacheService.cpp:2059
19 xul.dll nsSetSmartSizeEvent::Run netwerk/cache/nsCacheService.cpp:230
20 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:625
21 xul.dll CallCreateInstance obj-firefox/xpcom/build/nsComponentManagerUtils.cpp:170
22 nspr4.dll MD_CURRENT_THREAD nsprpub/pr/src/md/windows/w95thred.c:308
23 xul.dll MessageLoop::DoWork ipc/chromium/src/base/message_loop.cc:412
24 xul.dll BaseStringEnumerator::Release xpcom/components/nsCategoryManager.cpp:129
25 xul.dll xul.dll@0xee7af
26 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:110
27 xul.dll nsCommandLine::EnumerateHandlers toolkit/components/commandlines/nsCommandLine.cpp:600
28 xul.dll xul.dll@0x4621f
29 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:201
30 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:175
31 msvcr90.dll getenv_helper_nolock
32 xul.dll MessageLoop::current ipc/chromium/src/base/message_loop.cc:85
33 xul.dll nsBaseAppShell::Run widget/src/xpwidgets/nsBaseAppShell.cpp:189
34 xul.dll nsAppStartup::Run toolkit/components/startup/nsAppStartup.cpp:220
35 xul.dll XRE_main toolkit/xre/nsAppRunner.cpp:3558
Updated•13 years ago
|
Updated•13 years ago
|
Keywords: main-thread-io
Comment 1•13 years ago
|
||
(In reply to Michal Novotny (:michal) from comment #0)
> We count the smart size of the cache on the background thread to avoid IO on
> the main thread, but we set the capacity on the main thread and if the
> capacity is lower than the current cache usage then we evict necessary space
> on the main thread.
(In email) Michal Novotny wrote:
> https://bugzilla.mozilla.org/show_bug.cgi?id=707436
>
> I'd like to postpone evicting of the entries in case of startup and
> maybe spread the eviction in a longer timerange to avoid massive IO.
> This would mean that we wouldn't respect the cache size for a limited
> time. Is this acceptable?
It seems reasonable to me?
Dumb question #1: Why don't we just assume that the cache is under capacity at startup? Under what circumstances would this assumption be wrong? And, how often would that occur?
Dumb question #2: Why don't we cache the size of the cache, so that we don't have to calculate the size of the cache at all? Doesn't calculating the size of the cache require a LOT of seeks/reads?
Assignee | ||
Comment 2•13 years ago
|
||
(In reply to Brian Smith (:bsmith) from comment #1)
> Dumb question #1: Why don't we just assume that the cache is under capacity
> at startup? Under what circumstances would this assumption be wrong? And,
> how often would that occur?
We need to calculate the smart size sometimes and we do it at startup. The assumption would be wrong whenever SmartCacheSize() returns smaller size than is the actual cache usage. But this is a good point, we could simply delay the smart size calculation instead of delaying the eviction.
> Dumb question #2: Why don't we cache the size of the cache, so that we don't
> have to calculate the size of the cache at all? Doesn't calculating the size
> of the cache require a LOT of seeks/reads?
We do cache the smart size (http://hg.mozilla.org/mozilla-central/annotate/4b71b1e9cc0c/netwerk/cache/nsCacheService.cpp#l231) but the smart size should respect the free space, so we need to recalculate it. In fact, I think that we should do it periodically and not only at startup. It doesn't require a lot of IO, we take a free space on drive and add the usage of the current cache (we store this information in the cache header).
Assignee | ||
Comment 3•13 years ago
|
||
Attachment #582859 -
Flags: review?(jduell.mcbugs)
Comment 4•13 years ago
|
||
Comment on attachment 582859 [details] [diff] [review]
patch v1
Review of attachment 582859 [details] [diff] [review]:
-----------------------------------------------------------------
Sorry I didn't get to this before the code work--was busy with websockets.
I have only one question--looks good otherwise.
::: netwerk/cache/nsCacheService.cpp
@@ +204,5 @@
> + NS_DECL_ISUPPORTS
> +
> + NS_IMETHOD Notify(nsITimer* aTimer) {
> + if (nsCacheService::gService)
> + nsCacheService::gService->SetDiskSmartSize_Locked(true);
How can you call the _Locked version here? We're called by timer here--have you been holding the lock the whole time since the timer init, or is there some other way you're guaranteed to have the lock (or don't need it)?
Assignee | ||
Comment 5•13 years ago
|
||
(In reply to Jason Duell (:jduell) from comment #4)
> > + if (nsCacheService::gService)
> > + nsCacheService::gService->SetDiskSmartSize_Locked(true);
>
> How can you call the _Locked version here? We're called by timer here--have
> you been holding the lock the whole time since the timer init, or is there
> some other way you're guaranteed to have the lock (or don't need it)?
Good catch! The lock was missing.
Attachment #582859 -
Attachment is obsolete: true
Attachment #582859 -
Flags: review?(jduell.mcbugs)
Attachment #583166 -
Flags: review?(jduell.mcbugs)
Comment 6•13 years ago
|
||
Comment on attachment 583166 [details] [diff] [review]
patch v2
Review of attachment 583166 [details] [diff] [review]:
-----------------------------------------------------------------
Interdiff between v1 and v2 is broken, but +r assuming the only significant change is that you're using gService->SetDiskSmartSize instead of gService->SetDiskSmartSize_Locked.
Attachment #583166 -
Flags: review?(jduell.mcbugs) → review+
Assignee | ||
Comment 7•13 years ago
|
||
Whiteboard: [inbound]
Target Milestone: --- → mozilla12
Comment 8•13 years ago
|
||
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/eb614b45e478 - so far only Linux and Mac have finished, but they both have a Mutex/ReentrantMonitor/nsTArray_base/nsThread/nsTimerImpl leak in mochitest-ipcplugins (which, remember, is named backwards, so that means with ipcplugins *disabled*).
Whiteboard: [inbound]
Assignee | ||
Comment 9•13 years ago
|
||
The timer is now canceled on shutdown to avoid the leak.
Result from try server: https://tbpl.mozilla.org/?tree=Try&rev=a382361806c0
All oranges seem to be known intermittent failures...
Attachment #583166 -
Attachment is obsolete: true
Attachment #584235 -
Flags: review?(jduell.mcbugs)
Assignee | ||
Comment 10•13 years ago
|
||
Yet a minor change in nsSetDiskSmartSizeCallback::Notify(). Check if we still have gService before nulling out mSmartSizeTimer.
Attachment #584235 -
Attachment is obsolete: true
Attachment #584235 -
Flags: review?(jduell.mcbugs)
Attachment #584248 -
Flags: review?(jduell.mcbugs)
Updated•13 years ago
|
Attachment #584248 -
Flags: review?(jduell.mcbugs) → review+
Assignee | ||
Comment 11•13 years ago
|
||
pushed to try server https://tbpl.mozilla.org/?tree=Try&rev=128df2696064
Attachment #584248 -
Attachment is obsolete: true
Attachment #585718 -
Flags: review?(bjarne)
Comment 12•13 years ago
|
||
Comment on attachment 585718 [details] [diff] [review]
patch v6 - removes usage of preferences off the main thread
Review of attachment 585718 [details] [diff] [review]:
-----------------------------------------------------------------
r- because of comments below but seems ok otherwise
::: netwerk/cache/nsCacheService.cpp
@@ +1471,5 @@
> + rv = mSmartSizeTimer->InitWithCallback(new nsSetDiskSmartSizeCallback(),
> + 1000*60*3,
> + nsITimer::TYPE_ONE_SHOT);
> + if (NS_FAILED(rv))
> + return rv;
Nit: The 'if' seems a little unnecessary since |rv| is returned below anyway. But the question is whether we can do something better here..? Or is this unlikely to fail (in which case: why do we bother to check it..)?
@@ +2682,4 @@
> }
>
> nsresult
> nsCacheService::SetDiskSmartSize_Locked(bool checkPref)
I don't see the value of the param anymore since this version caches the pref. Afaics, we can just check the cached value.
Attachment #585718 -
Flags: review?(bjarne) → review-
Assignee | ||
Comment 13•13 years ago
|
||
- removed checkPref argument
- added warning when timer initialization fails
Attachment #585718 -
Attachment is obsolete: true
Attachment #585763 -
Flags: review?(bjarne)
Comment 14•13 years ago
|
||
Comment on attachment 585763 [details] [diff] [review]
patch v7
Review of attachment 585763 [details] [diff] [review]:
-----------------------------------------------------------------
Ok, assuming it passes tryserver
Attachment #585763 -
Flags: review?(bjarne) → review+
Assignee | ||
Comment 15•13 years ago
|
||
(In reply to Bjarne (:bjarne) from comment #14)
>
> Ok, assuming it passes tryserver
One orange on WinXP debug - https://tbpl.mozilla.org/?tree=Try&rev=128df2696064
Green on a second try - https://tbpl.mozilla.org/?tree=Try&rev=6ad28964be62
Assignee | ||
Comment 16•13 years ago
|
||
Whiteboard: [inbound]
Assignee | ||
Comment 17•13 years ago
|
||
Comment on attachment 585763 [details] [diff] [review]
patch v7
[Approval Request Comment]
User impact if declined: Slow startup on mobile. Disk cache would need to be disabled.
Testing completed (on m-c, etc.): try server (see comment #15) and local testing
Risk to taking this patch (and alternatives if risky): unknown
Attachment #585763 -
Flags: approval-mozilla-aurora?
Comment 18•13 years ago
|
||
(In reply to Michal Novotny (:michal) from comment #17)
> Comment on attachment 585763 [details] [diff] [review]
> Risk to taking this patch (and alternatives if risky): unknown
Hi Michal - this risk assessment and the size of the patch gives me some pause about approving for Aurora. I've got a couple of questions.
1) Do we have an understanding of how much benefit this will be to the user (% of startup time)?
2) Is this benefit for both XUL and Native on mobile?
3) Since the benefit is on mobile, could we make this patch mobile only to reduce risk for desktop?
Assignee | ||
Comment 19•13 years ago
|
||
(In reply to Alex Keybl [:akeybl] from comment #18)
> (In reply to Michal Novotny (:michal) from comment #17)
> > Comment on attachment 585763 [details] [diff] [review]
> > Risk to taking this patch (and alternatives if risky): unknown
>
> Hi Michal - this risk assessment and the size of the patch gives me some
> pause about approving for Aurora. I've got a couple of questions.
>
> 1) Do we have an understanding of how much benefit this will be to the user
> (% of startup time)?
I don't know the percentage but the benefit is huge, because this can block the main thread for several seconds or even more (depends on how much of space we need to evict and how slow is the disk).
> 2) Is this benefit for both XUL and Native on mobile?
> 3) Since the benefit is on mobile, could we make this patch mobile only to
> reduce risk for desktop?
Sorry, my description wasn't clear enough. The problem is on all: desktop, native mobile and XUL mobile. So the desktop will benefit too. The difference is that we didn't have the disk cache enabled on mobile until FF 11. And if we don't land this fix in FF 11, we need to postpone enabling of the disk cache on mobile.
Comment 20•13 years ago
|
||
(In reply to Alex Keybl [:akeybl] from comment #18)
> 1) Do we have an understanding of how much benefit this will be to the user
> (% of startup time)?
Set hangmonitor.timeout to 30 during a few days in 12.0a1 and compare crash stats with the ones from 11.0a1/20111126 to 11.0a1/20111129 (about 150 crashes per build).
That will also give a Snappy progress snapshot.
Comment 21•13 years ago
|
||
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 22•13 years ago
|
||
(In reply to Michal Novotny (:michal) from comment #17)
> Risk to taking this patch (and alternatives if risky): unknown
Thanks for the explanation Michal, and the good suggestion Scoobidiver. I don't think this bug alone would drive turning on the hangmonitor in 12 at this point, but CC'ing Sheila since I believe she'll make the call on when we should try our experiment again.
Given the unknown risk, and the fact that nothing points to this being a recent regression, let's let this ride the train.
Updated•13 years ago
|
Attachment #585763 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
You need to log in
before you can comment on or make changes to this bug.
Description
•