Closed Bug 707436 Opened 13 years ago Closed 13 years ago

nsSetSmartSizeEvent can cause a lot of IO on the main thread

Categories

(Core :: Networking: Cache, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla12

People

(Reporter: michal, Assigned: michal)

References

Details

(Keywords: hang, main-thread-io, Whiteboard: [inbound])

Crash Data

Attachments

(1 file, 5 obsolete files)

We count the smart size of the cache on the background thread to avoid IO on the main thread, but we set the capacity on the main thread and if the capacity is lower than the current cache usage then we evict necessary space on the main thread. Frame Module Signature [Expand] Source 0 ntdll.dll ZwQueryFullAttributesFile 1 KERNELBASE.dll KERNELBASE.dll@0x81cc 2 xul.dll nsAString_internal::Assign xpcom/string/src/nsTSubstring.cpp:336 3 xul.dll GetFileInfo xpcom/io/nsLocalFileWin.cpp:460 4 xul.dll nsLocalFile::ResolveAndStat xpcom/io/nsLocalFileWin.cpp:814 5 msvcr90.dll sprintf 6 nspr4.dll nspr4.dll@0xa69f 7 xul.dll IsShortcutPath xpcom/io/nsLocalFileWin.cpp:320 8 xul.dll nsLocalFile::IsDirectory xpcom/io/nsLocalFileWin.cpp:2471 9 xul.dll nsLocalFile::Remove xpcom/io/nsLocalFileWin.cpp:1833 10 xul.dll nsDiskCacheMap::DeleteStorage 11 nspr4.dll MD_CURRENT_THREAD nsprpub/pr/src/md/windows/w95thred.c:308 12 xul.dll nsDiskCacheMap::DeleteStorage netwerk/cache/nsDiskCacheMap.cpp:996 13 xul.dll nsDiskCacheEvictor::VisitRecord netwerk/cache/nsDiskCacheDevice.cpp:189 14 xul.dll nsDiskCacheMap::VisitEachRecord netwerk/cache/nsDiskCacheMap.cpp:541 15 xul.dll nsDiskCacheMap::EvictRecords netwerk/cache/nsDiskCacheMap.cpp:614 16 xul.dll nsDiskCacheDevice::EvictDiskCacheEntries 17 xul.dll nsDiskCacheDevice::SetCapacity netwerk/cache/nsDiskCacheDevice.cpp:1148 18 xul.dll nsCacheService::SetDiskCacheCapacity netwerk/cache/nsCacheService.cpp:2059 19 xul.dll nsSetSmartSizeEvent::Run netwerk/cache/nsCacheService.cpp:230 20 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:625 21 xul.dll CallCreateInstance obj-firefox/xpcom/build/nsComponentManagerUtils.cpp:170 22 nspr4.dll MD_CURRENT_THREAD nsprpub/pr/src/md/windows/w95thred.c:308 23 xul.dll MessageLoop::DoWork ipc/chromium/src/base/message_loop.cc:412 24 xul.dll BaseStringEnumerator::Release xpcom/components/nsCategoryManager.cpp:129 25 xul.dll xul.dll@0xee7af 26 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:110 27 xul.dll nsCommandLine::EnumerateHandlers toolkit/components/commandlines/nsCommandLine.cpp:600 28 xul.dll xul.dll@0x4621f 29 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:201 30 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:175 31 msvcr90.dll getenv_helper_nolock 32 xul.dll MessageLoop::current ipc/chromium/src/base/message_loop.cc:85 33 xul.dll nsBaseAppShell::Run widget/src/xpwidgets/nsBaseAppShell.cpp:189 34 xul.dll nsAppStartup::Run toolkit/components/startup/nsAppStartup.cpp:220 35 xul.dll XRE_main toolkit/xre/nsAppRunner.cpp:3558
Crash Signature: [@ chromehang | ZwQueryFullAttributesFile ]
Keywords: hang
Blocks: 701909
(In reply to Michal Novotny (:michal) from comment #0) > We count the smart size of the cache on the background thread to avoid IO on > the main thread, but we set the capacity on the main thread and if the > capacity is lower than the current cache usage then we evict necessary space > on the main thread. (In email) Michal Novotny wrote: > https://bugzilla.mozilla.org/show_bug.cgi?id=707436 > > I'd like to postpone evicting of the entries in case of startup and > maybe spread the eviction in a longer timerange to avoid massive IO. > This would mean that we wouldn't respect the cache size for a limited > time. Is this acceptable? It seems reasonable to me? Dumb question #1: Why don't we just assume that the cache is under capacity at startup? Under what circumstances would this assumption be wrong? And, how often would that occur? Dumb question #2: Why don't we cache the size of the cache, so that we don't have to calculate the size of the cache at all? Doesn't calculating the size of the cache require a LOT of seeks/reads?
(In reply to Brian Smith (:bsmith) from comment #1) > Dumb question #1: Why don't we just assume that the cache is under capacity > at startup? Under what circumstances would this assumption be wrong? And, > how often would that occur? We need to calculate the smart size sometimes and we do it at startup. The assumption would be wrong whenever SmartCacheSize() returns smaller size than is the actual cache usage. But this is a good point, we could simply delay the smart size calculation instead of delaying the eviction. > Dumb question #2: Why don't we cache the size of the cache, so that we don't > have to calculate the size of the cache at all? Doesn't calculating the size > of the cache require a LOT of seeks/reads? We do cache the smart size (http://hg.mozilla.org/mozilla-central/annotate/4b71b1e9cc0c/netwerk/cache/nsCacheService.cpp#l231) but the smart size should respect the free space, so we need to recalculate it. In fact, I think that we should do it periodically and not only at startup. It doesn't require a lot of IO, we take a free space on drive and add the usage of the current cache (we store this information in the cache header).
Attached patch patch v1 (obsolete) — Splinter Review
Attachment #582859 - Flags: review?(jduell.mcbugs)
Comment on attachment 582859 [details] [diff] [review] patch v1 Review of attachment 582859 [details] [diff] [review]: ----------------------------------------------------------------- Sorry I didn't get to this before the code work--was busy with websockets. I have only one question--looks good otherwise. ::: netwerk/cache/nsCacheService.cpp @@ +204,5 @@ > + NS_DECL_ISUPPORTS > + > + NS_IMETHOD Notify(nsITimer* aTimer) { > + if (nsCacheService::gService) > + nsCacheService::gService->SetDiskSmartSize_Locked(true); How can you call the _Locked version here? We're called by timer here--have you been holding the lock the whole time since the timer init, or is there some other way you're guaranteed to have the lock (or don't need it)?
Attached patch patch v2 (obsolete) — Splinter Review
(In reply to Jason Duell (:jduell) from comment #4) > > + if (nsCacheService::gService) > > + nsCacheService::gService->SetDiskSmartSize_Locked(true); > > How can you call the _Locked version here? We're called by timer here--have > you been holding the lock the whole time since the timer init, or is there > some other way you're guaranteed to have the lock (or don't need it)? Good catch! The lock was missing.
Attachment #582859 - Attachment is obsolete: true
Attachment #582859 - Flags: review?(jduell.mcbugs)
Attachment #583166 - Flags: review?(jduell.mcbugs)
Comment on attachment 583166 [details] [diff] [review] patch v2 Review of attachment 583166 [details] [diff] [review]: ----------------------------------------------------------------- Interdiff between v1 and v2 is broken, but +r assuming the only significant change is that you're using gService->SetDiskSmartSize instead of gService->SetDiskSmartSize_Locked.
Attachment #583166 - Flags: review?(jduell.mcbugs) → review+
Whiteboard: [inbound]
Target Milestone: --- → mozilla12
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/eb614b45e478 - so far only Linux and Mac have finished, but they both have a Mutex/ReentrantMonitor/nsTArray_base/nsThread/nsTimerImpl leak in mochitest-ipcplugins (which, remember, is named backwards, so that means with ipcplugins *disabled*).
Whiteboard: [inbound]
Attached patch patch v4 (obsolete) — Splinter Review
The timer is now canceled on shutdown to avoid the leak. Result from try server: https://tbpl.mozilla.org/?tree=Try&rev=a382361806c0 All oranges seem to be known intermittent failures...
Attachment #583166 - Attachment is obsolete: true
Attachment #584235 - Flags: review?(jduell.mcbugs)
Attached patch patch v5 (obsolete) — Splinter Review
Yet a minor change in nsSetDiskSmartSizeCallback::Notify(). Check if we still have gService before nulling out mSmartSizeTimer.
Attachment #584235 - Attachment is obsolete: true
Attachment #584235 - Flags: review?(jduell.mcbugs)
Attachment #584248 - Flags: review?(jduell.mcbugs)
Attachment #584248 - Flags: review?(jduell.mcbugs) → review+
Attachment #584248 - Attachment is obsolete: true
Attachment #585718 - Flags: review?(bjarne)
Comment on attachment 585718 [details] [diff] [review] patch v6 - removes usage of preferences off the main thread Review of attachment 585718 [details] [diff] [review]: ----------------------------------------------------------------- r- because of comments below but seems ok otherwise ::: netwerk/cache/nsCacheService.cpp @@ +1471,5 @@ > + rv = mSmartSizeTimer->InitWithCallback(new nsSetDiskSmartSizeCallback(), > + 1000*60*3, > + nsITimer::TYPE_ONE_SHOT); > + if (NS_FAILED(rv)) > + return rv; Nit: The 'if' seems a little unnecessary since |rv| is returned below anyway. But the question is whether we can do something better here..? Or is this unlikely to fail (in which case: why do we bother to check it..)? @@ +2682,4 @@ > } > > nsresult > nsCacheService::SetDiskSmartSize_Locked(bool checkPref) I don't see the value of the param anymore since this version caches the pref. Afaics, we can just check the cached value.
Attachment #585718 - Flags: review?(bjarne) → review-
Attached patch patch v7Splinter Review
- removed checkPref argument - added warning when timer initialization fails
Attachment #585718 - Attachment is obsolete: true
Attachment #585763 - Flags: review?(bjarne)
Comment on attachment 585763 [details] [diff] [review] patch v7 Review of attachment 585763 [details] [diff] [review]: ----------------------------------------------------------------- Ok, assuming it passes tryserver
Attachment #585763 - Flags: review?(bjarne) → review+
(In reply to Bjarne (:bjarne) from comment #14) > > Ok, assuming it passes tryserver One orange on WinXP debug - https://tbpl.mozilla.org/?tree=Try&rev=128df2696064 Green on a second try - https://tbpl.mozilla.org/?tree=Try&rev=6ad28964be62
Comment on attachment 585763 [details] [diff] [review] patch v7 [Approval Request Comment] User impact if declined: Slow startup on mobile. Disk cache would need to be disabled. Testing completed (on m-c, etc.): try server (see comment #15) and local testing Risk to taking this patch (and alternatives if risky): unknown
Attachment #585763 - Flags: approval-mozilla-aurora?
(In reply to Michal Novotny (:michal) from comment #17) > Comment on attachment 585763 [details] [diff] [review] > Risk to taking this patch (and alternatives if risky): unknown Hi Michal - this risk assessment and the size of the patch gives me some pause about approving for Aurora. I've got a couple of questions. 1) Do we have an understanding of how much benefit this will be to the user (% of startup time)? 2) Is this benefit for both XUL and Native on mobile? 3) Since the benefit is on mobile, could we make this patch mobile only to reduce risk for desktop?
(In reply to Alex Keybl [:akeybl] from comment #18) > (In reply to Michal Novotny (:michal) from comment #17) > > Comment on attachment 585763 [details] [diff] [review] > > Risk to taking this patch (and alternatives if risky): unknown > > Hi Michal - this risk assessment and the size of the patch gives me some > pause about approving for Aurora. I've got a couple of questions. > > 1) Do we have an understanding of how much benefit this will be to the user > (% of startup time)? I don't know the percentage but the benefit is huge, because this can block the main thread for several seconds or even more (depends on how much of space we need to evict and how slow is the disk). > 2) Is this benefit for both XUL and Native on mobile? > 3) Since the benefit is on mobile, could we make this patch mobile only to > reduce risk for desktop? Sorry, my description wasn't clear enough. The problem is on all: desktop, native mobile and XUL mobile. So the desktop will benefit too. The difference is that we didn't have the disk cache enabled on mobile until FF 11. And if we don't land this fix in FF 11, we need to postpone enabling of the disk cache on mobile.
(In reply to Alex Keybl [:akeybl] from comment #18) > 1) Do we have an understanding of how much benefit this will be to the user > (% of startup time)? Set hangmonitor.timeout to 30 during a few days in 12.0a1 and compare crash stats with the ones from 11.0a1/20111126 to 11.0a1/20111129 (about 150 crashes per build). That will also give a Snappy progress snapshot.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
(In reply to Michal Novotny (:michal) from comment #17) > Risk to taking this patch (and alternatives if risky): unknown Thanks for the explanation Michal, and the good suggestion Scoobidiver. I don't think this bug alone would drive turning on the hangmonitor in 12 at this point, but CC'ing Sheila since I believe she'll make the call on when we should try our experiment again. Given the unknown risk, and the fact that nothing points to this being a recent regression, let's let this ride the train.
Attachment #585763 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Depends on: 716289
Depends on: 728915
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: