Closed Bug 707436 Opened 13 years ago Closed 13 years ago

nsSetSmartSizeEvent can cause a lot of IO on the main thread

Categories

(Core :: Networking: Cache, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla12

People

(Reporter: michal, Assigned: michal)

References

Details

(Keywords: hang, main-thread-io, Whiteboard: [inbound])

Crash Data

Attachments

(1 file, 5 obsolete files)

We count the smart size of the cache on the background thread to avoid IO on the main thread, but we set the capacity on the main thread and if the capacity is lower than the current cache usage then we evict necessary space on the main thread.

Frame 	Module 	Signature [Expand] 	Source
0 	ntdll.dll 	ZwQueryFullAttributesFile 	
1 	KERNELBASE.dll 	KERNELBASE.dll@0x81cc 	
2 	xul.dll 	nsAString_internal::Assign 	xpcom/string/src/nsTSubstring.cpp:336
3 	xul.dll 	GetFileInfo 	xpcom/io/nsLocalFileWin.cpp:460
4 	xul.dll 	nsLocalFile::ResolveAndStat 	xpcom/io/nsLocalFileWin.cpp:814
5 	msvcr90.dll 	sprintf 	
6 	nspr4.dll 	nspr4.dll@0xa69f 	
7 	xul.dll 	IsShortcutPath 	xpcom/io/nsLocalFileWin.cpp:320
8 	xul.dll 	nsLocalFile::IsDirectory 	xpcom/io/nsLocalFileWin.cpp:2471
9 	xul.dll 	nsLocalFile::Remove 	xpcom/io/nsLocalFileWin.cpp:1833
10 	xul.dll 	nsDiskCacheMap::DeleteStorage 	
11 	nspr4.dll 	MD_CURRENT_THREAD 	nsprpub/pr/src/md/windows/w95thred.c:308
12 	xul.dll 	nsDiskCacheMap::DeleteStorage 	netwerk/cache/nsDiskCacheMap.cpp:996
13 	xul.dll 	nsDiskCacheEvictor::VisitRecord 	netwerk/cache/nsDiskCacheDevice.cpp:189
14 	xul.dll 	nsDiskCacheMap::VisitEachRecord 	netwerk/cache/nsDiskCacheMap.cpp:541
15 	xul.dll 	nsDiskCacheMap::EvictRecords 	netwerk/cache/nsDiskCacheMap.cpp:614
16 	xul.dll 	nsDiskCacheDevice::EvictDiskCacheEntries 	
17 	xul.dll 	nsDiskCacheDevice::SetCapacity 	netwerk/cache/nsDiskCacheDevice.cpp:1148
18 	xul.dll 	nsCacheService::SetDiskCacheCapacity 	netwerk/cache/nsCacheService.cpp:2059
19 	xul.dll 	nsSetSmartSizeEvent::Run 	netwerk/cache/nsCacheService.cpp:230
20 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:625
21 	xul.dll 	CallCreateInstance 	obj-firefox/xpcom/build/nsComponentManagerUtils.cpp:170
22 	nspr4.dll 	MD_CURRENT_THREAD 	nsprpub/pr/src/md/windows/w95thred.c:308
23 	xul.dll 	MessageLoop::DoWork 	ipc/chromium/src/base/message_loop.cc:412
24 	xul.dll 	BaseStringEnumerator::Release 	xpcom/components/nsCategoryManager.cpp:129
25 	xul.dll 	xul.dll@0xee7af 	
26 	xul.dll 	mozilla::ipc::MessagePump::Run 	ipc/glue/MessagePump.cpp:110
27 	xul.dll 	nsCommandLine::EnumerateHandlers 	toolkit/components/commandlines/nsCommandLine.cpp:600
28 	xul.dll 	xul.dll@0x4621f 	
29 	xul.dll 	MessageLoop::RunHandler 	ipc/chromium/src/base/message_loop.cc:201
30 	xul.dll 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:175
31 	msvcr90.dll 	getenv_helper_nolock 	
32 	xul.dll 	MessageLoop::current 	ipc/chromium/src/base/message_loop.cc:85
33 	xul.dll 	nsBaseAppShell::Run 	widget/src/xpwidgets/nsBaseAppShell.cpp:189
34 	xul.dll 	nsAppStartup::Run 	toolkit/components/startup/nsAppStartup.cpp:220
35 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp:3558
Crash Signature: [@ chromehang | ZwQueryFullAttributesFile ]
Keywords: hang
Blocks: 701909
(In reply to Michal Novotny (:michal) from comment #0)
> We count the smart size of the cache on the background thread to avoid IO on
> the main thread, but we set the capacity on the main thread and if the
> capacity is lower than the current cache usage then we evict necessary space
> on the main thread.

(In email) Michal Novotny wrote:
> https://bugzilla.mozilla.org/show_bug.cgi?id=707436
> 
> I'd like to postpone evicting of the entries in case of startup and
> maybe spread the eviction in a longer timerange to avoid massive IO.
> This would mean that we wouldn't respect the cache size for a limited
> time. Is this acceptable?

It seems reasonable to me?

Dumb question #1: Why don't we just assume that the cache is under capacity at startup? Under what circumstances would this assumption be wrong? And, how often would that occur?

Dumb question #2: Why don't we cache the size of the cache, so that we don't have to calculate the size of the cache at all? Doesn't calculating the size of the cache require a LOT of seeks/reads?
(In reply to Brian Smith (:bsmith) from comment #1)
> Dumb question #1: Why don't we just assume that the cache is under capacity
> at startup? Under what circumstances would this assumption be wrong? And,
> how often would that occur?

We need to calculate the smart size sometimes and we do it at startup. The assumption would be wrong whenever SmartCacheSize() returns smaller size than is the actual cache usage. But this is a good point, we could simply delay the smart size calculation instead of delaying the eviction.

> Dumb question #2: Why don't we cache the size of the cache, so that we don't
> have to calculate the size of the cache at all? Doesn't calculating the size
> of the cache require a LOT of seeks/reads?

We do cache the smart size (http://hg.mozilla.org/mozilla-central/annotate/4b71b1e9cc0c/netwerk/cache/nsCacheService.cpp#l231) but the smart size should respect the free space, so we need to recalculate it. In fact, I think that we should do it periodically and not only at startup. It doesn't require a lot of IO, we take a free space on drive and add the usage of the current cache (we store this information in the cache header).
Attached patch patch v1 (obsolete) — Splinter Review
Attachment #582859 - Flags: review?(jduell.mcbugs)
Comment on attachment 582859 [details] [diff] [review]
patch v1

Review of attachment 582859 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry I didn't get to this before the code work--was busy with websockets.

I have only one question--looks good otherwise.

::: netwerk/cache/nsCacheService.cpp
@@ +204,5 @@
> +    NS_DECL_ISUPPORTS
> +
> +    NS_IMETHOD Notify(nsITimer* aTimer) {
> +        if (nsCacheService::gService)
> +            nsCacheService::gService->SetDiskSmartSize_Locked(true);

How can you call the _Locked version here?  We're called by timer here--have you been holding the lock the whole time since the timer init, or is there some other way you're guaranteed to have the lock (or don't need it)?
Attached patch patch v2 (obsolete) — Splinter Review
(In reply to Jason Duell (:jduell) from comment #4)
> > +        if (nsCacheService::gService)
> > +            nsCacheService::gService->SetDiskSmartSize_Locked(true);
> 
> How can you call the _Locked version here?  We're called by timer here--have
> you been holding the lock the whole time since the timer init, or is there
> some other way you're guaranteed to have the lock (or don't need it)?

Good catch! The lock was missing.
Attachment #582859 - Attachment is obsolete: true
Attachment #582859 - Flags: review?(jduell.mcbugs)
Attachment #583166 - Flags: review?(jduell.mcbugs)
Comment on attachment 583166 [details] [diff] [review]
patch v2

Review of attachment 583166 [details] [diff] [review]:
-----------------------------------------------------------------

Interdiff between v1 and v2 is broken, but +r assuming the only significant change is that you're using gService->SetDiskSmartSize instead of gService->SetDiskSmartSize_Locked.
Attachment #583166 - Flags: review?(jduell.mcbugs) → review+
http://hg.mozilla.org/integration/mozilla-inbound/rev/8d1b087b9713
Whiteboard: [inbound]
Target Milestone: --- → mozilla12
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/eb614b45e478 - so far only Linux and Mac have finished, but they both have a Mutex/ReentrantMonitor/nsTArray_base/nsThread/nsTimerImpl leak in mochitest-ipcplugins (which, remember, is named backwards, so that means with ipcplugins *disabled*).
Whiteboard: [inbound]
Attached patch patch v4 (obsolete) — Splinter Review
The timer is now canceled on shutdown to avoid the leak.

Result from try server: https://tbpl.mozilla.org/?tree=Try&rev=a382361806c0
All oranges seem to be known intermittent failures...
Attachment #583166 - Attachment is obsolete: true
Attachment #584235 - Flags: review?(jduell.mcbugs)
Attached patch patch v5 (obsolete) — Splinter Review
Yet a minor change in nsSetDiskSmartSizeCallback::Notify(). Check if we still have gService before nulling out mSmartSizeTimer.
Attachment #584235 - Attachment is obsolete: true
Attachment #584235 - Flags: review?(jduell.mcbugs)
Attachment #584248 - Flags: review?(jduell.mcbugs)
Attachment #584248 - Flags: review?(jduell.mcbugs) → review+
pushed to try server https://tbpl.mozilla.org/?tree=Try&rev=128df2696064
Attachment #584248 - Attachment is obsolete: true
Attachment #585718 - Flags: review?(bjarne)
Comment on attachment 585718 [details] [diff] [review]
patch v6 - removes usage of preferences off the main thread

Review of attachment 585718 [details] [diff] [review]:
-----------------------------------------------------------------

r- because of comments below but seems ok otherwise

::: netwerk/cache/nsCacheService.cpp
@@ +1471,5 @@
> +        rv = mSmartSizeTimer->InitWithCallback(new nsSetDiskSmartSizeCallback(),
> +                                               1000*60*3,
> +                                               nsITimer::TYPE_ONE_SHOT);
> +        if (NS_FAILED(rv))
> +            return rv;

Nit: The 'if' seems a little unnecessary since |rv| is returned below anyway. But the question is whether we can do something better here..? Or is this unlikely to fail (in which case: why do we bother to check it..)?

@@ +2682,4 @@
>  }
>  
>  nsresult
>  nsCacheService::SetDiskSmartSize_Locked(bool checkPref)

I don't see the value of the param anymore since this version caches the pref. Afaics, we can just check the cached value.
Attachment #585718 - Flags: review?(bjarne) → review-
Attached patch patch v7Splinter Review
- removed checkPref argument
- added warning when timer initialization fails
Attachment #585718 - Attachment is obsolete: true
Attachment #585763 - Flags: review?(bjarne)
Comment on attachment 585763 [details] [diff] [review]
patch v7

Review of attachment 585763 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, assuming it passes tryserver
Attachment #585763 - Flags: review?(bjarne) → review+
(In reply to Bjarne (:bjarne) from comment #14)
> 
> Ok, assuming it passes tryserver

One orange on WinXP debug - https://tbpl.mozilla.org/?tree=Try&rev=128df2696064
Green on a second try - https://tbpl.mozilla.org/?tree=Try&rev=6ad28964be62
Comment on attachment 585763 [details] [diff] [review]
patch v7

[Approval Request Comment]
User impact if declined: Slow startup on mobile. Disk cache would need to be disabled.
Testing completed (on m-c, etc.): try server (see comment #15) and local testing
Risk to taking this patch (and alternatives if risky): unknown
Attachment #585763 - Flags: approval-mozilla-aurora?
(In reply to Michal Novotny (:michal) from comment #17)
> Comment on attachment 585763 [details] [diff] [review]
> Risk to taking this patch (and alternatives if risky): unknown

Hi Michal - this risk assessment and the size of the patch gives me some pause about approving for Aurora. I've got a couple of questions.

1) Do we have an understanding of how much benefit this will be to the user (% of startup time)?
2) Is this benefit for both XUL and Native on mobile?
3) Since the benefit is on mobile, could we make this patch mobile only to reduce risk for desktop?
(In reply to Alex Keybl [:akeybl] from comment #18)
> (In reply to Michal Novotny (:michal) from comment #17)
> > Comment on attachment 585763 [details] [diff] [review]
> > Risk to taking this patch (and alternatives if risky): unknown
> 
> Hi Michal - this risk assessment and the size of the patch gives me some
> pause about approving for Aurora. I've got a couple of questions.
> 
> 1) Do we have an understanding of how much benefit this will be to the user
> (% of startup time)?

I don't know the percentage but the benefit is huge, because this can block the main thread for several seconds or even more (depends on how much of space we need to evict and how slow is the disk).

> 2) Is this benefit for both XUL and Native on mobile?
> 3) Since the benefit is on mobile, could we make this patch mobile only to
> reduce risk for desktop?

Sorry, my description wasn't clear enough. The problem is on all: desktop, native mobile and XUL mobile. So the desktop will benefit too. The difference is that we didn't have the disk cache enabled on mobile until FF 11. And if we don't land this fix in FF 11, we need to postpone enabling of the disk cache on mobile.
(In reply to Alex Keybl [:akeybl] from comment #18)
> 1) Do we have an understanding of how much benefit this will be to the user
> (% of startup time)?
Set hangmonitor.timeout to 30 during a few days in 12.0a1 and compare crash stats with the ones from 11.0a1/20111126 to 11.0a1/20111129 (about 150 crashes per build).
That will also give a Snappy progress snapshot.
https://hg.mozilla.org/mozilla-central/rev/461a7c923594
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
(In reply to Michal Novotny (:michal) from comment #17)
> Risk to taking this patch (and alternatives if risky): unknown

Thanks for the explanation Michal, and the good suggestion Scoobidiver. I don't think this bug alone would drive turning on the hangmonitor in 12 at this point, but CC'ing Sheila since I believe she'll make the call on when we should try our experiment again.

Given the unknown risk, and the fact that nothing points to this being a recent regression, let's let this ride the train.
Attachment #585763 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Depends on: 716289
Depends on: 728915
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: