Closed Bug 670911 Opened 13 years ago Closed 13 years ago

"clear cache now" (or cache delete after crash) is slow (and hangs UI) on Windows/NTFS

Categories

(Core :: Networking: Cache, defect)

x86
Windows XP
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla8

People

(Reporter: neil, Assigned: jduell.mcbugs)

References

Details

(Keywords: hang)

Attachments

(2 files, 1 obsolete file)

When the cache service detects that the cache is corrupt after a crash, it tries to clear the cache quickly by renaming the folder and then unlocking the cache while it deletes the files in the background. Unfortunately bug 224692 means that all of the files and subfolders in the cache have their security information reset, and the cache is still locked while this happens freezing the UI as soon as it tries to access the cache. Bug 559942 exacerbates the problem by allowing the cache to become bigger and therefore hold more files.
This sounds like something that needs to be fixed on the cache's side.
Component: XPCOM → Networking: Cache
QA Contact: xpcom → networking.cache
The actual hang occurs inside the call to nsLocalFileWin::MoveTo which is why I filed the bug under XPCOM...
Neil, can you explain a bit more why the UI hangs during the folder deletion?  We're renaming the "Cache" folder to "Cache.trash", then deleting it (too quickly: in bug 651011 we're going to slow the deletion down).  Are you saying that the directory rename operation itself causes security info updates on all the files in the directory tree?  Or that even once we've created a new, different "Cache" directory, we're somehow locked while Cache.trash is deleted?

I've really been wondering why this has caused hangs (as opposed to just annoying amounts of I/O)--thanks for reporting!
(In reply to comment #4)
> Neil, can you explain a bit more why the UI hangs during the folder
> deletion?  We're renaming the "Cache" folder to "Cache.trash", then deleting
> it (too quickly: in bug 651011 we're going to slow the deletion down).  Are
> you saying that the directory rename operation itself causes security info
> updates on all the files in the directory tree?
Bug 224692 causes the security to be updated on all the files in the directory tree. Although this happens on a separate thread, the cache is locked until the call to MoveTo completes, causing the UI to hang when it needs the cache.
> Bug 224692 causes the security to be updated on all the files in the directory tree. 

Do we have an API that will allow us to rename the directory w/o all the NTFS permissions getting touched for the whole tree?  From the discussion in bug 224692 it sounds like the native windows move command doesn't touch the info, and we added code to support the case of wanting downloads to inherit permission of their enclosing folder when we move them from TEMP.  Can we turn the code off easily?

I suppose the other alternative is to not rename the Cache directory, and create a new one called Cache-1, etc., but it involves jumping through hoops at startup (which directory is the 'real' Cache?) that I'd rather avoid if we can do a rename quickly.
(In reply to comment #6)
> > Bug 224692 causes the security to be updated on all the files in the directory tree. 
> Do we have an API that will allow us to rename the directory w/o all the
> NTFS permissions getting touched for the whole tree?  From the discussion in
> bug 224692 it sounds like the native windows move command doesn't touch the
> info, and we added code to support the case of wanting downloads to inherit
> permission of their enclosing folder when we move them from TEMP.  Can we
> turn the code off easily?
Well, that depends on your definition of easily. Ideally I would have suggested that we turn the code off if the first parameter for MoveTo is null, but I wasn't sure how easy it was to tell that from the point where the call is made.
Resetting ACLs (as added in bug 224692) is pointless when we're already in the target directory, and hangs UI during corrupted cache rename.
Attachment #547350 - Flags: review?(benjamin)
Currently, for no good reason, we take a bad "Cache" directory and 1) create a Cache.Trash directory, 2) create a "Trash" subdirectory of that, and then 3) move Cache into it, winding up with "Cache.Trash/Trash/Cache".

This changes the logic to simply rename "Cache" -> "Cache.trash".  We need to keep the operation as a rename in the same directory (rather than a move into a subdirectory) in order to keep NTFS ACL resets from slowing down the rename into a potentially minutes-long operation (locking the cache and thus the UI the whole while).

Also adds telemetry calls.  Some of this will conflict with Taras' already half-landed (inbound) patch for bug 672651.  I'll fix conflicts before I land.

Still to come: 1) batching delete of cache dir tree into smaller chunks separated by sleep calls, and 2) adding a 'delay' parameter to delay start of all deletes for some amount of time (at startup we don't want the extra I/O.  But these two patches cover the most horrible part of the bug (the UI freeze), so I'd like to see them land ASAP.
Assignee: nobody → jduell.mcbugs
Status: NEW → ASSIGNED
Attachment #547355 - Flags: review?(michal.novotny)
OS: Windows XP → All
Hardware: x86 → All
Summary: Cache slow to "clear" after a crash → Cache slow to clear (and hangs UI) after a crash
On Windows at least, lowering the thread priority will lower its IO priority as well, so that might be something to try too.
If some folks would be willing to test what I've got so far to verify that the UI no longer locks up, that would be great (I've got an SSD drive and can't replicate any noticeable lockup).  There's a build available at

  http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jduell@mozilla.com-d3f1bb76aa01/try-win32/firefox-8.0a1.en-US.win32.zip

Steps: backup your profile (with a big Cache directory preferably), kill your browser with the task manager, and then try running this build and see if you hang at startup (make sure your cache directory is tiny after the restart, or Firefox may have managed to exit "cleanly": for some reason I don't always find that "End Task' always results in a "corrupted" cache).  Then you can exit, restore your profile and get your old, big cache back.  (Extra credit: try the corrupt restart with your stock Firefox and compare).  Then note in this bug how big your cache was and how it went.  Thanks!

James--thanks for the suggestion: the thread is already getting launched with PR_PRIORITY_LOW, which the NSPR docs say is the lowest possible priority.
Status: ASSIGNED → NEW
OS: All → Windows XP
Hardware: All → x86
And if you have about:telemetry installed

   https://addons.mozilla.org/en-US/firefox/addon/abouttelemetry/

you can also go to "about:telemetry" in the ftp build after you restart, and report the numbers for "Time spent renaming bad Cache to Cache.Trash (ms)" and "Time spent deleting disk cache (ms)"  (about:telemetry won't work on Firefox 5, so no comparison #s alas)
Attachment #547355 - Flags: review?(michal.novotny) → review+
Attachment #547350 - Flags: superreview?(neil)
Attachment #547350 - Flags: review?(benjamin)
Attachment #547350 - Flags: review+
Comment on attachment 547350 [details] [diff] [review]
let renames w/o changing directory skip NTFS ACL reset

>+        // Set security permissions to inherit from parent.
>+        // Note: propagates to all children: slow for big file trees
Nit: needs trailing full stop.

> nsresult
> nsLocalFile::CopyMove(nsIFile *aParentDir, const nsAString &newName, PRBool followSymlinks, PRBool move)
> {
>+    PRBool sameDirRename = PR_FALSE;
>+
>     nsCOMPtr<nsIFile> newParentDir = aParentDir;
>     // check to see if this exists, otherwise return an error.
>     // we will check this by resolving.  If the user wants us
>     // to follow links, then we are talking about the target,
>     // hence we can use the |followSymlinks| parameter.
>     nsresult rv  = ResolveAndStat();
>     if (NS_FAILED(rv))
>         return rv;
> 
>     if (!newParentDir)
>     {
>+        // No need to reset ACLs to "new" dir if not changing directory
>+        sameDirRename = PR_TRUE;
I was looking at this when it dawned on me that sameDirRename == !newParentDir == !aParentDir, and as the latter isn't used anywhere else in the method, you can just pass !aParentDir to CopySingleFile. Or am I missing something?
> you can just pass !aParentDir to CopySingleFile

Yes, that looks like it works.  I don't think it makes the code as easy to follow as the named variable and accompanying comment (and the compiler will presumably make them equally efficient).  But I'm fine with using your logic if you prefer.

Is that a +sr?
Comment on attachment 547350 [details] [diff] [review]
let renames w/o changing directory skip NTFS ACL reset

Sure, if you're happy to do that, then sr=me with that changed.
Attachment #547350 - Flags: superreview?(neil) → superreview+
incorporates comment 16.
Attachment #547350 - Attachment is obsolete: true
Attachment #549053 - Flags: superreview+
Attachment #549053 - Flags: review+
I've inserted a old, non-SSD drive in my machine, and here are some numbers with and without these patches:

With a 877 MB Cache folder:
-----------------------------
Rename to *.Trash (blocks UI)    15179 ms -> 217 ms
Cache open at Startup            15182 ms -> 222 ms
Delete Trash (off-main-thread)   21337 ms -> 19450

With a 293 MB Cache folder:
-----------------------------
Rename to *.Trash (blocks UI)    1728 ms -> 62 ms
Cache open at Startup            1733 ms -> 66 ms
Delete Trash (off-main-thread)   3172 ms -> 8000 ms

So this fix is a big win--we go from blocking the UI (very noticeably) for 15 seconds to 200 ms, a 70x speedup.   (Interesting to note that the I/O times here do not scale linearly--the delete and old-style ACL rename are 3x longer per-byte for the bigger cache).
Blocks: 674869
http://hg.mozilla.org/integration/mozilla-inbound/rev/f5272524806b
http://hg.mozilla.org/integration/mozilla-inbound/rev/85efefa85d5f

I'm moving the rest of the work (delaying the delete IO) to a followup--bug 674869--so we can look at landing this on aurora/beta ASAP.
Whiteboard: [inbound]
Comment on attachment 547355 [details] [diff] [review]
Change Cache.trash logic to not use extra subdirectories

The UI hang here is really gross, and this fix is pretty simple.
Attachment #547355 - Flags: approval-mozilla-beta?
Attachment #547355 - Flags: approval-mozilla-aurora?
Comment on attachment 549053 [details] [diff] [review]
let renames w/o changing directory skip NTFS ACL reset  v2

Note: need both patches to get any benefit.
Attachment #549053 - Flags: approval-mozilla-beta?
Attachment #549053 - Flags: approval-mozilla-aurora?
Oh, got one data point wrong:

Delete Trash (off-main-thread)   3172 ms -> 8000 ms

This is actually inverted (delete got faster, from 8->3 seconds).  But I suspect that's actually noise--there's nothing in this patch that changes the delete logic, just the rename.
Rename to *.Trash (blocks UI)    1728 ms -> 62 ms
Cache open at Startup            1733 ms -> 66 ms

Also, in case it's not obvious, Cache open at Startup blocks on the rename, which is why the Cache open time timing is always just a touch longer.
(In reply to comment #20)
> (Interesting to note that the I/O times here do not scale linearly--the
> delete and old-style ACL rename are 3x longer per-byte for the bigger cache).
It's probably the disk size of the metadata that influences the ACL rename, so just depends on the number of files rather than their size.
http://hg.mozilla.org/mozilla-central/rev/85efefa85d5f
http://hg.mozilla.org/mozilla-central/rev/f5272524806b
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla8
> It's probably the disk size of the metadata that influences the ACL rename,
> so just depends on the number of files rather than their size.

But I did the 877 MB test by making two copies of the 293 directory and storing them as subdirs, so the number of files per MB is the same.  So I'm still not sure what the increased time is about.  But the takeaway is that we should make sure we always test at the upper range of the max cache size...
Whiteboard: [inbound]
Target Milestone: mozilla8 → ---
Comment on attachment 547355 [details] [diff] [review]
Change Cache.trash logic to not use extra subdirectories

We don't want this in aurora or beta and will wait for it to bubble up from central.
Attachment #547355 - Flags: approval-mozilla-beta?
Attachment #547355 - Flags: approval-mozilla-beta-
Attachment #547355 - Flags: approval-mozilla-aurora?
Attachment #547355 - Flags: approval-mozilla-aurora-
Attachment #549053 - Flags: approval-mozilla-beta?
Attachment #549053 - Flags: approval-mozilla-beta-
Attachment #549053 - Flags: approval-mozilla-aurora?
Attachment #549053 - Flags: approval-mozilla-aurora-
There might be a regression in bug 675420, but I'm not sure. I'm only seeing a problem in Nightly, while another tester claims that it also exists in 5.0 and up.
bug 675420 is caused by a different bugfix.  This bugfix is innocent of regressions (so far :)
Severity: normal → blocker
Severity: blocker → critical
Renaming to make it clear that this also covers the case where a user hits "Clear Now" to clear the cache. (so not just cache delete after a crash).

The fix lands in Firefox/Gecko 8.
Summary: Cache slow to clear (and hangs UI) after a crash → "clear cache now" (or cache delete after crash) is slow (and hangs UI) on Windows/NTFS
Target Milestone: --- → mozilla8
I've tried clearing cache on Windows XP and it's still slow (Firefox Beta).
There's also bug 730662 that is about slow clearing of Crash.Trash folder after a crash (on Windows 7).
Sorry, Cache.Trash.
(In reply to Marco Castelluccio from comment #39)
> I've tried clearing cache on Windows XP and it's still slow (Firefox Beta).

Was this slow because of deleting Cache.Trash or because of evicting all entries (bug #648232)?
I don't know, I'll try to investigate with your script to fill the cache (however I don't know how much time it will take, as the computer wasn't mine and I haven't access to it currently).
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: