Last Comment Bug 670911 - "clear cache now" (or cache delete after crash) is slow (and hangs UI) on Windows/NTFS
: "clear cache now" (or cache delete after crash) is slow (and hangs UI) on Win...
Status: RESOLVED FIXED
: hang
Product: Core
Classification: Components
Component: Networking: Cache (show other bugs)
: unspecified
: x86 Windows XP
: -- critical with 1 vote (vote)
: mozilla8
Assigned To: Jason Duell [:jduell] (needinfo me)
:
: Patrick McManus [:mcmanus]
Mentors:
: 630420 648888 649530 651011 669140 687232 691075 696292 (view as bug list)
Depends on:
Blocks: 224692 559942 674869
  Show dependency treegraph
 
Reported: 2011-07-12 06:10 PDT by neil@parkwaycc.co.uk
Modified: 2012-02-29 00:06 PST (History)
25 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
let renames w/o changing directory skip NTFS ACL reset (4.72 KB, patch)
2011-07-21 02:23 PDT, Jason Duell [:jduell] (needinfo me)
benjamin: review+
neil: superreview+
Details | Diff | Splinter Review
Change Cache.trash logic to not use extra subdirectories (5.40 KB, patch)
2011-07-21 03:19 PDT, Jason Duell [:jduell] (needinfo me)
michal.novotny: review+
christian: approval‑mozilla‑aurora-
christian: approval‑mozilla‑beta-
Details | Diff | Splinter Review
let renames w/o changing directory skip NTFS ACL reset v2 (4.24 KB, patch)
2011-07-28 03:16 PDT, Jason Duell [:jduell] (needinfo me)
jduell.mcbugs: review+
jduell.mcbugs: superreview+
christian: approval‑mozilla‑aurora-
christian: approval‑mozilla‑beta-
Details | Diff | Splinter Review

Description neil@parkwaycc.co.uk 2011-07-12 06:10:57 PDT
When the cache service detects that the cache is corrupt after a crash, it tries to clear the cache quickly by renaming the folder and then unlocking the cache while it deletes the files in the background. Unfortunately bug 224692 means that all of the files and subfolders in the cache have their security information reset, and the cache is still locked while this happens freezing the UI as soon as it tries to access the cache. Bug 559942 exacerbates the problem by allowing the cache to become bigger and therefore hold more files.
Comment 1 Josh Matthews [:jdm] (on vacation until Dec 5) 2011-07-12 07:23:08 PDT
This sounds like something that needs to be fixed on the cache's side.
Comment 2 neil@parkwaycc.co.uk 2011-07-12 07:44:26 PDT
The actual hang occurs inside the call to nsLocalFileWin::MoveTo which is why I filed the bug under XPCOM...
Comment 3 :Gavin Sharp [email: gavin@gavinsharp.com] 2011-07-12 16:21:32 PDT
See also bug 651011.
Comment 4 Jason Duell [:jduell] (needinfo me) 2011-07-13 18:50:57 PDT
Neil, can you explain a bit more why the UI hangs during the folder deletion?  We're renaming the "Cache" folder to "Cache.trash", then deleting it (too quickly: in bug 651011 we're going to slow the deletion down).  Are you saying that the directory rename operation itself causes security info updates on all the files in the directory tree?  Or that even once we've created a new, different "Cache" directory, we're somehow locked while Cache.trash is deleted?

I've really been wondering why this has caused hangs (as opposed to just annoying amounts of I/O)--thanks for reporting!
Comment 5 neil@parkwaycc.co.uk 2011-07-14 01:11:54 PDT
(In reply to comment #4)
> Neil, can you explain a bit more why the UI hangs during the folder
> deletion?  We're renaming the "Cache" folder to "Cache.trash", then deleting
> it (too quickly: in bug 651011 we're going to slow the deletion down).  Are
> you saying that the directory rename operation itself causes security info
> updates on all the files in the directory tree?
Bug 224692 causes the security to be updated on all the files in the directory tree. Although this happens on a separate thread, the cache is locked until the call to MoveTo completes, causing the UI to hang when it needs the cache.
Comment 6 Jason Duell [:jduell] (needinfo me) 2011-07-14 02:01:11 PDT
> Bug 224692 causes the security to be updated on all the files in the directory tree. 

Do we have an API that will allow us to rename the directory w/o all the NTFS permissions getting touched for the whole tree?  From the discussion in bug 224692 it sounds like the native windows move command doesn't touch the info, and we added code to support the case of wanting downloads to inherit permission of their enclosing folder when we move them from TEMP.  Can we turn the code off easily?

I suppose the other alternative is to not rename the Cache directory, and create a new one called Cache-1, etc., but it involves jumping through hoops at startup (which directory is the 'real' Cache?) that I'd rather avoid if we can do a rename quickly.
Comment 7 neil@parkwaycc.co.uk 2011-07-14 02:44:07 PDT
(In reply to comment #6)
> > Bug 224692 causes the security to be updated on all the files in the directory tree. 
> Do we have an API that will allow us to rename the directory w/o all the
> NTFS permissions getting touched for the whole tree?  From the discussion in
> bug 224692 it sounds like the native windows move command doesn't touch the
> info, and we added code to support the case of wanting downloads to inherit
> permission of their enclosing folder when we move them from TEMP.  Can we
> turn the code off easily?
Well, that depends on your definition of easily. Ideally I would have suggested that we turn the code off if the first parameter for MoveTo is null, but I wasn't sure how easy it was to tell that from the point where the call is made.
Comment 8 Jason Duell [:jduell] (needinfo me) 2011-07-14 12:40:10 PDT
*** Bug 651011 has been marked as a duplicate of this bug. ***
Comment 9 Jason Duell [:jduell] (needinfo me) 2011-07-14 12:40:37 PDT
*** Bug 669140 has been marked as a duplicate of this bug. ***
Comment 10 Jason Duell [:jduell] (needinfo me) 2011-07-19 23:13:35 PDT
*** Bug 672651 has been marked as a duplicate of this bug. ***
Comment 11 Jason Duell [:jduell] (needinfo me) 2011-07-21 02:23:53 PDT
Created attachment 547350 [details] [diff] [review]
let renames w/o changing directory skip NTFS ACL reset

Resetting ACLs (as added in bug 224692) is pointless when we're already in the target directory, and hangs UI during corrupted cache rename.
Comment 12 Jason Duell [:jduell] (needinfo me) 2011-07-21 03:19:04 PDT
Created attachment 547355 [details] [diff] [review]
Change Cache.trash logic to not use extra subdirectories

Currently, for no good reason, we take a bad "Cache" directory and 1) create a Cache.Trash directory, 2) create a "Trash" subdirectory of that, and then 3) move Cache into it, winding up with "Cache.Trash/Trash/Cache".

This changes the logic to simply rename "Cache" -> "Cache.trash".  We need to keep the operation as a rename in the same directory (rather than a move into a subdirectory) in order to keep NTFS ACL resets from slowing down the rename into a potentially minutes-long operation (locking the cache and thus the UI the whole while).

Also adds telemetry calls.  Some of this will conflict with Taras' already half-landed (inbound) patch for bug 672651.  I'll fix conflicts before I land.

Still to come: 1) batching delete of cache dir tree into smaller chunks separated by sleep calls, and 2) adding a 'delay' parameter to delay start of all deletes for some amount of time (at startup we don't want the extra I/O.  But these two patches cover the most horrible part of the bug (the UI freeze), so I'd like to see them land ASAP.
Comment 13 James May [:fowl] 2011-07-21 03:27:59 PDT
On Windows at least, lowering the thread priority will lower its IO priority as well, so that might be something to try too.
Comment 14 Jason Duell [:jduell] (needinfo me) 2011-07-22 00:32:49 PDT
If some folks would be willing to test what I've got so far to verify that the UI no longer locks up, that would be great (I've got an SSD drive and can't replicate any noticeable lockup).  There's a build available at

  http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jduell@mozilla.com-d3f1bb76aa01/try-win32/firefox-8.0a1.en-US.win32.zip

Steps: backup your profile (with a big Cache directory preferably), kill your browser with the task manager, and then try running this build and see if you hang at startup (make sure your cache directory is tiny after the restart, or Firefox may have managed to exit "cleanly": for some reason I don't always find that "End Task' always results in a "corrupted" cache).  Then you can exit, restore your profile and get your old, big cache back.  (Extra credit: try the corrupt restart with your stock Firefox and compare).  Then note in this bug how big your cache was and how it went.  Thanks!

James--thanks for the suggestion: the thread is already getting launched with PR_PRIORITY_LOW, which the NSPR docs say is the lowest possible priority.
Comment 15 Jason Duell [:jduell] (needinfo me) 2011-07-22 00:52:09 PDT
And if you have about:telemetry installed

   https://addons.mozilla.org/en-US/firefox/addon/abouttelemetry/

you can also go to "about:telemetry" in the ftp build after you restart, and report the numbers for "Time spent renaming bad Cache to Cache.Trash (ms)" and "Time spent deleting disk cache (ms)"  (about:telemetry won't work on Firefox 5, so no comparison #s alas)
Comment 16 neil@parkwaycc.co.uk 2011-07-26 16:44:05 PDT
Comment on attachment 547350 [details] [diff] [review]
let renames w/o changing directory skip NTFS ACL reset

>+        // Set security permissions to inherit from parent.
>+        // Note: propagates to all children: slow for big file trees
Nit: needs trailing full stop.

> nsresult
> nsLocalFile::CopyMove(nsIFile *aParentDir, const nsAString &newName, PRBool followSymlinks, PRBool move)
> {
>+    PRBool sameDirRename = PR_FALSE;
>+
>     nsCOMPtr<nsIFile> newParentDir = aParentDir;
>     // check to see if this exists, otherwise return an error.
>     // we will check this by resolving.  If the user wants us
>     // to follow links, then we are talking about the target,
>     // hence we can use the |followSymlinks| parameter.
>     nsresult rv  = ResolveAndStat();
>     if (NS_FAILED(rv))
>         return rv;
> 
>     if (!newParentDir)
>     {
>+        // No need to reset ACLs to "new" dir if not changing directory
>+        sameDirRename = PR_TRUE;
I was looking at this when it dawned on me that sameDirRename == !newParentDir == !aParentDir, and as the latter isn't used anywhere else in the method, you can just pass !aParentDir to CopySingleFile. Or am I missing something?
Comment 17 Jason Duell [:jduell] (needinfo me) 2011-07-26 22:30:11 PDT
> you can just pass !aParentDir to CopySingleFile

Yes, that looks like it works.  I don't think it makes the code as easy to follow as the named variable and accompanying comment (and the compiler will presumably make them equally efficient).  But I'm fine with using your logic if you prefer.

Is that a +sr?
Comment 18 neil@parkwaycc.co.uk 2011-07-27 00:29:51 PDT
Comment on attachment 547350 [details] [diff] [review]
let renames w/o changing directory skip NTFS ACL reset

Sure, if you're happy to do that, then sr=me with that changed.
Comment 19 Jason Duell [:jduell] (needinfo me) 2011-07-28 03:16:04 PDT
Created attachment 549053 [details] [diff] [review]
let renames w/o changing directory skip NTFS ACL reset  v2

incorporates comment 16.
Comment 20 Jason Duell [:jduell] (needinfo me) 2011-07-28 04:02:18 PDT
I've inserted a old, non-SSD drive in my machine, and here are some numbers with and without these patches:

With a 877 MB Cache folder:
-----------------------------
Rename to *.Trash (blocks UI)    15179 ms -> 217 ms
Cache open at Startup            15182 ms -> 222 ms
Delete Trash (off-main-thread)   21337 ms -> 19450

With a 293 MB Cache folder:
-----------------------------
Rename to *.Trash (blocks UI)    1728 ms -> 62 ms
Cache open at Startup            1733 ms -> 66 ms
Delete Trash (off-main-thread)   3172 ms -> 8000 ms

So this fix is a big win--we go from blocking the UI (very noticeably) for 15 seconds to 200 ms, a 70x speedup.   (Interesting to note that the I/O times here do not scale linearly--the delete and old-style ACL rename are 3x longer per-byte for the bigger cache).
Comment 21 Jason Duell [:jduell] (needinfo me) 2011-07-28 04:27:45 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/f5272524806b
http://hg.mozilla.org/integration/mozilla-inbound/rev/85efefa85d5f

I'm moving the rest of the work (delaying the delete IO) to a followup--bug 674869--so we can look at landing this on aurora/beta ASAP.
Comment 22 Jason Duell [:jduell] (needinfo me) 2011-07-28 04:29:43 PDT
Comment on attachment 547355 [details] [diff] [review]
Change Cache.trash logic to not use extra subdirectories

The UI hang here is really gross, and this fix is pretty simple.
Comment 23 Jason Duell [:jduell] (needinfo me) 2011-07-28 04:30:26 PDT
Comment on attachment 549053 [details] [diff] [review]
let renames w/o changing directory skip NTFS ACL reset  v2

Note: need both patches to get any benefit.
Comment 24 Jason Duell [:jduell] (needinfo me) 2011-07-28 04:35:49 PDT
Oh, got one data point wrong:

Delete Trash (off-main-thread)   3172 ms -> 8000 ms

This is actually inverted (delete got faster, from 8->3 seconds).  But I suspect that's actually noise--there's nothing in this patch that changes the delete logic, just the rename.
Comment 25 Jason Duell [:jduell] (needinfo me) 2011-07-28 04:40:00 PDT
Rename to *.Trash (blocks UI)    1728 ms -> 62 ms
Cache open at Startup            1733 ms -> 66 ms

Also, in case it's not obvious, Cache open at Startup blocks on the rename, which is why the Cache open time timing is always just a touch longer.
Comment 26 neil@parkwaycc.co.uk 2011-07-28 04:41:35 PDT
(In reply to comment #20)
> (Interesting to note that the I/O times here do not scale linearly--the
> delete and old-style ACL rename are 3x longer per-byte for the bigger cache).
It's probably the disk size of the metadata that influences the ACL rename, so just depends on the number of files rather than their size.
Comment 28 Jason Duell [:jduell] (needinfo me) 2011-07-28 12:58:44 PDT
> It's probably the disk size of the metadata that influences the ACL rename,
> so just depends on the number of files rather than their size.

But I did the 877 MB test by making two copies of the 293 directory and storing them as subdirs, so the number of files per MB is the same.  So I'm still not sure what the increased time is about.  But the takeaway is that we should make sure we always test at the upper range of the max cache size...
Comment 29 christian 2011-07-28 14:30:09 PDT
Comment on attachment 547355 [details] [diff] [review]
Change Cache.trash logic to not use extra subdirectories

We don't want this in aurora or beta and will wait for it to bubble up from central.
Comment 30 Jo Hermans 2011-07-30 01:33:11 PDT
*** Bug 649530 has been marked as a duplicate of this bug. ***
Comment 31 Jo Hermans 2011-07-31 15:19:11 PDT
There might be a regression in bug 675420, but I'm not sure. I'm only seeing a problem in Nightly, while another tester claims that it also exists in 5.0 and up.
Comment 32 Jason Duell [:jduell] (needinfo me) 2011-08-01 21:47:34 PDT
bug 675420 is caused by a different bugfix.  This bugfix is innocent of regressions (so far :)
Comment 33 Michal Novotny (:michal) 2011-08-29 11:00:03 PDT
*** Bug 630420 has been marked as a duplicate of this bug. ***
Comment 34 Jo Hermans 2011-09-18 02:34:38 PDT
*** Bug 687232 has been marked as a duplicate of this bug. ***
Comment 35 Jason Duell [:jduell] (needinfo me) 2011-09-29 13:56:13 PDT
*** Bug 648888 has been marked as a duplicate of this bug. ***
Comment 36 Jason Duell [:jduell] (needinfo me) 2011-10-03 10:36:30 PDT
Renaming to make it clear that this also covers the case where a user hits "Clear Now" to clear the cache. (so not just cache delete after a crash).

The fix lands in Firefox/Gecko 8.
Comment 37 Jason Duell [:jduell] (needinfo me) 2011-10-03 10:39:12 PDT
*** Bug 691075 has been marked as a duplicate of this bug. ***
Comment 38 Lawrence Mandel [:lmandel] (use needinfo) 2011-10-21 12:44:30 PDT
*** Bug 696292 has been marked as a duplicate of this bug. ***
Comment 39 Marco Castelluccio [:marco] 2012-02-26 03:47:58 PST
I've tried clearing cache on Windows XP and it's still slow (Firefox Beta).
There's also bug 730662 that is about slow clearing of Crash.Trash folder after a crash (on Windows 7).
Comment 40 Marco Castelluccio [:marco] 2012-02-26 03:49:10 PST
Sorry, Cache.Trash.
Comment 41 Michal Novotny (:michal) 2012-02-28 15:53:43 PST
(In reply to Marco Castelluccio from comment #39)
> I've tried clearing cache on Windows XP and it's still slow (Firefox Beta).

Was this slow because of deleting Cache.Trash or because of evicting all entries (bug #648232)?
Comment 42 Marco Castelluccio [:marco] 2012-02-29 00:06:35 PST
I don't know, I'll try to investigate with your script to fill the cache (however I don't know how much time it will take, as the computer wasn't mine and I haven't access to it currently).

Note You need to log in before you can comment on or make changes to this bug.