Closed Bug 701909 Opened 13 years ago Closed 8 years ago

Disk cache seems to cause exceptionally slow startups(1min+)

Categories

(Core :: Networking: Cache, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: taras.mozilla, Assigned: michal)

References

(Blocks 2 open bugs)

Details

(Keywords: perf, Whiteboard: [Snappy:p1][necko-would-take])

Attachments

(1 file, 1 obsolete file)

See first attachment in bug 701872, atleast one NETWORK_DISK_CACHE_OPEN is closely related to firstPaint speed
Blocks: 701872
Whiteboard: [Snappy]
Whiteboard: [Snappy] → [Snappy:p1]
Taras, can you explain how that attachment shows what this bug is about?
These are telemetry histograms in json form. the .sum field is of interest.

NETWORK_DISK_CACHE_OPEN":{"range":[1,10000],"bucket_count":10,"histogram_type":0,"values":{"3172":0,"10000":1},"sum":31684}
NETWORK_DISK_CACHE_OPEN":{"range":[1,10000],"bucket_count":10,"histogram_type":0,"values":{"3172":0,"10000":1},"sum":53181}

53seconds is a bit long to wait on disk cache to do something.
"sum" is the sum of something?  The highest value?  

Most of the cases in the json file point to the *rename* (not delete) of a corrupt existing cache as the main culprit here (NETWORK_DISK_CACHE_TRASHRENAME consumes 52619 of the 53181ms in the worst case in comment 2, and 31263 of 31684 in the other case).

We added the NETWORK_DISK_CACHE_TRASHRENAME in the same checkin that "fixed" the issue with NTFS taking a looong time to rename directories.  Looking over the code, the most obvious possibility is the case where we there's already a "cache.trash" left over from a previous run:

 // TrashDir may already exist (if we crashed while deleting it, etc.)
 // In that case current Cache dir should be small--just move it to
 // subdirectory of TrashDir and eat the NTFS ACL overhead.

In this case we take the slow path of moving the directory into a subdirectory, with the overhead of resetting NTFS ACLs for each file in it.  I'm surprised that users could accumulate a large enough cache in the small window between 1) creating a new cache -> 2) crashing before the old cache was deleted (should be minutes maximum) to take 53 seconds to rename it into a subdirectory at the next startup.  But I don't see any other obvious reasons for the slowness.

Michal's patch for bug 695003 will eliminate this edge case--we'll now be moving to "cache.trash.RANDOMNUMBER" (with repeated tried to avoid filename collisions from getting the same random number), so we'll never move a cache directory into a subdirectory during the rename.

Do we have a good way to split out telemetry results by date?  Comparing results of at nightlies before/after we land bug 695003 (hopefully today or tomorrow) would be very useful.

(P.S.  Just to make life difficult there's also a 18023ms NETWORK_DISK_CACHE_OPEN in the json file that doesn't appear to involve a cache deletion.  So there may be multiple issues here.)
Depends on: 695003
Michal: I'm less confident that bug 695003 will completely fix this.  Try to think about any other reasons why we might be hitting these startup renaming slowness issues.

Taras: there's no reason moving a (large) directory is slow on FAT32, is there?  Should just be as fast as a single file move, IIRC.
Assignee: nobody → michal.novotny
I don't know any more than you in this case. Fact that opening cache files is slow might suggest disk fragmentation. Can someone hook up the code that measures remaining disk space to report to telemetry?

My thoughts on addressing this:
a) disable disk cache while opening cache..ie go to network(could have a 100ms timeout)
b) do not do anything on shutdown. Perform cache maintenance only on a thread during runtime. Should probably also bypass disk cache while doing heavy maintenance.
(In reply to Taras Glek (:taras) from comment #5)
> I don't know any more than you in this case. Fact that opening cache files
> is slow might suggest disk fragmentation. Can someone hook up the code that
> measures remaining disk space to report to telemetry?

Which code are you thinking about measuring?
Attached patch available disk space telemetry (obsolete) — Splinter Review
Did you mean something like this?
(In reply to Jason Duell (:jduell) from comment #3)
> "sum" is the sum of something?  The highest value?  
Sum is the raw sum of values that went into the histogram..in this case histograms have a single value so it's literally the time that this operation took.
> 
> Do we have a good way to split out telemetry results by date?  Comparing
> results of at nightlies before/after we land bug 695003 (hopefully today or
> tomorrow) would be very useful.

There is some rudimentary filtering by date in the newer frontend at http://tinyurl.com/makeTarasHappy
You can also file metrics bugs with requests for specific reports.
(In reply to Nathan Froyd (:froydnj) from comment #7)
> Created attachment 578697 [details] [diff] [review] [diff] [details] [review]
> available disk space telemetry
> 
> Did you mean something like this?

No. We need to check for diskspace on the Cache/ directory itself, not random files firefox looks at and how much space is left on those partitions.
> Can someone hook up the code that
> measures remaining disk space to report to telemetry?

Opened bug 707402 for that--patch is there.

> a) disable disk cache while opening cache..ie go to network (could have a 100ms 
> timeout)

That's a possibility, but I'm not sure if it would mess up things that depend on items being in cache (view-source comes to mind, though I wouldn't consider it mission-critical).  There may be other factors that make this tricky, or it might be fairly straightforward--michal/bjarne can hopefully answer that (biesi/bz may also know).

One other alternative is that if we land bug 602611 (store items to RAM cache first), we'd be able to launch network requests at startup w/o waiting for the disk cache to open: they'd store into RAM cache, then get stored to disk later.  But I'm not sure how close that patch is to be landable, and I think it uses write-through now (so might block on disk cache once network reply is received, which is still a win of a network round-trip, but still bad if disk cache open takes very long time).

> b) do not do anything on shutdown. Perform cache maintenance only on a thread 
> during runtime. Should probably also bypass disk cache while doing heavy 
> maintenance.

I assume this is directed at the shutdown bug, and general cache perf, not so much this startup bug.  AFAIK we do almost all disk cache I/O on a non-main thread already.  Some variant of bug 602611 that does periodic RAM->disk cache flush (instead of write-through) could possibly wait for the deletedir thread to finish before flushing, but getting the details right could be hard (ex: if it takes a long time, the RAM cache could fill up).  We already set the deletedir thread to have the lowest possible thread priority, but I'm guessing this doesn't help much in the case of I/O (i.e. if the cache thread does I/O and sleeps, the low-priority thread can wake up and still schedule I/O that slows down the original I/O).  Figuring out a way to not wait for the disk cache to open at startup seems both easier and more urgent than this.
P.S. 

> ie go to network (could have a 100ms timeout)

We would definitely want a timeout, else we'd load every single resource in every tab off the network, whereas now we force-load any resource we have on disk cache (w/o revalidating over the network).

Hopefully we can solve the slow dir rename issue, and this will all be moot.
Depends on: 707436
What exactly measures firstPaint? Is it the delay between startup and the moment when we start painting UI, or the moment when we start rendering the first page? If it is the latter case, then I'm almost sure that this is now fixed by #695003. The scenario described in the description of that bug happens all the time we need to drop the cache at startup.
first paint is literally the first time the browser chrome is painted. Page can start loading before or after that.
Blocks: 447581
If it helps writing it down here, on x64 Windows 7, Firefox 10.0.2, it just took around a whooping 30 minutes (not exaggerating!) on my machine to clear the Cache.Trash folder after Firefox was crashed for some reason and I restarted it. All that time, it was deleting files from that folder or doing something with that in my FF profile. It is totally unacceptable. More details can be found on the above duplicate marked bug. Thanks...
So I suspect that this bug (or at least a large % of cases) is a result of the initial rename of Cache -> Cache.Trash.  We're still seeing 3% of NETWORK_DISK_CACHE_TRASHRENAME timings taking 10 seconds or more for Firefox 10 (the timing is capped at 10 seconds, so it would include long hangs).  This is only on Windows--OSX and Linux are fine.  The numbers are better for 11/12 (down to .2% for 11.0a2) but that might be an artifact of a small sample size.  All the bugs that touch this code since 10 are AFIAK crash fixes not perf ones (ex: bug 695003), so there's no reason I can think of for a perf improvement since FF 8.

I'd really, really like to track this down.  Piyush, are you using an encrypted filesystem?  Is your drive almost full?  Anything else exceptional about your setup that might cause the drive to be slow?  If you are willing, I can give you some steps to follow so we can try to reproduce the issue on your system while keeping a copy of your profile so we can debug what's going on.

1) browse for a while to build up a large cache again (you start from scratch after a crash).  The bigger the better (close to 1 GB, or at least 500 MB).  You can check your current cache size via Preferences | Advanced | Network

2) save a backup of your profile directory (including Cache directory).

3) Deliberately crash Firefox (install the crashme plugin: http://code.google.com/p/crashme/ which lets you crash whenever you want)

4) See if you hang again on startup.  If you do, please keep a copy of your old profile around (you can restore a copy of it to your regular profile, so you can keep browsing as normal), and we can send you builds with debug instrumentation to figure out where things are going wrong.

5) If you can reliably duplicate the hang, try it with a nightly Firefox build to see if it does any better.

Piyush:  let me know if that seems doable, and if you're willing to help us track this down.  It's been hard to get users who both have seen the issue and are willing to do these steps.
Sure! Jive me a day as I prepare for the next office meeting finding excuses of why I didn't do anything in the past week *again*, and I'll do this experiment. :D

Why do we need an extension for crashing Firefox? :
taskkill /f /im Firefox.exe
on the command prompt works just as fine. (Actually, I lied. The last time it hadn't crashed for 'some' reason. I had killed it deliberately, just using this command I have saved as a batch, to kill offensive processes. I routinely kill firefox (:sob) as it grows to 1 to 2 GB in every 3-4 few days on my machine. Even properly closing takes minutes. I'm guilty. :) )

Question:
- Could you as well tell me how to increase the cache usage quickly to 1 GB or so without waiting for days?
> Why do we need an extension for crashing Firefox? 
> taskkill /f /im Firefox.exe

That will probably work fine too.  You just need to make sure Firefox doesn't get a chance to shut down in an orderly way.

> Could you as well tell me how to increase the cache usage quickly to 1 GB or so without waiting for days?

I've had good luck doing google image searches and scrolling down the page of results--fills up cache with lots of images.  Although the bug that was originally causing these hangs was suffering from the number of files in the cache, so filling it up with big files (youtube videos etc) is less likely to show the hang.  At least if it's the same hang we've had before (with NTFS ACL permission changes being the cause of slowness).

Thanks a lot for offering to help.  I look forward to seeing your results.
(In reply to Piyush Soni from comment #17)
> - Could you as well tell me how to increase the cache usage quickly to 1 GB
> or so without waiting for days?

You can use the attached file to fill the cache quickly.
Comment on attachment 601394 [details]
script to fill the cache

Michal, 

when I try to use this script (i.e hit "fill") I wind up seeing "count: undefined/undefined".  And cache isn't getting filled.  Looks like it's because you use enablePrivilege("UniversalXPConnect"):  here's web console error:

[14:14:06.958] Use of enablePrivilege is deprecated.  Please use code that runs with the system principal (e.g. an extension) instead. @ https://bug701909.bugzilla.mozilla.org/attachment.cgi?id=601394

[14:14:06.958] A script from "https://bug701909.bugzilla.mozilla.org" was denied UniversalXPConnect privileges. @ https://bug701909.bugzilla.mozilla.org/attachment.cgi?id=601394:3
Attachment #601394 - Flags: feedback-
It works if you save the file to disk and then allow the enhanced abilities.
" ... and allow the enhanced abilities"
And how do we do that? 

I tried putting the script stuff into a Greasemonkey script (runs with slightly more privileges, but still much lesser than an extension) and ran but got the error of 'denied UniversalXPConnect privileges'
(In reply to Piyush Soni from comment #22)
> " ... and allow the enhanced abilities"
> And how do we do that? 

Firefox should IMO show a modal dialog if the page contains UniversalXPConnect() and the scheme is file (at least my FF does this).
Piyush:  does that makes sense?  Just download the file and open it (with control-O), then say OK to the scary dialog.
Sorry for not responding earlier. Yes, I had done that and it had worked, but unfortunately, since my bug report I have tried it 4 times on different days, still it has not happened that *before* even painting the window it takes so much time. It is still taking a few minutes, but after showing me the 'well, this is embarrassing' window. Only when I click restore, then it does something with the Cache.Trash folder. But it's not going to those gigantic times again. I will keep on trying. Oh, btw, yes, my cache folder was around 1 GB. 

And I don't think my file system is encrypted. Most of these things are standard run of the mill, except all the exotic software on my machine which few people would probably install. But as I said, I am sure it will happen again.
> It is still taking a few minutes, but after showing me the 
> 'well, this is embarrassing' window.

OK, so you're not getting a freeze before even painting the initial chrome at startup.  But you're seeing your computer freeze for a few minutes after you hit "restore session"?  Freeze as in you can't do anything at all with the UI, or just that page loading is sluggish while you restore lots of tabs?
Attachment #578697 - Attachment is obsolete: true
Keywords: perf
Thunderbird user reports reproducible long startup and "unresponsive script" message at https://getsatisfaction.com/mozilla_messaging/topics/unresponsive_script-c4ael 

no unresponsive script if cache is deleted before startup. cache described at https://getsatisfaction.com/mozilla_messaging/topics/unresponsive_script-c4ael#reply_11814877
(In reply to Jason Duell [:jduell] (needinfo? me for lower latency) from comment #26)
> > It is still taking a few minutes, but after showing me the 
> > 'well, this is embarrassing' window.
> 
> OK, so you're not getting a freeze before even painting the initial chrome
> at startup.  But you're seeing your computer freeze for a few minutes after
> you hit "restore session"?  Freeze as in you can't do anything at all with
> the UI, or just that page loading is sluggish while you restore lots of tabs?

piyush seems to be gone, but let's try needinfo anyway
Flags: needinfo?(piyush_soni)
I'm here, but I'm afraid I don't have much to report. May be it doesn't happen with the new UI? I've seen the "couple of minutes before painting even the initial Chrome" problem a couple of times after that, but there's no consistent way to reproduce that. It just happens randomly, and when I try to do that again, it doesn't. I'm not sure what can be done here to this bug. These days when it becomes slow I just go and delete those Local cache folders.
Flags: needinfo?(piyush_soni)
Forgot to write: I'm open to doing any experiments or generating reports you might need from me.
Whiteboard: [Snappy:p1] → [Snappy:p1[necko-would-take]
Whiteboard: [Snappy:p1[necko-would-take] → [Snappy:p1][necko-would-take]
This bug refers to deleting the cache at startup after the crash. We don't do this anymore in the new cache so I doubt the startup delay is caused by the cache. If you think it is caused by the cache, please try to reproduce the problem with "cache2:5" NSPR logging.
Flags: needinfo?(piyush_soni)
No reaction on this bug and it's likely it's fixed/invalid.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(piyush_soni)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.