701909 - Disk cache seems to cause exceptionally slow startups(1min+)

These are telemetry histograms in json form. the .sum field is of interest. NETWORK_DISK_CACHE_OPEN":{"range":[1,10000],"bucket_count":10,"histogram_type":0,"values":{"3172":0,"10000":1},"sum":31684} NETWORK_DISK_CACHE_OPEN":{"range":[1,10000],"bucket_count":10,"histogram_type":0,"values":{"3172":0,"10000":1},"sum":53181} 53seconds is a bit long to wait on disk cache to do something.

Jason Duell

Comment 3

•

14 years ago

"sum" is the sum of something? The highest value? Most of the cases in the json file point to the *rename* (not delete) of a corrupt existing cache as the main culprit here (NETWORK_DISK_CACHE_TRASHRENAME consumes 52619 of the 53181ms in the worst case in comment 2, and 31263 of 31684 in the other case). We added the NETWORK_DISK_CACHE_TRASHRENAME in the same checkin that "fixed" the issue with NTFS taking a looong time to rename directories. Looking over the code, the most obvious possibility is the case where we there's already a "cache.trash" left over from a previous run: // TrashDir may already exist (if we crashed while deleting it, etc.) // In that case current Cache dir should be small--just move it to // subdirectory of TrashDir and eat the NTFS ACL overhead. In this case we take the slow path of moving the directory into a subdirectory, with the overhead of resetting NTFS ACLs for each file in it. I'm surprised that users could accumulate a large enough cache in the small window between 1) creating a new cache -> 2) crashing before the old cache was deleted (should be minutes maximum) to take 53 seconds to rename it into a subdirectory at the next startup. But I don't see any other obvious reasons for the slowness. Michal's patch for bug 695003 will eliminate this edge case--we'll now be moving to "cache.trash.RANDOMNUMBER" (with repeated tried to avoid filename collisions from getting the same random number), so we'll never move a cache directory into a subdirectory during the rename. Do we have a good way to split out telemetry results by date? Comparing results of at nightlies before/after we land bug 695003 (hopefully today or tomorrow) would be very useful. (P.S. Just to make life difficult there's also a 18023ms NETWORK_DISK_CACHE_OPEN in the json file that doesn't appear to involve a cache deletion. So there may be multiple issues here.)

Jason Duell

Updated

•

14 years ago

Depends on: 695003

Jason Duell

Comment 4

•

14 years ago

Michal: I'm less confident that bug 695003 will completely fix this. Try to think about any other reasons why we might be hitting these startup renaming slowness issues. Taras: there's no reason moving a (large) directory is slow on FAT32, is there? Should just be as fast as a single file move, IIRC.

Assignee: nobody → michal.novotny

(dormant account)

Reporter

Comment 5

•

14 years ago

I don't know any more than you in this case. Fact that opening cache files is slow might suggest disk fragmentation. Can someone hook up the code that measures remaining disk space to report to telemetry? My thoughts on addressing this: a) disable disk cache while opening cache..ie go to network(could have a 100ms timeout) b) do not do anything on shutdown. Perform cache maintenance only on a thread during runtime. Should probably also bypass disk cache while doing heavy maintenance.

Nathan Froyd [:froydnj]

Comment 6

•

14 years ago

(In reply to Taras Glek (:taras) from comment #5) > I don't know any more than you in this case. Fact that opening cache files > is slow might suggest disk fragmentation. Can someone hook up the code that > measures remaining disk space to report to telemetry? Which code are you thinking about measuring?

Nathan Froyd [:froydnj]

Comment 7

•

14 years ago

Attached patch available disk space telemetry (obsolete) — Details — Splinter Review

Did you mean something like this?

(dormant account)

Reporter

Comment 8

•

14 years ago

(In reply to Jason Duell (:jduell) from comment #3) > "sum" is the sum of something? The highest value? Sum is the raw sum of values that went into the histogram..in this case histograms have a single value so it's literally the time that this operation took. > > Do we have a good way to split out telemetry results by date? Comparing > results of at nightlies before/after we land bug 695003 (hopefully today or > tomorrow) would be very useful. There is some rudimentary filtering by date in the newer frontend at http://tinyurl.com/makeTarasHappy You can also file metrics bugs with requests for specific reports.

(dormant account)

Reporter

Comment 9

•

14 years ago

(In reply to Nathan Froyd (:froydnj) from comment #7) > Created attachment 578697 [details] [diff] [review] [diff] [details] [review] > available disk space telemetry > > Did you mean something like this? No. We need to check for diskspace on the Cache/ directory itself, not random files firefox looks at and how much space is left on those partitions.

Jason Duell

Comment 10

•

14 years ago

> Can someone hook up the code that > measures remaining disk space to report to telemetry? Opened bug 707402 for that--patch is there. > a) disable disk cache while opening cache..ie go to network (could have a 100ms > timeout) That's a possibility, but I'm not sure if it would mess up things that depend on items being in cache (view-source comes to mind, though I wouldn't consider it mission-critical). There may be other factors that make this tricky, or it might be fairly straightforward--michal/bjarne can hopefully answer that (biesi/bz may also know). One other alternative is that if we land bug 602611 (store items to RAM cache first), we'd be able to launch network requests at startup w/o waiting for the disk cache to open: they'd store into RAM cache, then get stored to disk later. But I'm not sure how close that patch is to be landable, and I think it uses write-through now (so might block on disk cache once network reply is received, which is still a win of a network round-trip, but still bad if disk cache open takes very long time). > b) do not do anything on shutdown. Perform cache maintenance only on a thread > during runtime. Should probably also bypass disk cache while doing heavy > maintenance. I assume this is directed at the shutdown bug, and general cache perf, not so much this startup bug. AFAIK we do almost all disk cache I/O on a non-main thread already. Some variant of bug 602611 that does periodic RAM->disk cache flush (instead of write-through) could possibly wait for the deletedir thread to finish before flushing, but getting the details right could be hard (ex: if it takes a long time, the RAM cache could fill up). We already set the deletedir thread to have the lowest possible thread priority, but I'm guessing this doesn't help much in the case of I/O (i.e. if the cache thread does I/O and sleeps, the low-priority thread can wake up and still schedule I/O that slows down the original I/O). Figuring out a way to not wait for the disk cache to open at startup seems both easier and more urgent than this.

Jason Duell

Comment 11

•

14 years ago

P.S. > ie go to network (could have a 100ms timeout) We would definitely want a timeout, else we'd load every single resource in every tab off the network, whereas now we force-load any resource we have on disk cache (w/o revalidating over the network). Hopefully we can solve the slow dir rename issue, and this will all be moot.

Michal Novotny [:michal]

Assignee

Updated

•

14 years ago

Depends on: 707436

Michal Novotny [:michal]

Assignee

Comment 12

•

14 years ago

What exactly measures firstPaint? Is it the delay between startup and the moment when we start painting UI, or the moment when we start rendering the first page? If it is the latter case, then I'm almost sure that this is now fixed by #695003. The scenario described in the description of that bug happens all the time we need to drop the cache at startup.

(dormant account)

Reporter

Comment 13

•

14 years ago

first paint is literally the first time the browser chrome is painted. Page can start loading before or after that.

Mardeg

Updated

•

13 years ago

Blocks: 447581

Piyush Soni

Comment 15

•

13 years ago

If it helps writing it down here, on x64 Windows 7, Firefox 10.0.2, it just took around a whooping 30 minutes (not exaggerating!) on my machine to clear the Cache.Trash folder after Firefox was crashed for some reason and I restarted it. All that time, it was deleting files from that folder or doing something with that in my FF profile. It is totally unacceptable. More details can be found on the above duplicate marked bug. Thanks...

Jason Duell

Comment 16

•

13 years ago

So I suspect that this bug (or at least a large % of cases) is a result of the initial rename of Cache -> Cache.Trash. We're still seeing 3% of NETWORK_DISK_CACHE_TRASHRENAME timings taking 10 seconds or more for Firefox 10 (the timing is capped at 10 seconds, so it would include long hangs). This is only on Windows--OSX and Linux are fine. The numbers are better for 11/12 (down to .2% for 11.0a2) but that might be an artifact of a small sample size. All the bugs that touch this code since 10 are AFIAK crash fixes not perf ones (ex: bug 695003), so there's no reason I can think of for a perf improvement since FF 8. I'd really, really like to track this down. Piyush, are you using an encrypted filesystem? Is your drive almost full? Anything else exceptional about your setup that might cause the drive to be slow? If you are willing, I can give you some steps to follow so we can try to reproduce the issue on your system while keeping a copy of your profile so we can debug what's going on. 1) browse for a while to build up a large cache again (you start from scratch after a crash). The bigger the better (close to 1 GB, or at least 500 MB). You can check your current cache size via Preferences | Advanced | Network 2) save a backup of your profile directory (including Cache directory). 3) Deliberately crash Firefox (install the crashme plugin: http://code.google.com/p/crashme/ which lets you crash whenever you want) 4) See if you hang again on startup. If you do, please keep a copy of your old profile around (you can restore a copy of it to your regular profile, so you can keep browsing as normal), and we can send you builds with debug instrumentation to figure out where things are going wrong. 5) If you can reliably duplicate the hang, try it with a nightly Firefox build to see if it does any better. Piyush: let me know if that seems doable, and if you're willing to help us track this down. It's been hard to get users who both have seen the issue and are willing to do these steps.

Piyush Soni

Comment 17

•

13 years ago

Sure! Jive me a day as I prepare for the next office meeting finding excuses of why I didn't do anything in the past week *again*, and I'll do this experiment. :D Why do we need an extension for crashing Firefox? : taskkill /f /im Firefox.exe on the command prompt works just as fine. (Actually, I lied. The last time it hadn't crashed for 'some' reason. I had killed it deliberately, just using this command I have saved as a batch, to kill offensive processes. I routinely kill firefox (:sob) as it grows to 1 to 2 GB in every 3-4 few days on my machine. Even properly closing takes minutes. I'm guilty. :) ) Question: - Could you as well tell me how to increase the cache usage quickly to 1 GB or so without waiting for days?

Jason Duell

Comment 18

•

13 years ago

> Why do we need an extension for crashing Firefox? > taskkill /f /im Firefox.exe That will probably work fine too. You just need to make sure Firefox doesn't get a chance to shut down in an orderly way. > Could you as well tell me how to increase the cache usage quickly to 1 GB or so without waiting for days? I've had good luck doing google image searches and scrolling down the page of results--fills up cache with lots of images. Although the bug that was originally causing these hangs was suffering from the number of files in the cache, so filling it up with big files (youtube videos etc) is less likely to show the hang. At least if it's the same hang we've had before (with NTFS ACL permission changes being the cause of slowness). Thanks a lot for offering to help. I look forward to seeing your results.

Michal Novotny [:michal]

Assignee

Comment 19

•

13 years ago

Attached file script to fill the cache — Details

(In reply to Piyush Soni from comment #17) > - Could you as well tell me how to increase the cache usage quickly to 1 GB > or so without waiting for days? You can use the attached file to fill the cache quickly.

Jason Duell

Comment 20

•

13 years ago

Comment on attachment 601394 [details] script to fill the cache Michal, when I try to use this script (i.e hit "fill") I wind up seeing "count: undefined/undefined". And cache isn't getting filled. Looks like it's because you use enablePrivilege("UniversalXPConnect"): here's web console error: [14:14:06.958] Use of enablePrivilege is deprecated. Please use code that runs with the system principal (e.g. an extension) instead. @ https://bug701909.bugzilla.mozilla.org/attachment.cgi?id=601394 [14:14:06.958] A script from "https://bug701909.bugzilla.mozilla.org" was denied UniversalXPConnect privileges. @ https://bug701909.bugzilla.mozilla.org/attachment.cgi?id=601394:3

Attachment #601394 - Flags: feedback-

Michal Novotny [:michal]

Assignee

Comment 21

•

13 years ago

It works if you save the file to disk and then allow the enhanced abilities.

Piyush Soni

Comment 22

•

13 years ago

" ... and allow the enhanced abilities" And how do we do that? I tried putting the script stuff into a Greasemonkey script (runs with slightly more privileges, but still much lesser than an extension) and ran but got the error of 'denied UniversalXPConnect privileges'

Michal Novotny [:michal]

Assignee

Comment 23

•

13 years ago

(In reply to Piyush Soni from comment #22) > " ... and allow the enhanced abilities" > And how do we do that? Firefox should IMO show a modal dialog if the page contains UniversalXPConnect() and the scheme is file (at least my FF does this).

Jason Duell

Comment 24

•

13 years ago

Piyush: does that makes sense? Just download the file and open it (with control-O), then say OK to the scary dialog.

Piyush Soni

Comment 25

•

13 years ago

Sorry for not responding earlier. Yes, I had done that and it had worked, but unfortunately, since my bug report I have tried it 4 times on different days, still it has not happened that *before* even painting the window it takes so much time. It is still taking a few minutes, but after showing me the 'well, this is embarrassing' window. Only when I click restore, then it does something with the Cache.Trash folder. But it's not going to those gigantic times again. I will keep on trying. Oh, btw, yes, my cache folder was around 1 GB. And I don't think my file system is encrypted. Most of these things are standard run of the mill, except all the exotic software on my machine which few people would probably install. But as I said, I am sure it will happen again.

Jason Duell

Comment 26

•

13 years ago

> It is still taking a few minutes, but after showing me the > 'well, this is embarrassing' window. OK, so you're not getting a freeze before even painting the initial chrome at startup. But you're seeing your computer freeze for a few minutes after you hit "restore session"? Freeze as in you can't do anything at all with the UI, or just that page loading is sluggish while you restore lots of tabs?

Nathan Froyd [:froydnj]

Updated

•

12 years ago

Attachment #578697 - Attachment is obsolete: true

Wayne Mery (:wsmwk)

Updated

•

12 years ago

Keywords: perf

Wayne Mery (:wsmwk)

Comment 27

•

12 years ago

Thunderbird user reports reproducible long startup and "unresponsive script" message at https://getsatisfaction.com/mozilla_messaging/topics/unresponsive_script-c4ael no unresponsive script if cache is deleted before startup. cache described at https://getsatisfaction.com/mozilla_messaging/topics/unresponsive_script-c4ael#reply_11814877

Wayne Mery (:wsmwk)

Comment 28

•

11 years ago

(In reply to Jason Duell [:jduell] (needinfo? me for lower latency) from comment #26) > > It is still taking a few minutes, but after showing me the > > 'well, this is embarrassing' window. > > OK, so you're not getting a freeze before even painting the initial chrome > at startup. But you're seeing your computer freeze for a few minutes after > you hit "restore session"? Freeze as in you can't do anything at all with > the UI, or just that page loading is sluggish while you restore lots of tabs? piyush seems to be gone, but let's try needinfo anyway

Flags: needinfo?(piyush_soni)

Piyush Soni

Comment 29

•

10 years ago

I'm here, but I'm afraid I don't have much to report. May be it doesn't happen with the new UI? I've seen the "couple of minutes before painting even the initial Chrome" problem a couple of times after that, but there's no consistent way to reproduce that. It just happens randomly, and when I try to do that again, it doesn't. I'm not sure what can be done here to this bug. These days when it becomes slow I just go and delete those Local cache folders.

Flags: needinfo?(piyush_soni)

Piyush Soni

Comment 30

•

10 years ago

Forgot to write: I'm open to doing any experiments or generating reports you might need from me.

Patrick McManus [:mcmanus]

Updated

•

9 years ago

Whiteboard: [Snappy:p1] → [Snappy:p1[necko-would-take]

Marco Castelluccio [:marco]

Updated

•

9 years ago

Whiteboard: [Snappy:p1[necko-would-take] → [Snappy:p1][necko-would-take]

Michal Novotny [:michal]

Assignee

Comment 31

•

9 years ago

This bug refers to deleting the cache at startup after the crash. We don't do this anymore in the new cache so I doubt the startup delay is caused by the cache. If you think it is caused by the cache, please try to reproduce the problem with "cache2:5" NSPR logging.

Flags: needinfo?(piyush_soni)

Honza Bambas (:mayhemer)

Comment 32

•

9 years ago

No reaction on this bug and it's likely it's fixed/invalid.

Status: NEW → RESOLVED

Closed: 9 years ago

Flags: needinfo?(piyush_soni)

Resolution: --- → WONTFIX

available disk space telemetry 14 years ago Nathan Froyd [:froydnj] 2.25 KB, patch		Details \| Diff \| Splinter Review
script to fill the cache 13 years ago Michal Novotny [:michal] 2.33 KB, text/html	jduell.mcbugs : feedback-	Details