Created attachment 597364 [details] SsdReady Firefox numbers After I moved my system Solid State Drive I began to examine, which applications I use are SSD unfriendly in terms of performance and life time. Unfortunately, Firefox is in the second list. Below are only basic explanations on how SSD working. Who are interesting in more details, read the article - http://www.anandtech.com/print/2829. Who familiar with SSD internals, can skip till test results. In contrast to HDD, SSDs have limited number of writes and also in part of scenarios write data in different ways. First of all, minimum write amount is one page, it's size is 4096 bytes for most part of such drives, with move to 8192 bytes in newer modes. This means, that if we even write a couple of bytes, all the page will be used. The second problem rises when we need to change some data, SSD can't just write changes, it needs to use read-modify-(erase)-write procedure. Old drives erases same cell, which holds changed data, and writes it back, but this lead to low performance. Erasing is slow procedure, so now SSD writes changed data to new unused cell (from free space or special spare area), old cell marked as dirty and later, during garbage collecting procedure, it's erased and moved to clean cells pool. But this also mean, that if program writes changes in small portions, then each write "consume" 4k, shortening SSD life with factor 10-100 or even bigger. Next touch on performance, in simplified view SSD can be represented as RAID 1 of flash drives, to achieve max performance data must be written in "page size"*"number of SSD channels" blocks, for modern drives with 8-10 channels it gives 32k-40k data block size. Let's take a look on this IOMeter diagram http://www.ixbt.com/storage/SSD/p14/diags/seqwrite-all.swf we can see that anything lower than 4k leads to relatively poor performance, and max speed achieved on 64k block size. Same diagram for reading, if somebody interested - http://www.ixbt.com/storage/SSD/p14/diags/seqread-all.swf Now, aimed with this knowledge, let's take a look how Firefox writing data. I've captured it's activity with Process Monitor (Procmon) during normal browsing activity of 30 minutes, clean profile, no preferences changed, no extensions, nightly build: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0a1) Gecko/20120208 Firefox/13.0a1 During that session there was 7652 WriteFile events, I sorted them by Length, count them, and receive such top list: Length Num 32768 2325 24 1638 16384 1494 4096 108 32 94 4 78 266 48 226 41 98304 32 We see, that most part of writes made almost optimal block size, but on second place there are very small, just 24 bytes, writes, and each of them waste 4096 bytes of space. Next sort was made by number of writes in each file: File Num C:\Storage\FF\Profile\cookies.sqlite-wal 2225 C:\Storage\FF\Profile\places.sqlite-wal 1200 C:\Storage\FF\Profile\Cache\_CACHE_001_ 1097 C:\Storage\FF\Profile\places.sqlite 496 C:\Storage\FF\Profile\Cache\8\07\E995Cd01 301 C:\Storage\FF\Profile\Cache\0\9C\6453Ad01 253 C:\Storage\FF\Profile\cookies.sqlite 186 C:\Storage\FF\Profile\Cache\7\D4\C4D9Cd01 182 C:\Storage\FF\Profile\Cache\_CACHE_002_ 150 C:\Storage\FF\Profile\Cache\_CACHE_003_ 146 If merge all Cache writes into one item, than Cache take 2987 of total writes. Now let's see, what write size uses most written files cookies.sqlite-wal Length Num 32768 1081 32 63 24 1081 If we look at Procmon log, we'll find that writes of 32k and 24 are following one after another, taking quick look on what -wal are, supposing that those 24 bytes writes are SQL statements places.sqlite-wal Size Num 495616 18 32768 557 4096 29 32 29 24 557 Similar to above. _CACHE_001_ Completely random writes size from 731 to 1 byte. And the funny thing is that other cache files, such as E995Cd01 or 6453Ad01, uses exactly 16k write size. From my point of view, this writes are first nominees for optimization. Also, during forums reading, I find that one person provide similar results before, I've attached a screenshot from SsdReady program, showing how much data was written during his daily session, and his comment was: "1.5GB is session data, 700 MB in cookies, other don't know, so unfortunately I moved profile to HDD". I'm not sure, if this bug is enough, or it must become meta and separate issues for Firefox parts filled.
Created attachment 597365 [details] Write logs Also attaching write logs in native Procmon format, exported in csv and in MS Excel format
This is an interesting insight, though we don't have much control on SQLite internal writes, may be interesting to make a more general research on how SQLite handles writes and publish it on sqlite-users mailing list. The 32768 bytes are pages, the 24 bytes are probably header for the pages, the larger writes are likely wal-checkpoints. Not sure about the 32 and 4096 bytes writes. It's puzzling how many writes cookies does, maybe they should be buffered somehow in the service. Reducing writes would globally be a good win, where possible. I'll send a mail to SQLite support pointing out this bug, since I suppose they may be interested in it. Even if those 24 bytes writes may just be similar to the header that is kept in memory and that causes jemalloc to round them. The cache is probably a completely separate issue since it may store any kind of file, from images to a simple html file. Though you may change the settings to use in-memory cache.
Windows and all flavors of unix should be buffering write() system calls and only sending content to persistent storage (flash) when FlushFileBuffers() or fsync() are called. SQLite assumes as much, at least. SQLite does makes many small, sequential writes to the WAL file, but it only calls FlushFileBuffers()/fsync() after writing large chunks that are usually multiples of 4096. We are familiar with the limitations and issues of flash memory and work hard to make SQLite flash-friendly. Are you saying that Windows does not buffer writes, but instead translates every call to WriteFile() into one or more flash-memory page writes? If that is so, please let us know and we will add a buffering layer to the Windows OS interface to overcome what then must be considered a glaring deficiency in the windows filesystem implementation. Or, perhaps there is some setting or configuration that we are failing to enable that is necessary to activate buffering on windows? Can you direct us to the appropriate documentation?
(In reply to D. Richard Hipp from comment #3) > Are you saying that Windows does not buffer writes, but instead translates > every call to WriteFile() into one or more flash-memory page writes? No, I'm saying that every write with size not equal to flash page size uses full page, that doesn't depend on OS. Yes, in some cases buffering may help, reducing number of writes, but even if it merge two serial writes of 32768 and 24 bytes to one 32792 write request to drive, the drive still write 6 4096 bytes pages
The way filesystems normally work is that they accumulate many small, out-of-order writes in memory, and then combine them all into a single big contiguous write to disk or flash. They've been doing this for a long time (4 decades) since even though spinning disks do not suffer from the wear-out problems of SSD, they do go faster when you write in big chunks versus small chunks, and filesystem designers have long been focused on performance. So if the windows filesystem works as we think it does, it will not matter if Firefox and SQLite write a total of 131072 bytes to a file using 100000 separate WriteFile() calls with 3 or 4 bytes written in each call and in a haphazard order, or if FF/Sqlite does a single big WriteFile() call with all 131072 bytes. In either case, the writes will be coalesced and sent to flash all at once, resulting in a single 131072 byte write to flash and a minimal amount of wear on the NAND-flash chip. Do you have information to suggest that the windows filesystem does not work as described in the previous paragraph? Can you share that information with us? If we need to modify SQLite so that it coalesces writes into big chunks by itself, we can certainly do that. It will involve an extra memcpy() per write and so there is a small performance penalty, and our belief is that the operating system already coalesces the writes for us automatically. Hence our preference would be to leave this out. But if our understanding of how the windows filesystem works is incorrect then we should probable take the performance hit and coalesce writes in user-space. Please guide us with whatever authoritative information you have on the subject.
I definitely think this should be a meta-bug with two separate bugs for cache and sqlite stuff. Richard, I think what Phoenix is suggesting is if the 32768 and 24 byte writes are actually contiguous on disk (data + header) then there may be benefit to reducing the data size to 32744 bytes so that the total data+header size is a nice integer multiple of 4096. Of course if they're not contiguous then changing the data size wouldn't help anything.
32768 is apparently the page size of the places.sqlite database, at least for the database provided by Phoenix. For any SQLite database with a page size of N, the WAL file consists of a short header (32 bytes) followed by alternating 24-byte and N-byte blocks. The page size N must be a power of 2. We could make N less than a power of two for the WAL file, but then the database would have non-aligned pages and so you would just be shifting the problem from the WAL file to the database. Changing the page size so that it is not a power of two would also break file format compatibility. The writes to the WAL file are contiguous and sequential. And in FF, the PRAGMA synchronous=NORMAL setting is used which means that FlushFileBuffers() is called only for a checkpoint peration, which is rare. Hence, the filesystem should be (ought to be) accumulating all writes to the WAL file in memory, then writing about 1MB to flash all at once whenever a checkpoint operation occurs. Because of those 24-byte records in between each page, the total amount written is unlikely to be a multiple of 4096 bytes and will probably have to be padded out to the next multiple of 4096, but we are only talking about an average of 2048 bytes of padding per 1MiB of writing, or about 0.2%. Correction: SQLite tries to do a checkpoint after accumulating about 1000 pages of content in the WAL file, and so it writes about 1MB only if the page size is 1K. If the page size is 32K, the checkpoints occurs when the wal grows to about 32MB. Hence, in the case of FF, the padding will be only about 0.006% of all bytes written, assuming the filesystem cache and the track buffers in the flash controller on the SSD drive are doing their jobs.
We don't rely on the default of 1000 pages, we set wal_autocheckpoint to a number of pages to get 512KB journal before a checkpoint, so eventual loss are not too bad.
(In reply to D. Richard Hipp from comment #5) > The way filesystems normally work is that they accumulate many small, > out-of-order writes in memory, and then combine them all into a single big > contiguous write to disk or flash. They've been doing this for a long time > (4 decades) since even though spinning disks do not suffer from the wear-out > problems of SSD, they do go faster when you write in big chunks versus small > chunks, and filesystem designers have long been focused on performance. I checked this with Diskmon, and seems that you are correct, while I'm was not, so write size shouldn't be much problem here (at least on Windows 7). On other hand, total number of -wal files rewrite looks too big for me, during six minutes Diskmon measurements those files was rewrited near two times in a minute cookies.sqlite-wal 22:26:05 22:26:11 22:26:20 22:26:27 22:26:36 22:26:43 22:26:49 22:27:43 22:29:39 22:30:57 22:31:42 places.sqlite-wal 22:26:05 22:26:43 22:29:39 22:30:57 22:31:11 22:31:13 22:31:29 22:31:32 22:31:42 (In reply to Boris Zbarsky (:bz) from comment #6) > I definitely think this should be a meta-bug with two separate bugs for > cache and sqlite stuff. I'm not very familiar with bugzilla, can you do this?
As far as I know, no team is currently actively tracking this, and since the two bugs here are well-filed, I'm going to close this tracker.