Open Bug 337927 Opened 19 years ago Updated 2 years ago

Fragmented downloads, apparently because Firefox does not pre-allocate space for downloads

Categories

(Toolkit :: Downloads API, enhancement)

x86
Windows XP
enhancement

Tracking

()

REOPENED

People

(Reporter: zumbi9in, Unassigned)

References

Details

(Whiteboard: tpi?)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3 Hi, I would like you to make a feature (option) to allocate disk space before downloading a file, for exemple: I downloaded a 19Mbs file and my defrag program told me that it was in 230 pieces. Each time I download something I have to make a copy to defrag, so my HD doesn´t go crazy :) Hope you can do that in the next release, Firefox is the best! Thanks, Alexandre. Reproducible: Always Actual Results: Download a big file and run a defragment program that shows fragmentation reports like Perfect Disk 7
Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.9a1) Gecko/20060509 Minefield/3.0a1 How full is your disk? Does your OS actually mind files being fragmented? I suspect that pre-allocating space will be (or has been) decided on using other factors.
I have a 400GB HD with 30% free space desfragmented.
see the discussion in bug 251876 (which was WONTFIX'ed)
Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.9a1) Gecko/20060509 Minefield/3.0a1 On Mac OS, many files created by Mozilla are found in numerous fragments, but the HFS+ filesytem performs well even in the case of fragmented files. (I have always believed that this is because the filesystem uses a 'fill first extent' policy, rather than filling the largest free extent. It is hard to see how this can be practically improved, given that we want the allocated file size to reflect the true size on disk at all times. I tend to copy significant downlaods to an external drive, and that copy should produce a defragmented file, as it is a single act of creation, rather then data dribbling in from the network. If it is any sympathy, the Bittorrent protocol is even more prone to tickle filesystem infelicities.
Summary: Fragmented Downloads → Fragmented downloads, apparently because Firefox does not pre-allocate space for downloads
I confirm this for 3 or more files downloading at once; v1503; It likely happens for 2 files also. Defragmenting large files (.iso's) is extremely slow and usually fails unless disk has much more space free than the largest fragmented file. A pair of 2g files can have 1000's of fragments causing WinXP to suggest a defrag that would otherwise not be needed. This should at least be the default for a second download started while first continues.
Whiteboard: WONTFIX? #c3
WONTFIX, see comment 3
Status: UNCONFIRMED → RESOLVED
Closed: 17 years ago
Resolution: --- → WONTFIX
Verified, per above.
Status: RESOLVED → VERIFIED
Product: Firefox → Toolkit
I'm reopening this. The discussion in bug 251876 is about fixing it in the generic case, which is much harder especially when having to deal with a crash. We've already realized that avoiding fragmentation can cause massive performance improvements, and spent time to make sure many of our own database files are not fragmented and provided cross platform APIs to deal with it. So it seems to me the original reasons for WONTFIXING no longer hold. On the contrary, downloads are getting ever bigger, and especially the largest ones will end up on HDDs, not SSDs, whose rotational speeds have *not* evolved...
Status: VERIFIED → REOPENED
Ever confirmed: true
Resolution: WONTFIX → ---
(In reply to Gian-Carlo Pascutto [:gcp] from comment #8) > The discussion in bug 251876 is about fixing it in the generic case More relevantly, bug 251876 suggests using the SetEndOfFile API, in other words the file size is preallocated, and if the file is closed abruptly, attempts to read the file past the portion that has already been written will result in NUL characters. We cannot do that, because we use the file size to achieve the most reliable resuming, and if we stored the amount of data written elsewhere, we'd lose this reliability. > which > is much harder especially when having to deal with a crash. We've already > realized that avoiding fragmentation can cause massive performance > improvements I'm not sure exactly which performance improvement you're referring to. I presume you refer to read performance, and I expect little difference when writing. We only read back a small subset of our downloads (resumable downloads for which we need to compute the hash). Moreover, we do this on a background thread, so this doesn't directly affect responsiveness. I don't think we should be concerned with optimizing read performance for other applications, which should be done by the operating system. > and spent time to make sure many of our own database files are > not fragmented and provided cross platform APIs to deal with it. As I mentioned, database files have a quite different I/O profile. > So it seems to me the original reasons for WONTFIXING no longer hold. On the > contrary, downloads are getting ever bigger, and especially the largest ones > will end up on HDDs, not SSDs, whose rotational speeds have *not* evolved... I believe the original discussion is still valid. That said, if there is an API to _advise_ the operating system that we are about to write a file of a given size, _without_ actually setting the file size, I think we could accept a properly executed patch to add this feature to OS.File. The difficulty of testing this automatically means we'll treat this as a best-effort improvement with known risk of regression. If you know of such an API for Windows, feel free to file a separate bug. Note that, if I remember correctly, at least in some places we already advise the operating system that our downloads will be sequentially read or written (but don't specify the expected final file size).
(In reply to :Paolo Amadini from comment #9) > More relevantly, bug 251876 suggests using the SetEndOfFile API, We have a generic fallocate API now. > We cannot do that, because we use the file size to achieve the most reliable > resuming, and if we stored the amount of data written elsewhere, we'd lose > this reliability. What reliability? If you fsync the relevant writes, the worst that can happen is that the download has to rewind a bit on resuming if we crashed during the download. If you don't, I wonder what reliability you truly have. And yes, if you currently use the file size to see where to resume, that'll become harder and need extra work. > I don't think we should be concerned with optimizing read performance for > other applications, which should be done by the operating system. I believe that's incredibly short-sighted. If that were true, what was the point of optimizing our own databases? Just let the OS do it. Even worse, users are likely to want to use the downloaded file soon, which doesn't exactly leave much time for the OS to do the optimization (not to mention most Linux installations don't even have this capability!). Our users are using Firefox to download files they want to read. To pretend the performance of that is not our problem is at the very least somewhat disingenuous. > I believe the original discussion is still valid. The one which says this isn't possible because the File API's are frozen? We're not 2004 any more, that ship sailed long ago. > That said, if there is an API to _advise_ the operating system that we are about to write a file of a > given size, _without_ actually setting the file size, I think we could > accept a properly executed patch to add this feature to OS.File. Either this or figure out a way to recover file size info via another way. I'm not even looking into the specifics of implementing this right now. I'm just pointing out that the original reasons for shelving this are 10 years out of date, no longer valid and that if someone comes up with a good approach for this, we should take it.
Whiteboard: WONTFIX? #c3
> We have a generic fallocate API now. If we need support for fallocate in OS.File, don't hesitate to file a bug. > What reliability? If you fsync the relevant writes, the worst that can happen is that the download has to rewind a bit on resuming if we crashed during the download. If you don't, I wonder what reliability you truly have. And yes, if you currently use the file size to see where to resume, that'll become harder and need extra work. FWIW, I seem to remember that we do not want to fsync downloads, ever. This increases reliability (but not enough) at the expense of overall system responsiveness (and that can be a lot, depending on cases).
(In reply to Gian-Carlo Pascutto [:gcp] from comment #10) > > More relevantly, bug 251876 suggests using the SetEndOfFile API, > > We have a generic fallocate API now. "in other words the file size is preallocated, and if the file is closed abruptly, attempts to read the file past the portion that has already been written will result in NUL characters." As far as I can tell, this is true of "fallocate" as well (<http://man7.org/linux/man-pages/man2/fallocate.2.html>). > What reliability? If you fsync the relevant writes, the worst that can > happen is that the download has to rewind a bit on resuming if we crashed > during the download. If you don't, I wonder what reliability you truly have. After a recovery, you will never have extra NUL characters in the file. This is the reliability property. The fsync is unimportant, it only affects how much data you can recover. > And yes, if you currently use the file size to see where to resume, that'll > become harder and need extra work. Anything stored elsewhere won't be as reliable, as it won't necessarily be in sync with how much data the OS has actually written. > > I don't think we should be concerned with optimizing read performance for > > other applications, which should be done by the operating system. > > I believe that's incredibly short-sighted. This is unhelpful. > If that were true, what was the > point of optimizing our own databases? I already replied to this question: the databases have different I/O profiles. > Just let the OS do it. Even worse, > users are likely to want to use the downloaded file soon, which doesn't > exactly leave much time for the OS to do the optimization (not to mention > most Linux installations don't even have this capability!). Also, "optimizing" may mean different things. I think the discussion will be easier if we try to scope it down to the case at hand, instead of finding (possibly unrelated) examples elsewhere. > Our users are using Firefox to download files they want to read. To pretend > the performance of that is not our problem is at the very least somewhat > disingenuous. We always have to make trade-offs - that's part of our work as software designers. Again, "performance" may or may not be our problem, based on what we mean exactly. In this case, my opinion on the trade-off is that we should err on the side of reliability. That said, I agree that we should do our best to help the OS optimize the allocation of large downloads, and this is why I proposed a solution I think is valid in comment 9. > > I believe the original discussion is still valid. > > The one which says this isn't possible because the File API's are frozen? > We're not 2004 any more, that ship sailed long ago. Not that part, obviously. > I'm just pointing out that the original reasons for shelving this are 10 years > out of date, no longer valid and that if someone comes up with a good > approach for this, we should take it. Some of them are no longer valid. To better reflect what we need, I think the best thing to do is to file a new bug along the lines of "Investigate a way to advise the operating system on the final file size, without preallocating it".
(In reply to :Paolo Amadini from comment #12) > After a recovery, you will never have extra NUL characters in the file. This > is the reliability property. The fsync is unimportant, it only affects how > much data you can recover. You may not have *extra* characters - file length is considered metadata and usually more strongly protected by filesystems - but there's no guarantee at all the data in there is valid. It may be zeroes or it may be the previous inode contents at that physical location. If you want that guarantee, you need an fsync. > Anything stored elsewhere won't be as reliable, as it won't necessarily be > in sync with how much data the OS has actually written. There's ways around this, and in any case we probably don't care as long as it's an under, not over-estimation. > This is unhelpful. So is fragmenting the files when it could have been avoided. Files are downloaded because the user wants to use them, which means reading them back. Our own performance increases if the users' disk isn't seeking all over the place reading non-Firefox files. There are no downsides to not fragmenting the files when we can do so without loss of reliability, I hope that's not a point of contention. > In this case, my opinion on the trade-off is that we should err on the side > of reliability. I fully agree. I just don't think the two here are necessarily exclusive. > That said, I agree that we should do our best to help the OS > optimize the allocation of large downloads, and this is why I proposed a > solution I think is valid in comment 9. ... > Some of them are no longer valid. To better reflect what we need, I think > the best thing to do is to file a new bug along the lines of "Investigate a > way to advise the operating system on the final file size, without > preallocating it". Feel free to file such a bug, but I don't know any API that actually exposes the semantics you want and is actually effective (in solving this issue). posix_fadvise has such semantics but as far as I know it doesn't help much for not fragmenting writes (I'd love to be wrong about that one). And it's not available where most of our users are (Windows). Preallocation *is* effective and available on most platforms, but we need to be careful if we want to keep the valid-downloaded-filesizes and the writes in sync. Here's two possible approaches I can think of: 1) Notice when the file is above a certain size (say 5M). If so, we start doing fallocate calls in 1M chunks. If we resume a crashed download, we assume the last chunk was incomplete and rewind a maximum of 1M. This is very safe, reasonably simple, but at the cost of the occasional rewind on a crash/power failure, and not preventing all fragmentation. 2) Extend the fallocated file length by 8-bytes (PRInt64) and store the currently-valid value there, and update it when writing out data. When finished, truncate to the correct length. This is harder and involves extra writes, at the benefit of not rewinding and being able to fallocate the full size at once.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #13) > You may not have *extra* characters - file length is considered metadata and > usually more strongly protected by filesystems - but there's no guarantee at > all the data in there is valid. It may be zeroes or it may be the previous > inode contents at that physical location. If you want that guarantee, you > need an fsync. Yeah, I think you're right in the general case. In practice, at least in the case of killing the browser because it has become unresponsive, with an NTFS file system, I don't think I've ever observed corruption. I didn't run any formal tests however. > There are no downsides to not fragmenting the files when we can do so > without loss of reliability, I hope that's not a point of contention. You're right, though code complexity and the entity of the benefit are also factors of the equation. > Preallocation *is* effective and available on most platforms, but we need to > be careful if we want to keep the valid-downloaded-filesizes and the writes > in sync. Here's two possible approaches I can think of: > > 1) Notice when the file is above a certain size (say 5M). If so, we start > doing fallocate calls in 1M chunks. If we resume a crashed download, we > assume the last chunk was incomplete and rewind a maximum of 1M. This is > very safe, reasonably simple, but at the cost of the occasional rewind on a > crash/power failure, and not preventing all fragmentation. > > 2) Extend the fallocated file length by 8-bytes (PRInt64) and store the > currently-valid value there, and update it when writing out data. When > finished, truncate to the correct length. This is harder and involves extra > writes, at the benefit of not rewinding and being able to fallocate the full > size at once. These are two possible solutions that decrease fragmentation and have similar (and maybe slightly better) reliability properties than the current state. Compared to simple advising, it should be noted that both have a non-trivial code complexity cost. Whether we're willing to pay this cost is up for debate - it seems to me your opinion differs from mine here. Having a way to measure the benefit would certainly help in deciding (though I'm not saying we necessarily need to make a measurement).
On windows there is an easy way to set file allocation size without also setting real file size, but this will work only on NTFS volumes and requires the use of not well documented native function NtSetInformationFile with information class FileAllocationInformation, but firefox is already using some native api, so I don't think that it will be a problem
See Also: → 1329437
(In reply to M8R-dcy0a from comment #15) > On windows there is an easy way to set file allocation size without also > setting real file size, but this will work only on NTFS volumes and requires > the use of not well documented native function NtSetInformationFile with > information class FileAllocationInformation, but firefox is already using > some native api, so I don't think that it will be a problem There is a source from within MS confirming this: https://blogs.msdn.microsoft.com/oldnewthing/20160714-00/?p=93875 So we might be able to fix this.
Whiteboard: tpi?
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.