Closed Bug 594172 Opened 14 years ago Closed 14 years ago

ZipWriter write()s too often

Categories

(Core :: Networking: JAR, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla2.0b7
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: taras.mozilla, Assigned: taras.mozilla)

References

Details

(Whiteboard: [ts])

Attachments

(1 file, 3 obsolete files)

Attached patch Cutting out of bonus writes (obsolete) — Splinter Review
Combination of a tiny output buffer + overwriting header seems to result in up to 1000x overuse of write() calls.
Whiteboard: [ts]
Blocks: 593349
Comment on attachment 472805 [details] [diff] [review]
Cutting out of bonus writes

Bumping the buffer up doesn't reduce writes due to excessive seeking while writing. Turns out the seeking is to write the header, which seems to happen on Close() anyway. With this change zips only need one write.
Attachment #472805 - Flags: review?(dtownsend)
Comment on attachment 472805 [details] [diff] [review]
Cutting out of bonus writes

(In reply to comment #1)
> Bumping the buffer up doesn't reduce writes due to excessive seeking while
> writing. Turns out the seeking is to write the header, which seems to happen on
> Close() anyway. With this change zips only need one write.

This isn't true, zip files are meant to contain two headers for every entry, one is the file header that appears immediately before the file data (and is written in nsZipDataStream::CompleteEntry, the other is in the central directory which is written in nsZipWriter::Close. They are redundant and most tools seem to just use the central directory which is why this change passes tests, but it does not leave a correct zip file. You can see the problem by running zip -FF on a generated zip. It will complain about crc and file size mismatches and the "fixed" zip will be full of 0 length entries.
Attachment #472805 - Flags: review?(dtownsend) → review-
I'm not sure you can avoid the seeking here.

You could perhaps punt it all to the end and go back and write all file headers and then the central directory in Close, but that loses a certain amount of fault tolerance (currently if the app crashes or something before nsZipWriter::Close is called the zip file should be repairable with common zip tools). It is still the same amount of seeking but maybe removing the seeking while writing the main file data gives you a win anyway.

The only way to truly solve this is to write the file header before the file data, but that requires knowing the compressed size of the data and the crc value in advance.
(In reply to comment #3)
> I'm not sure you can avoid the seeking here.
> 
> You could perhaps punt it all to the end and go back and write all file headers
> and then the central directory in Close, but that loses a certain amount of
> fault tolerance (currently if the app crashes or something before
> nsZipWriter::Close is called the zip file should be repairable with common zip
> tools). It is still the same amount of seeking but maybe removing the seeking
> while writing the main file data gives you a win anyway.

That's reasonable.

> 
> The only way to truly solve this is to write the file header before the file
> data, but that requires knowing the compressed size of the data and the crc
> value in advance.

you can also buffer new zip items before writing them out.

I wonder if we should just buffer the whole zip file in memory.
(In reply to comment #4)
> (In reply to comment #3)
> > The only way to truly solve this is to write the file header before the file
> > data, but that requires knowing the compressed size of the data and the crc
> > value in advance.
> 
> you can also buffer new zip items before writing them out.
> 
> I wonder if we should just buffer the whole zip file in memory.

That is certainly a choice that works for small enough files but it was originally designed to allow for large zip files to be created. I guess a way to indicate that you want faster operation for smaller zip files might work.
So I just confirmed my theory, having lots of write() calls causes windows to fragment the file a lot. The above patch gets rid of fragmentation nicely.

So given your requirements, updating offsets after the file is written is the best option.
Attached patch no with working zip -FF (obsolete) — Splinter Review
This passes tests. The file is still more fragmented than it needs to be on updates. 

I wonder if we should add an api to leave some space inbetween last file and the central directory. Either that or support opening zipfiles from filehandle so then the user can call mozilla::fallocate on it (bug 592520). This would also involve not doing SetEOF in that case.
Assignee: nobody → tglek
Attachment #472805 - Attachment is obsolete: true
Attachment #473156 - Flags: review?(dtownsend)
Attachment #473156 - Flags: review?(dtownsend)
Attached patch Now updates work with zip -FF (obsolete) — Splinter Review
Attachment #473156 - Attachment is obsolete: true
Attachment #473183 - Flags: review?(dtownsend)
FWIW, zip is designed to work with streaming. There is a mode to write the zip length information after the data so no seeking is required.
(In reply to comment #9)
> FWIW, zip is designed to work with streaming. There is a mode to write the zip
> length information after the data so no seeking is required.

Can you point to details on this, the main zip spec is pretty explicit about file headers coming before their file data and also file headers have no offset information normally.
From the spec: http://www.pkware.com/documents/APPNOTE/APPNOTE-6.3.0.TXT

  C.  Data descriptor:

        crc-32                          4 bytes
        compressed size                 4 bytes
        uncompressed size               4 bytes

      This descriptor exists only if bit 3 of the general
      purpose bit flag is set (see below).  It is byte aligned
      and immediately follows the last byte of compressed data.
      This descriptor is used only when it was not possible to
      seek in the output .ZIP file, e.g., when the output .ZIP file
      was standard output or a non-seekable device.

Basically, this optional post-data descriptor contains information which can only be calculated after the fact while streaming.
Oh yeah. For some reason I recall that there were problems with that, but we could try it.
Comment on attachment 473183 [details] [diff] [review]
Now updates work with zip -FF

>diff --git a/modules/libjar/zipwriter/src/nsZipWriter.cpp b/modules/libjar/zipwriter/src/nsZipWriter.cpp
>--- a/modules/libjar/zipwriter/src/nsZipWriter.cpp
>+++ b/modules/libjar/zipwriter/src/nsZipWriter.cpp
>@@ -301,7 +301,7 @@ NS_IMETHODIMP nsZipWriter::Open(nsIFile 
>         return rv;
>     }
> 
>-    rv = NS_NewBufferedOutputStream(getter_AddRefs(mStream), stream, 0x800);
>+    rv = NS_NewBufferedOutputStream(getter_AddRefs(mStream), stream, 3*1024*1024);

Spaces around the *s please.
Attachment #473183 - Flags: review?(dtownsend) → review+
Requesting approval to land because this fixes a significant startup performance problem.
Attachment #473183 - Attachment is obsolete: true
Attachment #474758 - Flags: review+
Attachment #474758 - Flags: approval2.0?
What's the cost/benefit of taking this in terms of risk of regression? We're getting close to code freeze, and I'd rather rachet down regression risk until we re-open for the next beta. Opinions?
This will help startup perf significantly, and is fairly safe. I think we should take it, and block on it since it's related to a startup cache regression.
blocking2.0: --- → betaN+
Keywords: checkin-needed
http://hg.mozilla.org/mozilla-central/rev/b4c78a544949
Status: NEW → RESOLVED
Closed: 14 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla2.0b7
Comment on attachment 474758 [details] [diff] [review]
write out in bigger chunks.

Canceling unnecessary approval request, since this is blocking+
Attachment #474758 - Flags: approval2.0?
Version: unspecified → Trunk
Depends on: 610040
This may have caused bug 610040.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: