Closed Bug 356470 Opened 18 years ago Closed 17 years ago

HTTPS File Upload results in partial and/or corrupted files on server

Categories

(Core :: Networking, defect)

1.8 Branch
defect
Not set
major

Tracking

()

RESOLVED INVALID

People

(Reporter: jhuckaby, Assigned: Dolske)

References

()

Details

(Keywords: dataloss, regression, Whiteboard: ssl)

Attachments

(2 files, 2 obsolete files)

User-Agent:       Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061003 Firefox/2.0

When I upload a large (1 MB) file using a simple form and HTTPS, the file arrives on the server smaller (800K) than it should be.  This same upload works fine in Firefox 1.5, MSIE 6, Safari 2.0, etc.  The bug only occurs in Firefox 2.0 RC2 Win32, and only under HTTPS.

Reproducible: Always

Steps to Reproduce:
1. Visit this page: https://staging.sr.admission.net/joetemp/file_upload.html
2. Upload a large (1 MB) file

Actual Results:  
See the results page which lists the file size under "File Uploads", and it will be smaller than it should be.  My test file is 1163446 bytes, and with Firefox 2.0 RC2 it arrives on the server with size of 905215 bytes.

Expected Results:  
The file size on the server should be the actual size of the file.

My server-side script is using mod_perl 1.29 on Apache 1.3.37 under Fedora Core 4, i386 arch.  The Perl module used to parse the HTTP POST is Apache::Request version 1.33.  Source code available on request: jhuckaby at gmail dot com.

The Content-Length header that Firefox is sending seems to be correct, so there must be something wrong with the file section of the multi-part HTTP POST request.

Oh, last thing.  This bug is NOT reproducible in the Mac OS X version of Firefox 2.0 RC2.  It only occurs in the Windows version.
Here is the file I used to reproduce the bug (1163446 bytes):
http://staging.sr.admission.net/joetemp/12-images.zip
Version: unspecified → 2.0 Branch
It appears that this bug is reproducible under the testcase provided using "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061012 BonEcho/2.0".  I can't seem to reproduce with other files, though.  Perhaps this only occurs in a very small range of file sizes?

It "might" be useful to test it further by uploading the file via HTTPS, then redownloading it to see if the file is intact.  Have you , the reporter ever, in a "real-life" situation uploaded a file via HTTPS, and found it to be corrupted or incomplete when you downloaded it?

Yes, using the test file provided, I did upload, then re-download the file and it is indeed corrupted.  The file size of the re-downloaded file is much smaller than the original (883K != 1136K), WinZip reports "This file is not a valid archive", and gunzip reports "Cannot find EOCD signature".  I have also reproduced this with another file that is 900K.  I have have now tried several other browsers / OSes, and can ONLY reproduce this with Firefox 2.0 RC2 Win32 with HTTPS (HTTP works fine).

I have tried: Firefox 2.0 RC2 Mac OS X, Firefox 1.5.0.6 Win32, Firefox 1.5.0.6 Mac OS X, Safari 2.0.4 Mac OS X, Netscape 7.2 Win32, Netscape 7.2 Mac OS X, MSIE 6.0 Win32 SP1, MSIE 6.0 Win32 SP2.  None of these other browsers has the bug.
I can also reproduce using a 968KB .exe file.  When uploaded, it becomes just 263.99K.  Moving to "critical" severity and confirming...
Severity: normal → critical
Status: UNCONFIRMED → NEW
Ever confirmed: true
This isn't a critical bug...
Severity: critical → normal
(In reply to comment #5)
> This isn't a critical bug...
> 

Why isn't this considered a critical bug?  It results in loss of data.  This functionality works perfectly fine on earlier (pre-2.0) versions of Firefox and is now broken.  This potentially affects hundreds of file upload services, including the many photo upload services where files are likely to be large due to the high resolution of digital cameras.  For our customers alone this will affect dozens of major newspaper sites posting classified ads.
A critical bug is a bug that causes a crash, hang, or loss of data. While this bug apparently results in some data not being uploaded for certain files, it doesn't fall into the "critical" category. I'm not trying to dismiss this bug as unimportant - it seems pretty bad, in fact, more so than I'd initially realized when I first read the bug. Jackie has graciously offered to try and find a narrower regression range, by testing mozilla-1.8 builds at http://archive.mozilla.org/pub/firefox/nightly/ and/or http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/. That's probably the best way forward.
Alpha 1 - WFM "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20060321 Firefox/2.0a1"

Alpha 2 - WFM "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1a2) Gecko/20060512 BonEcho/2.0a2"

Alpha 3 - WFM "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1a3) Gecko/20060526 BonEcho/2.0a3"

Beta 1 - Doesn't work "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b1) Gecko/20060710 Firefox/2.0b1"

It probably has broken somewhere between alpha 3 and beta 1.  
I have pinpointed a regression window for this bug, as follows

Works in "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1a3) Gecko/20060626 BonEcho/2.0a3"
Broken in "Broken in "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1a3) Gecko/20060628 BonEcho/2.0a3"

There seems to be no 2006-06-27 build available!

(In reply to comment #4)
> I can also reproduce using a 968KB .exe file.  When uploaded, it becomes just
> 263.99K.  Moving to "critical" severity and confirming...
> 

This file can be accessed by...

1. Go to http://ideas.live.com/signup.aspx?versionId=b2456790-90e6-4d28-9219-5d7207d94d45
2. Enter your e-mail address.
3. Download the file by clicking on the "Go to Windows Live OneCare Beta v1.5 Setup" link.
4. Try to reproduce the bug with the file you just downloaded.

Good work Jackie! I can confirm that there was a change in behavior in that range.

Bonsai checkins:
http://bonsai.mozilla.org/cvsquery.cgi?module=PhoenixTinderbox&branch=MOZILLA_1_8_BRANCH&branchtype=match&date=explicit&mindate=2006-06-26+01%3A00&maxdate=2006-06-28+04%3A00

mozilla/netwerk checkins:
http://bonsai.mozilla.org/cvsquery.cgi?branch=MOZILLA_1_8_BRANCH&branchtype=match&dir=mozilla%2Fnetwerk&date=explicit&mindate=1151308800&maxdate=1151492400

I'll try backing out some of the patches that may have likely caused this.
Severity: normal → major
Flags: blocking1.9?
Flags: blocking1.8.1.1?
Keywords: regression
Gavin, does backing our bug 334142 change anything?
(In reply to comment #12)
> Gavin, does backing our bug 334142 change anything?

Hmm, backing out the patch from bug 334142 (which re-fixes bug 137155) has the weird effect of making the server-reported size vary seemingly randomly.

These tests were made with the 1,163,446 byte file in comment 1:

Current branch: server always reports 905,215 bytes
Current branch with bug 137155 fixed (bug 334142 backed out): server reports different sizes each time
Other data points: I can't reproduce this bug using a 1.5.0.7 build (which also has the patch to bug 334142), but I can reproduce it on the trunk. I'll get a trunk regression range to ideally try and narrow down the 1.8 range above.
The trunk regression range is:
http://bonsai.mozilla.org/cvsquery.cgi?module=PhoenixTinderbox&date=explicit&mindate=2006-06-22+02%3A00&maxdate=2006-06-23+05%3A00

Which also includes bug 331825 and bug 334142. I tried backing out bug 331825 and it didn't have any effect. I guess the next step is to try pulling by date around that time to make sure that subsequent changes aren't interfering with my testing.
I'll do that next.
Gavin, thanks for looking into this!  You could also try backing out both of those bugs together to see whether it's the combination of the two patches that causes issues...
Flags: blocking-firefox2?
Keywords: dataloss
Backing out both netwerk patches has the same effect as just backing out bug 334142.

The only other checkin common to the two regression ranges is bug 235773 (enable ECC). Since this seems to only affect HTTPS (Joseph or Robert, can you confirm this?), that might actually be the culprit.

If it is related to ECC (building now to confirm), that may explain why this problem wasn't identified before: it would only affect SSL, and even then perhaps only on certain servers depending on their SSL setup.
Yes, this only affects HTTPS file uploads.  I cannot reproduce the bug with plain HTTP.
This bug seems to have multiple causes. Pulling by date from the 1.8 branch just before the patch for bug 334142 landed works, right after fails, so this bug was once caused by the backout of bug 137155. However, a build from before the patch for bug 137155 landed works. So something must have landed between bug 137155 (2006-02-08 09:39) and bug 334142 (2006-06-26 18:21) that depended on the fix for bug 137155 to not cause this bug. I'll try to identify which change that is.
Is this issue specific to the server?   I'm curious if this has been on trunk since June why we haven't yet gotten other reports...
Can this be reproduced on any other server? [Joseph, I'll send you an email request for your Perl code.]

I'll also do a little network sniffing to see what's going on.
The following information is critical to helping drivers decide the impact of this bug:

- does it affect marquee sites with https upload (what marquee sites have https upload?)

- does it affect ALL https upload, or just https w/ECC?

- what range of file sized does it affect? 900kb-1.2MB? Anything over a certain size?

- what other effects might this have for https traffic? Will it affect downloads?

Thanks. 
(In reply to comment #22)
> - does it affect ALL https upload, or just https w/ECC?

Given the findings from comment 19, I'm no longer very confident that my initial hypothesis that this bug is related to ECC was correct.

> - what range of file sized does it affect? 900kb-1.2MB? Anything over a certain size?

I think failure depends on the size of the file relative to the size of the packets being sent. I can reproduce with the file from comment 1 (1,163,446 bytes), but not with a file that's 2,097,152 bytes.
Thinking out loud here.  Has there perhaps been a change that caused
the NSPR sockets used for file upload to be non-blocking, where they were 
blocking socket before.

Is it possible that the code for uploading those files assumes that 
the sockets are blocking, but they are now non-blocking?   

Blocking NSPR sockets have the property that they never return short writes.  
That is, when a write function is called to write a fixed amount, 
it returns either that fixed amount or -1, but no value in between.

However a non-blocking NSPR socket may return any non-zero value between 
-1 and the non-zero amount specified to be written.  If code assumed that 
any return value other than -1 was equivalent to having written the full 
amount requested, (i.e. if the code only checked the return value for -1)
that might be OK for blocking NSPR sockets but not for non-blocking NSPR 
sockets.  This applies to SSL sockets, too.

Given that the reported results are that the file transferred is shorter
than the one sent, this makes me suspect that the code is not handling
"short writes" properly.  Just a guess.  This hypothesis doesn't explain 
why 1MB files fail and 2MB files succeed!
I confirm it works on Linux.

I confirm I get the problem on Windows.

I compiled with ECC disabled, and it still fails on Windows.
* youtube uploads over http
* flickr.com ""
* gmail will run happily either way

I'm really surprised we didnt' see this with bz attachments.. 
Answering my own question - bz doesn't allow uploads greater than 1024 bytes.  So can't test there.  Has anyone tried reproducing with other servers?
I'm still doing some debugging, but thoguht I should post an update:

What appear to be happening is that  the server is interrupting the SSL conversation. A network snoop shows the upload progressing normally, until th server suddenly sends a packet with an Alert message (in the SSL protocol). [This might be the result of bad data from the browser, so I can't say for sure whose fault this is a result of.]

I'm getting bogus output from ssltap, so I don't know what type of SSL Alert it is. [Seems like an ssltap bug?] I'm poking around some more to see if I can get something useful out if it.
Testing on other servers seems to check out... The file in question has no problem being uploaded (over SSL) on Gmail, Yousendit.com, wiki.ubuntu.com, or a test Apache server set up during discussion on IRC.

As another data point, I created a few new files with the same size as the problematic "12-images.zip", but pulling the contents from /dev/urandom and /dev/zero. On Windows, the /dev/zero file uploaded fine but 3 /dev/urandom files all failed, at different points (ie, the server reported uploads of different sizes). On Linux, all 4 files uploaded ok.
Since making comment 19, I've built by date in the time period that bug 137155 was on the trunk, with the patch backed out (essentially a simulation of that patch never landing). With that setup, the regression range that I find with this bug points to the landing of bug 337770.

Conclusion: the upgrade of NSS from bug 337770 is the immediate cause of this bug. However, fixing bug 137155 also causes this bug to not appear. In this case, bug 137155 was fixed first, then NSS was upgraded, which means that the regression didn't appear until the patch for bug 137155 was backed out one month later (nearly 5 months after it landed).

I don't really know enough about NSS/netwerk interaction to be able to draw a useful conclusion from this information, and the diffs in bug 337770 are a little too big for me to actually read through looking for hints.
Based on comment 28 and comment 29, we won't block the release of Firefox 2 for this. However, I hope gavin's research in comment 30 is helpful for getting the appropriate fixes to NSS for 1.8.1.1. 
Flags: blocking-firefox2? → blocking-firefox2-
This is clearly not a Firefox issue.  It's either networking or NSS, or both...  Trying NSS for a start.
Assignee: nobody → nobody
Component: File Handling → Test
Flags: blocking-firefox2-
Product: Firefox → NSS
QA Contact: file.handling → test
Version: 2.0 Branch → 3.9
Component: Test → Libraries
QA Contact: test → libraries
Update to Comment 28...

Ugh -  I somehow managed to sniff a freak connection, and it's unlike all the other times I've reproduced the problem. So, the bit about getting an unexpected SSL alert isn't what's normally happening. I can't seem to reproduce this first freak sniff, so whatever caused it is probably something else.

The good news is that a nominal sniff of the problem (using 12-images.zip) shows that this sure seems like a server problem... We're sending the full file contents to the server (an uninterrupted block of 1,185,483 encrypted bytes to the server, which would be the 1,163,446 bytes of the file plus 22,037 bytes of SSL overhead). The resulting web page claims only 905,215 bytes of the file were recieved, so the remote end is discarding the extra data somewhere.

I had mconner test a debug build with the SSLTRACE and SSLDEBUG environment variables set (to "50"), and the resulting output from reproducing the problem appears to indicate the same thing.

I'm still at a loss to explain the exact reason for the failure, though. If the submitter could look at gathering some additional data from the server's end (Apache/OpenSSL debug logging?), that might help explain what the server thinks is happening.
(In reply to comment #28)
> the server is interrupting the SSL conversation. 
> server suddenly sends a packet with an Alert message (in the SSL protocol).
> I don't know what type of SSL Alert it is. 

The SSL alert record content is encrypted.  Record headers are not encrypted,
but the record contents are.  For an alert record, the encrypted content 
is very small, but it's the part you want. SSLTap cannot decrypt it.
You could try using an SSL cipher suite with NULL encryption.

But any alert you receive from the server during the data transfer (after the
handshake) is effectively a sign that the server is aborting the transfer.
Justin, 
Please attach the portion of your ssltap output showing the handshake(s).
Also, the very tail end of the output might be useful.
Also, Justin, please attach (or paste into a comment) the "raw request header"
output from the page returned by the server in response to a flawed upload.
Justin sent me some very large log files of a connection with this problem.
(Thanks! Justin)

I see nothing yet that obviously indicates an SSL error of any kind.

Here are some observations gleaned from those logs.
The negotiated protocol version was 3.1, TLS 1.0
The server chose cipher suite 0x0039, TLS_DHE_RSA_WITH_AES_256_CBC_SHA

ssl_SecureSend was being called alternately with writes of size 4095, and 1.
This behavior MIGHT be expected if writes to the underlying TCP socket were experiencing "short writes" (full output buffers).  But the log contains considerable evidence contradicting that.  If we were experiencing short TCP
writes, we would not be seeing a new SSL record being generated for each 
1-byte write from above.  We would see one SSL record generated for every 
TWO (or more) writes from above.  But in the log file, we do see ssl records 
being generated with alternating sizes of 4133 and 37 (which are the 
expected values for payloads of 4095 and 1 with that ciphersuite).  This 
suggests that the odd alternating write sizes are due to strange behavior
in the layer above ssl_SecureSend.

Note that 4K is not a record size boundary for TLS.  TLS's record size
boundary is 16K.

I would like to know which of the many forms of write functions is being 
used by the layers above to send this file.  Is the layer above libSSL 
calling PR_Write, PR_Send, PR_SendFile, PR_TransmitFile, PR_Writev, or what?

I continue to suspect a server problem.  There have been a handful of bugs
filed that turned out to be TLS servers with flawed block cipher logic that
was manifest only with AES 256 ciphersuites.  SO, I'd suggest conducting 
an experiment with this server, where the AES 256 cipher suites have been
disabled in the client.  See if the problem can be reproduced with AES_128
or RC4.  If not, that's quite telling.
(In reply to comment #37)
> This 
> suggests that the odd alternating write sizes are due to strange behavior
> in the layer above ssl_SecureSend.

The direct layer above libssl is PSM's layer. It will use the same buffer sizes for reading and writing as used by the layer above it, which is Mozilla's networking layer Necko.


> I would like to know which of the many forms of write functions is being 
> used by the layers above to send this file.  Is the layer above libSSL 
> calling PR_Write, PR_Send, PR_SendFile, PR_TransmitFile, PR_Writev, or what?

Out of the functions you listed, the PSM layer implements the "write" function only.
Just wanted to add my comment here to let you all know that this doesn't seem to be limited merely to truncated file sizes on the other end, but rather the skipping entirely of elements of a multiple file upload.

I have a simple file upload app I wrote is used here at Goddard Space Flight Center which is running on mod_perl 1.29 on Apache 1.3.27. It provides a form that gives 5 separate file upload input boxes one right below the other. Prior to Firefox 2.0, you could select up to 5 files (one for each field), and it would successfully upload them all. You could even leave some fields empty and it would still upload them all fine...

ie - if the file upload boxes are on the page like:
|BOX1| <Browse>
|BOX2| <Browse>
|BOX3| <Browse>
|BOX4| <Browse>
|BOX5| <Browse>

You didn't HAVE to fill them sequentially, you could have only a file in Box1 and Box4 for example, and they would both still get there fine.

Anyhow, I noticed tonight that at times Firefox 2.0 was only uploading 1 file out of 5, sometimes only 2 out of 5, sometimes 3, but so far never any more than three, even if all 5 iputs were full. More often than not though, it will seem to only upload 1 file and then stop processing.

The code that actually reads in the file is quite rudimentary, and just uses the Apache::Request module:

for (my $f_upload = $$r_apr->upload; $f_upload; $f_upload = $f_upload->next) {
   my ($curr_file) = $f_upload->filename;
   my ($curr_size) = $f_upload->size;

   if ($curr_file) {
       $curr_file =~ m/^.*(\\|\/)(.*)/;  # strip the remote path
       my ($file_name) = "";

       if (! $2) {  $file_name = $curr_file;  }
       else {   $file_name = $2;  }

       $$r_apr->log_error("The uploaded file was called: \"$file_name\"");

       if (-e "/tempfiles/$fullname/$uid/$file_name") {
            $$r_apr->log_error("File already exists!!!");
       }
       else {
           $f_upload->link("/tempfiles/$fullname/$uid/$file_name") ||
                   die "Error linking file!! Error was: $!";
           $$r_apr->log_error("Your file \"$file_name\" uploaded successfully!!!");
           push(@uploaded_files, "$file_name"."|||||"."$curr_size");
       }
   }
   else {
       $$r_apr->log_error("NO FILE TO UPLOAD!!!");
   }
}

foreach(@uploaded_files) {
    $$r_apr->log_error("Array Item: $_\n");
}

---------------

No matter how many files were set to be uploaded, only 1-3 "Array Item: xxxxxx" lines are ever showing up in the logs, and more often than not it's just 1 like I said before. The application is working perfectly fine apart from this issue though, the file that makes it there don't seem to be corrupted (though I'm only testing with at most 30kb PDF files at the moment, so with larger ones it very well might get corrupted).

I look forward to any progress that is made with this, and eagerly await a solution, since at the moment (this being an app used widely on center), I'll have to tell people to stick with Firefox 1.5 till this gets wrapped up.

PS - yes, all uploads are done through https exclusively.
Without an owner or current activity this looks unrealistic for 1.8.1.1--Want it fixed though.
Flags: wanted1.8.1.x+
Flags: blocking1.8.1.1?
Flags: blocking1.8.1.1-
Moving back to Core. The bug might be in NSS (dunno if that's been shown) but they're not likely to care enough to track down if the only symptom is in Firefox file uploading. If we want it fixed we'll have to own it until this is further along.
Assignee: nobody → dolske
Component: Libraries → Networking
Product: NSS → Core
Version: 3.9 → 1.8 Branch
It would be most helpful if either the original reporter (Joe) or  the comment 39 reporter (Tim) could provide some more details/logs of what Apache thinks is happening.

From the snooping I did in comment 33, Firefox *is* sending the data, but the server is apparently discarding it before the script gets it. It would sure be nice to know why.
Hello all.  As the original reporter of this bug, I would be happy to add more info.  First, I cranked up both my Apache LogLevel and mod_ssl SSLLogLevel to "debug", restarted Apache, and reproduced the bug by uploading the 1.1 MB file from Firefox 2.0 Win32 over HTTPS.  The Apache error_log contained nothing at all, but the ssl_engine_log contained a LARGE amount of information (to the tune of 5.5 MB).  Here is that excerpt if anyone is interested, gzipped for downloading:

http://staging.sr.admission.net/joetemp/ssl_engine.log.gz

I then decided to try having the server simply dump out the raw HTTP POST data without using Apache::Request at all, to see the size, and if submitting over HTTP differs from HTTPS.  Here are those raw dumps:

http://staging.sr.admission.net/joetemp/rawcontent_ff20_win32_http.bin
http://staging.sr.admission.net/joetemp/rawcontent_ff20_win32_https.bin

The sizes are correct (1.1 MB), and they differ slightly, but it looks like it's just the multipart/form-data "boundary", which is randomly generated.  I do not have software to perform a deep analysis of these dumps, so I do not know for sure if the binary data differs.  I also dumped the raw HTTP POST data from a few other browsers, for further comparison:

http://staging.sr.admission.net/joetemp/rawcontent_ff1508_mac_https.bin
http://staging.sr.admission.net/joetemp/rawcontent_ie6_win32_https.bin
http://staging.sr.admission.net/joetemp/rawcontent_safari2_mac_https.bin

Keep in mind that submitting from ALL these browsers / OSes works just fine (the bug is not reproducible), EXCEPT for Firefox 2.0 Win32 HTTPS.  However, I do not see any large differences in the raw dumps, which is interesting.  These raw dumps certainly make it seem like a bug in Apache::Request (the library parsing the HTTP POST), but why only Firefox 2.0 Win32 HTTPS?  All other browsers / OSes work fine.

Next I took a deeper look at the Apache::Request module (docs available here: http://search.cpan.org/~stas/libapreq-1.33/Request/Request.pm ).  I made use of the UPLOAD_HOOK feature to have libapreq call a function while it is parsing the HTTP POST.  The results were as expected.  I incremented a counter of bytes received, and wrote all the data to a filehandle.  Looks like the function is called for every 4096 bytes.  The total size was 905215 bytes (same problem, much too small -- actual file is 1163446 bytes), which doesn't shed any further light.  Somehow the remaining 200K is just not getting read.  Apache::Request doesn't seem to have a "debug" mode, and it doesn't warn or write anything to STDERR.

I also wrote a file upload receiver script in PHP, and submitted to that.  The bug is not reproducible.  So PHP's built-in HTTP POST parser seems to deal with the submission just fine.  That also points the finger at Apache::Request, but again, why only with Firefox 2.0 Win32 HTTPS?

I can't think of anything else to try, but I did send an e-mail to the author of Apache::Request (Stas Bekman, stas at stason dot org) and will reference this bug so he can take a look himself.
AH HA!  I think I may be onto something here.  I took a closer look at the Apache::Request UPLOAD_HOOK callback, and examined the number of times it was called, and the number of bytes sent to it for each call.  The results are VERY interesting, and show a GREAT variance between Firefox 2.0 Win32 HTTPS vs. HTTP vs. other browsers.  Check this out:

http://staging.sr.admission.net/joetemp/upload_hook_log_ff20_win32_https.txt
http://staging.sr.admission.net/joetemp/upload_hook_log_ff20_win32_http.txt

See the difference?  When the file is uploaded via Firefox 2.0 Win32 HTTPS, the upload hook calls start out with various sizes, then they get into a pattern that is 4095 bytes, followed by 1 byte, followed by 4095 bytes, followed by 1 byte, and so on.  Now compare that to the HTTP upload.  Weird, eh?  Here are some other browsers / OSes for further comparison:

http://staging.sr.admission.net/joetemp/upload_hook_log_ff1508_mac_https.txt
http://staging.sr.admission.net/joetemp/upload_hook_log_ie6_win32_https.txt
http://staging.sr.admission.net/joetemp/upload_hook_log_safari2_mac_https.txt

These results are VERY interesting.  The Firefox 2.0 Win32 HTTPS upload is the ONLY instance where the upload hook is called for 1 single byte every other iteration.  Could it be that Apache::Request is trying to read chunks from the socket that are "too small"?  So Firefox is "not ready" with the next packet so the socket select() call returns false for can_read() or something, and Apache::Request gives up?  Very, very interesting indeed.
Apache::Request appears to have a "hack" to deal with Mozilla 0.97.  I direct your attention to this snippet of code from inside libapreq (Apache::Request), apache_multipart_buffer.c, lines 65 - 70:

    if (bytes_to_read >= self->r->remaining) {
        bytes_to_read = self->r->remaining - strlen(self->boundary);
#ifdef DEBUG
        ap_log_rerror(MPB_ERROR, "mozilla 0.97 hack: '%ld'", self->r->remaining);
#endif
    }

And this from apache_request.c, Lines 582 - 588:

/* mozilla empty-file (missing CRLF) hack */
            fill_buffer(mbuff);
            if( strEQN(mbuff->buf_begin, mbuff->boundary, 
                      strlen(mbuff->boundary)) ) {
                r->remaining -= 2;
                continue; 
            }

I could be totally off base here, but there may be a possibility this "hack" is affecting Firefox 2.0's multi-part upload.  I'll try to recompile my Apache::Request module with the DEBUG flag set, and see if anything gets logged while reproducing the bug.
Oh boy.  Re-compiling libapreq with the DEBUG flag adds TONS of useful information to the Apache error_log.  Check this out.  This captured while reproducing the bug (1.1 MB file upload via Firefox 2.0 Win32 HTTPS):

http://staging.sr.admission.net/joetemp/libapreq_debug_ff20_win32_https.txt

This has pretty much everything we need to know.  And the best part?  Look at the last line logged:

[Thu Dec 21 10:24:18 2006] [error] [client 10.40.9.22] [libapreq] silently drop remaining '254180' bytes

That is approximately the number of bytes missing from the file.  libapreq is dropping it!  Now, compare this against using HTTP (not HTTPS), and some other browsers / OSes:

http://staging.sr.admission.net/joetemp/libapreq_debug_ff20_win32_http.txt
http://staging.sr.admission.net/joetemp/libapreq_debug_ie6_win32_https.txt
http://staging.sr.admission.net/joetemp/libapreq_debug_ff1508_mac_https.txt
http://staging.sr.admission.net/joetemp/libapreq_debug_safari2_mac_https.txt

All these other logs look "clean" compared to the libapreq_debug_ff20_win32_https (which is the only place the bug is reproducible).  That log has all kinds of extra stuff.  Of interest is the number of times the "find_boundary" thing is logged.  Almost like the library is having trouble locating the boundary between multipart objects.
I took a peek at this.  What's happening is that occasionally, we get a 0 byte payload (as can be seen the abover log).

Once that happens, multipart_buffer_read returns 0 and ApacheRequest_parse_multipart breaks out of the data-reading loop for that parameter and starts reading line-by-line looking for the next boundary/header (which it may or may not find since we're treating binary data as text, so all bets are off).

(In reply to comment #37)
It still looks like the core of the issue is the way output is being sent.  Not being a maven, I don't know if this odd 4095/1/4095/1 packet sequence is normal, but it sure looks unhealthy to me...

> Justin sent me some very large log files of a connection with this problem.
> (Thanks! Justin)
> 
> I see nothing yet that obviously indicates an SSL error of any kind.
> 
> Here are some observations gleaned from those logs.
> The negotiated protocol version was 3.1, TLS 1.0
> The server chose cipher suite 0x0039, TLS_DHE_RSA_WITH_AES_256_CBC_SHA
> 
> ssl_SecureSend was being called alternately with writes of size 4095, and 1.
> This behavior MIGHT be expected if writes to the underlying TCP socket were
> experiencing "short writes" (full output buffers).  But the log contains
> considerable evidence contradicting that.  If we were experiencing short TCP
> writes, we would not be seeing a new SSL record being generated for each 
> 1-byte write from above.  We would see one SSL record generated for every 
> TWO (or more) writes from above.  But in the log file, we do see ssl records 
> being generated with alternating sizes of 4133 and 37 (which are the 
> expected values for payloads of 4095 and 1 with that ciphersuite).  This 
> suggests that the odd alternating write sizes are due to strange behavior
> in the layer above ssl_SecureSend.
> 
> Note that 4K is not a record size boundary for TLS.  TLS's record size
> boundary is 16K.
> 
> I would like to know which of the many forms of write functions is being 
> used by the layers above to send this file.  Is the layer above libSSL 
> calling PR_Write, PR_Send, PR_SendFile, PR_TransmitFile, PR_Writev, or what?
> 
> I continue to suspect a server problem.  There have been a handful of bugs
> filed that turned out to be TLS servers with flawed block cipher logic that
> was manifest only with AES 256 ciphersuites.  SO, I'd suggest conducting 
> an experiment with this server, where the AES 256 cipher suites have been
> disabled in the client.  See if the problem can be reproduced with AES_128
> or RC4.  If not, that's quite telling.
> 

Darin, biesi, is there a way to log the sizes of the data buffers that HTTP passes to the downstream (NSS, in this case?) stuff?

Nelson, is there a way to log the sizes of the data buffers NSS is receiving and the packets it's creating?
Attached file trace file for Boris
Boris,

Here is a trace log from a typical run uploading the file 12-images.zip.

In addition to the usual log stuff, this log file includes the actual 
plain text and actual cipher text of the records, in hex and ASCII.  
Since the vast bulk of the data is binary (a large compressed zip file), 
I stripped out the data buffer dumps for most of the binary data.  That 
reduced this log to a managable size.  The unzipped log file is 4900 lines, 
~175 KB.

ECC is not used.  The cipher suite is TLS_DHE_RSA_WITH_AES_256_CBC_SHA.
So, yes, this server is using AES 256.

The lines in the log are grouped in blocks, one per call to SSL's
SecureSend function by the layer above it.  Each block begins with
a line that looks like
   SecureSend: sending NNNN bytes      (where NNNN varies)
and ends with a line that looks like
   SecureSend: returning NNNN count     (where NNNN varies)
or
   SecureSend: returning -1 count, error -5998  (that's WOULD_BLOCK)

In between those lines are other lines telling what libSSL did
between the call and the return.  Here is some interpretation of
those lines.

   sending NNNN bytes of saved data
      Means that before libSSL tried to do anything with the new
      data that is being passed down to it, it tried to send some
      data that was previously buffered after a previous write
      attempt to the underlying socket got a WOULD_BLOCK.
      If this attempt to send the previous buffered data also gets
      a WOULD_BLOCK, this will be reported immediately to the caller.

   SendRecord type: application_data (23) nIn=NNNN
      libSSL will attempt to take this much of the data being passed
      down and turn it into one or more SSL records, to be encrypted
      and sent.  In this log file, this amount will be either
      equal to, or exactly one less than, the amount the caller is
      passing down.

   Send record (plain text) [Len: NNNN]
      libSSL has prepared an SSL data record to send (not yet
      encrypted) containing this much data.  In this log, this amount
      will always be equal to the amount in the preceeding line.

   send (encrypted) record data: [Len: NNNN]
      libSSL has padded and encrypted the SSL record, and put a 5-byte
      SSL record header on it.  In this log file, the encrypted record
      will generally be 26-57 bytes longer than the corresponding
      unencrypted record, because the SHA MAC is 20 bytes, padding is
      1-32 bytes, plus the 5-byte header.  The encrypted record will
      always have a size of 5+32*n.  This amount is generally written
      to the underlying socket.

   saving NNNN bytes of data (NNNN total saved so far)
      if the write of the encrypted record to the underlying socket
      got a "short write", or got WOULD_BLOCK, the unwritten portion
      is buffered, and we will attempt to send this remainder again
      at the next call to libSSL.  In this case, the amount returned
      to the caller will be 1 less than the amount the caller tried
      to write, or (if that would be zero), -1 with WOULD_BLOCK.
Attachment #251266 - Attachment mime type: application/octet-stream → application/zip
So if I see:

  SecureSend: sending 1 bytes

in the log, that means the caller of SecureSend only passed NSS one byte, right?  Is that caller just NSS itself?  I don't see any mention of SecureSend outside NSS...

Basically, I'd like to track down where this 1-byte thing is coming from in our code.
I wonder if

  SecureSend: sending 1 bytes

has to do with the patch in bug 80092 comment 47.
(In reply to comment #50)
> So if I see:   SecureSend: sending 1 bytes    in the log, that means the 
> caller of SecureSend only passed NSS one byte, right?  

Yes.  

> Is that caller just NSS itself?  I don't see any mention of SecureSend
> outside NSS...

On an SSL socket, SecureSend is called (indirectly, through the PRIOMethods table) from  PR_Write, PR_Send, PR_SendFile, PR_TransmitFile, and PR_Writev.  
That's why I asked, in comment 37, which of these functions is being called 
by the layer above.  

> Basically, I'd like to track down where this 1-byte thing is coming from 
> in our code.

I hope we can, too.  But I hasten to add that, while it is inefficient,
it is not wrong or incorrect.  This behavior is entirely allowed by http,
ssl, TCP, and what-have-you.  So, if the receiver cannot handle this data
stream simply because of the unusual way in which it is packetized, IMO,
that's not a bug we should try to fix by cobbling our own code.  

I'm all for making our code more efficient, and getting rid of those one-
byte writes for the sake of efficiency.  IINM, bug 137155 is precisely 
about that very issue, and this bug's problem was seen after the "fix" for 
that bug was backed out.

But I would hope we wouldn't change our code (assuming it is correct) just 
because of a bug in a third party server that tries to decode uploaded zip 
files on the fly (if that is indeed the cause).  The server's log file 
claims it received a zero-length input, yet the attached trace file shows 
no evidence that we ever sent an SSL appliation data record with zero data, 
and even if we did, such a record should simply be silently discarded by 
the recipient, not treated as some EOF-like event.  

For all I know, the real problem here has something to do with the contents 
of the particular zip file being uploaded, and is not the fault of SSL at all.
It may be that some combination of a particular compression data construct 
in the zip and some unusual pattern of the data packet sizes is causing the 
on-the-fly ZIP decoder to choke.
(In reply to comment #51)
> I wonder if  SecureSend: sending 1 bytes has to do with the patch in 
> bug 80092 comment 47.

Sometimes the 4095-4096 byte writes succceed, writing all the data without
would_block.  At other times, they return "short writes" for that case.
The behavior of the alternating 4095,1 byte writes is seen EVEN in the 
cases where the entire write succeeded without any short write occurring.

The scheme implemented for bug 80092 has the property that if the application 
does a large write, which is turned into an encrypted record, part of which
is unsent, we return a "short write" count to the caller, one less than the 
amount the caller requested.  The entire amount that the caller previously
passed has been comitted to be sent (is in the SSL record).  To the caller
it appears that the last byte he tried to send was unsent, and he may try to 
write that last byte again immediately.  But when that happens, we will not
send that one byte in a new SSL record, because it has already been sent in 
a previous SSL record.  

In the log file, this is seen in a sequence like the following:

SecureSend: sending 160 bytes
SendRecord type: application_data (23) nIn=160
Send record (plain text) [Len: 160]
send (encrypted) record data: [Len: 197]
saving 197 bytes of data (197 total saved so far)
SecureSend: returning 159 count

SecureSend: sending 1 bytes
sending 197 bytes of saved data
SecureSend: returning -1 count, error -5998

SecureSend: sending 1 bytes
sending 197 bytes of saved data
SecureSend: returning 1 count

The first write of 160 bytes put all 160 bytes of application data into
the SSL record which was then queued to be sent, but we told the caller
that we sent only 159 bytes.  The caller tried to send that one byte again.
When the previous 160-byte record was sent, we returned 1 to the caller,
telling him that that last byte was finally sent, but we did not send that
last byte again in a new SSL record.  

This behavior does NOT result in SSL records being sent over the wire that
contain alternating large and small amounts of sent data.  The would_block
logic results in one large SSL record being sent for a series of calls,
large and small, by the caller.  

But this trace (and the server's log) clearly shows that SSL records were
being sent alternating in size between 4095 and 1 byte.  And from the 
trace file, we see that the caller was sending these alternating sizes
even when no would_block behavior was being seen (and also when would_block
behavior was being seen).  For example:

SecureSend: sending 4095 bytes
SendRecord type: application_data (23) nIn=4095
Send record (plain text) [Len: 4095]
send (encrypted) record data: [Len: 4133]
SecureSend: returning 4095 count

SecureSend: sending 1 bytes

Here we see a 4096-byte write being sent completely, on the first try,
in its entirety.  No short write, no previously saved data being sent,
no new data being saved.  But then immediately therafter, the caller 
calls with 1 byte anyway.  This shows that the alternating 4095,1 
behavior was not caused by would_blocks and short writes.  
So the simple thing to do to find the caller seems to be to breakpoint in SecureSend, condition the breakpoint on the size being 1 byte, and look at the stack...
Yeah, boris, conditional breakpoint should reveal all.  I'm testing with 
debug builds of NSS but not debug browser builds.  

suggested experiments:
a) produce a file of all zero bytes, exactly the same length as 12-images.zip
and name it to be a zip file, and upload it in the same way.  Netlib and NSS
should treat it exactly the same as 12-images.zip.  I'd expect to see the 
same pattern of write sizes.  Does that fail?

b) rename the file from 12-images.zip to something else (e.g. 12-images.bin)
and upload it, not as a zip file, but as an appliation/octet-stream.  That 
should disable the on-the-fly zip file decoder at the receiver.  Of course, 
netlib and NSS won't treat the file any differently due to the changed name 
and MIME type.  I'd expect to see the same pattern of write sizes.  
Does that fail?
I did experiments like the ones I suggested above.  Here are the results:

-) copy 12-images.zip to 12-images.exe and upload it.  Different name,
different MIME type.  Uploaded length: exactly the same as for 12-images.zip,
namely Size: 883.99 K (905215 bytes)

-) using program dd, create a new file of exactly the same length as as 
12-images.zip, but whose contents are NOT a zip file.  Named it 12-omages.zip.  
Uploaded it.  Result: Size: 19.99 K (20479 bytes)  truncated more severely!!   
Problem is sensitive to data content.  

Next I'll repeat that last experiment while running an SSL trace as before.
SSL trace for uploading file 12-omages.zip shows the same exact pattern 
of sizes of encrypted SSL record buffers, and essentially identical 
pattern of write sizes coming from layer above.  Yet the file was truncated
on the server in an entirely different place, after only ~20 KB.  
Netlib didn't behave differently, NSS didn't behave differently.  
The file contents were different.  
In reply to Comment #56: There is no zip decompression happening on the server.  The file type is ignored, and stored in its original format.  However, Nelson's test using dd is very interesting!  I also created my own "blank" version of the file, containing exactly 1,163,446 "o" characters.  Here is a link to download it if anyone is interested:

http://staging.sr.admission.net/joetemp/12-omages.zip

Amazingly, I am UNABLE to reproduce the bug with this file!  It seems to upload just fine every time (I tried 15 times in a row).  Here is the Apache::Request (libapreq) debug output while uploading it:

http://staging.sr.admission.net/joetemp/12-omages-ff20-win32-https.txt

That is exactly what it SHOULD look like, discarding only 42 bytes at the end (which is normal).  But as soon as I switch back to the REAL 12-images.zip file, the behavior is 100% reproducible again (chopping off 200K).

On a side-note, I just upgraded my Mac to Firefox 2 (v2.0.0.1, up from v1.5.0.9), and the bug is NOT reproducible!  It remains reproducible ONLY on the Win32 version of Firefox 2.0.
Wanted, along with bug 137155.
Flags: blocking1.9? → blocking1.9-
Whiteboard: [wanted-1.9]
We are having the exact same problem (incomplete uploads of files through ssl).

On Windows, Firefox 2.0.0.1 works fine, but 2.0.0.2 does not.
On Windows, Firefox 1.5.0.9 works fine, but 1.5.0.10 does not.

Works fine on Mac and Fedora Core.

Works fine on Windows for any Firefox version when not using ssl.

I have been able to replicate the problem with smaller files (80 k).
OK.  So we're calling nsSSLThread::requestWrite with an |amount| of 1.  This ends up setting the members that eventually get passed to ssl_SecureSend (by writing to a socket, looks like?).

In any case, the caller of nsSSLThread::requestWrite is nsSocketOutputStream::Write.  The caller of that is nsHttpConnection::OnReadSegment.

Tracing further back, we end up in nsBufferedInputStream::ReadSegments, where we have:

328             PRUint32 amt = PR_MIN(count, mFillPoint - mCursor);

(gdb) p count
$9 = 4096
(gdb) p mFillPoint
$10 = 8192
(gdb) p mCursor
$11 = 8191

The next time through, we have:

(gdb) p count
$17 = 4095
(gdb) p mFillPoint
$18 = 8192
(gdb) p mCursor
$19 = 0

because we subtracted the 1 byte we read from |count|.

nsBufferedInputStream::ReadSegments is called by nsMultiplexInputStream::ReadSegments, which always passes 4096 as the count.

Tracing further back through several levels of MIME and multiplex input stream ReadSegments nesting, we get to nsHttpTransaction::ReadSegments, which passes nsHttpTransaction::ReadRequestSegment as the callback function...  This just writes the data out to the socket.  So the "real issue" is somehow the interaction of all these buffered and multiplex streams...
So I tried tracking the calls to nsSSLThread::requestWrite:

The first call comes from:

nsStringInputStream::ReadSegments
nsMultiplexInputStream::ReadSegments
nsHttpTransaction::ReadSegments

This is basically the headers for the POST.  This is 460 bytes.

Next we have 102 bytes (the start of the multipart body) with the stack:

nsStringInputStream::ReadSegments
nsMultiplexInputStream::ReadSegments
nsMIMEInputStream::ReadSegments
nsMultiplexInputStream::ReadSegments
nsHttpTransaction::ReadSegments

Then 27 bytes for the Content-Length header for the file part, with basically the same stack.

Then 165 bytes for the rest of the headers for the file part.

Then 3931 bytes of the file.  Note that 3931 + 165 = 4096.

Then I see the 165 and 3931 alternate for the rest of the file upload.

Basically, as far as I can see this thing ends up hitting the buffered stream around the file at some offset (which depends on the exact headers being sent, possibly on the part separator, etc), and then oscillate in terms of the amount of stuff it's going to send.

So to fix the oscillation, we either need to buffer up the data somewhere (which is what bug 137155 had effectively done) or someone (the multiplex input stream?) needs to be smarter about how they do things.  In particular, instead of spanning read requests across the sub-streams, it could just return early when it gets to the end of a sub-stream....  That way we'd always start the file upload with a clean read of 4096 bytes.  Given that every single call to the callback does a separate network socket write anyway, there's really no win to trying to span across to the next stream, imo.  It might even be more efficient to do this, in general, than what we do now.

Darin, Christian, what do you think?
Boris,  Thanks for doing all that digging.  
I have a question (er two or three) about what you've found.  

You've found code that causes alternating big and little writes, 
and that's suboptimal, but is it wrong?  

Have you found anything that would cause data to be lost, or repeated,
or the wrong data to be sent?  

IOW, do we have any reason to believe that the behavior you found is 
really responsible for the data loss reported by users of various servers
in this bug?
> and that's suboptimal, but is it wrong?  

As far as I can tell, no.

I'm still not quite certain under what conditions Apache is dropping data here, but I was assuming based on the previous comments that the 1-byte packets were relevant there.  Rereading, this might not have been a correct assumption....
Would anyone happen to know whether or not this bug pops up when using mod_perl 2.0 on Apache 2.x instead of the Apache 1.3.x and mod_perl 1.29 config that both I and the original reporter were using? If the upgrade will fix this issue, I might have to pursue that route, since this is becoming quite a headache for our app.

Thanks in advance.
Tim, 
We have not yet identified ANY erroneous behavior in the browser 
(related to this bug).

The possibility that this error is due to *server* software has not been
eliminated, IMO.  So, if you have a fix for this issue in the form of a 
server software upgrade, I think that would be a worthwhile pursuit. 
If you try that, please let us know the results.
Thanks for the reply. I'll likely try the server-side upgrade this weekend to see if that helps at all. And just so I'm not misunderstanding you, did you mean to say that you don't believe this to be a bug in the browser itself, but rather a problem with the server-side software that has merely been exposed due to changes in Mozilla? I suppose that could be the case, but only if every other browser is simply letting the server-side issues slip through without notice. Opera 9, IE (5,6 & 7), Safari, and Konquerer all successfully handle the upload without issue. Only starting with Firefox 2.0 (and now, unfortunately 1.5.0.10) did this issue come to light. Would that perhaps be a better area to look for the issue, merely studying changes made to the 1.8.0.x branch between .9 and .10? 1.5.0.9 and earlier all worked without any problems at all, but starting with the recent .10 upgrade, even the older versions of Firefox have been exhibiting this bug on our end. It pains me that I've had no option but to recommend our users to use either Opera, Safari or IE in place of Firefox if they're looking to successfully upload files.

I don't mean to seem ungrateful for Boris' work in diagnosis, just trying to make clear that this IS an issue that has manifested itself in recent Firefox versions, most specifically on the change from .9 to .10 of the 1.5.0 branch.

Thanks again, and I'll keep you posted on my results with the mod_perl upgrade.
Gavin already looked into that.  See comment 19 and comment 30 for his conclusions.

Which raises an interesting point: Nelson, the NSS update changed _something_ here.  Any idea what?  I really don't have the knowledge I'd need to debug NSS effectively, but I think figuring out what part of bug 337770 causes the regression here (by local backout if needed, which requires knowing which parts depend on each other) would help determine whether this is a necko problem, NSS problem, or server problem...

Is there a way to use a by-date pool off the development NSS branch with a given Gecko build?  Ideally we'd find which exact checkin between NSS_3_11_1_RTM and NSS_3_11_20060512_TAG  that way.

Nelson, is comment 41 correct in its assessment, or would you guys be willing to try to figure out which of your changes had an effect here?
Boris, try backing out the NSS patch described in comment 51.
I've written and rewritten lots of comments in reply to the above questions,
but each time I do, I find that I have merely repeated and elaborated on
comments I've already made in this bug.  

The patch for bug 80092 causes the behavior described in comment 54.
Perhaps it would be good to re-read comment 54, all the way through.

Yes, NSS will sometimes return a write count that is 1 less than the
amount the caller was trying to write.  But as seen MANY times in
the attached trace, the alternating 1B,4KB pattern is seen even in 
the more typical case where NSS returns the full write count.

If the patch for bug 80092 is found to be INCORRECT, introducing 
ERRONEOUS behavior, then we should change it, else we should NOT. 

Please re-read comment 57 and comment 58.  The truncation problem 
is sensitive to file CONTENTS.  NSS/SSL is sensitive only to the 
amounts of data written in each call, not to the contents.  
NSS treats the application data as an opaque byte stream.
What code is sensitive to contents?  

In reply to comment 65, Boris, I think the 1-byte packets are exactly
the issue.  I think they are triggering the problem.  I think that if
you eliminate them, by buffering the data above SSL and doing writes
of larger sizes, the problem will vanish.  If you really want to 
change FireFox to eliminate this problem, whether FireFox is at fault 
or not, then by all means, implement that buffering.  

But I do not think that 1 byte writes are an error, they are not *wrong*.  
NSS handles them fine. I suspect strongly that the server does not.
> Yes, NSS will sometimes return a write count that is 1 less than the
> amount the caller was trying to write.

I assume it never does this for a 1-byte write, right?

> If the patch for bug 80092 is found to be INCORRECT

My point is that I don't know enough about either SSL or the NSS APIs to evaluate its correctness.  Furthermore, if that patch doesn't cause the alternating pattern per se (as you say), and the alternating pattern is responsible for the issue (as you say), then what does that patch have to do with anything?

I'll try backing it out when I get back if no one has a chance to try that before then, I guess...  But I still think that it would make more sense if someone more familiar with NSS debugged what happens at that level.

> What code is sensitive to contents?  

Apache, as far as I can tell.  See comment 45 and 46.

> I think that if you eliminate them, by buffering the data above SSL

Then I reintroduce bug 334142.  Or is that bug gone with the buffering and the new SSL code?  I can't tell; can you?

> If you really want to change FireFox to eliminate this problem,

I really want us to not make significant web compat changes like this on stable branches.  From where I stand, an NSS change (which is probably correct in its own right) which broke sites was made on branch.  So the options I see are basically to try to somehow work around it in a layer above NSS (without a clear idea of what NSS is really doing here, at least on my end), to revert back to the previous NSS tag, or to accept that we made an incompatible change that's breaking sites on a stable branch.  I don't really like any of those options, to be honest.  Care to propose others?
I attached this patch to help us find the truth, not to
suggest that the NSS patch in bug 80092 comment 47 is
incorrect.

This patch backs out the patch in bug 80092 comment 47
(The original patch can't be backed out cleanly because
more changes have been made on top of it.)

After I applied this patch to Firefox 2.0.0.2, I can upload
the test file with the correct file size.  This experiment
showed that the patch in bug 80092 comment 47 is the NSS
change Boris is looking for.

The bug is most likely on the server side.  The interaction
between Firefox and this NSS change causes Firefox to upload
data to this server in a different but correct sequence of
chunks.  The server should be able to handle that.
I'm changing the platform and OS to "All" and "All". Been getting numerous reports from my users now that this is also happening on both Linux (Fedora Core 5) and MacOS 10.4.x with Firefox 2.0.0.x.
OS: Windows XP → All
Hardware: PC → All
Bug 369414 is a duplicate of this bug, but I am skeptical about whether the reported similar problems on Mac and Linux are indeed the same bug.

In addition to the three apparently necessary conditions of Firefox 2, Windows, and https, I believe this bug also only occurs when uploading to a server running libapreq1 (i.e. mod_perl on Apache 1.3).

File uploads are being truncated after 0A bytes that occur immediately before a 1K boundary (or possibly 2K or 4K; 4K makes the most sense).  In other words, it seems that all the truncated file sizes are 1 byte shy of being a perfect multiple of 4096 (or at least 1024), and 0A (that is, a line feed) is the final byte in these files.

Why only Firefox 2 Windows?  I don't know, but I have observed the "4095, 1, 4095, 1" buffer read pattern as well, and I think when the single byte read gets a 0A byte, it's triggering a server bug with this function in c/apache_multipart_buffer.c in libapreq1:

/*
  gets the next CRLF terminated line from the input buffer.
  if it doesn't find a CRLF, and the buffer isn't completely full, returns
  NULL; otherwise, returns the beginning of the null-terminated line,
  minus the CRLF.

  note that we really just look for LF terminated lines. this works
  around a bug in internet explorer for the macintosh which sends mime
  boundaries that are only LF terminated when you use an image submit
  button in a multipart/form-data form.
 */
char* next_line(multipart_buffer *self)
{
    /* look for LF in the data */
    char* line = self->buf_begin;
    char* ptr = memchr(self->buf_begin, "\r\n", self->bytes_in_buffer);

    /* LF found */
    if(ptr) {
	/* terminate the string, remove CRLF */
	if((ptr - line) > 0 && *(ptr-1) == '\r') *(ptr-1) = 0;
	else *ptr = 0;

	/* bump the pointer */
	self->buf_begin = ptr + 1;
	self->bytes_in_buffer -= (self->buf_begin - line);
    }

    /* no LF found */
    else {
	/* buffer isn't completely full, fail */
	if(self->bytes_in_buffer < self->bufsize)
	  return NULL;

	/* return entire buffer as a partial line */
	line[self->bufsize] = 0;
	self->buf_begin = ptr;
	self->bytes_in_buffer = 0;
    }

    return line;
}
Sorry, the memchr call line actually reads:
   char* ptr = memchr(self->buf_begin, '\n', self->bytes_in_buffer);
In any case, this is Apache::Request code, not Mozilla code, and although I haven't been able to confirm that this is the cause of the problem, it certainly sounds promising to me.
The apreq developers are planning a maintenance release of
libapreq1.  This version primarily addresses an issue noted
with FireFox 2.0 truncating file uploads in SSL mode.

Please give the tarball at

http://people.apache.org/~issac/libapreq-1.34-RC1.tar.gz

a try and report comments/problems/etc. to the apreq-dev list
at apreq-dev@httpd.apache.org. 
I can happily report that the linked libapreq-1.34-RC1 DOES in fact clear up this issue for me. Many thanks to all involved in getting a fix put out for this rather frustrating issue.
Isaac, may I trouble you to attach to this bug a diff between the previous 
libapreq source and this new one?  Please make it a "unified diff" (from 
diff -u  or diff -u10).  Thanks.
So I tried to eliminate the oscillation, with this patch.  I'm not sure it works...  but maybe.

With this patch, I see the following pattern of calls:

1)  Attempt to write 4096 bytes via nsSSLThread::requestWrite
2)  nsSSLThread::requestWrite only accepts 3931 bytes
3)  Attempt to write the remaining 163 bytes via nsSSLThread::requestWrite
4)  nsSSLThread::requestWrite accepts none of these
5)  Attempt to write 4096 bytes via nsSSLThread::requestWrite
6)  nsSSLThread::requestWrite only accepts 193 bytes
7)  Attempt to write the remaining 3931 bytes via nsSSLThread::requestWrite
8)  nsSSLThread::requestWrite accepts none of these

repeated over and over.  This is perhaps better than what we used to have; without this patch we were making some extra 193-byte calls into requestWrite between steps 4 and 5 (and requestWrite was accepting these 193 bytes).  But I don't know that it's actually changing what goes on the wire...

Nelson, do you think this would prevent the oscillation in packet sizes?
Nelson,
For reference, a patch for this specific issue can be found at http://svn.apache.org/viewvc/httpd/apreq/branches/1.x/c/apache_multipart_buffer.c?r1=531272&r2=531273&pathrev=531273&view=patch

Barring any surprises, this new version will be released later this week as stable legacy, and for best performance all legacy users will be encouraged to upgrade to this version (if not to the latest releases of version 2.2 of the Apache HTTP Server and libapreq2, which to the best of my knowledge, is not affected by this issue).
In comment 75, ralp.no.h@gmail.com suggests that the problem occurs when FF 
does a one byte write of the character 0x0A (a.k.a. \n).  Bug 369414 seems 
to confirm.  Bug 369414 shows that the first character AFTER the point of 
truncation, the first character lost due to truncation, is 0x0A.  I repeated
the test with the original 12-images.zip file and found that with it also,
the first character lost due to truncation is 0x0A.  

I made a file of 2^20 pairs of \r\n ( 0D 0A ), and tried to upload it. 
It was truncated after only 20479 bytes, and the first character lost 
due to truncation was 0A.

I also made a file of 2^20 0x0A bytes, and tested with it, thinking that 
it would also hit the problem early in the file, perhaps after even fewer
characters than the file of 0D0A pairs. To my surprise, that entire file transferred without truncation.  So apparently that single-byte 0A write
must follow a write that did NOT end in 0A.

Curiously, the server seems to upload the whole file, and THEN do the 
truncation.  The upload transfer doesn't stop at the point of truncation.  
All 2^20 or 2^21 charactersseem to get sent.  Only a fraction are retained.

This explains why Joseph Huckaby was unable to reproduce the problem
with a file of 1.2 million 'o' characters.  It requires the presence of
a 0x0A character to occur, and his file of 'o' characters had none. 

Assuming that ralp.no.h@gmail.com's analysis correct, (which seems to be 
pretty well confirmed here) a patch that prevents FF from ever sending a 
single byte SSL record containing the byte 0x0A should eliminate the 
server's truncation.  

I'll reply to Boris's comment 80 separately.
Boris, in light of Issac's comment 82, should we just resolve this 
WORKSFORME or WONTFIX ?  (The proposed NOTOURBUG resolution would come in 
mighty handy here.)
Sure, but then we should file a separate bug on our suboptimal packet sizes -- sending 4096-byte packets should be a lot faster than sending alternating 1-byte and 4095-byte packets, if we can manage it.  So I'd still want an answer to my question.

Filing a separate bug for that is the right thing to do, probably.
> a patch that prevents FF from ever sending a 
> single byte SSL record containing the byte 0x0A should eliminate the 
> server's truncation.

This would correct the behavior, but I don't think it's necessary.  I tracked down and managed to hack around the real culprit within libapreq on the server (although it wasn't specifically the code I pasted above), and a thread on the apache mailing list prompted development of a proper patch by the real devs (thanks!) which Issac has linked to in comment 82.

I would recommend resolving this bug as INVALID (and bug 369414 as DUPLICATE), since the 4095/1/4095/1 behavior isn't necessarily *wrong*, it's just really *dumb*, and it happens to be really good at triggering a real bug on the server-side (which is fixed, thanks to the comments on this bug).

I agree with Boris that a separate bug (Enhancement) probably ought to be opened for the inefficient packet size pattern.
Comment on attachment 262592 [details] [diff] [review]
Actual attempt to eliminate oscillation -- other patch is an impostor

>+                // suboptimal if each callback call writes to a socker

A typo in the comment.  Change "socker" to "socket".

I'm glad we are getting to the bottom of this.
Re: Comment 85, if your objective is to speed up uploads, then you should 
also increase the minimum write size to a much larger value for all writes
except the last.  I might suggest 64KB.  The reason is that writes of any
amount less than the TCP transmit window size are subject to Nagle delays.
Filed bug 378629.  Thank you, everyone, for the help in getting to the bottom of this and for getting it fixed in Apache!
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → INVALID
As of today, is the server at
  https://staging.sr.admission.net/joetemp/file_upload.html
running a broken or a fixed apache server?

I am unable to reproduce an incomplete upload with the provided file and latest 1.8 branch.

It would be interesting to have access to a broken server, while I work on improvement bug 378629.
Comment on attachment 262592 [details] [diff] [review]
Actual attempt to eliminate oscillation -- other patch is an impostor

This patch is wrong, because it violates the nsIInputStream API contract.
Attachment #262592 - Attachment is obsolete: true
Kai: Yes, the staging.sr.admission.net server still has the original, unpatched libqpreq 1.33, and I can still reproduce the bug.  Just tried this morning: Uploading the "12-images.zip" file, which is 1,163,446 bytes, only 905,215 bytes arrive on the server, as shown on the result page.  I used Firefox 2.0.0.1 on WinXP SP2, HTTPS of course.

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1
(In reply to comment #91)
> I am unable to reproduce an incomplete upload with the provided file and latest
> 1.8 branch.

(In reply to comment #93)
> Kai: Yes, the staging.sr.admission.net server still has the original, unpatched
> libqpreq 1.33, and I can still reproduce the bug.


I forgot. Linux always works for me.

For whatever reason, I can reproduce this bug on Windows, only!

Joseph, thanks for keeping up the broken server.
It's an important help for me.
Flags: wanted1.8.1.x+
Whiteboard: [wanted-1.9]
If you experienced this bug and would like to see an improvement in stable versions of Firefox, you might want to grab a nightly experimental trunk build and provide feedback in bug 378629.
Whiteboard: ssl
Depends on: 705755
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: