Closed
Bug 356470
Opened 18 years ago
Closed 17 years ago
HTTPS File Upload results in partial and/or corrupted files on server
Categories
(Core :: Networking, defect)
Tracking
()
RESOLVED
INVALID
People
(Reporter: jhuckaby, Assigned: Dolske)
References
()
Details
(Keywords: dataloss, regression, Whiteboard: ssl)
Attachments
(2 files, 2 obsolete files)
12.64 KB,
application/zip
|
Details | |
28.53 KB,
patch
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322) Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061003 Firefox/2.0 When I upload a large (1 MB) file using a simple form and HTTPS, the file arrives on the server smaller (800K) than it should be. This same upload works fine in Firefox 1.5, MSIE 6, Safari 2.0, etc. The bug only occurs in Firefox 2.0 RC2 Win32, and only under HTTPS. Reproducible: Always Steps to Reproduce: 1. Visit this page: https://staging.sr.admission.net/joetemp/file_upload.html 2. Upload a large (1 MB) file Actual Results: See the results page which lists the file size under "File Uploads", and it will be smaller than it should be. My test file is 1163446 bytes, and with Firefox 2.0 RC2 it arrives on the server with size of 905215 bytes. Expected Results: The file size on the server should be the actual size of the file. My server-side script is using mod_perl 1.29 on Apache 1.3.37 under Fedora Core 4, i386 arch. The Perl module used to parse the HTTP POST is Apache::Request version 1.33. Source code available on request: jhuckaby at gmail dot com. The Content-Length header that Firefox is sending seems to be correct, so there must be something wrong with the file section of the multi-part HTTP POST request. Oh, last thing. This bug is NOT reproducible in the Mac OS X version of Firefox 2.0 RC2. It only occurs in the Windows version.
Reporter | ||
Comment 1•18 years ago
|
||
Here is the file I used to reproduce the bug (1163446 bytes): http://staging.sr.admission.net/joetemp/12-images.zip
Version: unspecified → 2.0 Branch
Comment 2•18 years ago
|
||
It appears that this bug is reproducible under the testcase provided using "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061012 BonEcho/2.0". I can't seem to reproduce with other files, though. Perhaps this only occurs in a very small range of file sizes? It "might" be useful to test it further by uploading the file via HTTPS, then redownloading it to see if the file is intact. Have you , the reporter ever, in a "real-life" situation uploaded a file via HTTPS, and found it to be corrupted or incomplete when you downloaded it?
Reporter | ||
Comment 3•18 years ago
|
||
Yes, using the test file provided, I did upload, then re-download the file and it is indeed corrupted. The file size of the re-downloaded file is much smaller than the original (883K != 1136K), WinZip reports "This file is not a valid archive", and gunzip reports "Cannot find EOCD signature". I have also reproduced this with another file that is 900K. I have have now tried several other browsers / OSes, and can ONLY reproduce this with Firefox 2.0 RC2 Win32 with HTTPS (HTTP works fine). I have tried: Firefox 2.0 RC2 Mac OS X, Firefox 1.5.0.6 Win32, Firefox 1.5.0.6 Mac OS X, Safari 2.0.4 Mac OS X, Netscape 7.2 Win32, Netscape 7.2 Mac OS X, MSIE 6.0 Win32 SP1, MSIE 6.0 Win32 SP2. None of these other browsers has the bug.
Comment 4•18 years ago
|
||
I can also reproduce using a 968KB .exe file. When uploaded, it becomes just 263.99K. Moving to "critical" severity and confirming...
Severity: normal → critical
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 6•18 years ago
|
||
(In reply to comment #5) > This isn't a critical bug... > Why isn't this considered a critical bug? It results in loss of data. This functionality works perfectly fine on earlier (pre-2.0) versions of Firefox and is now broken. This potentially affects hundreds of file upload services, including the many photo upload services where files are likely to be large due to the high resolution of digital cameras. For our customers alone this will affect dozens of major newspaper sites posting classified ads.
Comment 7•18 years ago
|
||
A critical bug is a bug that causes a crash, hang, or loss of data. While this bug apparently results in some data not being uploaded for certain files, it doesn't fall into the "critical" category. I'm not trying to dismiss this bug as unimportant - it seems pretty bad, in fact, more so than I'd initially realized when I first read the bug. Jackie has graciously offered to try and find a narrower regression range, by testing mozilla-1.8 builds at http://archive.mozilla.org/pub/firefox/nightly/ and/or http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/. That's probably the best way forward.
Comment 8•18 years ago
|
||
Alpha 1 - WFM "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20060321 Firefox/2.0a1" Alpha 2 - WFM "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1a2) Gecko/20060512 BonEcho/2.0a2" Alpha 3 - WFM "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1a3) Gecko/20060526 BonEcho/2.0a3" Beta 1 - Doesn't work "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b1) Gecko/20060710 Firefox/2.0b1" It probably has broken somewhere between alpha 3 and beta 1.
Comment 9•18 years ago
|
||
I have pinpointed a regression window for this bug, as follows Works in "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1a3) Gecko/20060626 BonEcho/2.0a3" Broken in "Broken in "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1a3) Gecko/20060628 BonEcho/2.0a3" There seems to be no 2006-06-27 build available!
Comment 10•18 years ago
|
||
(In reply to comment #4) > I can also reproduce using a 968KB .exe file. When uploaded, it becomes just > 263.99K. Moving to "critical" severity and confirming... > This file can be accessed by... 1. Go to http://ideas.live.com/signup.aspx?versionId=b2456790-90e6-4d28-9219-5d7207d94d45 2. Enter your e-mail address. 3. Download the file by clicking on the "Go to Windows Live OneCare Beta v1.5 Setup" link. 4. Try to reproduce the bug with the file you just downloaded.
Comment 11•18 years ago
|
||
Good work Jackie! I can confirm that there was a change in behavior in that range. Bonsai checkins: http://bonsai.mozilla.org/cvsquery.cgi?module=PhoenixTinderbox&branch=MOZILLA_1_8_BRANCH&branchtype=match&date=explicit&mindate=2006-06-26+01%3A00&maxdate=2006-06-28+04%3A00 mozilla/netwerk checkins: http://bonsai.mozilla.org/cvsquery.cgi?branch=MOZILLA_1_8_BRANCH&branchtype=match&dir=mozilla%2Fnetwerk&date=explicit&mindate=1151308800&maxdate=1151492400 I'll try backing out some of the patches that may have likely caused this.
Comment 12•18 years ago
|
||
Gavin, does backing our bug 334142 change anything?
Comment 13•18 years ago
|
||
(In reply to comment #12) > Gavin, does backing our bug 334142 change anything? Hmm, backing out the patch from bug 334142 (which re-fixes bug 137155) has the weird effect of making the server-reported size vary seemingly randomly. These tests were made with the 1,163,446 byte file in comment 1: Current branch: server always reports 905,215 bytes Current branch with bug 137155 fixed (bug 334142 backed out): server reports different sizes each time
Comment 14•18 years ago
|
||
Other data points: I can't reproduce this bug using a 1.5.0.7 build (which also has the patch to bug 334142), but I can reproduce it on the trunk. I'll get a trunk regression range to ideally try and narrow down the 1.8 range above.
Comment 15•18 years ago
|
||
The trunk regression range is: http://bonsai.mozilla.org/cvsquery.cgi?module=PhoenixTinderbox&date=explicit&mindate=2006-06-22+02%3A00&maxdate=2006-06-23+05%3A00 Which also includes bug 331825 and bug 334142. I tried backing out bug 331825 and it didn't have any effect. I guess the next step is to try pulling by date around that time to make sure that subsequent changes aren't interfering with my testing. I'll do that next.
Comment 16•18 years ago
|
||
Gavin, thanks for looking into this! You could also try backing out both of those bugs together to see whether it's the combination of the two patches that causes issues...
Flags: blocking-firefox2?
Keywords: dataloss
Comment 17•18 years ago
|
||
Backing out both netwerk patches has the same effect as just backing out bug 334142. The only other checkin common to the two regression ranges is bug 235773 (enable ECC). Since this seems to only affect HTTPS (Joseph or Robert, can you confirm this?), that might actually be the culprit. If it is related to ECC (building now to confirm), that may explain why this problem wasn't identified before: it would only affect SSL, and even then perhaps only on certain servers depending on their SSL setup.
Reporter | ||
Comment 18•18 years ago
|
||
Yes, this only affects HTTPS file uploads. I cannot reproduce the bug with plain HTTP.
Comment 19•18 years ago
|
||
This bug seems to have multiple causes. Pulling by date from the 1.8 branch just before the patch for bug 334142 landed works, right after fails, so this bug was once caused by the backout of bug 137155. However, a build from before the patch for bug 137155 landed works. So something must have landed between bug 137155 (2006-02-08 09:39) and bug 334142 (2006-06-26 18:21) that depended on the fix for bug 137155 to not cause this bug. I'll try to identify which change that is.
Comment 20•18 years ago
|
||
Is this issue specific to the server? I'm curious if this has been on trunk since June why we haven't yet gotten other reports...
Assignee | ||
Comment 21•18 years ago
|
||
Can this be reproduced on any other server? [Joseph, I'll send you an email request for your Perl code.] I'll also do a little network sniffing to see what's going on.
Comment 22•18 years ago
|
||
The following information is critical to helping drivers decide the impact of this bug: - does it affect marquee sites with https upload (what marquee sites have https upload?) - does it affect ALL https upload, or just https w/ECC? - what range of file sized does it affect? 900kb-1.2MB? Anything over a certain size? - what other effects might this have for https traffic? Will it affect downloads? Thanks.
Comment 23•18 years ago
|
||
(In reply to comment #22) > - does it affect ALL https upload, or just https w/ECC? Given the findings from comment 19, I'm no longer very confident that my initial hypothesis that this bug is related to ECC was correct. > - what range of file sized does it affect? 900kb-1.2MB? Anything over a certain size? I think failure depends on the size of the file relative to the size of the packets being sent. I can reproduce with the file from comment 1 (1,163,446 bytes), but not with a file that's 2,097,152 bytes.
Comment 24•18 years ago
|
||
Thinking out loud here. Has there perhaps been a change that caused the NSPR sockets used for file upload to be non-blocking, where they were blocking socket before. Is it possible that the code for uploading those files assumes that the sockets are blocking, but they are now non-blocking? Blocking NSPR sockets have the property that they never return short writes. That is, when a write function is called to write a fixed amount, it returns either that fixed amount or -1, but no value in between. However a non-blocking NSPR socket may return any non-zero value between -1 and the non-zero amount specified to be written. If code assumed that any return value other than -1 was equivalent to having written the full amount requested, (i.e. if the code only checked the return value for -1) that might be OK for blocking NSPR sockets but not for non-blocking NSPR sockets. This applies to SSL sockets, too. Given that the reported results are that the file transferred is shorter than the one sent, this makes me suspect that the code is not handling "short writes" properly. Just a guess. This hypothesis doesn't explain why 1MB files fail and 2MB files succeed!
Comment 25•18 years ago
|
||
I confirm it works on Linux. I confirm I get the problem on Windows. I compiled with ECC disabled, and it still fails on Windows.
Comment 26•18 years ago
|
||
* youtube uploads over http * flickr.com "" * gmail will run happily either way I'm really surprised we didnt' see this with bz attachments..
Comment 27•18 years ago
|
||
Answering my own question - bz doesn't allow uploads greater than 1024 bytes. So can't test there. Has anyone tried reproducing with other servers?
Assignee | ||
Comment 28•18 years ago
|
||
I'm still doing some debugging, but thoguht I should post an update: What appear to be happening is that the server is interrupting the SSL conversation. A network snoop shows the upload progressing normally, until th server suddenly sends a packet with an Alert message (in the SSL protocol). [This might be the result of bad data from the browser, so I can't say for sure whose fault this is a result of.] I'm getting bogus output from ssltap, so I don't know what type of SSL Alert it is. [Seems like an ssltap bug?] I'm poking around some more to see if I can get something useful out if it.
Assignee | ||
Comment 29•18 years ago
|
||
Testing on other servers seems to check out... The file in question has no problem being uploaded (over SSL) on Gmail, Yousendit.com, wiki.ubuntu.com, or a test Apache server set up during discussion on IRC. As another data point, I created a few new files with the same size as the problematic "12-images.zip", but pulling the contents from /dev/urandom and /dev/zero. On Windows, the /dev/zero file uploaded fine but 3 /dev/urandom files all failed, at different points (ie, the server reported uploads of different sizes). On Linux, all 4 files uploaded ok.
Comment 30•18 years ago
|
||
Since making comment 19, I've built by date in the time period that bug 137155 was on the trunk, with the patch backed out (essentially a simulation of that patch never landing). With that setup, the regression range that I find with this bug points to the landing of bug 337770. Conclusion: the upgrade of NSS from bug 337770 is the immediate cause of this bug. However, fixing bug 137155 also causes this bug to not appear. In this case, bug 137155 was fixed first, then NSS was upgraded, which means that the regression didn't appear until the patch for bug 137155 was backed out one month later (nearly 5 months after it landed). I don't really know enough about NSS/netwerk interaction to be able to draw a useful conclusion from this information, and the diffs in bug 337770 are a little too big for me to actually read through looking for hints.
Comment 31•18 years ago
|
||
Based on comment 28 and comment 29, we won't block the release of Firefox 2 for this. However, I hope gavin's research in comment 30 is helpful for getting the appropriate fixes to NSS for 1.8.1.1.
Flags: blocking-firefox2? → blocking-firefox2-
Comment 32•18 years ago
|
||
This is clearly not a Firefox issue. It's either networking or NSS, or both... Trying NSS for a start.
Assignee: nobody → nobody
Component: File Handling → Test
Flags: blocking-firefox2-
Product: Firefox → NSS
QA Contact: file.handling → test
Version: 2.0 Branch → 3.9
Updated•18 years ago
|
Component: Test → Libraries
QA Contact: test → libraries
Updated•18 years ago
|
Assignee | ||
Comment 33•18 years ago
|
||
Update to Comment 28... Ugh - I somehow managed to sniff a freak connection, and it's unlike all the other times I've reproduced the problem. So, the bit about getting an unexpected SSL alert isn't what's normally happening. I can't seem to reproduce this first freak sniff, so whatever caused it is probably something else. The good news is that a nominal sniff of the problem (using 12-images.zip) shows that this sure seems like a server problem... We're sending the full file contents to the server (an uninterrupted block of 1,185,483 encrypted bytes to the server, which would be the 1,163,446 bytes of the file plus 22,037 bytes of SSL overhead). The resulting web page claims only 905,215 bytes of the file were recieved, so the remote end is discarding the extra data somewhere. I had mconner test a debug build with the SSLTRACE and SSLDEBUG environment variables set (to "50"), and the resulting output from reproducing the problem appears to indicate the same thing. I'm still at a loss to explain the exact reason for the failure, though. If the submitter could look at gathering some additional data from the server's end (Apache/OpenSSL debug logging?), that might help explain what the server thinks is happening.
Comment 34•18 years ago
|
||
(In reply to comment #28) > the server is interrupting the SSL conversation. > server suddenly sends a packet with an Alert message (in the SSL protocol). > I don't know what type of SSL Alert it is. The SSL alert record content is encrypted. Record headers are not encrypted, but the record contents are. For an alert record, the encrypted content is very small, but it's the part you want. SSLTap cannot decrypt it. You could try using an SSL cipher suite with NULL encryption. But any alert you receive from the server during the data transfer (after the handshake) is effectively a sign that the server is aborting the transfer.
Comment 35•18 years ago
|
||
Justin, Please attach the portion of your ssltap output showing the handshake(s). Also, the very tail end of the output might be useful.
Comment 36•18 years ago
|
||
Also, Justin, please attach (or paste into a comment) the "raw request header" output from the page returned by the server in response to a flawed upload.
Comment 37•18 years ago
|
||
Justin sent me some very large log files of a connection with this problem. (Thanks! Justin) I see nothing yet that obviously indicates an SSL error of any kind. Here are some observations gleaned from those logs. The negotiated protocol version was 3.1, TLS 1.0 The server chose cipher suite 0x0039, TLS_DHE_RSA_WITH_AES_256_CBC_SHA ssl_SecureSend was being called alternately with writes of size 4095, and 1. This behavior MIGHT be expected if writes to the underlying TCP socket were experiencing "short writes" (full output buffers). But the log contains considerable evidence contradicting that. If we were experiencing short TCP writes, we would not be seeing a new SSL record being generated for each 1-byte write from above. We would see one SSL record generated for every TWO (or more) writes from above. But in the log file, we do see ssl records being generated with alternating sizes of 4133 and 37 (which are the expected values for payloads of 4095 and 1 with that ciphersuite). This suggests that the odd alternating write sizes are due to strange behavior in the layer above ssl_SecureSend. Note that 4K is not a record size boundary for TLS. TLS's record size boundary is 16K. I would like to know which of the many forms of write functions is being used by the layers above to send this file. Is the layer above libSSL calling PR_Write, PR_Send, PR_SendFile, PR_TransmitFile, PR_Writev, or what? I continue to suspect a server problem. There have been a handful of bugs filed that turned out to be TLS servers with flawed block cipher logic that was manifest only with AES 256 ciphersuites. SO, I'd suggest conducting an experiment with this server, where the AES 256 cipher suites have been disabled in the client. See if the problem can be reproduced with AES_128 or RC4. If not, that's quite telling.
Comment 38•18 years ago
|
||
(In reply to comment #37) > This > suggests that the odd alternating write sizes are due to strange behavior > in the layer above ssl_SecureSend. The direct layer above libssl is PSM's layer. It will use the same buffer sizes for reading and writing as used by the layer above it, which is Mozilla's networking layer Necko. > I would like to know which of the many forms of write functions is being > used by the layers above to send this file. Is the layer above libSSL > calling PR_Write, PR_Send, PR_SendFile, PR_TransmitFile, PR_Writev, or what? Out of the functions you listed, the PSM layer implements the "write" function only.
Comment 39•18 years ago
|
||
Just wanted to add my comment here to let you all know that this doesn't seem to be limited merely to truncated file sizes on the other end, but rather the skipping entirely of elements of a multiple file upload. I have a simple file upload app I wrote is used here at Goddard Space Flight Center which is running on mod_perl 1.29 on Apache 1.3.27. It provides a form that gives 5 separate file upload input boxes one right below the other. Prior to Firefox 2.0, you could select up to 5 files (one for each field), and it would successfully upload them all. You could even leave some fields empty and it would still upload them all fine... ie - if the file upload boxes are on the page like: |BOX1| <Browse> |BOX2| <Browse> |BOX3| <Browse> |BOX4| <Browse> |BOX5| <Browse> You didn't HAVE to fill them sequentially, you could have only a file in Box1 and Box4 for example, and they would both still get there fine. Anyhow, I noticed tonight that at times Firefox 2.0 was only uploading 1 file out of 5, sometimes only 2 out of 5, sometimes 3, but so far never any more than three, even if all 5 iputs were full. More often than not though, it will seem to only upload 1 file and then stop processing. The code that actually reads in the file is quite rudimentary, and just uses the Apache::Request module: for (my $f_upload = $$r_apr->upload; $f_upload; $f_upload = $f_upload->next) { my ($curr_file) = $f_upload->filename; my ($curr_size) = $f_upload->size; if ($curr_file) { $curr_file =~ m/^.*(\\|\/)(.*)/; # strip the remote path my ($file_name) = ""; if (! $2) { $file_name = $curr_file; } else { $file_name = $2; } $$r_apr->log_error("The uploaded file was called: \"$file_name\""); if (-e "/tempfiles/$fullname/$uid/$file_name") { $$r_apr->log_error("File already exists!!!"); } else { $f_upload->link("/tempfiles/$fullname/$uid/$file_name") || die "Error linking file!! Error was: $!"; $$r_apr->log_error("Your file \"$file_name\" uploaded successfully!!!"); push(@uploaded_files, "$file_name"."|||||"."$curr_size"); } } else { $$r_apr->log_error("NO FILE TO UPLOAD!!!"); } } foreach(@uploaded_files) { $$r_apr->log_error("Array Item: $_\n"); } --------------- No matter how many files were set to be uploaded, only 1-3 "Array Item: xxxxxx" lines are ever showing up in the logs, and more often than not it's just 1 like I said before. The application is working perfectly fine apart from this issue though, the file that makes it there don't seem to be corrupted (though I'm only testing with at most 30kb PDF files at the moment, so with larger ones it very well might get corrupted). I look forward to any progress that is made with this, and eagerly await a solution, since at the moment (this being an app used widely on center), I'll have to tell people to stick with Firefox 1.5 till this gets wrapped up. PS - yes, all uploads are done through https exclusively.
Comment 40•18 years ago
|
||
Without an owner or current activity this looks unrealistic for 1.8.1.1--Want it fixed though.
Flags: wanted1.8.1.x+
Flags: blocking1.8.1.1?
Flags: blocking1.8.1.1-
Comment 41•18 years ago
|
||
Moving back to Core. The bug might be in NSS (dunno if that's been shown) but they're not likely to care enough to track down if the only symptom is in Firefox file uploading. If we want it fixed we'll have to own it until this is further along.
Assignee: nobody → dolske
Component: Libraries → Networking
Product: NSS → Core
Version: 3.9 → 1.8 Branch
Assignee | ||
Comment 42•18 years ago
|
||
It would be most helpful if either the original reporter (Joe) or the comment 39 reporter (Tim) could provide some more details/logs of what Apache thinks is happening. From the snooping I did in comment 33, Firefox *is* sending the data, but the server is apparently discarding it before the script gets it. It would sure be nice to know why.
Reporter | ||
Comment 43•18 years ago
|
||
Hello all. As the original reporter of this bug, I would be happy to add more info. First, I cranked up both my Apache LogLevel and mod_ssl SSLLogLevel to "debug", restarted Apache, and reproduced the bug by uploading the 1.1 MB file from Firefox 2.0 Win32 over HTTPS. The Apache error_log contained nothing at all, but the ssl_engine_log contained a LARGE amount of information (to the tune of 5.5 MB). Here is that excerpt if anyone is interested, gzipped for downloading: http://staging.sr.admission.net/joetemp/ssl_engine.log.gz I then decided to try having the server simply dump out the raw HTTP POST data without using Apache::Request at all, to see the size, and if submitting over HTTP differs from HTTPS. Here are those raw dumps: http://staging.sr.admission.net/joetemp/rawcontent_ff20_win32_http.bin http://staging.sr.admission.net/joetemp/rawcontent_ff20_win32_https.bin The sizes are correct (1.1 MB), and they differ slightly, but it looks like it's just the multipart/form-data "boundary", which is randomly generated. I do not have software to perform a deep analysis of these dumps, so I do not know for sure if the binary data differs. I also dumped the raw HTTP POST data from a few other browsers, for further comparison: http://staging.sr.admission.net/joetemp/rawcontent_ff1508_mac_https.bin http://staging.sr.admission.net/joetemp/rawcontent_ie6_win32_https.bin http://staging.sr.admission.net/joetemp/rawcontent_safari2_mac_https.bin Keep in mind that submitting from ALL these browsers / OSes works just fine (the bug is not reproducible), EXCEPT for Firefox 2.0 Win32 HTTPS. However, I do not see any large differences in the raw dumps, which is interesting. These raw dumps certainly make it seem like a bug in Apache::Request (the library parsing the HTTP POST), but why only Firefox 2.0 Win32 HTTPS? All other browsers / OSes work fine. Next I took a deeper look at the Apache::Request module (docs available here: http://search.cpan.org/~stas/libapreq-1.33/Request/Request.pm ). I made use of the UPLOAD_HOOK feature to have libapreq call a function while it is parsing the HTTP POST. The results were as expected. I incremented a counter of bytes received, and wrote all the data to a filehandle. Looks like the function is called for every 4096 bytes. The total size was 905215 bytes (same problem, much too small -- actual file is 1163446 bytes), which doesn't shed any further light. Somehow the remaining 200K is just not getting read. Apache::Request doesn't seem to have a "debug" mode, and it doesn't warn or write anything to STDERR. I also wrote a file upload receiver script in PHP, and submitted to that. The bug is not reproducible. So PHP's built-in HTTP POST parser seems to deal with the submission just fine. That also points the finger at Apache::Request, but again, why only with Firefox 2.0 Win32 HTTPS? I can't think of anything else to try, but I did send an e-mail to the author of Apache::Request (Stas Bekman, stas at stason dot org) and will reference this bug so he can take a look himself.
Reporter | ||
Comment 44•18 years ago
|
||
AH HA! I think I may be onto something here. I took a closer look at the Apache::Request UPLOAD_HOOK callback, and examined the number of times it was called, and the number of bytes sent to it for each call. The results are VERY interesting, and show a GREAT variance between Firefox 2.0 Win32 HTTPS vs. HTTP vs. other browsers. Check this out: http://staging.sr.admission.net/joetemp/upload_hook_log_ff20_win32_https.txt http://staging.sr.admission.net/joetemp/upload_hook_log_ff20_win32_http.txt See the difference? When the file is uploaded via Firefox 2.0 Win32 HTTPS, the upload hook calls start out with various sizes, then they get into a pattern that is 4095 bytes, followed by 1 byte, followed by 4095 bytes, followed by 1 byte, and so on. Now compare that to the HTTP upload. Weird, eh? Here are some other browsers / OSes for further comparison: http://staging.sr.admission.net/joetemp/upload_hook_log_ff1508_mac_https.txt http://staging.sr.admission.net/joetemp/upload_hook_log_ie6_win32_https.txt http://staging.sr.admission.net/joetemp/upload_hook_log_safari2_mac_https.txt These results are VERY interesting. The Firefox 2.0 Win32 HTTPS upload is the ONLY instance where the upload hook is called for 1 single byte every other iteration. Could it be that Apache::Request is trying to read chunks from the socket that are "too small"? So Firefox is "not ready" with the next packet so the socket select() call returns false for can_read() or something, and Apache::Request gives up? Very, very interesting indeed.
Reporter | ||
Comment 45•18 years ago
|
||
Apache::Request appears to have a "hack" to deal with Mozilla 0.97. I direct your attention to this snippet of code from inside libapreq (Apache::Request), apache_multipart_buffer.c, lines 65 - 70: if (bytes_to_read >= self->r->remaining) { bytes_to_read = self->r->remaining - strlen(self->boundary); #ifdef DEBUG ap_log_rerror(MPB_ERROR, "mozilla 0.97 hack: '%ld'", self->r->remaining); #endif } And this from apache_request.c, Lines 582 - 588: /* mozilla empty-file (missing CRLF) hack */ fill_buffer(mbuff); if( strEQN(mbuff->buf_begin, mbuff->boundary, strlen(mbuff->boundary)) ) { r->remaining -= 2; continue; } I could be totally off base here, but there may be a possibility this "hack" is affecting Firefox 2.0's multi-part upload. I'll try to recompile my Apache::Request module with the DEBUG flag set, and see if anything gets logged while reproducing the bug.
Reporter | ||
Comment 46•18 years ago
|
||
Oh boy. Re-compiling libapreq with the DEBUG flag adds TONS of useful information to the Apache error_log. Check this out. This captured while reproducing the bug (1.1 MB file upload via Firefox 2.0 Win32 HTTPS): http://staging.sr.admission.net/joetemp/libapreq_debug_ff20_win32_https.txt This has pretty much everything we need to know. And the best part? Look at the last line logged: [Thu Dec 21 10:24:18 2006] [error] [client 10.40.9.22] [libapreq] silently drop remaining '254180' bytes That is approximately the number of bytes missing from the file. libapreq is dropping it! Now, compare this against using HTTP (not HTTPS), and some other browsers / OSes: http://staging.sr.admission.net/joetemp/libapreq_debug_ff20_win32_http.txt http://staging.sr.admission.net/joetemp/libapreq_debug_ie6_win32_https.txt http://staging.sr.admission.net/joetemp/libapreq_debug_ff1508_mac_https.txt http://staging.sr.admission.net/joetemp/libapreq_debug_safari2_mac_https.txt All these other logs look "clean" compared to the libapreq_debug_ff20_win32_https (which is the only place the bug is reproducible). That log has all kinds of extra stuff. Of interest is the number of times the "find_boundary" thing is logged. Almost like the library is having trouble locating the boundary between multipart objects.
Comment 47•18 years ago
|
||
I took a peek at this. What's happening is that occasionally, we get a 0 byte payload (as can be seen the abover log). Once that happens, multipart_buffer_read returns 0 and ApacheRequest_parse_multipart breaks out of the data-reading loop for that parameter and starts reading line-by-line looking for the next boundary/header (which it may or may not find since we're treating binary data as text, so all bets are off). (In reply to comment #37) It still looks like the core of the issue is the way output is being sent. Not being a maven, I don't know if this odd 4095/1/4095/1 packet sequence is normal, but it sure looks unhealthy to me... > Justin sent me some very large log files of a connection with this problem. > (Thanks! Justin) > > I see nothing yet that obviously indicates an SSL error of any kind. > > Here are some observations gleaned from those logs. > The negotiated protocol version was 3.1, TLS 1.0 > The server chose cipher suite 0x0039, TLS_DHE_RSA_WITH_AES_256_CBC_SHA > > ssl_SecureSend was being called alternately with writes of size 4095, and 1. > This behavior MIGHT be expected if writes to the underlying TCP socket were > experiencing "short writes" (full output buffers). But the log contains > considerable evidence contradicting that. If we were experiencing short TCP > writes, we would not be seeing a new SSL record being generated for each > 1-byte write from above. We would see one SSL record generated for every > TWO (or more) writes from above. But in the log file, we do see ssl records > being generated with alternating sizes of 4133 and 37 (which are the > expected values for payloads of 4095 and 1 with that ciphersuite). This > suggests that the odd alternating write sizes are due to strange behavior > in the layer above ssl_SecureSend. > > Note that 4K is not a record size boundary for TLS. TLS's record size > boundary is 16K. > > I would like to know which of the many forms of write functions is being > used by the layers above to send this file. Is the layer above libSSL > calling PR_Write, PR_Send, PR_SendFile, PR_TransmitFile, PR_Writev, or what? > > I continue to suspect a server problem. There have been a handful of bugs > filed that turned out to be TLS servers with flawed block cipher logic that > was manifest only with AES 256 ciphersuites. SO, I'd suggest conducting > an experiment with this server, where the AES 256 cipher suites have been > disabled in the client. See if the problem can be reproduced with AES_128 > or RC4. If not, that's quite telling. >
Comment 48•18 years ago
|
||
Darin, biesi, is there a way to log the sizes of the data buffers that HTTP passes to the downstream (NSS, in this case?) stuff? Nelson, is there a way to log the sizes of the data buffers NSS is receiving and the packets it's creating?
Comment 49•18 years ago
|
||
Boris, Here is a trace log from a typical run uploading the file 12-images.zip. In addition to the usual log stuff, this log file includes the actual plain text and actual cipher text of the records, in hex and ASCII. Since the vast bulk of the data is binary (a large compressed zip file), I stripped out the data buffer dumps for most of the binary data. That reduced this log to a managable size. The unzipped log file is 4900 lines, ~175 KB. ECC is not used. The cipher suite is TLS_DHE_RSA_WITH_AES_256_CBC_SHA. So, yes, this server is using AES 256. The lines in the log are grouped in blocks, one per call to SSL's SecureSend function by the layer above it. Each block begins with a line that looks like SecureSend: sending NNNN bytes (where NNNN varies) and ends with a line that looks like SecureSend: returning NNNN count (where NNNN varies) or SecureSend: returning -1 count, error -5998 (that's WOULD_BLOCK) In between those lines are other lines telling what libSSL did between the call and the return. Here is some interpretation of those lines. sending NNNN bytes of saved data Means that before libSSL tried to do anything with the new data that is being passed down to it, it tried to send some data that was previously buffered after a previous write attempt to the underlying socket got a WOULD_BLOCK. If this attempt to send the previous buffered data also gets a WOULD_BLOCK, this will be reported immediately to the caller. SendRecord type: application_data (23) nIn=NNNN libSSL will attempt to take this much of the data being passed down and turn it into one or more SSL records, to be encrypted and sent. In this log file, this amount will be either equal to, or exactly one less than, the amount the caller is passing down. Send record (plain text) [Len: NNNN] libSSL has prepared an SSL data record to send (not yet encrypted) containing this much data. In this log, this amount will always be equal to the amount in the preceeding line. send (encrypted) record data: [Len: NNNN] libSSL has padded and encrypted the SSL record, and put a 5-byte SSL record header on it. In this log file, the encrypted record will generally be 26-57 bytes longer than the corresponding unencrypted record, because the SHA MAC is 20 bytes, padding is 1-32 bytes, plus the 5-byte header. The encrypted record will always have a size of 5+32*n. This amount is generally written to the underlying socket. saving NNNN bytes of data (NNNN total saved so far) if the write of the encrypted record to the underlying socket got a "short write", or got WOULD_BLOCK, the unwritten portion is buffered, and we will attempt to send this remainder again at the next call to libSSL. In this case, the amount returned to the caller will be 1 less than the amount the caller tried to write, or (if that would be zero), -1 with WOULD_BLOCK.
Updated•18 years ago
|
Attachment #251266 -
Attachment mime type: application/octet-stream → application/zip
Comment 50•18 years ago
|
||
So if I see: SecureSend: sending 1 bytes in the log, that means the caller of SecureSend only passed NSS one byte, right? Is that caller just NSS itself? I don't see any mention of SecureSend outside NSS... Basically, I'd like to track down where this 1-byte thing is coming from in our code.
Comment 51•18 years ago
|
||
I wonder if SecureSend: sending 1 bytes has to do with the patch in bug 80092 comment 47.
Comment 52•18 years ago
|
||
(In reply to comment #50) > So if I see: SecureSend: sending 1 bytes in the log, that means the > caller of SecureSend only passed NSS one byte, right? Yes. > Is that caller just NSS itself? I don't see any mention of SecureSend > outside NSS... On an SSL socket, SecureSend is called (indirectly, through the PRIOMethods table) from PR_Write, PR_Send, PR_SendFile, PR_TransmitFile, and PR_Writev. That's why I asked, in comment 37, which of these functions is being called by the layer above. > Basically, I'd like to track down where this 1-byte thing is coming from > in our code. I hope we can, too. But I hasten to add that, while it is inefficient, it is not wrong or incorrect. This behavior is entirely allowed by http, ssl, TCP, and what-have-you. So, if the receiver cannot handle this data stream simply because of the unusual way in which it is packetized, IMO, that's not a bug we should try to fix by cobbling our own code. I'm all for making our code more efficient, and getting rid of those one- byte writes for the sake of efficiency. IINM, bug 137155 is precisely about that very issue, and this bug's problem was seen after the "fix" for that bug was backed out. But I would hope we wouldn't change our code (assuming it is correct) just because of a bug in a third party server that tries to decode uploaded zip files on the fly (if that is indeed the cause). The server's log file claims it received a zero-length input, yet the attached trace file shows no evidence that we ever sent an SSL appliation data record with zero data, and even if we did, such a record should simply be silently discarded by the recipient, not treated as some EOF-like event. For all I know, the real problem here has something to do with the contents of the particular zip file being uploaded, and is not the fault of SSL at all. It may be that some combination of a particular compression data construct in the zip and some unusual pattern of the data packet sizes is causing the on-the-fly ZIP decoder to choke.
Comment 53•18 years ago
|
||
(In reply to comment #51) > I wonder if SecureSend: sending 1 bytes has to do with the patch in > bug 80092 comment 47. Sometimes the 4095-4096 byte writes succceed, writing all the data without would_block. At other times, they return "short writes" for that case. The behavior of the alternating 4095,1 byte writes is seen EVEN in the cases where the entire write succeeded without any short write occurring.
Comment 54•18 years ago
|
||
The scheme implemented for bug 80092 has the property that if the application does a large write, which is turned into an encrypted record, part of which is unsent, we return a "short write" count to the caller, one less than the amount the caller requested. The entire amount that the caller previously passed has been comitted to be sent (is in the SSL record). To the caller it appears that the last byte he tried to send was unsent, and he may try to write that last byte again immediately. But when that happens, we will not send that one byte in a new SSL record, because it has already been sent in a previous SSL record. In the log file, this is seen in a sequence like the following: SecureSend: sending 160 bytes SendRecord type: application_data (23) nIn=160 Send record (plain text) [Len: 160] send (encrypted) record data: [Len: 197] saving 197 bytes of data (197 total saved so far) SecureSend: returning 159 count SecureSend: sending 1 bytes sending 197 bytes of saved data SecureSend: returning -1 count, error -5998 SecureSend: sending 1 bytes sending 197 bytes of saved data SecureSend: returning 1 count The first write of 160 bytes put all 160 bytes of application data into the SSL record which was then queued to be sent, but we told the caller that we sent only 159 bytes. The caller tried to send that one byte again. When the previous 160-byte record was sent, we returned 1 to the caller, telling him that that last byte was finally sent, but we did not send that last byte again in a new SSL record. This behavior does NOT result in SSL records being sent over the wire that contain alternating large and small amounts of sent data. The would_block logic results in one large SSL record being sent for a series of calls, large and small, by the caller. But this trace (and the server's log) clearly shows that SSL records were being sent alternating in size between 4095 and 1 byte. And from the trace file, we see that the caller was sending these alternating sizes even when no would_block behavior was being seen (and also when would_block behavior was being seen). For example: SecureSend: sending 4095 bytes SendRecord type: application_data (23) nIn=4095 Send record (plain text) [Len: 4095] send (encrypted) record data: [Len: 4133] SecureSend: returning 4095 count SecureSend: sending 1 bytes Here we see a 4096-byte write being sent completely, on the first try, in its entirety. No short write, no previously saved data being sent, no new data being saved. But then immediately therafter, the caller calls with 1 byte anyway. This shows that the alternating 4095,1 behavior was not caused by would_blocks and short writes.
Comment 55•18 years ago
|
||
So the simple thing to do to find the caller seems to be to breakpoint in SecureSend, condition the breakpoint on the size being 1 byte, and look at the stack...
Comment 56•18 years ago
|
||
Yeah, boris, conditional breakpoint should reveal all. I'm testing with debug builds of NSS but not debug browser builds. suggested experiments: a) produce a file of all zero bytes, exactly the same length as 12-images.zip and name it to be a zip file, and upload it in the same way. Netlib and NSS should treat it exactly the same as 12-images.zip. I'd expect to see the same pattern of write sizes. Does that fail? b) rename the file from 12-images.zip to something else (e.g. 12-images.bin) and upload it, not as a zip file, but as an appliation/octet-stream. That should disable the on-the-fly zip file decoder at the receiver. Of course, netlib and NSS won't treat the file any differently due to the changed name and MIME type. I'd expect to see the same pattern of write sizes. Does that fail?
Comment 57•18 years ago
|
||
I did experiments like the ones I suggested above. Here are the results: -) copy 12-images.zip to 12-images.exe and upload it. Different name, different MIME type. Uploaded length: exactly the same as for 12-images.zip, namely Size: 883.99 K (905215 bytes) -) using program dd, create a new file of exactly the same length as as 12-images.zip, but whose contents are NOT a zip file. Named it 12-omages.zip. Uploaded it. Result: Size: 19.99 K (20479 bytes) truncated more severely!! Problem is sensitive to data content. Next I'll repeat that last experiment while running an SSL trace as before.
Comment 58•18 years ago
|
||
SSL trace for uploading file 12-omages.zip shows the same exact pattern of sizes of encrypted SSL record buffers, and essentially identical pattern of write sizes coming from layer above. Yet the file was truncated on the server in an entirely different place, after only ~20 KB. Netlib didn't behave differently, NSS didn't behave differently. The file contents were different.
Reporter | ||
Comment 59•18 years ago
|
||
In reply to Comment #56: There is no zip decompression happening on the server. The file type is ignored, and stored in its original format. However, Nelson's test using dd is very interesting! I also created my own "blank" version of the file, containing exactly 1,163,446 "o" characters. Here is a link to download it if anyone is interested: http://staging.sr.admission.net/joetemp/12-omages.zip Amazingly, I am UNABLE to reproduce the bug with this file! It seems to upload just fine every time (I tried 15 times in a row). Here is the Apache::Request (libapreq) debug output while uploading it: http://staging.sr.admission.net/joetemp/12-omages-ff20-win32-https.txt That is exactly what it SHOULD look like, discarding only 42 bytes at the end (which is normal). But as soon as I switch back to the REAL 12-images.zip file, the behavior is 100% reproducible again (chopping off 200K). On a side-note, I just upgraded my Mac to Firefox 2 (v2.0.0.1, up from v1.5.0.9), and the bug is NOT reproducible! It remains reproducible ONLY on the Win32 version of Firefox 2.0.
Comment 60•18 years ago
|
||
Wanted, along with bug 137155.
Flags: blocking1.9? → blocking1.9-
Whiteboard: [wanted-1.9]
Comment 61•17 years ago
|
||
We are having the exact same problem (incomplete uploads of files through ssl). On Windows, Firefox 2.0.0.1 works fine, but 2.0.0.2 does not. On Windows, Firefox 1.5.0.9 works fine, but 1.5.0.10 does not. Works fine on Mac and Fedora Core. Works fine on Windows for any Firefox version when not using ssl. I have been able to replicate the problem with smaller files (80 k).
Comment 62•17 years ago
|
||
OK. So we're calling nsSSLThread::requestWrite with an |amount| of 1. This ends up setting the members that eventually get passed to ssl_SecureSend (by writing to a socket, looks like?). In any case, the caller of nsSSLThread::requestWrite is nsSocketOutputStream::Write. The caller of that is nsHttpConnection::OnReadSegment. Tracing further back, we end up in nsBufferedInputStream::ReadSegments, where we have: 328 PRUint32 amt = PR_MIN(count, mFillPoint - mCursor); (gdb) p count $9 = 4096 (gdb) p mFillPoint $10 = 8192 (gdb) p mCursor $11 = 8191 The next time through, we have: (gdb) p count $17 = 4095 (gdb) p mFillPoint $18 = 8192 (gdb) p mCursor $19 = 0 because we subtracted the 1 byte we read from |count|. nsBufferedInputStream::ReadSegments is called by nsMultiplexInputStream::ReadSegments, which always passes 4096 as the count. Tracing further back through several levels of MIME and multiplex input stream ReadSegments nesting, we get to nsHttpTransaction::ReadSegments, which passes nsHttpTransaction::ReadRequestSegment as the callback function... This just writes the data out to the socket. So the "real issue" is somehow the interaction of all these buffered and multiplex streams...
Comment 63•17 years ago
|
||
So I tried tracking the calls to nsSSLThread::requestWrite: The first call comes from: nsStringInputStream::ReadSegments nsMultiplexInputStream::ReadSegments nsHttpTransaction::ReadSegments This is basically the headers for the POST. This is 460 bytes. Next we have 102 bytes (the start of the multipart body) with the stack: nsStringInputStream::ReadSegments nsMultiplexInputStream::ReadSegments nsMIMEInputStream::ReadSegments nsMultiplexInputStream::ReadSegments nsHttpTransaction::ReadSegments Then 27 bytes for the Content-Length header for the file part, with basically the same stack. Then 165 bytes for the rest of the headers for the file part. Then 3931 bytes of the file. Note that 3931 + 165 = 4096. Then I see the 165 and 3931 alternate for the rest of the file upload. Basically, as far as I can see this thing ends up hitting the buffered stream around the file at some offset (which depends on the exact headers being sent, possibly on the part separator, etc), and then oscillate in terms of the amount of stuff it's going to send. So to fix the oscillation, we either need to buffer up the data somewhere (which is what bug 137155 had effectively done) or someone (the multiplex input stream?) needs to be smarter about how they do things. In particular, instead of spanning read requests across the sub-streams, it could just return early when it gets to the end of a sub-stream.... That way we'd always start the file upload with a clean read of 4096 bytes. Given that every single call to the callback does a separate network socket write anyway, there's really no win to trying to span across to the next stream, imo. It might even be more efficient to do this, in general, than what we do now. Darin, Christian, what do you think?
Comment 64•17 years ago
|
||
Boris, Thanks for doing all that digging. I have a question (er two or three) about what you've found. You've found code that causes alternating big and little writes, and that's suboptimal, but is it wrong? Have you found anything that would cause data to be lost, or repeated, or the wrong data to be sent? IOW, do we have any reason to believe that the behavior you found is really responsible for the data loss reported by users of various servers in this bug?
Comment 65•17 years ago
|
||
> and that's suboptimal, but is it wrong?
As far as I can tell, no.
I'm still not quite certain under what conditions Apache is dropping data here, but I was assuming based on the previous comments that the 1-byte packets were relevant there. Rereading, this might not have been a correct assumption....
Comment 66•17 years ago
|
||
Would anyone happen to know whether or not this bug pops up when using mod_perl 2.0 on Apache 2.x instead of the Apache 1.3.x and mod_perl 1.29 config that both I and the original reporter were using? If the upgrade will fix this issue, I might have to pursue that route, since this is becoming quite a headache for our app. Thanks in advance.
Comment 67•17 years ago
|
||
Tim, We have not yet identified ANY erroneous behavior in the browser (related to this bug). The possibility that this error is due to *server* software has not been eliminated, IMO. So, if you have a fix for this issue in the form of a server software upgrade, I think that would be a worthwhile pursuit. If you try that, please let us know the results.
Comment 68•17 years ago
|
||
Thanks for the reply. I'll likely try the server-side upgrade this weekend to see if that helps at all. And just so I'm not misunderstanding you, did you mean to say that you don't believe this to be a bug in the browser itself, but rather a problem with the server-side software that has merely been exposed due to changes in Mozilla? I suppose that could be the case, but only if every other browser is simply letting the server-side issues slip through without notice. Opera 9, IE (5,6 & 7), Safari, and Konquerer all successfully handle the upload without issue. Only starting with Firefox 2.0 (and now, unfortunately 1.5.0.10) did this issue come to light. Would that perhaps be a better area to look for the issue, merely studying changes made to the 1.8.0.x branch between .9 and .10? 1.5.0.9 and earlier all worked without any problems at all, but starting with the recent .10 upgrade, even the older versions of Firefox have been exhibiting this bug on our end. It pains me that I've had no option but to recommend our users to use either Opera, Safari or IE in place of Firefox if they're looking to successfully upload files. I don't mean to seem ungrateful for Boris' work in diagnosis, just trying to make clear that this IS an issue that has manifested itself in recent Firefox versions, most specifically on the change from .9 to .10 of the 1.5.0 branch. Thanks again, and I'll keep you posted on my results with the mod_perl upgrade.
Comment 69•17 years ago
|
||
Gavin already looked into that. See comment 19 and comment 30 for his conclusions. Which raises an interesting point: Nelson, the NSS update changed _something_ here. Any idea what? I really don't have the knowledge I'd need to debug NSS effectively, but I think figuring out what part of bug 337770 causes the regression here (by local backout if needed, which requires knowing which parts depend on each other) would help determine whether this is a necko problem, NSS problem, or server problem... Is there a way to use a by-date pool off the development NSS branch with a given Gecko build? Ideally we'd find which exact checkin between NSS_3_11_1_RTM and NSS_3_11_20060512_TAG that way. Nelson, is comment 41 correct in its assessment, or would you guys be willing to try to figure out which of your changes had an effect here?
Comment 70•17 years ago
|
||
Boris, try backing out the NSS patch described in comment 51.
Comment 71•17 years ago
|
||
I've written and rewritten lots of comments in reply to the above questions, but each time I do, I find that I have merely repeated and elaborated on comments I've already made in this bug. The patch for bug 80092 causes the behavior described in comment 54. Perhaps it would be good to re-read comment 54, all the way through. Yes, NSS will sometimes return a write count that is 1 less than the amount the caller was trying to write. But as seen MANY times in the attached trace, the alternating 1B,4KB pattern is seen even in the more typical case where NSS returns the full write count. If the patch for bug 80092 is found to be INCORRECT, introducing ERRONEOUS behavior, then we should change it, else we should NOT. Please re-read comment 57 and comment 58. The truncation problem is sensitive to file CONTENTS. NSS/SSL is sensitive only to the amounts of data written in each call, not to the contents. NSS treats the application data as an opaque byte stream. What code is sensitive to contents? In reply to comment 65, Boris, I think the 1-byte packets are exactly the issue. I think they are triggering the problem. I think that if you eliminate them, by buffering the data above SSL and doing writes of larger sizes, the problem will vanish. If you really want to change FireFox to eliminate this problem, whether FireFox is at fault or not, then by all means, implement that buffering. But I do not think that 1 byte writes are an error, they are not *wrong*. NSS handles them fine. I suspect strongly that the server does not.
Comment 72•17 years ago
|
||
> Yes, NSS will sometimes return a write count that is 1 less than the > amount the caller was trying to write. I assume it never does this for a 1-byte write, right? > If the patch for bug 80092 is found to be INCORRECT My point is that I don't know enough about either SSL or the NSS APIs to evaluate its correctness. Furthermore, if that patch doesn't cause the alternating pattern per se (as you say), and the alternating pattern is responsible for the issue (as you say), then what does that patch have to do with anything? I'll try backing it out when I get back if no one has a chance to try that before then, I guess... But I still think that it would make more sense if someone more familiar with NSS debugged what happens at that level. > What code is sensitive to contents? Apache, as far as I can tell. See comment 45 and 46. > I think that if you eliminate them, by buffering the data above SSL Then I reintroduce bug 334142. Or is that bug gone with the buffering and the new SSL code? I can't tell; can you? > If you really want to change FireFox to eliminate this problem, I really want us to not make significant web compat changes like this on stable branches. From where I stand, an NSS change (which is probably correct in its own right) which broke sites was made on branch. So the options I see are basically to try to somehow work around it in a layer above NSS (without a clear idea of what NSS is really doing here, at least on my end), to revert back to the previous NSS tag, or to accept that we made an incompatible change that's breaking sites on a stable branch. I don't really like any of those options, to be honest. Care to propose others?
Comment 73•17 years ago
|
||
I attached this patch to help us find the truth, not to suggest that the NSS patch in bug 80092 comment 47 is incorrect. This patch backs out the patch in bug 80092 comment 47 (The original patch can't be backed out cleanly because more changes have been made on top of it.) After I applied this patch to Firefox 2.0.0.2, I can upload the test file with the correct file size. This experiment showed that the patch in bug 80092 comment 47 is the NSS change Boris is looking for. The bug is most likely on the server side. The interaction between Firefox and this NSS change causes Firefox to upload data to this server in a different but correct sequence of chunks. The server should be able to handle that.
Comment 74•17 years ago
|
||
I'm changing the platform and OS to "All" and "All". Been getting numerous reports from my users now that this is also happening on both Linux (Fedora Core 5) and MacOS 10.4.x with Firefox 2.0.0.x.
OS: Windows XP → All
Hardware: PC → All
Comment 75•17 years ago
|
||
Bug 369414 is a duplicate of this bug, but I am skeptical about whether the reported similar problems on Mac and Linux are indeed the same bug. In addition to the three apparently necessary conditions of Firefox 2, Windows, and https, I believe this bug also only occurs when uploading to a server running libapreq1 (i.e. mod_perl on Apache 1.3). File uploads are being truncated after 0A bytes that occur immediately before a 1K boundary (or possibly 2K or 4K; 4K makes the most sense). In other words, it seems that all the truncated file sizes are 1 byte shy of being a perfect multiple of 4096 (or at least 1024), and 0A (that is, a line feed) is the final byte in these files. Why only Firefox 2 Windows? I don't know, but I have observed the "4095, 1, 4095, 1" buffer read pattern as well, and I think when the single byte read gets a 0A byte, it's triggering a server bug with this function in c/apache_multipart_buffer.c in libapreq1: /* gets the next CRLF terminated line from the input buffer. if it doesn't find a CRLF, and the buffer isn't completely full, returns NULL; otherwise, returns the beginning of the null-terminated line, minus the CRLF. note that we really just look for LF terminated lines. this works around a bug in internet explorer for the macintosh which sends mime boundaries that are only LF terminated when you use an image submit button in a multipart/form-data form. */ char* next_line(multipart_buffer *self) { /* look for LF in the data */ char* line = self->buf_begin; char* ptr = memchr(self->buf_begin, "\r\n", self->bytes_in_buffer); /* LF found */ if(ptr) { /* terminate the string, remove CRLF */ if((ptr - line) > 0 && *(ptr-1) == '\r') *(ptr-1) = 0; else *ptr = 0; /* bump the pointer */ self->buf_begin = ptr + 1; self->bytes_in_buffer -= (self->buf_begin - line); } /* no LF found */ else { /* buffer isn't completely full, fail */ if(self->bytes_in_buffer < self->bufsize) return NULL; /* return entire buffer as a partial line */ line[self->bufsize] = 0; self->buf_begin = ptr; self->bytes_in_buffer = 0; } return line; }
Comment 76•17 years ago
|
||
Sorry, the memchr call line actually reads: char* ptr = memchr(self->buf_begin, '\n', self->bytes_in_buffer); In any case, this is Apache::Request code, not Mozilla code, and although I haven't been able to confirm that this is the cause of the problem, it certainly sounds promising to me.
Comment 77•17 years ago
|
||
The apreq developers are planning a maintenance release of libapreq1. This version primarily addresses an issue noted with FireFox 2.0 truncating file uploads in SSL mode. Please give the tarball at http://people.apache.org/~issac/libapreq-1.34-RC1.tar.gz a try and report comments/problems/etc. to the apreq-dev list at apreq-dev@httpd.apache.org.
Comment 78•17 years ago
|
||
I can happily report that the linked libapreq-1.34-RC1 DOES in fact clear up this issue for me. Many thanks to all involved in getting a fix put out for this rather frustrating issue.
Comment 79•17 years ago
|
||
Isaac, may I trouble you to attach to this bug a diff between the previous libapreq source and this new one? Please make it a "unified diff" (from diff -u or diff -u10). Thanks.
Comment 80•17 years ago
|
||
So I tried to eliminate the oscillation, with this patch. I'm not sure it works... but maybe. With this patch, I see the following pattern of calls: 1) Attempt to write 4096 bytes via nsSSLThread::requestWrite 2) nsSSLThread::requestWrite only accepts 3931 bytes 3) Attempt to write the remaining 163 bytes via nsSSLThread::requestWrite 4) nsSSLThread::requestWrite accepts none of these 5) Attempt to write 4096 bytes via nsSSLThread::requestWrite 6) nsSSLThread::requestWrite only accepts 193 bytes 7) Attempt to write the remaining 3931 bytes via nsSSLThread::requestWrite 8) nsSSLThread::requestWrite accepts none of these repeated over and over. This is perhaps better than what we used to have; without this patch we were making some extra 193-byte calls into requestWrite between steps 4 and 5 (and requestWrite was accepting these 193 bytes). But I don't know that it's actually changing what goes on the wire... Nelson, do you think this would prevent the oscillation in packet sizes?
Comment 81•17 years ago
|
||
Attachment #262591 -
Attachment is obsolete: true
Comment 82•17 years ago
|
||
Nelson, For reference, a patch for this specific issue can be found at http://svn.apache.org/viewvc/httpd/apreq/branches/1.x/c/apache_multipart_buffer.c?r1=531272&r2=531273&pathrev=531273&view=patch Barring any surprises, this new version will be released later this week as stable legacy, and for best performance all legacy users will be encouraged to upgrade to this version (if not to the latest releases of version 2.2 of the Apache HTTP Server and libapreq2, which to the best of my knowledge, is not affected by this issue).
Comment 83•17 years ago
|
||
In comment 75, ralp.no.h@gmail.com suggests that the problem occurs when FF does a one byte write of the character 0x0A (a.k.a. \n). Bug 369414 seems to confirm. Bug 369414 shows that the first character AFTER the point of truncation, the first character lost due to truncation, is 0x0A. I repeated the test with the original 12-images.zip file and found that with it also, the first character lost due to truncation is 0x0A. I made a file of 2^20 pairs of \r\n ( 0D 0A ), and tried to upload it. It was truncated after only 20479 bytes, and the first character lost due to truncation was 0A. I also made a file of 2^20 0x0A bytes, and tested with it, thinking that it would also hit the problem early in the file, perhaps after even fewer characters than the file of 0D0A pairs. To my surprise, that entire file transferred without truncation. So apparently that single-byte 0A write must follow a write that did NOT end in 0A. Curiously, the server seems to upload the whole file, and THEN do the truncation. The upload transfer doesn't stop at the point of truncation. All 2^20 or 2^21 charactersseem to get sent. Only a fraction are retained. This explains why Joseph Huckaby was unable to reproduce the problem with a file of 1.2 million 'o' characters. It requires the presence of a 0x0A character to occur, and his file of 'o' characters had none. Assuming that ralp.no.h@gmail.com's analysis correct, (which seems to be pretty well confirmed here) a patch that prevents FF from ever sending a single byte SSL record containing the byte 0x0A should eliminate the server's truncation. I'll reply to Boris's comment 80 separately.
Comment 84•17 years ago
|
||
Boris, in light of Issac's comment 82, should we just resolve this WORKSFORME or WONTFIX ? (The proposed NOTOURBUG resolution would come in mighty handy here.)
Comment 85•17 years ago
|
||
Sure, but then we should file a separate bug on our suboptimal packet sizes -- sending 4096-byte packets should be a lot faster than sending alternating 1-byte and 4095-byte packets, if we can manage it. So I'd still want an answer to my question. Filing a separate bug for that is the right thing to do, probably.
Comment 86•17 years ago
|
||
> a patch that prevents FF from ever sending a > single byte SSL record containing the byte 0x0A should eliminate the > server's truncation. This would correct the behavior, but I don't think it's necessary. I tracked down and managed to hack around the real culprit within libapreq on the server (although it wasn't specifically the code I pasted above), and a thread on the apache mailing list prompted development of a proper patch by the real devs (thanks!) which Issac has linked to in comment 82. I would recommend resolving this bug as INVALID (and bug 369414 as DUPLICATE), since the 4095/1/4095/1 behavior isn't necessarily *wrong*, it's just really *dumb*, and it happens to be really good at triggering a real bug on the server-side (which is fixed, thanks to the comments on this bug). I agree with Boris that a separate bug (Enhancement) probably ought to be opened for the inefficient packet size pattern.
Comment 87•17 years ago
|
||
Comment on attachment 262592 [details] [diff] [review] Actual attempt to eliminate oscillation -- other patch is an impostor >+ // suboptimal if each callback call writes to a socker A typo in the comment. Change "socker" to "socket". I'm glad we are getting to the bottom of this.
Comment 88•17 years ago
|
||
Re: Comment 85, if your objective is to speed up uploads, then you should also increase the minimum write size to a much larger value for all writes except the last. I might suggest 64KB. The reason is that writes of any amount less than the TCP transmit window size are subject to Nagle delays.
Comment 89•17 years ago
|
||
Filed bug 378629. Thank you, everyone, for the help in getting to the bottom of this and for getting it fixed in Apache!
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → INVALID
Comment 91•17 years ago
|
||
As of today, is the server at https://staging.sr.admission.net/joetemp/file_upload.html running a broken or a fixed apache server? I am unable to reproduce an incomplete upload with the provided file and latest 1.8 branch. It would be interesting to have access to a broken server, while I work on improvement bug 378629.
Comment 92•17 years ago
|
||
Comment on attachment 262592 [details] [diff] [review] Actual attempt to eliminate oscillation -- other patch is an impostor This patch is wrong, because it violates the nsIInputStream API contract.
Attachment #262592 -
Attachment is obsolete: true
Reporter | ||
Comment 93•17 years ago
|
||
Kai: Yes, the staging.sr.admission.net server still has the original, unpatched libqpreq 1.33, and I can still reproduce the bug. Just tried this morning: Uploading the "12-images.zip" file, which is 1,163,446 bytes, only 905,215 bytes arrive on the server, as shown on the result page. I used Firefox 2.0.0.1 on WinXP SP2, HTTPS of course. User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1
Comment 94•17 years ago
|
||
(In reply to comment #91) > I am unable to reproduce an incomplete upload with the provided file and latest > 1.8 branch. (In reply to comment #93) > Kai: Yes, the staging.sr.admission.net server still has the original, unpatched > libqpreq 1.33, and I can still reproduce the bug. I forgot. Linux always works for me. For whatever reason, I can reproduce this bug on Windows, only! Joseph, thanks for keeping up the broken server. It's an important help for me.
Updated•17 years ago
|
Flags: wanted1.8.1.x+
Whiteboard: [wanted-1.9]
Comment 95•17 years ago
|
||
If you experienced this bug and would like to see an improvement in stable versions of Firefox, you might want to grab a nightly experimental trunk build and provide feedback in bug 378629.
Updated•17 years ago
|
Whiteboard: ssl
You need to log in
before you can comment on or make changes to this bug.
Description
•