Closed Bug 1349862 Opened 3 years ago Closed 3 years ago

XMLHttpRequest returning corrupt data for large blobs

Categories

(Core :: DOM: Core & HTML, defect, blocker)

52 Branch
defect
Not set
blocker

Tracking

()

VERIFIED FIXED
mozilla55
Tracking Status
firefox-esr45 --- unaffected
firefox52 - wontfix
firefox-esr52 53+ verified
firefox53 + verified
firefox54 + verified
firefox55 + verified

People

(Reporter: keean, Assigned: baku)

References

Details

(4 keywords)

Attachments

(2 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Linux; Android 7.0; Nexus 6 Build/NBD92D) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36

Steps to reproduce:

XMLHttpRequest of large data (approx 140MB) with response type "blob" returns a corrupt file (about 600,000 bytes around 9M in the blob appear incorrect). This only happens in FireFox on MacOS-X, the same code works fine on Linux FireFox, and on other browsers (chrome and Safari) on the Mac. Also setting response the to "moz-blob" works around the problem for MacOS-X Firefox. Test was to XMLHttpRequest the large file and SHA-256 checksum it, and compare this to the checksum produced on other browsers/platforms. It appears the corrupt data is random(ish) as the checksum changed every time. I confirmed the bug was not in webcrypto by using an XMLHttpRequest with reponsetype='arraybuffer' and converting this to a blob before feeding it to the checksum code, and the correct checksum was produced on all browsers and platforms, however loading a large file to an array buffer uses a lot more RAM than a Blob, and prevents large files being downloaded on some platforms. This bug is present in the latest production Firefox(52).



Expected results:

The checksum should have agreed with the other browsers and platforms.
Here's a test for this bug (you will need to set the file to fetch, something about 200MB should be large enough to trigger the bug).

```
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    <head>
        <title>Text XMLHttpRequest</title>
    </head>
    <body>
        <script type="text/javascript">
//<![CDATA[

function get_blob(url) {
    return new Promise(function(succ, fail) {
        var xhr = new XMLHttpRequest();
        xhr.open('GET', url, true);
        xhr.responseType = 'blob';
        xhr.send(null);
        xhr.onprogress = function(e) {
            console.log('.');
        };
        xhr.onreadystatechange = function() {
            if (xhr.readyState === XMLHttpRequest.DONE) {
                if (xhr.status === 200) {
                    console.log('downloaded file size: ', xhr.response.size);
                    succ(xhr.response);
                } else {
                    fail(this.statusText);
                }
            }
        };
    })
}

var subtle = crypto.subtle || crypto.webkitSubtle;

function readBlob(blob) {
    return new Promise(function blob_promise(succ, fail) {
        if (blob instanceof Blob) {
            var fileReader = new FileReader();
            fileReader.onload = function() {
                succ(this.result);
            }
            fileReader.readAsArrayBuffer(blob);
        } else {
            succ(blob);
        }
    });
}

function sha256(buf) {
    return subtle.digest('SHA-256', buf).then(function hex_encode(hash) {
        var value = new Uint32Array(hash);
        var str = '';
        for (var i = 0; i < value.length; ++i) {
            str += ('00000000' + value[i].toString(16)).slice(-8);
        }
        return str;
    });
}

get_blob('large_binary_file_200MB')
.then(readBlob)
.then(sha256)
.then(function(hex) {
    console.log(hex);
});

//]]>
        </script>
    </body>
</html>
```
As you're able to reproduce the issue on Mac OSX, could you install the tool mozregression to narrow down a regression range in FF52.
See http://mozilla.github.io/mozregression/ for details (you need python 2.7).

Run the command "mozregression --good=51" then for each build downloaded by the tool, make the test and enter if the build is good or bad.
After the run, copy here the final pushlog from the repository mozilla-inbound.
Component: Untriaged → DOM
Flags: needinfo?(keean)
OS: Unspecified → Mac OS X
Product: Firefox → Core
Hardware: Unspecified → x86_64
[Tracking Requested - why for this release]: regression corrupt data for large blobs

I can reproduce on Windows10 64bit Firefox52.0.1 and Nightly55.0a1, but not on Firefox51.0.1.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: regression
OS: Mac OS X → All
Hardware: x86_64 → All
Severity: normal → major
[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

Regression window:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=ca17ce6a2c9a3e906c9527c1e44c98185325cabe&tochange=c634201ba01d846403e692921a44038d2e55817a

Regressed by: Bug 1202006
Blocks: 1202006
Severity: major → blocker
Flags: needinfo?(amarchesini)
Summary: XMLHttpRequest returning corrupt data for large blobs on macos-x → XMLHttpRequest returning corrupt data for large blobs
Flags: needinfo?(keean)
Has Regression Range: --- → yes
Has STR: --- → yes
Component: DOM → DOM: Core & HTML
Flags: needinfo?(bugs)
Keywords: dataloss
I assume baku will take a look at this.
Flags: needinfo?(bugs)
Assignee: nobody → amarchesini
Flags: needinfo?(amarchesini)
I need extra information:

1. Does it happen also in non-e10s mode?
2. When the operation is completed, can you check if your have a temporary file called mozilla-temp-<something> ? It will be in your temporary directory.
3. if yes, does that file have the same content of the large_binary_file_200MB?
Flags: needinfo?(alice0775)
(In reply to Andrea Marchesini [:baku] from comment #6)
> I need extra information:
> 
> 1. Does it happen also in non-e10s mode?

Yes, I can reproduce the problem in non-e10s and e10s both.
  
> 2. When the operation is completed, can you check if your have a temporary
> file called mozilla-temp-<something> ? It will be in your temporary
> directory.

C:\Users\[userID]\AppData\Local\Temp\mozilla-temp-files\mozilla-temp-41
I got file after reload explorer.
And the file will be deleted automatically after a while.

> 3. if yes, does that file have the same content of the
> large_binary_file_200MB?

Not same.
Attachment #8851042 - Attachment is obsolete: true
Flags: needinfo?(kyle)
Build Identifier:
https://hg.mozilla.org/mozilla-central/rev/01d1dedf400d4be413b1a0d48090dca7acf29637
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0 ID:20170324030205



Str:
1. Save the attached "modified reporter's testcase html" to local file system
2. Clear everything Menu > History > History – Clear Recent History…
3. Just to be sure, restart the browser
4. Open the downloaded html file
5. Wait a minutes (it is depended on net speed, this will download approx 50MB file of http://archive.mozilla.org/pub/firefox/nightly/2017/03/2017-03-01-03-02-02-mozilla-central/firefox-54.0a1.en-US.win32.zip )
   --- observe results
6. Press F5 key to reload
   --- observe results
7. Repeat step6 if any

Actual Results:
In the first attempt, the hash result is incorrect.
After reload, the result is random. Sometimes the correct result will be obtained.

Expected Results:
The results of hash should always be ed932fabd383a2132fdab7ff51edc8b8ed9d2857c0a73bd98f388a8a04815320 .
I'm building m-c for OS X and Windows now, will update when I verify repros. Thanks for the STR!
Ok. STR verified on macOS 10.12.3 running FF55 (almost constant failure), and Debian Stretch running FF55 (intermittent failures, maybe ~40% of the time). Only platform I could not repro on was Win64 with FF55, multiple runs of both locally compiled and distribution versions never returned an incorrect hash.
Flags: needinfo?(kyle)
That was win64 with a 64-bit FF build, btw.
Ok, actually, strike that. After many retries on 64-bit ff on win64, it finally failed there too.
Tracking for 53 onwards. Release date for 53 is April 18. 
Marking wontfix for 52 though.
Didn't make much progress on this today outside of repro'ing, doing some binary diffs to see where corruption happened (It usually seems to start < 10mb into the file, then there are some chunks afterward that are ok again but lots of corruption otherwise), and trying to figure out code paths. Debug seemed to show failures less often than release on macOS, but I'm not sure if this is related to timing or the file system or what yet.
Will never I/O threads run in out-of-order? I could not find a guarantee.
nsStreamTransportService uses a thread pool, so multiple WriteRunnables can race, I think.
(In reply to Masatoshi Kimura [:emk] from comment #18)
> nsStreamTransportService uses a thread pool, so multiple WriteRunnables can
> race, I think.

Yeah, that's definitely the case.  This implementation treats it like it acts like AbstractThread's TaskQueue which provides the desired guarantees.


Aside: It also looks like this logic bypasses Necko streams because of NS_OpenAnonymousTemporaryFile doing same-thread I/O when it probably wanted to create a new class nsTemporaryFileOutputStream to complement the existing nsTemporaryFileInputStream so it could use existing Necko logic that avoids problems like these.  If this bug doesn't end up going that way and there's not a follow-up to clean that up already, it seems like there should be.
Flags: needinfo?(amarchesini)
Attached patch xhr_bug.patchSplinter Review
Flags: needinfo?(amarchesini)
Attachment #8851492 - Flags: review?(bugmail)
Keywords: site-compat
Comment on attachment 8851492 [details] [diff] [review]
xhr_bug.patch

Review of attachment 8851492 [details] [diff] [review]:
-----------------------------------------------------------------

I think you also need to invoke BeginShutdown() if you want TaskQueue's destructor to not assert.  Please make sure you've run a test for this under a debug build before landing!  Relatedly, TaskQueue::Dispatch as used will do a MOZ_DIAGNOSTIC_ASSERT in even of dispatch failure.  This is nice for ensuring dispatch, but unless you're sure this logic will never run/race the stream transport service's shutdown, you might want to provide the DontAssertDispatchSuccess parameter.  (STS shuts down at xpcom-shutdown-threads it looks like.  Not sure what constraints this code is operating undre.)
Attachment #8851492 - Flags: review?(bugmail) → review+
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/40d99f377bdd
MutableBlobStorage must use a TaskQueue in order to preserve the order of runnables for I/O ops, r=asuth
https://hg.mozilla.org/mozilla-central/rev/40d99f377bdd
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Can we land an automated test for this?
Flags: needinfo?(amarchesini)
Flags: in-testsuite?
We have a test for MutableBlobStorage but I don't know if we can force a random scheduling of runnables in the thread pool.
I can write a test, but I don't know if that would cover this issue.
Andrew, do you have a better approach here?
Flags: needinfo?(amarchesini) → needinfo?(bugmail)
One possibility is creating an "adversarial" implementation of nsIStreamTransportService/nsIThreadPool/nsIEventTarget that temporarily replaces itself as the factory for the "@mozilla.org/network/stream-transport-service;1" contract.  There's some existing testing code along these lines for different contracts.  See http://searchfox.org/mozilla-central/source/browser/components/preferences/in-content/tests/browser_advanced_update.js#30 and http://searchfox.org/mozilla-central/source/testing/xpcshell/head.js#243 for examples of implementations that seem to work (but disclaimer, see later).

This isn't as crazy an idea as it seems because implementations like nsInputStreamPump access the service via CID not ContractID so we're less likely to break everything by introducing this behavior.  However, this means the above examples' idiom of unregisterFactory/registerFactory should be changed to not call unregisterFactory since it operates on the CID.  Instead, the first registerFactory for the fake implementation clobbers the ContractID but leaves the original CID intact.  Then when it's time to restore things, a second registerFactory call with the ContractID and original CID but a null factory will cause the component manager to re-establish the original mapping.

The difficult bit is ensuring that there are at least 2 write runnables and that we have a chance to shuffle them, especially because implementations like nsInputStreamPump are very clever about consolidating events.  The good news is that dom/xhr/tests/temporaryFileBlob.sjs just uses setTimeout for a timeline of: @0s: send 256 bytes, @1s: send the rest, @2s: close the connection.  That definitely should net at least 2 OnDataAvailable invocations.  So have the adversarial STS have a "reverseEventsDispatchAndBecomePassthrough()" method, do a setTimeout(1250), invoke that, unregister the fake STS, and then verify the XHR completes and the data is right.  (Also try running the test in a tree without this bug's fix and ensure that it fails.)
Flags: needinfo?(bugmail)
Can this ride the trains or should we be tracking this for possible backport to affected branches?
Flags: needinfo?(amarchesini)
We should uplift given the severity of the problem (random data corruption).

If we do not uplift, we should at least disable file-backed blobs on branches.
Comment on attachment 8851492 [details] [diff] [review]
xhr_bug.patch

Approval Request Comment
[Feature/Bug causing the regression]: MutableBlobStorage implementation
[User impact if declined]: corrupted blob data
[Is this code covered by automated tests?]: not yet.
[Has the fix been verified in Nightly?]: not yet.
[Needs manual test from QE? If yes, steps to reproduce]: Yes. follow the description of the bug: create a blob starting from a big resource in XHR. 
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: We just introduce a TaskQueue when MutableBlobStorage dispatches runnable.
[String changes made/needed]: none
Flags: needinfo?(amarchesini)
Attachment #8851492 - Flags: approval-mozilla-beta?
Attachment #8851492 - Flags: approval-mozilla-aurora?
Comment on attachment 8851492 [details] [diff] [review]
xhr_bug.patch

Avoid data corruption, may be risky, let's try this on beta 10. 
qdot, would you mind testing once this lands on beta later this week, since you already had a look for nightly?
Flags: needinfo?(kyle)
Attachment #8851492 - Flags: approval-mozilla-beta?
Attachment #8851492 - Flags: approval-mozilla-beta+
Attachment #8851492 - Flags: approval-mozilla-aurora?
Attachment #8851492 - Flags: approval-mozilla-aurora+
I'll pull beta now and get it built, will try to test by this evening. That said, we have a manual testcase with STR in Comment 10 if you want to add that to a set also.
Blocks: 1353629
Seems to be working fine for me on beta now. No errors on any platform.
Flags: needinfo?(kyle)
Excellent!

baku, could you request an uplift or disable file-backed blobs on ESR 52?
Flags: needinfo?(amarchesini)
Flagging this for manual testing, str and testcase in Comment 10.
Flags: qe-verify+
Comment on attachment 8851492 [details] [diff] [review]
xhr_bug.patch

[Approval Request Comment]
User impact if declined:  corrupted Blob can be returned to content
Risk to taking this patch (and alternatives if risky): none. The issue here was that the runnables were scheduled wrongly. We need to use a TaskQueue.
String or UUID changes made by this patch:  none.
Flags: needinfo?(amarchesini)
Attachment #8851492 - Flags: approval-mozilla-esr52?
I am unable to reproduce the initial issue on Windows 10 using 52.0.2 build. I maybe missing something but after following the steps from comment 10, once I reach step 5 and load the testcase nothing happens. I don't see anything in particular different in browser console as well (comparing 52.0.2 and 53beta9). 

Should I pay attention to something in particular here in order to differentiate the bad build from the fixed one?
Flags: needinfo?(alice0775)
It will take a bit (the test is downloading a 50MB file in the background), but a dialog should pop up with the MD5 sum of the downloaded file.
(In reply to Bogdan Maris, QA [:bogdan_maris] from comment #39)
> I am unable to reproduce the initial issue on Windows 10 using 52.0.2 build.
> I maybe missing something but after following the steps from comment 10,
> once I reach step 5 and load the testcase nothing happens. I don't see
> anything in particular different in browser console as well (comparing
> 52.0.2 and 53beta9). 
> 
> Should I pay attention to something in particular here in order to
> differentiate the bad build from the fixed one?

I can reproduce the problem on Firefox52.0.2 32bit/Windows10 HOME 64bit.
Flags: needinfo?(alice0775)
Comment on attachment 8851492 [details] [diff] [review]
xhr_bug.patch

fix a regression with large blobs on esr52
Attachment #8851492 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
(In reply to Alice0775 White from comment #41)
> (In reply to Bogdan Maris, QA [:bogdan_maris] from comment #39)
> > I am unable to reproduce the initial issue on Windows 10 using 52.0.2 build.
> > I maybe missing something but after following the steps from comment 10,
> > once I reach step 5 and load the testcase nothing happens. I don't see
> > anything in particular different in browser console as well (comparing
> > 52.0.2 and 53beta9). 
> > 
> > Should I pay attention to something in particular here in order to
> > differentiate the bad build from the fixed one?
> 
> I can reproduce the problem on Firefox52.0.2 32bit/Windows10 HOME 64bit.

Hello, Alice! I didn't manage to reproduce the issue on Windows 10 x64, using 52.0.2 (20170323105023) (x86), following the steps from comment 10. The behavior is the same like in 55.0a1 (2017-04-11). As Bogdan already mentioned, there is no visible result after step5. More than that, the Web Console result is the same in the above mentioned builds (the affected and the fixed one). Could you please mention what we should particularly focus on in order to reproduce/verify this issue?
Flags: needinfo?(alice0775)
You should wait for 3-4 minutes(ADSL 9bps) at step 5 and 6, then an alert box pops up.

I am not sure, but it may need low performance PC(Core2Quad@2.5GHz, Mem8GB, HDD1TB, ADSL 9bps)
Flags: needinfo?(alice0775)
(In reply to Alice0775 White from comment #45)
> You should wait for 3-4 minutes(ADSL 9bps) at step 5 and 6, then an alert
> box pops up.
> 
> I am not sure, but it may need low performance PC(Core2Quad@2.5GHz, Mem8GB,
> HDD1TB, ADSL 9bps)

Couldn't reproduce the issue, even using Netlimiter. I encountered the same behavior from comment 39 and comment 44. Alice, do you think you can try verifying the fix? Thank you!
Flags: needinfo?(alice0775)
I can reproduce the issue on Firefox52.0.2[1].

[1]https://hg.mozilla.org/releases/mozilla-release/rev/e81854d6ce91f3174774a50c9c5813c33b9aac58
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 ID:20170323105023

And I verified that the issue was fixed on 53[2], 54beta[3] and Nightly55[4] as well as ESR52.1.0[5].

[2]https://hg.mozilla.org/releases/mozilla-release/rev/d345b657d381ade5195f1521313ac651618f54a2
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0 ID:20170413192749
[3]https://hg.mozilla.org/releases/mozilla-aurora/rev/105e456d811b8c4616de688e9afdcf0af620b80b
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0 ID:20170418004027
[4]https://hg.mozilla.org/mozilla-central/rev/c0ea5ed7f91a6be996a4a3c5ab25e2cdf6b4377e
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0 ID:20170419030223
[5]https://hg.mozilla.org/releases/mozilla-esr52/rev/3ea0e075203185d7f2d42f439455e97735bd1b20
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 ID:20170417065206
Status: RESOLVED → VERIFIED
Flags: needinfo?(alice0775)
You need to log in before you can comment on or make changes to this bug.