XMLHttpRequest state must not be LOADING or DONE is sometimes triggered when HTTP2 is used
Categories
(Core :: Networking: HTTP, defect, P2)
Tracking
()
People
(Reporter: mail, Assigned: dragana)
Details
(Whiteboard: [necko-triaged])
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0
Steps to reproduce:
As part of a larger web app, upload photos with XMLHttpRequest. There are 32 photos a ~4MB each, and 4-6 of them are uploaded in parallel.
The photo upload https://benaco.com
Actual results:
Nondeterministically, and roughly every ~30 photos, one of the uploads hangs forever.
The console shows
InvalidStateError: XMLHttpRequest state must not be LOADING or DONE
when that happens.
The progress counter I've built using ProgressEvent.loaded
shows the upload hanging at 00%.
The Network dev panel shows the HTTP POST request hanging (without HTTP status code).
Disabling HTTP2 via network.http.spdy.enabled.http2 = false
appears to fix the issue.
I have observed this issue in Firefox 61 and 66 on OSX (not yet with other versions or OSs, but only tested very little with them).
The issue does not appear in Chrome 72 running on the same machine, also using HTTPs.
The counterside is an nginx HTTP server.
Expected results:
Firefox should not have logged an internal error message to the console.
Reporter | ||
Comment 1•6 years ago
|
||
The error message in question is defined here: https://hg.mozilla.org/integration/mozilla-inbound/rev/9fb93fcf2b8e#l1.10
Usages in the Firefox code:
Reporter | ||
Comment 2•6 years ago
|
||
My message about trying this out yourself was unfortunately cut off.
Unfortunately I don't have a minimal repro for it yet, but you should be able to reproduce this issue after creating a (free) account at https://benaco.com, creating a "New scan" and uploading photos there. (This should work at least until I put a workaround in place, e.g. by disabling HTTP2 support in Nginx.)
Reporter | ||
Comment 3•6 years ago
|
||
I am wondering if this is something similar to bug 1050329 and https://trac.nginx.org/nginx/ticket/1250#comment:4, but relevant only in Firefox this time.
![]() |
||
Updated•6 years ago
|
Comment 4•6 years ago
|
||
(In reply to Niklas Hambüchen from comment #0)
Nondeterministically, and roughly every ~30 photos, one of the uploads hangs forever.
I'm thinking this could be caused by the nginx max_requests issue:
http://nginx.org/en/docs/http/ngx_http_v2_module.html#http2_max_requests
It automatically closes the connection after 1000 requests.
Reporter | ||
Comment 5•6 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #4)
I'm thinking this could be caused by the nginx max_requests issue:
Sorry, I forgot to mention it:
Given that I found that max requests issue before, I have long set http2_max_requests 1000000000
in nginx, so that should not be the issue.
(Unless it's somehow ignored, but I found that setting to fix connection reset errors I had in Chrome, so it should be absent, and the fact that this seems to happen only on FF on OSX also speaks against it given that before I experienced the nignx problem on Linux.)
Reporter | ||
Comment 6•6 years ago
|
||
I have managed to reproduce this problem on a second OSX machine in another country now, and there too it doesn't seem to happen on Linux.
Reporter | ||
Comment 7•6 years ago
|
||
I have now also managed to reproduce it in a OSX VM, running an older version of OSX (10.10.4) and Firefox, where I can see it in both 57.0.4 and 66.
So I'm pretty sure by now that this is a general issue.
I can also provide remote access to the OSX VM if that helps a developer who can't reproduce it taking a look.
Reporter | ||
Comment 8•6 years ago
|
||
I have recorded a network log while the problem occurs.
First, the console messages:
20:47:12.772 My log: ... An XHR request failed (0 attempt) ...
20:47:12.800 My log: Got exception: InvalidStateError: XMLHttpRequest state must not be LOADING or DONE.
20:47:12.801 MRA@https://cdn.benaco.com/...
Relevant sections from the network log around those times (note there's a time mapping involved, but since I recorded only for a few minutes, there is no ambiguity even when ignoring hours):
Relevant logs around those times:
2019-04-05 03:47:12.771897 UTC - [Parent 4534: Main Thread]: E/nsHttp nsHttpTransaction 0x125830000 SetRequestContext 0x125d37300
2019-04-05 03:47:12.772025 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver different_rrset add to set 2 88.99.64.230
2019-04-05 03:47:12.772053 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=0x12635a400, timeout=65535s
2019-04-05 03:47:12.772081 UTC - [Parent 4534: Main Thread]: V/nsHttp nsHttpConnectionMgr::AddTransaction [trans=0x125830000 0]
2019-04-05 03:47:12.772209 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver different_rrset false
2019-04-05 03:47:12.772234 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport not engaged
2019-04-05 03:47:12.772261 UTC - [Parent 4534: Main Thread]: D/nsSocketTransport STS dispatch [0x12582d900]
2019-04-05 03:47:12.772380 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver Caching host [benaco.com] record for 60 seconds (grace 60).
2019-04-05 03:47:12.772410 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=0x12583a000, timeout=65535s
2019-04-05 03:47:12.772439 UTC - [Parent 4534: Main Thread]: D/nsSocketTransport PollableEvent::Signal
2019-04-05 03:47:12.772566 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver CompleteLookup: benaco.com has 195.201.174.100
2019-04-05 03:47:12.772602 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport not engaged
2019-04-05 03:47:12.772697 UTC - [Parent 4534: Main Thread]: D/nsHttp nsHttpChannel::ContinueBeginConnectWithResult result [this=0x117a5f000 rv=0 mCanceled=0]
2019-04-05 03:47:12.772815 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver CompleteLookup: benaco.com has 138.201.20.33
2019-04-05 03:47:12.772845 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=0x12635ac00, timeout=65535s
2019-04-05 03:47:12.772872 UTC - [Parent 4534: Main Thread]: D/nsHttp nsHttpChannel [0x117a5f000] created nsChannelClassifier [0x123cfb640]
2019-04-05 03:47:12.772992 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver CompleteLookup: benaco.com has 88.99.64.230
2019-04-05 03:47:12.773016 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport not engaged
...
2019-04-05 03:47:12.787517 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport poll timeout: none
2019-04-05 03:47:12.787522 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport timeout = -1 milliseconds
-- no 03:47:12.800 here!
2019-04-05 03:47:12.911978 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport ...returned after 125 milliseconds
2019-04-05 03:47:12.912010 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=0x125836400, timeout=65535s
2019-04-05 03:47:12.912019 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport not engaged
Any idea what might be going on?
The only suspicious thing I could find is that at the time the first log message was printed, some DNS stuff was done.
Also note that (as you can see from the log), the host benaco.com resolves to 3 DNS A entries. I also have a TTL of 10 seconds set for those, but Firefox ignores them and does Caching host [benaco.com] record for 60 seconds (grace 60)
.
Is it possible that the DNS stuff happening can interfere in some way with the XHR?
Reporter | ||
Comment 9•6 years ago
|
||
OK, I've finally managed to also observe this problem on Linux (Ubuntu 16.04), Firefox 66.0.2, in a fresh Firefox profile.
Comment 10•6 years ago
|
||
(In reply to Niklas Hambüchen from comment #8)
Also note that (as you can see from the log), the host benaco.com resolves to 3 DNS A entries. I also have a TTL of 10 seconds set for those, but Firefox ignores them and does
Caching host [benaco.com] record for 60 seconds (grace 60)
.
We bump it up to a minimum of 60, because 10 seconds is too low to even load an entire website nowadays :)
Is it possible that the DNS stuff happening can interfere in some way with the XHR?
That seems likely. Does your application's logic require that we don't use those IPs past their expiration?
Also, do you maybe serve the same certificate from different hosts that share some IPs? I'm wondering if there's a chance that bug 1420777 might be involved here.
Reporter | ||
Comment 11•6 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #10)
That seems likely. Does your application's logic require that we don't use those IPs past their expiration?
Generally yes, but in this case that's unlikely. This service uses short TTLs to ensure the minimum possible user disruption when we deploy things or a server goes down for some reason (using AWS Route53 health checks), but I've made sure that no deploying was happening and all servers were up while this error occurred, so the DNS query should have returned the same contents every time.
Also note this is the first tiem I'm reading Firefox network log reports, so I may be missing something (and would be very happy if somebody could explain what we're looking at in this output), but it seems to me that different_rrset false
indicates that this is triggering this code path (https://searchfox.org/mozilla-central/rev/d1f516dbfcddb9c3f4cf1bfa76ba1dc5622465c4/netwerk/dns/nsHostResolver.cpp#1649) which means nothing has changed in Firefox's perception of that DNS query, is that right?
Also, do you maybe serve the same certificate from different hosts that share some IPs? I'm wondering if there's a chance that bug 1420777 might be involved here.
Hmm, I don't think so. There are 2 domains at play (benaco.com and cdn.benaco.com), but they resolve to the same set of 3 IP addresses (though about:networking#dns shows the 3 IPs in a different order), and they currently provide exactly the same TLS certificate (which has Subject Alt Names for benaco.com, *.benaco.com and *.cdn.benaco.com).
What other things should be done to investigate this issue further?
This issue leaves me in a rough spot:
- The bug happens frequently enough that I must solve it or work around it to not get my users think that my service is broken.
- I already have some code in place to retry on failed XHRs, but it doesn't seem to kick in when this bug appears.
- I could disable HTTP2 for the service globally, possibly accepting a performance degradation.
- I could display a notice to my users that upload does not work reliably in Firefox and that they should use Chrome instead, until the issue is fixed, but as a 15 years Firefox user and advocate I would not like to do that.
In comment #1 I linked the 2 occurrences in the code that can trigger this error. Should
- that code be improved so that we can figure out which of the 2 functions it is that emits that error,
SetResponseType()
orOverrideMimeType()
? - that code be improved to not say "must not be LOADING or DONE", but say instead which of these two cases it is, so that we get more info on what the bug might be?
- I perform some more recording of any fashion that could help here?
Also, can anybody explain this invariant of why the state must not be LOADING or DONE in those 2 functions, and what code paths could lead to it being one of them?
Finally, any hints on how I could reproduce this issue more reliably? On my friend's Macbook it appears pretty frequently with one per ~40 images uploaded via XHR, but on my own Linux or OSX it can easily take me 30 minutes of effort to make it happen once.
Comment 12•6 years ago
|
||
Hi Niklas,
Thanks for all the investigation you did.
Might I ask you to share some of the logging you did, to figure out if this is a bug in our HTTP2 implementation?
I see you've figured out how to do that already, but here's a detailed set of instructions in case you need it: https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
Please check this with the latest Nightly, if you're not doing that already.
If we don't see anything in the logs, then I can provide custom builds with additional logging for you to try.
Thanks a lot!
Reporter | ||
Comment 13•6 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #12)
Thanks!
Might I ask you to share some of the logging you did, to figure out if this is a bug in our HTTP2 implementation?
The full log from which I quoted in comment #8 is available here (~200MB):
https://nh2.me/firefox-issue-1540574/OSX-vm-network-log-of-error.txt.gz
I've recorded that with the default logging settings, please let me know if I should change those to capture more detail.
Please check this with the latest Nightly, if you're not doing that already.
OK, I'll try that later today.
If we don't see anything in the logs, then I can provide custom builds with additional logging for you to try.
That would be very appreciated!
Reporter | ||
Comment 14•6 years ago
|
||
I can confirm the bug happens in Nightly too (I just tried in Nightly 2019-04-15).
Assignee | ||
Updated•6 years ago
|
Reporter | ||
Comment 15•6 years ago
|
||
Note I have now disabled HTTP2 on our servers to work around this bug, so that Firefox users have no further broken uploads.
That means you can't repro it with my repro in the issue description any more.
If you need that for working on that bug, please let me know, and I'll try and put up a second deployment or so that doesn't have the workaround.
Comment 16•6 years ago
•
|
||
Assigning to Dragana for now.
Comment 17•6 years ago
|
||
(In reply to Niklas Hambüchen from comment #15)
If you need that for working on that bug, please let me know, and I'll try and put up a second deployment or so that doesn't have the workaround.
Would it be possible to have the deployment available next week?
Updated•6 years ago
|
Reporter | ||
Comment 18•6 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #17)
Would it be possible to have the deployment available next week?
Already ready now.
https://benaco-stagingfirefox.benaco.co.uk/register
You can type in an arbitary email address and password (can be fake). Then go to /scans, click "New", and upload image files. Mine are ~2.5 MB each.
I usually queue up 80 of them, and wait, while having the Firefox dev console open, until the error appears.
I've just tried it there, and the error reproduced 3 times within 80 photos so far.
To make interaction smooth, I've joined #developers on irc.mozilla.org (as nh2
), ping me for any questions.
Reporter | ||
Comment 19•6 years ago
|
||
A short ping about this, as I continue to maintain the separate deployment and it does have some running costs (not huge but it'd be nice if it was used).
Comment 20•6 years ago
|
||
(In reply to Niklas Hambüchen from comment #19)
A short ping about this, as I continue to maintain the separate deployment and it does have some running costs (not huge but it'd be nice if it was used).
Dragana, do you have time to check this in the next couple of days? Otherwise someone else should get on it.
Reporter | ||
Comment 21•6 years ago
|
||
Just another gentle ping, as this deployment has eaten up 200 EUR in the last 20 days.
Assignee | ||
Comment 22•6 years ago
|
||
(In reply to Niklas Hambüchen from comment #21)
Just another gentle ping, as this deployment has eaten up 200 EUR in the last 20 days.
I am sorry I have not get to this earlier. I will take a look today.
Reporter | ||
Comment 23•6 years ago
|
||
I will take a look today.
Please let me know any way I can help make it happen :)
Reporter | ||
Comment 24•6 years ago
|
||
I guess I should send another friendly ping about this.
Assignee | ||
Comment 25•6 years ago
|
||
(In reply to Niklas Hambüchen from comment #24)
I guess I should send another friendly ping about this.
I had a look at the log, but still do not know what is going on, I need to try it myself. I will be get to it this week.
Assignee | ||
Comment 26•6 years ago
|
||
I could not reproduce it using link from comment 18.
Junior can you try to reproduce it?
Comment 27•6 years ago
|
||
I can not reproduce it in OSX for nightly and release.
I get the same message when I change wifi network during upload, but I don't think it's the same issue (and also happened in H1)
Assignee | ||
Comment 28•6 years ago
|
||
I forgot to mention, I looked at the log as well, but I do not see anything strange there. All request are served successfully, there is not error there. If there is a problem it is probably in XHR code.
Reporter | ||
Comment 29•6 years ago
|
||
You will have to try with more images.
I just retried it with Nightly, and it happened after uploading a few hundred images this time.
Here is a screenshot of it happening just now: https://nh2.me/firefox-http2-xhr-error-2019-07-04.png
The easiest way to test is is to create a directory with lots of images (e.g. using symlinks), and queue them up.
I use the following for that:
# Download example image: https://nh2.me/Water_web_by_Tom_Kijas.jpg
mkdir firefox-upload-test && cd firefox-upload-test
for i in {0..5000}; do ln -s $HOME/Downloads/Water_web_by_Tom_Kijas.jpg $i.jpg; done
This creates 5000 images that I can queue up.
I have moved the deployment to https://firefox-test.aws.nh2.me, where I've also lifted the restriction to upload 100 images, so you can queue infinite amounts now now for easier reproduction.
Please give it another shot!
Reporter | ||
Comment 30•6 years ago
|
||
I also managed to reproduce it with a significantly smaller file to upload:
This is 160 Bytes instead of 1.7 MB; in this case it took ~4000 of these files to upload until the problem triggered.
Reporter | ||
Comment 31•6 years ago
|
||
OK, I have some updates here.
I think the error message
XMLHttpRequest state must not be LOADING or DONE
was just a misleading error message.
That error message happened because of retry logic in our code, that retries XHRs when they fail. While doing so, it incorrectly set .responseType
on the retried XHR that already had send()
called, which is illegal.
Firefox's error message is very confusing here, suggesting that is an internal error when it's really a programmer error; Chromium's error message is much better.
I've filed bug 1564930 to improve the error messages.
A question is why the XHR that triggered the bad retry failed in the first place. I first suspected that there may be some other underlying HTTP2 problem in Firefox, but my current data doesn't support this:
On my nginx web server, I observe the problem as HTTP 408 Request Timeout
(the client timed out sending the data):
[10/Jul/2019:13:12:31 +0000] "POST /api/scans/id/7c281a11-a454-4904-8326-0a4cef32fa45/add-photos HTTP/2.0" 408 0 "https://firefox-test.aws.nh2.me/scans/id/7c281a11-a454-4904-8326-0a4cef32fa45" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0"
And I have by now observed other HTTP 408 errors from other browsers and also on HTTP/1.1.
I think this means we can close this as invalid, and focus on bug 1564930 to improve the error messages emitted by Firefox, pointing out that this is a javascript-developer error, not an internal Firefox bug.
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Updated•5 years ago
|
Description
•