Closed Bug 1540574 Opened 6 years ago Closed 6 years ago

XMLHttpRequest state must not be LOADING or DONE is sometimes triggered when HTTP2 is used

Categories

(Core :: Networking: HTTP, defect, P2)

66 Branch
defect

Tracking

()

RESOLVED INVALID

People

(Reporter: mail, Assigned: dragana)

Details

(Whiteboard: [necko-triaged])

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0

Steps to reproduce:

As part of a larger web app, upload photos with XMLHttpRequest. There are 32 photos a ~4MB each, and 4-6 of them are uploaded in parallel.

The photo upload https://benaco.com

Actual results:

Nondeterministically, and roughly every ~30 photos, one of the uploads hangs forever.

The console shows

InvalidStateError: XMLHttpRequest state must not be LOADING or DONE

when that happens.

The progress counter I've built using ProgressEvent.loaded shows the upload hanging at 00%.

The Network dev panel shows the HTTP POST request hanging (without HTTP status code).

Disabling HTTP2 via network.http.spdy.enabled.http2 = false appears to fix the issue.

I have observed this issue in Firefox 61 and 66 on OSX (not yet with other versions or OSs, but only tested very little with them).

The issue does not appear in Chrome 72 running on the same machine, also using HTTPs.

The counterside is an nginx HTTP server.

Expected results:

Firefox should not have logged an internal error message to the console.

My message about trying this out yourself was unfortunately cut off.

Unfortunately I don't have a minimal repro for it yet, but you should be able to reproduce this issue after creating a (free) account at https://benaco.com, creating a "New scan" and uploading photos there. (This should work at least until I put a workaround in place, e.g. by disabling HTTP2 support in Nginx.)

I am wondering if this is something similar to bug 1050329 and https://trac.nginx.org/nginx/ticket/1250#comment:4, but relevant only in Firefox this time.

Component: Untriaged → DOM: Networking
Product: Firefox → Core

(In reply to Niklas Hambüchen from comment #0)

Nondeterministically, and roughly every ~30 photos, one of the uploads hangs forever.

I'm thinking this could be caused by the nginx max_requests issue:
http://nginx.org/en/docs/http/ngx_http_v2_module.html#http2_max_requests

It automatically closes the connection after 1000 requests.

(In reply to Valentin Gosu [:valentin] from comment #4)

I'm thinking this could be caused by the nginx max_requests issue:

Sorry, I forgot to mention it:

Given that I found that max requests issue before, I have long set http2_max_requests 1000000000 in nginx, so that should not be the issue.

(Unless it's somehow ignored, but I found that setting to fix connection reset errors I had in Chrome, so it should be absent, and the fact that this seems to happen only on FF on OSX also speaks against it given that before I experienced the nignx problem on Linux.)

I have managed to reproduce this problem on a second OSX machine in another country now, and there too it doesn't seem to happen on Linux.

I have now also managed to reproduce it in a OSX VM, running an older version of OSX (10.10.4) and Firefox, where I can see it in both 57.0.4 and 66.

So I'm pretty sure by now that this is a general issue.

I can also provide remote access to the OSX VM if that helps a developer who can't reproduce it taking a look.

I have recorded a network log while the problem occurs.

First, the console messages:

20:47:12.772 My log: ... An XHR request failed (0 attempt) ...
20:47:12.800 My log: Got exception: InvalidStateError: XMLHttpRequest state must not be LOADING or DONE.
20:47:12.801 MRA@https://cdn.benaco.com/...

Relevant sections from the network log around those times (note there's a time mapping involved, but since I recorded only for a few minutes, there is no ambiguity even when ignoring hours):

Relevant logs around those times:

2019-04-05 03:47:12.771897 UTC - [Parent 4534: Main Thread]: E/nsHttp nsHttpTransaction 0x125830000 SetRequestContext 0x125d37300
2019-04-05 03:47:12.772025 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver different_rrset add to set 2 88.99.64.230
2019-04-05 03:47:12.772053 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=0x12635a400, timeout=65535s
2019-04-05 03:47:12.772081 UTC - [Parent 4534: Main Thread]: V/nsHttp nsHttpConnectionMgr::AddTransaction [trans=0x125830000 0]
2019-04-05 03:47:12.772209 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver different_rrset false
2019-04-05 03:47:12.772234 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport   not engaged
2019-04-05 03:47:12.772261 UTC - [Parent 4534: Main Thread]: D/nsSocketTransport STS dispatch [0x12582d900]
2019-04-05 03:47:12.772380 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver Caching host [benaco.com] record for 60 seconds (grace 60).
2019-04-05 03:47:12.772410 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=0x12583a000, timeout=65535s
2019-04-05 03:47:12.772439 UTC - [Parent 4534: Main Thread]: D/nsSocketTransport PollableEvent::Signal
2019-04-05 03:47:12.772566 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver CompleteLookup: benaco.com has 195.201.174.100
2019-04-05 03:47:12.772602 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport   not engaged
2019-04-05 03:47:12.772697 UTC - [Parent 4534: Main Thread]: D/nsHttp nsHttpChannel::ContinueBeginConnectWithResult result [this=0x117a5f000 rv=0 mCanceled=0]
2019-04-05 03:47:12.772815 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver CompleteLookup: benaco.com has 138.201.20.33
2019-04-05 03:47:12.772845 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=0x12635ac00, timeout=65535s
2019-04-05 03:47:12.772872 UTC - [Parent 4534: Main Thread]: D/nsHttp nsHttpChannel [0x117a5f000] created nsChannelClassifier [0x123cfb640]
2019-04-05 03:47:12.772992 UTC - [Parent 4534: DNS Resolver #15]: D/nsHostResolver CompleteLookup: benaco.com has 88.99.64.230
2019-04-05 03:47:12.773016 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport   not engaged
...
2019-04-05 03:47:12.787517 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport poll timeout: none
2019-04-05 03:47:12.787522 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport     timeout = -1 milliseconds
-- no 03:47:12.800 here!
2019-04-05 03:47:12.911978 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport     ...returned after 125 milliseconds
2019-04-05 03:47:12.912010 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=0x125836400, timeout=65535s
2019-04-05 03:47:12.912019 UTC - [Parent 4534: Socket Thread]: D/nsSocketTransport   not engaged

Any idea what might be going on?

The only suspicious thing I could find is that at the time the first log message was printed, some DNS stuff was done.

Also note that (as you can see from the log), the host benaco.com resolves to 3 DNS A entries. I also have a TTL of 10 seconds set for those, but Firefox ignores them and does Caching host [benaco.com] record for 60 seconds (grace 60).

Is it possible that the DNS stuff happening can interfere in some way with the XHR?

OK, I've finally managed to also observe this problem on Linux (Ubuntu 16.04), Firefox 66.0.2, in a fresh Firefox profile.

(In reply to Niklas Hambüchen from comment #8)

Also note that (as you can see from the log), the host benaco.com resolves to 3 DNS A entries. I also have a TTL of 10 seconds set for those, but Firefox ignores them and does Caching host [benaco.com] record for 60 seconds (grace 60).

We bump it up to a minimum of 60, because 10 seconds is too low to even load an entire website nowadays :)

Is it possible that the DNS stuff happening can interfere in some way with the XHR?

That seems likely. Does your application's logic require that we don't use those IPs past their expiration?
Also, do you maybe serve the same certificate from different hosts that share some IPs? I'm wondering if there's a chance that bug 1420777 might be involved here.

(In reply to Valentin Gosu [:valentin] from comment #10)

That seems likely. Does your application's logic require that we don't use those IPs past their expiration?

Generally yes, but in this case that's unlikely. This service uses short TTLs to ensure the minimum possible user disruption when we deploy things or a server goes down for some reason (using AWS Route53 health checks), but I've made sure that no deploying was happening and all servers were up while this error occurred, so the DNS query should have returned the same contents every time.

Also note this is the first tiem I'm reading Firefox network log reports, so I may be missing something (and would be very happy if somebody could explain what we're looking at in this output), but it seems to me that different_rrset false indicates that this is triggering this code path (https://searchfox.org/mozilla-central/rev/d1f516dbfcddb9c3f4cf1bfa76ba1dc5622465c4/netwerk/dns/nsHostResolver.cpp#1649) which means nothing has changed in Firefox's perception of that DNS query, is that right?

Also, do you maybe serve the same certificate from different hosts that share some IPs? I'm wondering if there's a chance that bug 1420777 might be involved here.

Hmm, I don't think so. There are 2 domains at play (benaco.com and cdn.benaco.com), but they resolve to the same set of 3 IP addresses (though about:networking#dns shows the 3 IPs in a different order), and they currently provide exactly the same TLS certificate (which has Subject Alt Names for benaco.com, *.benaco.com and *.cdn.benaco.com).

What other things should be done to investigate this issue further?

This issue leaves me in a rough spot:

  • The bug happens frequently enough that I must solve it or work around it to not get my users think that my service is broken.
  • I already have some code in place to retry on failed XHRs, but it doesn't seem to kick in when this bug appears.
  • I could disable HTTP2 for the service globally, possibly accepting a performance degradation.
  • I could display a notice to my users that upload does not work reliably in Firefox and that they should use Chrome instead, until the issue is fixed, but as a 15 years Firefox user and advocate I would not like to do that.

In comment #1 I linked the 2 occurrences in the code that can trigger this error. Should

  • that code be improved so that we can figure out which of the 2 functions it is that emits that error, SetResponseType() or OverrideMimeType()?
  • that code be improved to not say "must not be LOADING or DONE", but say instead which of these two cases it is, so that we get more info on what the bug might be?
  • I perform some more recording of any fashion that could help here?

Also, can anybody explain this invariant of why the state must not be LOADING or DONE in those 2 functions, and what code paths could lead to it being one of them?

Finally, any hints on how I could reproduce this issue more reliably? On my friend's Macbook it appears pretty frequently with one per ~40 images uploaded via XHR, but on my own Linux or OSX it can easily take me 30 minutes of effort to make it happen once.

Hi Niklas,

Thanks for all the investigation you did.
Might I ask you to share some of the logging you did, to figure out if this is a bug in our HTTP2 implementation?
I see you've figured out how to do that already, but here's a detailed set of instructions in case you need it: https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

Please check this with the latest Nightly, if you're not doing that already.

If we don't see anything in the logs, then I can provide custom builds with additional logging for you to try.

Thanks a lot!

Component: DOM: Networking → Networking: HTTP

(In reply to Valentin Gosu [:valentin] from comment #12)

Thanks!

Might I ask you to share some of the logging you did, to figure out if this is a bug in our HTTP2 implementation?

The full log from which I quoted in comment #8 is available here (~200MB):

https://nh2.me/firefox-issue-1540574/OSX-vm-network-log-of-error.txt.gz

I've recorded that with the default logging settings, please let me know if I should change those to capture more detail.

Please check this with the latest Nightly, if you're not doing that already.

OK, I'll try that later today.

If we don't see anything in the logs, then I can provide custom builds with additional logging for you to try.

That would be very appreciated!

I can confirm the bug happens in Nightly too (I just tried in Nightly 2019-04-15).

Flags: needinfo?(dd.mozilla)

Note I have now disabled HTTP2 on our servers to work around this bug, so that Firefox users have no further broken uploads.

That means you can't repro it with my repro in the issue description any more.

If you need that for working on that bug, please let me know, and I'll try and put up a second deployment or so that doesn't have the workaround.

Assigning to Dragana for now.

Assignee: nobody → dd.mozilla
Priority: -- → P2
Whiteboard: [necko-triaged]

(In reply to Niklas Hambüchen from comment #15)

If you need that for working on that bug, please let me know, and I'll try and put up a second deployment or so that doesn't have the workaround.

Would it be possible to have the deployment available next week?

Flags: needinfo?(mail)
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

(In reply to Valentin Gosu [:valentin] from comment #17)

Would it be possible to have the deployment available next week?

Already ready now.

https://benaco-stagingfirefox.benaco.co.uk/register

You can type in an arbitary email address and password (can be fake). Then go to /scans, click "New", and upload image files. Mine are ~2.5 MB each.

I usually queue up 80 of them, and wait, while having the Firefox dev console open, until the error appears.

I've just tried it there, and the error reproduced 3 times within 80 photos so far.

To make interaction smooth, I've joined #developers on irc.mozilla.org (as nh2), ping me for any questions.

Flags: needinfo?(mail)

A short ping about this, as I continue to maintain the separate deployment and it does have some running costs (not huge but it'd be nice if it was used).

(In reply to Niklas Hambüchen from comment #19)

A short ping about this, as I continue to maintain the separate deployment and it does have some running costs (not huge but it'd be nice if it was used).

Dragana, do you have time to check this in the next couple of days? Otherwise someone else should get on it.

Just another gentle ping, as this deployment has eaten up 200 EUR in the last 20 days.

(In reply to Niklas Hambüchen from comment #21)

Just another gentle ping, as this deployment has eaten up 200 EUR in the last 20 days.

I am sorry I have not get to this earlier. I will take a look today.

Flags: needinfo?(dd.mozilla)

I will take a look today.

Please let me know any way I can help make it happen :)

I guess I should send another friendly ping about this.

Flags: needinfo?(dd.mozilla)

(In reply to Niklas Hambüchen from comment #24)

I guess I should send another friendly ping about this.

I had a look at the log, but still do not know what is going on, I need to try it myself. I will be get to it this week.

Flags: needinfo?(dd.mozilla)

I could not reproduce it using link from comment 18.
Junior can you try to reproduce it?

Flags: needinfo?(juhsu)

I can not reproduce it in OSX for nightly and release.

I get the same message when I change wifi network during upload, but I don't think it's the same issue (and also happened in H1)

Flags: needinfo?(juhsu)

I forgot to mention, I looked at the log as well, but I do not see anything strange there. All request are served successfully, there is not error there. If there is a problem it is probably in XHR code.

You will have to try with more images.

I just retried it with Nightly, and it happened after uploading a few hundred images this time.

Here is a screenshot of it happening just now: https://nh2.me/firefox-http2-xhr-error-2019-07-04.png

The easiest way to test is is to create a directory with lots of images (e.g. using symlinks), and queue them up.

I use the following for that:

# Download example image: https://nh2.me/Water_web_by_Tom_Kijas.jpg
mkdir firefox-upload-test && cd firefox-upload-test
for i in {0..5000}; do ln -s $HOME/Downloads/Water_web_by_Tom_Kijas.jpg $i.jpg; done

This creates 5000 images that I can queue up.

I have moved the deployment to https://firefox-test.aws.nh2.me, where I've also lifted the restriction to upload 100 images, so you can queue infinite amounts now now for easier reproduction.

Please give it another shot!

Flags: needinfo?(dd.mozilla)

I also managed to reproduce it with a significantly smaller file to upload:

https://nh2.me/1px.jpg

This is 160 Bytes instead of 1.7 MB; in this case it took ~4000 of these files to upload until the problem triggered.

OK, I have some updates here.

I think the error message

XMLHttpRequest state must not be LOADING or DONE

was just a misleading error message.

That error message happened because of retry logic in our code, that retries XHRs when they fail. While doing so, it incorrectly set .responseType on the retried XHR that already had send() called, which is illegal.

Firefox's error message is very confusing here, suggesting that is an internal error when it's really a programmer error; Chromium's error message is much better.

I've filed bug 1564930 to improve the error messages.

A question is why the XHR that triggered the bad retry failed in the first place. I first suspected that there may be some other underlying HTTP2 problem in Firefox, but my current data doesn't support this:

On my nginx web server, I observe the problem as HTTP 408 Request Timeout (the client timed out sending the data):

[10/Jul/2019:13:12:31 +0000] "POST /api/scans/id/7c281a11-a454-4904-8326-0a4cef32fa45/add-photos HTTP/2.0" 408 0 "https://firefox-test.aws.nh2.me/scans/id/7c281a11-a454-4904-8326-0a4cef32fa45" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0"

And I have by now observed other HTTP 408 errors from other browsers and also on HTTP/1.1.

I think this means we can close this as invalid, and focus on bug 1564930 to improve the error messages emitted by Firefox, pointing out that this is a javascript-developer error, not an internal Firefox bug.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
Flags: needinfo?(dd.mozilla)
You need to log in before you can comment on or make changes to this bug.