Closed Bug 1050329 Opened 10 years ago Closed 6 years ago

h2: Better behavior when sending/receiving GOAWAY

Categories

(Core :: Networking: HTTP, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox65 --- fixed

People

(Reporter: u408661, Assigned: u408661)

References

Details

(Whiteboard: [spdy][necko-backlog])

Attachments

(5 files)

Right now when we receive a GOAWAY, we just kill the connection and show no error to the user (see also bug 1050063). We should at least show an error, perhaps even re-dial without h2 in the mix.
OS: Mac OS X → All
Hardware: x86 → All
See Also: → 1050063
Actually, GOAWAY in general might be ok. It's when we get a GOAWAY when we're expecting the server preamble (a SETTINGS frame) that got us into trouble in bug 1050063. In that case, we should almost definitely dial back without h2, since so far as we know, the other endpoint isn't speaking h2 at all...
Whiteboard: [spdy] → [spdy][necko-backlog]
Recently nginx started recycling HTTP/2 connections after `http2_max_requests`[1], which means that after making 1000(by default) requests through the same h2 connection browser will receive GOAWAY frame.

Currently Firefox does not properly handle cases where there are multiple streams inflight and `GOAWAY` is received. At that point browser should fetch all the responses until `last_accepted_stream_id`, and retry anything after it in the new connection(regardless of whether requests were idempotent or not, since webserver explicitly specified that it did not even try to handle them).

Easiest way to reproduce this is to set `http2_max_requests` to 1. Here is how gophertiles benchmark[2] looks when fronted by nginx[3] and fetched by Firefox Nightly:
https://www.dropbox.com/scl/fi/t971yx77jbs5f34la301i/Screenshot%202017-02-17%2022.58.11.png?dl=0


[1] http://hg.nginx.org/nginx/rev/9027991e2f37
[2] https://http2.golang.org/gophertiles
[3] https://paste.ubuntu.com/24018560/
SaveTheRbtz - this is a different bug (the pre-existing one here is for a server that is not compliant with 7540). It would be great if you could open the new one - if not I will do so and cc you.

I'm surprised at your report though - there is code to handle GOAWAY in the way that you descirbe. I'll investigate. what is the testcase in comment 3 supposed to look like?

you mention the idempotent issue - is that a factor in your testcase?
Closing after N transactions
============================
Silly as it is, sadly I do not see other way for long poll servers, since if client polls on the multiple endpoints, then h2 connection will always have at least one active transaction and idle timeout will never go off.
Same goes for DDoS scenarios when client continuously creates streams within the same connection.


Reproducible test
=================
Feel free to use following Docker container: https://github.com/SaveTheRbtz/h2-goaway

Inside you find:
* latest nginx-mainline version with h2 support
* self signed tls certs and nginx config
* snapshot of the Gophertiles http2 test
* very basic instructions

tcpdump
=======
Here is a network capture: https://www.dropbox.com/scl/fo/tkli0z5j36yobhsa144vi/AABrG9LI3Th56PBdiMWRKr7Ka?dl=0

Inside you'll find:
* pcap for the same gophertiles test
* sslkeylog for it

Here is a terse version of it: https://www.dropbox.com/scl/fi/koui2ivz36jv52aj8kp0x/Screenshot%202017-02-19%2023.29.59.png?dl=0

nginx there looks spec compliant, by sending(in order):
* h2: send GOAWAY
* h2: send HEADERS
* h2: send DATA
* tls: send close notify alert
* tcp: send FIN

firefox on the other hand:
* tcp: send ACK to the FIN packet
* h2: send GOAWAY
* tcp: recv RST

For that scenario spec says following:
"A GOAWAY frame might not immediately precede closing of the connection; a receiver of a GOAWAY that has no more use for the connection SHOULD still send a GOAWAY frame before terminating the connection."

Which, to my understanding, implies that either close(2) or shutdown(2) is OK after GOAWAY is sent.

When firefox receives TLS "close notify" alert, there is no reason to send any data on that TLS connection.
Also firefox can just read all the data from the socket before sending GOAWAY.
Flags: needinfo?(SaveTheRbtz)
you don't mention nginx sending the RST, just FF receiving it,  but it must :) The problem with RST is that it results in data loss - its not like a FIN with a sequence number. It can even result in data that has been acked by the client's kernel being dropped before firefox can read it and certainly any data that needs retransmission or whatever will not be retransmitted on a RST connection.

Basically, if you are generating a RST you can't expect the application protocol to work. That's why the server needs to half-close (shutdown() instead of close()) and wait for the GOAWAY from the client to arrive if it wants to work reliably. (step 3 in comment 6).

please see in the capture I attached that because of the RST the close/notify and fin never appear in the trace. The RST is a huge hammer - if it appears in a trace you can't expect things to work reliably.
(In reply to SaveTheRbtz from comment #8)
> Closing after N transactions
> ============================
> Silly as it is,

I want to be clear, I'm interested in discussing this point because I don't think its the best engineering choice for nginx (and I like nginx!) - but such an algorithm certainly is spec compliant (and timeout based scenarios devolve to similar code paths) so I want it to work.. I'm not debating that :)

> sadly I do not see other way for long poll servers, since if
> client polls on the multiple endpoints, then h2 connection will always have
> at least one active transaction and idle timeout will never go off.

your definition of idle can be based on frames.. so an idle long poll would just be killed with no frame level activity after a certain amount of time (RST_STREAM presumably before the GOAWAY).

> Same goes for DDoS scenarios when client continuously creates streams within
> the same connection.

on DDoS you should just FIN the connection - you're obviously not worried about graceful retry for DDoS - presumably they've then picked up a firewall rule as well.
I appreciate the docker container from comment 8 (that will really help if we need to troubleshoot further), but as long as the server is generating RST that's going to be the crux of the issue.. its probably INVALID until that's resolved or I understand the problem differently.

Someone else suggested to me that there should be some retrying in the case of the RST (even though the GOAWAY is never seen). I need to think about that some, but its certainly not about GOAWAY handling.
> you don't mention nginx sending the RST, just FF receiving it,  but it must :) The problem with RST is that it results in data loss - its not like a FIN with a sequence number. It can even result in data that has been acked by the client's kernel being dropped before firefox can read it and certainly any data that needs retransmission or whatever will not be retransmitted on a RST connection.

Because of the strict ordering of these events, my points here are:
* There is no data loss if firefox reads all the data from the socket after either "close notify" alert(TLS connection close) or FIN(tcp close/shutdown) is received.
* Firefox should not send GOAWAY in response to GOAWAY, instead it should send it as an indication of the  connection being closed from its side, again, only after all the data from buffer has been read.
* After receiving `close_notify` Firefox should only send it's own `close_notify` and close the connection.  Sending GOAWAY is against rfc5246[1]:
>   The other party MUST respond with a close_notify
>   alert of its own and close down the connection immediately,
>   discarding any pending writes.  It is not required for the initiator
>   of the close to wait for the responding close_notify alert before
>   closing the read side of the connection.


[1] https://tools.ietf.org/html/rfc5246#section-7.2.1
Flags: needinfo?(SaveTheRbtz)
I know my response is going to sound kind of pedantic, but please believe me its actually rooted in some very deep practical issues.

> Because of the strict ordering of these events, my points here are:

Packet receipt is not strictly ordered - things get lost, reordered and retransmitted. Its TCP's job to put them in order - unfortunately the TCP server sent a RST which aborts all TCP processing. No retransmissions happen, no ordering problems are fixed, unread kernel buffers are simply deleted (even if acked!). You can't ignore the RST in this scenario - generating that is what causes these problems and the fact that it is generated after those other pieces of information is simply a favorable race condition - it is not a guarantee.

> * There is no data loss if firefox reads all the data from the socket after
> either "close notify" alert(TLS connection close) or FIN(tcp close/shutdown)
> is received.

So look closely at the capture I attached here - its from Firefox's perspective. There is no TLS connection close or TCP FIN or even GOAWAY. There are however 31 TCP RSTs from the server (starting at frame #4199)

The last server frame prior to the first is 4195 - it contains part of a TLS frame, but not the whole thing so it cannot be decrypted. Perhaps that has the GOAWAY in it - no way to tell.

I'm not saying nginx didn't do write() with all of this stuff in it, its just that when it called close() with pending data in its read buffers (HEADERS requests and WINDOW_UPDATEs no doubt) the TCP stack was aborted. Not everything was sent, stuff that was sent isn't acked or retransmitted, and even stuff that is received by the kernel cannot be passed up to userspace.

If you see a RST in a packet trace you're generally risking data loss.

> * Firefox should not send GOAWAY in response to GOAWAY, instead it should
> send it as an indication of the  connection being closed from its side,
> again, only after all the data from buffer has been read.

it is an indication it is going to close its connection and it wouldn't do it until the outstanding streams were read.

> * After receiving `close_notify` Firefox should only send it's own
> `close_notify` and close the connection.  Sending GOAWAY is against
> rfc5246[1]:
> >   The other party MUST respond with a close_notify
> >   alert of its own and close down the connection immediately,
> >   discarding any pending writes.  It is not required for the initiator
> >   of the close to wait for the responding close_notify alert before
> >   closing the read side of the connection.
> 
> 
> [1] https://tools.ietf.org/html/rfc5246#section-7.2.1

because of the rst firefox never receives close_notify - see packet trace.

I do think this is a structural flaw in h2 fwiw. But the only way to deal with it is for the server to be (more confident) it isn't generating TCP RSTs.. the reasonable way to do that is some variation of sending GOAWAY and waiting for the client in turn to say its also done and some variation of sending its own GOAWAY or closing the connection. It makes sense that some timeout would be required to bound that - which sucks. But nothing is going to be reliable in the face of RST so we need to work around that first.
See Also: → 1411659
Update title to match reality - we need to handle both receiving *and* sending GOAWAY frames better with an error page.
Summary: h2: Better behavior when receiving GOAWAY → h2: Better behavior when sending/receiving GOAWAY
I have written up the TCP RST problem discussed by mcmanus and SaveTheRbtz here on the nginx bug tracker: https://trac.nginx.org/nginx/ticket/1250#comment:4
Previously this was a macro, but in later patches in this series, I want to make the macro a bit smarter. There's no particular need to have it as a macro (and macros that return from functions are often considered no-nos), so this makes it a method on the session, instead.
Previously, we would just let these fail. But, when a peer claims to speak h2 via ALPN, and then plainly doesn't speak h2 (by not doing the opening handshake properly), we should re-try any transactions dispatched to that session using http/1.1 only. No use in giving the user a horrible experience. We will also collect telemetry on how often we have sessions where this happens, so we can see how big of a problem this is (and thus if we need to do any kind of outreach).

Depends on D8432
This is kind of like the previous patch (where we had a not-very-friendly user experience shutting down misbehaving h2 sessions), but in this case the server has proven to us that it can speak a minimum of h2, so we don't want to just fallback. Instead, when we send a GOAWAY frame because we have detected some error on the part of the server, if it's a top-level page load, we'll show an error page explaining that the server spoke bad http/2, and the site admin(s) need to be contacted. We already did this for INADEQUATE_SECURITY (which is its own special case still), but that didn't cover all the cases.
Attached file h2-data-review.txt
Attachment #9016711 - Flags: review?(francois)
Comment on attachment 9016711 [details]
h2-data-review.txt

1) Is there or will there be **documentation** that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes, in Histograms.json.

2) Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, telemetry setting.

3) If the request is for permanent data collection, is there someone who will monitor the data over time?**

Yes, Nicholas Hurley.

4) Using the **[category system of data types](https://wiki.mozilla.org/Firefox/Data_Collection)** on the Mozilla wiki, what collection type of data do the requested measurements fall under?  **

Category 1.

5) Is the data collection request for default-on or default-off?

Default ON only in pre-release channels.

6) Does the instrumentation include the addition of **any *new* identifiers** (whether anonymous or otherwise; e.g., username, random IDs, etc.  See the appendix for more details)?

No.

7) Is the data collection covered by the existing Firefox privacy notice?

Yes.

8) Does there need to be a check-in in the future to determine whether to renew the data?

No, permanent.
Attachment #9016711 - Flags: review?(francois) → review+
See Also: → 1500743
Pushed by hurley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/336552a30e06
(prereq) Make RETURN_SESSION_ERROR a method. r=dragana
https://hg.mozilla.org/integration/autoland/rev/dd75cefee794
part 1 - Re-start transactions on totally busted h2 sessions. r=francois,dragana
https://hg.mozilla.org/integration/autoland/rev/8ee3ec221442
part 2 - Add error page for h2 goaway. r=dragana,bzbarsky
networkProtocolError=Firefox has experienced a network protocol violation that cannot be repaired.

Is it correct to have "Firefox" mentioned in the /dom file? As far as I remember, we can't easily use replacements in /browser and therefore we hard-code Firefox, but I'm not sure that's OK for dom (which should be brand agnostic)?
Flags: needinfo?(hurley)
Flags: needinfo?(u408661)
You need to log in before you can comment on or make changes to this bug.