Open Bug 1868987 Opened 10 months ago Updated 6 months ago

HTTP/2 upload to nginx breaks down with low bandwidth even though data is still being sent

Categories

(Core :: Networking: HTTP, defect, P2)

Firefox 121
defect

Tracking

()

UNCONFIRMED

People

(Reporter: el, Unassigned, NeedInfo)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression, Whiteboard: [necko-triaged])

Attachments

(1 file, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0

Steps to reproduce:

HTTP/2 upload to nginx breaks down with low bandwidth even though data is still being sent. (This bug is NOT about the UI not showing an error as a result, that would be ticket https://bugzilla.mozilla.org/show_bug.cgi?id=1842429 instead.)

Here's how to see the upload break when it shouldn't:

  1. Make sure network.http.http2.enabled in about:config is true
  2. Use a connection hard-limited to 2mbit/s or less. (IMPORTANT: this might need to be enforced on an OS or router level, not just with firefox's integrated limiter. I didn't test with firefox's integrated limiter.)
  3. Get the file https://webtest.horse64.org/file.dat
  4. Go to the form at https://webtest.horse64.org/
  5. For the file to upload, pick "file.dat"
  6. Press "Upload now" and wait, it will take a long time, expect 5 to 15 minutes or something.

Actual results:

With network.http.http2.enabled set to true, the upload eventually fails even when the connection is never interrupted and the connection never fully stalls either. Also, due to https://bugzilla.mozilla.org/show_bug.cgi?id=1842429 there isn't even any error shown and eventually the tabs loading-in-progress hourglass icon just silently disappears with no feedback whatsoever anything even happened.

With network.http.http2.enabled set to false, it works!

Expected results:

The page changes to a different one stating: "Success" as a big heading. This is what happens with With network.http.http2.enabled set to false, but it should also work with HTTP/2 enabled.

Component: Untriaged → Networking: HTTP
Product: Firefox → Core

Just for clarity, with 2mbit/s I mean the equivalent of 250 kilobytes per second, or 0.25MiB per second. Not full 2 megabytes per second. I'm not sure if it will make a difference, but it needs to be slow enough so whatever HTTP/2 seems to do as ping detection breaks down from the upload itself, and 2mbit/s seems to satisfy that while higher speeds might maybe not.

Thanks for the report.

Unfortunately, I can't reproduce probably due to my slow network condition being different (maybe different kind of packet loss or something else). An networking log could be useful here to see. Can you record a http log with the error occurring? https://firefox-source-docs.mozilla.org/networking/http/logging.html (about:logging with Networking preset)
Using a new profile (about:profiles) on the your test page https://webtest.horse64.org/ would be great, so that it doesn't contain sensitive data.

Flags: needinfo?(el)

Yeah, it might be needed to be limited on a device past yours, like the router, since i'm not entirely sure if how exactly the timeout and the stalling of connection happens is slightly influenced by how the Linux kernel prioritizes things on the networking layer. Nevertheless, I can reproduce it with a nearly 100% rate here, so I'm happy to provide all the logs you need. If you limit both up and downstream to 2mbit/s some way behind your device, you should also be able to reproduce it with nearly 100% succecss if you ever need to.

I uploaded the logs now, I hope the right kind: https://horse64.org/temp/errorlogs-firefox-1868987.zip

(It should be lean but if you actually spot sensitive data let me know lol, I used a blank profile but it's hard for me to check if I accidentally still included something I shouldn't have. It does seem to include the actual bytes sent so that'll hopefully be useful for debugging)

Flags: needinfo?(el)

sorry, I made a small typo: *"It should be clean but if you actually spot..."

Manuel, do you see any obvious Necko issue from the logs?

Flags: needinfo?(manuel)

I should mention, Opera has a similar issue when HTTP/2 isn't disabled, showing this error: https://bug1842429.bmoattachments.org/attachment.cgi?id=9342934 I don't know HTTP/2 much, just HTTP/1.1, but it feels like maybe it's a design issue with slow connections and how whatever pinging is handled. (Like, does that happen on a separate connection? I could easily see that get drowned out when the bandwidth is fully saturated. But then to conclude anything from that about the healthiness of the upload connection itself seems incorrect.) Unless this somehow is actually an nginx bug?

Attached file relevant_log_lines.txt
Still analyzing. Will add result of analysis in next comment. Accidentally submitted.

Thanks for providing the log file. It is helpful and seems to contain the relevant info.

We close the HTTP/2 connection due to ping timeout. The PING mechanism in HTTP/2 should only be used in idle connections.[1]. We are in the process of uploading, so should not time out and send a GOAWAY message. The bug seems to be on our side.

2023-12-11 18:05:37.278939 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 delta since last read 68s
2023-12-11 18:05:37.278952 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 handle outstanding ping, timeout=8000
2023-12-11 18:05:37.278965 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 Ping Timer Exhaustion
2023-12-11 18:05:37.278978 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::Close 7faf1adf6600 804B000E
2023-12-11 18:05:37.279005 UTC - [Parent 19692: Socket Thread]: I/nsHttp MaybeDecrementConcurrent 7faf1adf6600 id=0x13 concurrent=1 active=1
2023-12-11 18:05:37.279021 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpTransaction::Close [this=7faf18abf600 reason=80004004]
2023-12-11 18:05:37.279037 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnectionMgr::RemoveActiveTransaction t=7faf18abf600 tabid=10(1) thr=0
2023-12-11 18:05:37.279091 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpTransaction::ShouldRestartOn0RttError [this=7faf18abf600, mEarlyDataWasAvailable=0 error=80004004]
2023-12-11 18:05:37.279125 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpTransaction::RemoveDispatchedAsBlocking this=7faf18abf600 not blocking
2023-12-11 18:05:37.279139 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpTransaction 7faf18abf600 request context set to null in ReleaseBlockingTransaction() - was 7faf13d5e580
2023-12-11 18:05:37.279193 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::GenerateGoAway 7faf1adf6600 code=0
2023-12-11 18:05:37.279206 UTC - [Parent 19692: Socket Thread]: V/nsHttp Http2Session::LogIO 7faf1adf6600 stream=0 id=0x0 [Generate GoAway]
2023-12-11 18:05:37.279262 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnection::OnReadSegment [this=7faf1aed6800]
[...]
2023-12-11 18:05:37.282791 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnection::Close [this=7faf1aed6800 reason=80004004 mExperienceState=7]
[...]
2023-12-11 18:05:37.290382 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::WriteSegments 7faf1adf6600 InternalState 1
 state = BUFFERING_FRAME_HEADER
2023-12-11 18:05:37.290398 UTC - [Parent 19692: Socket Thread]: V/nsHttp Http2Session::WriteSegments 7faf1adf6600 already closed
2023-12-11 18:05:37.290411 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketReadable 7faf1aed6800 trans->ws rv=80004004 n=0 socketin=0
2023-12-11 18:05:37.290425 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnection::CloseTransaction[this=7faf1aed6800 trans=7faf1adf6600 reason=80004004]
2023-12-11 18:05:37.290437 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnection::DontReuse 7faf1aed6800 spdysession=7faf1adf6600
2023-12-11 18:05:37.290449 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnection::MarkAsDontReuse 7faf1aed6800
2023-12-11 18:05:37.290461 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::DontReuse 7faf1adf6600
2023-12-11 18:05:37.290486 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::~Http2Session 7faf1adf6600 mDownstreamState=1
2023-12-11 18:05:37.290533 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnection::Close [this=7faf1aed6800 reason=80004004 mExperienceState=7]

So the channel does produce an error. Our error handling seems to be somewhat non-existent. We do need to communicate better that an error occurred. But that is a task for Bug 1842429.

The error for this bug is why we time out on a ping during active upload. Will investigate further.

Further investigations.

  • From the log it seems like we don't send any PING packets out, because no Http2Session::SendPing line exists for the connection (edit: wrong early conclusion PING is sent out as explained further on)
  • The ping timeout network.http.http2.ping-timeout currently is 8 and was changed from 5 in Bug 1706377 changeset. Both are incredible short timeouts for slow connections.
  • The relevant function is the part about handling an outstanding ping even though we didn't sent any Http2Session::ReadTimeoutTick (the timeout=8000 means 8000ms = 8s timeout)
2023-12-11 18:05:28.508844 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 delta since last read 59s
2023-12-11 18:05:28.508859 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 generating ping
2023-12-11 18:05:28.508874 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::GeneratePing 7faf1adf6600 isAck=0
2023-12-11 18:05:28.508890 UTC - [Parent 19692: Socket Thread]: V/nsHttp Http2Session::LogIO 7faf1adf6600 stream=0 id=0x0 [Generate Ping]
[...]
2023-12-11 18:05:32.267194 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 delta since last read 63s
2023-12-11 18:05:32.267202 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 handle outstanding ping, timeout=8000
[...]
2023-12-11 18:05:33.270426 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 delta since last read 64s
2023-12-11 18:05:33.270463 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 handle outstanding ping, timeout=8000
[...]
2023-12-11 18:05:34.273552 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 delta since last read 65s
2023-12-11 18:05:34.273567 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 handle outstanding ping, timeout=8000
[...]
2023-12-11 18:05:35.275648 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 delta since last read 66s
2023-12-11 18:05:35.275658 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 handle outstanding ping, timeout=8000
[...]
2023-12-11 18:05:36.277109 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 delta since last read 67s
2023-12-11 18:05:36.277122 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 handle outstanding ping, timeout=8000
[...]
2023-12-11 18:05:37.278939 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 delta since last read 68s
2023-12-11 18:05:37.278952 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 handle outstanding ping, timeout=8000
2023-12-11 18:05:37.278965 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 Ping Timer Exhaustion
  • this means mPingSentEpoch is non-zero (semantically a PING frame has been sent). The variable is only set to a non-zero value at
  • It seems like the PING is only sent out after 58s of sending out no data. But it seems like our timeouts are too low for low-bandwidth connections.

@valentin. What do you recommend how to proceed here? Can we take into consideration that we are in the process of uploading/have active connections and not cause a PING frame?

Flags: needinfo?(manuel) → needinfo?(moz.valentin)

@Ellie: As a workaround for you: Setting network.http.http2.ping-threshold to 0 should disable the ping timeout behavior.

Thanks, that's very useful! For what it's worth I just disabled http/2 in about:config altogether. I never quite understood why exists anyway, I haven't seen any touted benefits beyond a negligible performance increase for some scenarios. Not that I mind it existing either whenever it actually works, I'll likely re-enable it once the bug is fixed. But it's not like I'm missing it at all.

For what it's worth, I tried reaching Opera about this and wasn't very successful (I got an auto response that doesn't really seem to suggest that it will reach a human soon). But Opera seems to have a similar bug, so I assume upstream chromium and Google Chrome might as well. If anyone at Mozilla is interested and has a contact, might be worth letting them know.

I talked with valentin about this bug and we reached the conclusion that the following things:

  • We have two code paths towards GeneratePing.

    • One is detecting whether the connection is still active after switching networks (like wlan) OnMsgVerifyTraffic
    • One is for pruning unused stale connections TimeoutTick
  • -> At least the code path for pruning unused stale connections should consider whether we are in the process of uploading to determining whether to prune a connection. The PING frame exists, because we need to send data over TCP to detect whether the TCP-connection is still alive. With upload data available, there is no need to use the extra PING frame.

  • For the PING timeout detection we might want to use the time the PING was sent (written to the TCP socket), not when the PING packet was created. Funnily when writing to the socket we want to write 15290 bytes, but PR_Write only writes 15273 bytes. That is 17 bytes less. The exact length of a PING frame. This seems like a weird coincidence and might be specific to this log. What we see here, is that we still have data to send and a full buffer that needs to be received by the server, before it can respond to the PING. Having the short timeout of 8s should be reconsidered.

    2023-12-11 18:05:28.508874 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::GeneratePing 7faf1adf6600 isAck=0
    2023-12-11 18:05:28.508890 UTC - [Parent 19692: Socket Thread]: V/nsHttp Http2Session::LogIO 7faf1adf6600 stream=0 id=0x0 [Generate Ping]
    2023-12-11 18:05:28.508910 UTC - [Parent 19692: Socket Thread]: V/nsHttp 00000000: 00 00 08 06 00 00 00 00 00 00 00 00 00 00 00 00
    2023-12-11 18:05:28.508924 UTC - [Parent 19692: Socket Thread]: V/nsHttp 00000010: 00
    2023-12-11 18:05:28.508939 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnection::OnReadSegment [this=7faf1aed6800]
    2023-12-11 18:05:28.508955 UTC - [Parent 19692: Socket Thread]: D/nsSocketTransport nsSocketOutputStream::Write [this=7faf19dc26c8 count=15290]
    2023-12-11 18:05:28.508969 UTC - [Parent 19692: Socket Thread]: D/nsSocketTransport   calling PR_Write [count=15290]
    2023-12-11 18:05:28.509047 UTC - [Parent 19692: Socket Thread]: D/nsSocketTransport   PR_Write returned [n=15273]
    2023-12-11 18:05:28.509066 UTC - [Parent 19692: Socket Thread]: E/nsSocketTransport nsSocketTransport::SendStatus [this=7faf19dc2400 status=4b0005]
    2023-12-11 18:05:28.509087 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::FlushOutputQueue 7faf1adf6600 sz=15290 rv=0 actual=15273
    

We might want to have a small discussion in our next networking team meeting about the changes we want to do here.

Flags: needinfo?(moz.valentin)
Assignee: nobody → manuel
Blocks: 1711687
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged]

Note that I tried reproducing this by limiting windows upload via QoS in powershell:

New-NetQosPolicy -Name "FirefoxLimit" -AppPathNameMatchCondition "firefox.exe" -ThrottleRateActionBitsPerSecond 1MB

The upload successfully completed eventually for me, so I don't think this is just a bandwidth throttling scenario. The delay and RTT are also probably relevant to reproduce this.

Apart from the thing Manuel noted, that the generated ping doesn't actually get sent on the socket (it did get sent after 3 seconds):

2023-12-11 18:05:31.341399 UTC - [Parent 19692: Socket Thread]: D/nsSocketTransport nsSocketOutputStream::Write [this=7faf19dc26c8 count=17]
2023-12-11 18:05:31.341407 UTC - [Parent 19692: Socket Thread]: D/nsSocketTransport   calling PR_Write [count=17]
2023-12-11 18:05:31.341439 UTC - [Parent 19692: Socket Thread]: D/nsSocketTransport   PR_Write returned [n=17]

but the response doesn't get back after 5 more seconds, so we close the session.

2023-12-11 18:05:37.278965 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 7faf1adf6600 Ping Timer Exhaustion

This is assuming that the upstream http2 server is indeed receiving and responding to the pings.
I find it interesting that between:
2023-12-11 18:06:17.601207 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::RecvPing 7faf1a517a00 PING Flags 0x1.
and
2023-12-11 18:07:47.981815 UTC - [Parent 19692: Socket Thread]: I/nsHttp Http2Session::RecvPing 7faf1abdc100 PING Flags 0x0.
we don't receive any pings at all.

In fact, between:
2023-12-11 18:05:23.974982 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketReadable 7faf1aed6800 trans->ws rv=0 n=0 socketin=80470007
and
2023-12-11 18:05:36.212248 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketReadable [this=7faf1b882a00]
we don't get any read events on any socket, with the first one with actual data being at:
2023-12-11 18:06:10.581464 UTC - [Parent 19692: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketReadable [this=7faf1b883c00]

Ellie, could you tell us how the bandwidth limitation is enforced? It would be useful to be able to reproduce this exact scenario.

Do you think you could also capture a wireshark trace, ideally with along with the ssl keys, so we can have an exact idea of what's going on,

Flags: needinfo?(el)

Previously on slow uploading connections could time out due to PING
frame being delayed likely by buffer bloat.

Notes from discussion with the team:

  • The long return time is likely due to buffer bloat
  • measure buffer bloat with setup

Possible solutions:

  • use ping times to determine timeout (send pings often) and make timeout adapt to measured RTT
  • only send ping when we don't receive any data AND can't send data on the socket due to full buffer <- our approach

I'll open a bug for tracking that the difference of creating the PING frame and it actually going out could cause too long measured RTT and makes timeout detection harder.

Ellie, could you tell us how the bandwidth limitation is enforced?

My internet is simply just 2mbit/s 😅 (up/down limit the same) and I use LTE and don't have a landline. Since it's all mobile data, the potential influences are basically infinite. Not much breaks with it though, if I don't open too many connections at once it's usually super reliable. But I couldn't tell you how exactly the provider enforces this limit, sadly.

Flags: needinfo?(el)

Could you still record a wireshark capture along with a HTTP log, please?
Thank you!

Please see comment #18. Thanks.

Flags: needinfo?(el)

This is the same behavior observed in Bug 790184 comment 11 and following. The workaround for Bug 790184 was the cause for the H2 upload problem in high bandwidth networks. The workaround was removed Bug 1830918 to fix the upload speed: netwerk/protocol/http/Http2StreamBase.cpp#558-561. The pref network.http.http2.send-buffer-size can likely be removed again after fixing this bug.

Attachment #9368910 - Attachment is obsolete: true

I think the gist of patch D196583 would be a way of fixing the bug. There is probably not a clean way of fixing this, because the PING frame in HTTP2 is mixing transport layer responsibility (whether the connection is still active) with application layer. Marking as a regression of the h2 upload speed bug, because it is very likely a regression from flipping that pref (see comment 20).

Assignee: manuel → nobody
Keywords: regression
Regressed by: 1596576
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-new]
Flags: needinfo?(kershaw)
Whiteboard: [necko-triaged][necko-priority-new] → [necko-triaged]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: