Closed Bug 1556005 Opened 5 years ago Closed 5 years ago

Connections hang when using HTTP/2 proxy

Categories

(Core :: Networking: HTTP, defect, P1)

68 Branch
x86_64
macOS
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: bugzilla-cloudflare, Assigned: mayhemer)

References

Details

(Whiteboard: [necko-triaged][secure-proxy-mvp])

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Firefox/68.0

Steps to reproduce:

Actual results:

Some internal stream limit is reached and newer connections cannot be made, breaking the speed test and preventing navigation to other sites.

I also got it to crash once; there should be a crash report with my email attached.

Expected results:

The browser opens more HTTP/2 sessions to accommodate the large number of connections.

OS: Unspecified → macOS
Hardware: Unspecified → x86_64

Thanks for this report! Checking this use case with mayhemer/node_http2_proxy and latest Firefox Nightly I can confirm there is such a problem.

Blocks: secure-proxy
Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P1
Assignee: nobody → honzab.moz

From a preliminary look at backtrack and logan this seems to be a duplicate or a variant of bug 1463729 for what I can reproduce.

Andrew, few questions:

  • have you been using MOZ_LOG during your test?
  • what is your download speed? (as the bug I mention above depends on hi-speed rx)

thanks

Flags: needinfo?(asun)

I did not use MOZ_LOG. But it shouldn't be too big of a problem since you can reproduce it yourself.

My download/upload speed ranges from 150-200 Mbps on the Wi-Fi here. I also noticed this bug occasionally during normal Web browsing, so it might be both throughput and the number of streams.

Flags: needinfo?(asun)

(In reply to Andrew Sun from comment #3)

Thanks for a quick answer.

I did not use MOZ_LOG. But it shouldn't be too big of a problem since you can reproduce it yourself.

To explain my question, the I/O that MOZ_LOG/_FILE do obviously slows the socket thread processing and makes that bug way more reproducible; explanation below.

My download/upload speed ranges from 150-200 Mbps on the Wi-Fi here. I also noticed this bug occasionally during normal Web browsing, so it might be both throughput and the number of streams.

That high speed is probably the trigger. The thing is that we have an "optimization" in our h2 implementation which simply loops as long as there are new data on the socket available (w/o querying poll()). If the downstream speed is high, it may loop that way for significant amount of time, blocking our single socket thread, which we poll and process socket events on, from processing new requests.

Whiteboard: [necko-triaged]

Hi Andrew, can you please check the test build at target.dmg which contains a draft fix for bug 1463729 in your setup again?

Thanks.

(Build from try: Bug 1463729 - Break the loop in nsHttpConnection::OnSocketReadable ...)

Flags: needinfo?(asun)

(In reply to Honza Bambas (:mayhemer) from comment #5)

Unfortunately, this only partially fixed the issue. fast.com still has lots of hanging/failed connections, and navigation to other sites during the speed test is sometimes very slow. Also, Firefox eats about 120% CPU during the speed test while transferring very little data.

I also got it to crash again when visiting another site during a speed test. There should be another crash report with my email attached.

Flags: needinfo?(asun)

(In reply to Andrew Sun from comment #6)

(In reply to Honza Bambas (:mayhemer) from comment #5)

Unfortunately, this only partially fixed the issue. fast.com still has lots of hanging/failed connections, and navigation to other sites during the speed test is sometimes very slow. Also, Firefox eats about 120% CPU during the speed test while transferring very little data.

Thanks for testing this. I have to confess I have similar results (modulo the CPU load, which I didn't look at, actually) with a nodejs based h2 proxy. Hence, this indicates there is another problem in Firefox and not a proxy.

I also got it to crash again when visiting another site during a speed test. There should be another crash report with my email attached.

You can go to about:crashes, there you will find all reports and you can share links (or IDs) here. Thanks.

(In reply to Honza Bambas (:mayhemer) from comment #7)

You can go to about:crashes, there you will find all reports and you can share links (or IDs) here. Thanks.
Here's a link to the crash: https://crash-stats.mozilla.org/report/index/039501f9-e9c1-4794-969f-0054c0190612

Looks like it's in the same place as the other crash report I submitted.

(In reply to Andrew Sun from comment #8)

(In reply to Honza Bambas (:mayhemer) from comment #7)

You can go to about:crashes, there you will find all reports and you can share links (or IDs) here. Thanks.
Here's a link to the crash: https://crash-stats.mozilla.org/report/index/039501f9-e9c1-4794-969f-0054c0190612

Looks like it's in the same place as the other crash report I submitted.

Looks like. The missing symbols are from the try run (we apparently don't expose symbols for them) Thanks.

Hi Andrew, sorry for not much update lately on this bug, we had an all company meeting last week.

I don't have a final fix at the moment, but actively working on it now, with some progress.

I have a new test build at [1] for OSX that has another version of the patch from bug 1463729. If you can, please test it again under the same conditions. Note that the crash (bug 1556491) is not fixed yet in that build.

It might be helpful for me if you could (as another separate test) run the test build with logging on [2] and try to catch the crash. MOZ_LOG would have to be added sync,pipnss:5 modules. Note that this is changing timing which may affect reproducibility (both directions.)

Thank you.

[1] https://queue.taskcluster.net/v1/task/a7ecOpWIStCnRjpnEo1HTw/runs/0/artifacts/public/build/target.tar.bz2 (apparently we no longer provide dmgs...)
[2] https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

Flags: needinfo?(asun)
Whiteboard: [necko-triaged] → [necko-triaged][secure-proxy-mvp]

(In reply to Honza Bambas (:mayhemer) from comment #10)

The binaries in the linked build are for Linux. Could you make sure the CI server is building for Mac?

Flags: needinfo?(asun) → needinfo?(honzab.moz)

(In reply to Andrew Sun from comment #11)

(In reply to Honza Bambas (:mayhemer) from comment #10)

The binaries in the linked build are for Linux. Could you make sure the CI server is building for Mac?

Ah! Clicked on a wrong build, here it goes (.dmg):
https://queue.taskcluster.net/v1/task/LVgC4-QQTAO8DZKuqSoqfw/runs/0/artifacts/public/build/target.dmg

Thanks!

Flags: needinfo?(honzab.moz) → needinfo?(asun)

The connections still hang on fast.com and speedtest.net (shows speed test errors). Navigation to other sites is possible during the speed test... until it isn't, and their page loads hang too.

With this build, I cannot reproduce the crash anymore.

Flags: needinfo?(asun)

(In reply to Andrew Sun from comment #13)

The connections still hang on fast.com and speedtest.net (shows speed test errors). Navigation to other sites is possible during the speed test... until it isn't, and their page loads hang too.

It's not clear what the "their page loads hang too" mean - can you clarify please?

How long do you have to wait to let pages actually load? Or does it end with an error of some kind?

With this build, I cannot reproduce the crash anymore.

Thanks.

(In reply to Honza Bambas (:mayhemer) from comment #14)

It's not clear what the "their page loads hang too" mean - can you clarify please?

This means either getting stuck at a white screen or loading halfway with missing fonts/images/content. Both cases have an indefinite loading spinner.

How long do you have to wait to let pages actually load? Or does it end with an error of some kind?

I waited at least a minute or two for pages to load (unsuccessfully). I don't think I waited much longer.

I think we can close this now, we have identified bugs causing various problems: 1563538, 1563695 - fixed. There are few more that may take affect in extreme cases (+250 sharding), will be fixed later.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.