Closed Bug 1523427 Opened 6 years ago Closed 6 years ago

Firefox 65: Web socket over h2 causes us to loop CONNECT request to a proxy

Categories

(Core :: Networking: HTTP, defect, P1)

65 Branch
Desktop
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla67
Tracking Status
relnote-firefox --- 65+
firefox-esr60 --- unaffected
firefox65 blocking fixed
firefox66 blocking fixed
firefox67 + fixed

People

(Reporter: rainer.meier, Assigned: dragana)

References

Details

(Keywords: 64bit, regression, Whiteboard: [necko-triaged])

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0

Steps to reproduce:

Actual results:

Firefox starts flooding my network with continued requests to edge-chat.facebook.com. My squid proxy log is flooded and looks like this:

...
1548713564.108 35 fdea:5b48:d4c1:1:e82d:509e:c563:fb6e TCP_TUNNEL/200 3309 CONNECT edge-chat.facebook.com:443 - HIER_DIRECT/2a03:2880:f008:1:face:b00c:0:1 -
1548713564.145 33 fdea:5b48:d4c1:1:e82d:509e:c563:fb6e TCP_TUNNEL/200 3309 CONNECT edge-chat.facebook.com:443 - HIER_DIRECT/2a03:2880:f008:1:face:b00c:0:1 -
1548713564.181 33 fdea:5b48:d4c1:1:e82d:509e:c563:fb6e TCP_TUNNEL/200 3308 CONNECT edge-chat.facebook.com:443 - HIER_DIRECT/2a03:2880:f008:1:face:b00c:0:1 -
1548713564.217 33 fdea:5b48:d4c1:1:e82d:509e:c563:fb6e TCP_TUNNEL/200 3309 CONNECT edge-chat.facebook.com:443 - HIER_DIRECT/2a03:2880:f008:1:face:b00c:0:1 -
1548713564.301 82 fdea:5b48:d4c1:1:e82d:509e:c563:fb6e TCP_TUNNEL/200 3310 CONNECT edge-chat.facebook.com:443 - HIER_DIRECT/2a03:2880:f008:1:face:b00c:0:1 -
1548713564.333 30 fdea:5b48:d4c1:1:e82d:509e:c563:fb6e TCP_TUNNEL/200 3309 CONNECT edge-chat.facebook.com:443 - HIER_DIRECT/2a03:2880:f008:1:face:b00c:0:1 -
...

Firefox seems to request the page in a loop. Effectively flooding my upload and download (approx. 1Mbps upload/download on a single machine in my case on). Rolling out this Firefox version in a larger network might cause a huge network load.

Also other services (mainly chat-like services) seem to experience those problem.

Even closing Facebook or messenger.com tabs does not stop the flood unless Firefox is closed down completely.

Expected results:

Firefox might fail to establish some socket or busy waiting and immediately re-tries. Effectively causing a connection flood.

Downgrading to Firefox 64 fixes the problem.
Re-upgrading to Firefox 65 immediately restarts the problem.

I tried to disable all add-ons at no avail. I can't make Firefox stop flooding the network.

Keywords: 64bit
OS: Unspecified → Windows 10
Hardware: Unspecified → Desktop

Is this only an issue if you use a proxy ?

Flags: needinfo?(rainer.meier)

I tried some debugging:
I did create an entirely new Firefox 65 profile and the flooding still occurs immediately after logging on to Facebook. So for sure it's not related to the profile or any add-ons.

(In reply to Matthias Versen [:Matti] from comment #1)

Is this only an issue if you use a proxy ?

Actually you are right. It's harder to trace then but I just tried. It stops if proxy is disabled.
However this is actually a critical issue for company networks and everybody running a proxy then.

Flags: needinfo?(rainer.meier)

I can't test this because I do not have a Facebook Account.
ni? Honza because this sounds critical and this could be a regression from bug 1520125

Component: Untriaged → Networking: HTTP
Flags: needinfo?(honzab.moz)
Product: Firefox → Core

(In reply to Rainer Meier from comment #3)

(In reply to Matthias Versen [:Matti] from comment #1)

Is this only an issue if you use a proxy ?

Actually you are right. It's harder to trace then but I just tried. It stops if proxy is disabled.
However this is actually a critical issue for company networks and everybody running a proxy then.

Confirmed. This seems to happen only on proxy connections.
I need to block the rollout to company network now as this bug might cause havok in proxied networks.

Component: Networking: HTTP → Untriaged
Product: Core → Firefox
Component: Untriaged → Networking: HTTP
Product: Firefox → Core

Haha, found another page where it happens, although unlikely to happen in corporate environments with proxies usually (just trying to find some chat-enabled pages)... lovoo.com shows the same behavior.

Component: Networking: HTTP → Untriaged
Product: Core → Firefox
Keywords: regression
Version: 64 Branch → 65 Branch
Flags: needinfo?(honzab.moz) → needinfo?(rainer.meier)

This is a log of the proxy flood generated by Firefox 65. Steps to reproduce:

Flooding to edge-chat.messenger.com starts immediately. Look for it in the logs.

The log has been zipped-up by 7-zip as even a quick logon to messenger.com immediately shows up with > 30MB of logs.

Flags: needinfo?(rainer.meier)

Thanks! The problem is captured there, hopefully we will be able to find the cause.

Preliminarily:

  • we try to open websocket at https://edge-chat.messenger.com/chat?region=lla
  • the transaction (nsHttpTransaction @15d49cb2000 created 2019-01-29 18:30:06.295) handling the upgrade channel request is looping between active/don't reuse/restart and sends CONNECT on every reactivation

seems related to bug 1434137 (websockets over h2).

Blocks: 1434137
Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P1
Summary: Firefox 65: High network load (looped requests?) → Firefox 65: Web socket over h2 causes us to loop CONNECT request to a proxy
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Whiteboard: [necko-triaged]

[Tracking Requested - why for this release]: using a proxy causes Firefox to make network attempts infinitely when visiting a site using web sockets.

Hi Rainer,

I have a fix for the problem and would like to verify it. Can you download this version:
https://queue.taskcluster.net/v1/task/Yys1zhYeQjWe_tnSUvEAWA/runs/0/artifacts/public/build/target.zip

You do not need to install it. You can just run firefox.exe

Thank you

Flags: needinfo?(rainer.meier)
Flags: in-testsuite?

Dragana, can we work around this issue by setting network.http.spdy.websockets to false? Reporter, can you please try that as well?

Flags: needinfo?(dd.mozilla)

This may block wider rollout of Fx65, so need to fully understand the impact of this issue ASAP.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #15)

Dragana, can we work around this issue by setting network.http.spdy.websockets to false? Reporter, can you please try that as well?

We can. It should work.

This will affect people using http proxy that are trying to use websockets. The server hosting websockets speaks h2 (the serve may not serve websockets over h2, but it speaks h2 generaly)

Flags: needinfo?(dd.mozilla)
Severity: normal → critical
Depends on: 1523978

Dragana has a patch, but it sounds like the next step is to disable the feature and let her patch ride the trains.

(In reply to Selena Deckelmann :selenamarie :selena use ni? pronoun: she from comment #18)

Dragana has a patch, but it sounds like the next step is to disable the feature and let her patch ride the trains.

We can uplift it to beta and maybe to release as well.

I am surprised this was not caught in our automated tests. Also, adding Tom, Tania to suggest including this as an additional scenario to test in our manual test process.

Flags: needinfo?(tmaity)
Flags: needinfo?(tgrabowski)

Tracked as a blocker so we don't forget including it as a patch in the next 65 dot release and in beta66.

Alright... I am sorry for late reply but I was busy at work and we might also not be in the same timezone ;-)

Here are my results:

Test 1: Stock Firefox 65

  • Install Firefox 65
  • create a fresh profile
    Result: Successfully reproduced issue, loop occurs.

Test 2: Config change

  • Set network.http.spdy.websockets=false
    Result: Issue does NOT occur, no problem.

Test 3: Use test-build

  • Install Firefox 67a build attached here
  • Reset network.http.spdy.websockets to true
    Result: Issue does NOT occur, seems to be fixed.

Note: I used the same profile in all tests.

So the issue seems identified and solved in the code as well as there is a work-around. However most companies/users will likely have a hard time to identify why the network load increases and even worse some of them might not be able to roll out the work-around easily.

Many thanks for taking this serious and fixing it that quickly.

Flags: needinfo?(rainer.meier)

Thanks, Rainer! We're working to flip the pref remotely as a short-term workaround until we can ship the proper fix in later release. Thanks again for your help narrowing this down so quickly.

I was able to reproduce the original problem in 65, but after trying the build in Comment 14, I tested it with the pref network.http.spdy.websockets set to true and false. Attached are the logs. Please review and see if this is in line with expectations of the bug fix.

https://drive.google.com/file/d/1SY-eFuyZPeFYN7fAFMgm-uJASsZrDIgE/view?usp=sharing

Flags: needinfo?(dd.mozilla)

Attached are my proxy logs for 65 when I did a pref flip without restarting.

https://drive.google.com/file/d/1FQZvZJnGTAY_5zGiazWJAPXDvIHL3AiA/view?usp=sharing

(In reply to Dragana Damjanovic [:dragana] from comment #17)

This will affect people using http proxy that are trying to use websockets. The server hosting websockets speaks h2 (the serve may not serve websockets over h2, but it speaks h2 generaly)

Dragana, when the flooding occurs, can users still use the website/service normally? I'm trying to understand whether this affects the user experience, or is it just generating lots of additional useless traffic? Thanks.

Rainer, we've pushed out the remote pref flip change.

You should see within the next 6 hours a new "hotfix-disable-websockets-over-h2-65-1523978" study in the "Active studies" section of about:studies. Once you show as enrolled in that, you should be seeing normal behavior with your proxy server again.

Flags: needinfo?(rainer.meier)

Hello Ryan, I just checked about:studies and up to now I just see pref-flip-screenshots-release-1369150 which seems to change extensions.screenshots.system-disabled to true.
So I guess I have to wait a bit or is there a way to update the studies?

I will report as soon as I received the websocket hotfix.

Flags: needinfo?(rainer.meier)

Yeah, it can take up to 6 hours to enroll. Thanks for checking!

Ryan, I guess I might not get this study. Simply due to the fact that participations to studies are disabled within my organization (via GPO). So this bug is bad in multiple ways. It seems to affect mainly corporate users. Corporate users however tend to disable participations to studies and/or feedback systems, automatic reports, data collection and telemetry.

Maybe you can confirm that I will not receive the study if the following option is disabled in "Firefox Data Collection and Use" settings:
[ ] Allow Firefox to send technical and interation data to Mozilla
[ ] Allow Firefox to install and run studies
[ ] Allow Firefox to make personalized extension recommendations

All those options are off in all corporate environments where I rolled out Firefox as those settings are incompatible with corporate policies.

"Allow Firefox to install and run studies" being disabled will indeed not enroll you :-\

Ryan, I think it's clear to me that it will. However if it would, then my next question as an operator of Firefox in corporate (small and medium business) environments would have been: How can you manipulate settings in my company-wide enrolled Firefox and how does this comply to corporate policies?
Indeed I used the official GPO to disable exactly this - unwanted communication and remote-management and control of software running in corporate environment.

So what I am saying is that I think my setup is not uncommon. Especially within the group of corporate users using a proxy it is also likely that the IT guidelines do not allow for telemetry and similar things. So the hotfix will exactly fail in those cases where the hotfix is needed. It's not needed for private users without proxies but it's needed in corporate environments with proxies and in such environments the hotfix is very likely not to be effective.

Alright, I have tested enabling studies and yes I got two hotfixes pushed now:

  • hotfix-disable-websockets-over-h2-65-1523978, network.http.spdy.websockets=false
  • prefflip-push-performance-1491171, dom.push.alwaysConnect=true

But I repeat again: I cannot rollout Firefox 65 in any of my managed environments as all those environments have "studies" disabled by policy for good reason. If I explain those companies that I have to enable participation in some "studies" in production environment just to get this fixed I better look for another job then as I am not doing mine right...

I think that's taking this specific bug off-topic. There's plenty of documentation about what data collection we do and how our studies work linked directly from the preferences section controlling them. As an org that cares about user agency and privacy, we certainly understand the trade-offs of having systems like these in place and do our best to explain the decisions made.

Anyway, yes, it's clear we're going to need to ship a 65.0.l release for this issue. Realistically, that won't be able to happen until next week at the earliest. This bug doesn't even have a patch up for review yet. I'm sorry for the inconvenience - this is obviously the kind of issue that doesn't easily turn up during pre-release testing, which is certainly painful for us all.

Hi Ryan, don't get me wrong. I think you should deploy the study anyway and flip the option. Just to fix at least the ones you can reach.
My intention was just to point out that this "hotfix" is probably not as effective as you might think as it does not reach the users actually affected.

So yes, I confirm the study works and puts the hotfix in place. I will however have to wait for Firefox 65.0.1 release for a rollout and I recommend any admin in a proxied environment to do the same. Not more, not less.

Many thanks again for your quick responses and investigations to everyone participating here.

I think we have nailed down the issue and hopefully we will see Firefox 65.0.1 soon as well as some updated test cases to prevent it re-appearing.

Thanks again!

We'll certainly look into what we can do to catch this kind of problem in pre-release, but honestly, feedback from admins such as yourself during the Beta cycle is likely to find more issues that we ever can on our own given the number of edge cases which can exist within those managed environments. That's an area where we'd love to see more admins trying things out when there's still time to get bugs fixed before release.

I fully agree. Granted I could have tested Firefox 65 pre-release or at least the release candidates before the release. Also granted the issue happens on pages (facebook?) not mainly used within companies, but might potentially also affect corporate communication tools. I also admit that in many corporate environments Firefox ESR might be the better option to go but in my (medium sized) companies I did not see a reason to go for ESR. I am also not sure if I would have seen this specific bug during testing of beta releases upfront as I just noticed "by accident" when checking the gateway traffic and proxy logs for a completely unrelated issue.

So yes, it is indeed an edge-case. But we got it nailed down pretty quickly and Firefox 65.0.1 is in the pipe. So I just stick with version 64.0.2 meanwhile. Not a huge deal.

There is a good reason I still stick to Firefox also in corporate application. With recent changes (GPO support, MSI installer) it became much more manageable. Just waiting for other fixes like correct placement of data in %LocalAppData% for further corporate compatibility improvements. But this is certainly going beyond this report.

I will for sure keep opening issue if I detect some. But the fact I didn't had to open critical bug reports in the past is sufficient for me to keep supporting Firefox.

Thanks again to everybody involved solving the issue and testing.

(In reply to Grover Wimberly IV [:Grover-QA] Use NeedInfo from comment #24)

I was able to reproduce the original problem in 65, but after trying the build in Comment 14, I tested it with the pref network.http.spdy.websockets set to true and false. Attached are the logs. Please review and see if this is in line with expectations of the bug fix.

https://drive.google.com/file/d/1SY-eFuyZPeFYN7fAFMgm-uJASsZrDIgE/view?usp=sharing

The log looks as expected. The patch solves the problem. Thank you.

Flags: needinfo?(dd.mozilla)

(In reply to Tom Grabowski [:TomGrab] from comment #27)

(In reply to Dragana Damjanovic [:dragana] from comment #17)

This will affect people using http proxy that are trying to use websockets. The server hosting websockets speaks h2 (the serve may not serve websockets over h2, but it speaks h2 generaly)

Dragana, when the flooding occurs, can users still use the website/service normally? I'm trying to understand whether this affects the user experience, or is it just generating lots of additional useless traffic? Thanks.

No they cannot. The http request is stack in a loop and it will only go out the loop if the request is canceled :(

Keywords: checkin-needed

Pushed by rvandermeulen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4fa023708467
Check whether we really have a response for a upgrade request before making a connection DontReuse. r=michal

Keywords: checkin-needed

Comment on attachment 9040377 [details]
Check whether we really have a response for a upgrade request before making a connection DontReuse. r=michal

Beta/Release Uplift Approval Request

Feature/Bug causing the regression

Bug 1434137

User impact if declined

Users using websockets hosted on a server that speaks http/2.0 and also use a http proxy will get in a loop and will not be able to connect to the websocket. They can get out of the loop only by canceling the websocket request.

Is this code covered by automated tests?

No

Has the fix been verified in Nightly?

No

Needs manual test from QE?

No

If yes, steps to reproduce

comment #24 confirms that QA can reproduce the issue and has tested the fix on a try build.

List of other uplifts needed

None

Risk to taking this patch

Low

Why is the change risky/not risky? (and alternatives if risky)

It is very small change. I have also looked at the log with the fix and all possible code paths around that code were used in the log and the behavior were as expected, without any new regressions.

String changes made/needed

Attachment #9040377 - Flags: approval-mozilla-release?
Attachment #9040377 - Flags: approval-mozilla-beta?

Comment on attachment 9040377 [details]
Check whether we really have a response for a upgrade request before making a connection DontReuse. r=michal

Hoping to get this into the 66 beta 4 build today so that we can see results for including it in a possible 65 dot release.

Attachment #9040377 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to Ritu Kothari (:ritu) from comment #20)

I am surprised this was not caught in our automated tests. Also, adding Tom, Tania to suggest including this as an additional scenario to test in our manual test process.

This will be added to our analysis document for 65 regarding including this test in our manual test process

Flags: needinfo?(tmaity)
Flags: needinfo?(tgrabowski)
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67

Rainer, we shipped 66.0b4 on Friday with this patch included. Would you be willing/able to test it in your environment to confirm that everything behaves as expected? Would be great to get that confirmation prior to accepting it for the 65.0.1 release. Thanks!

Flags: needinfo?(rainer.meier)

I just tested this:

  • Using Firefox 65.0 (release)
  • Set network.http.spdy.websockets=true (default)
    -> Result: I could reproduce the loop

Then I closed Firefox and Installed Firefox 66.0b4.
-> Result: No loop.

So I confirm the bug to be fixed in Firefox 66.0b4.
Can we expect this to be in the next point release 65.0.1 as well or do we need to skip Firefox 65 entirely and wait for Firefox 66.0 release?

Flags: needinfo?(rainer.meier)

I intend to take it in 65.0.1, yes. Aiming to ship that early next week.

Comment on attachment 9040377 [details]
Check whether we really have a response for a upgrade request before making a connection DontReuse. r=michal

[Triage Comment]
Simple fix which has been verified on Nightly and Beta. Approved for 65.0.1 (shipping early next week).

Attachment #9040377 - Flags: approval-mozilla-release? → approval-mozilla-release+

Rainer, the fix for this bug is included in the 65.0.1 release being built today and due to ship tomorrow morning Pacific time. If you could verify that things work as expected for you, I would be most grateful :). Thanks again for all your help getting this diagnosed and verified quickly!

Flags: needinfo?(rainer.meier)

Hi Ryan, I would be glad to do this. But where can I get the 65.0.1 build?
I kept monitoring https://ftp.mozilla.org/pub/firefox/candidates/ for 65.0.1 builds but couln't spot it yet. But maybe I am just monitoring the wrong location. Can you provide some hint? Thank you!

That's where it'll be once it's built. Should be there in a few hours.

Alright. Here it's currently 9pm local time. I hope I can do the test this night. I will keep monitoring the candidates and provide feedback as soon as I can.

I just did the tests:

The problem seems to be resolved.

Many thanks again.

Flags: needinfo?(rainer.meier)
Flags: in-qa-testsuite?(cornel.ionce)
Flags: in-qa-testsuite?(cornel.ionce) → in-qa-testsuite+

Please specify a root cause for this bug. See :tmaity for more information.

Root Cause: --- → ?
Flags: needinfo?(dd.mozilla)
Root Cause: ? → Testing Error
Flags: needinfo?(dd.mozilla)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: