1719046 - TCP RST will cause Firefox UI to lag or unresponsive

OUER

Reporter

Description

•

3 years ago

Attached video 2021-07-03.mp4 — Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0

Steps to reproduce:

Download clumsy (https://github.com/jagt/clumsy/releases/tag/0.3rc3) and run it.
Start after setting the chance of TCP RST to 30%.
Open a few pages in Firefox.

Actual results:

The browser UI is lagging or even unresponsive.

I'm sorry, my English is poor.

OUER

Reporter

Updated

•

3 years ago

Updated

•

3 years ago

Severity: -- → S3

Priority: -- → P3

Whiteboard: [necko-triaged]

plumerlis

Comment 1

•

3 years ago

same here

byzod

Comment 2

•

3 years ago

This bug affects countries that have limited access or bad connections heavily, for example China
My firefox freeze every 30 or 40 minutes due to the network status thay I can't do anything to improve, I can barely browse the web normally since ver 89

It might be P3 for people in first world, but it's essential P1 to people in developing countries

I hope Mozilla is aware of how many people are affected by this bug and how important it actually is for millions of people worldwide

suguangming95

Comment 3

•

3 years ago

same bug here, I'm in China.

plumerlis

Comment 4

•

3 years ago

same bug here
https://bugzilla.mozilla.org/show_bug.cgi?id=1718719

Kershaw Chang [:kershaw]

Comment 5

•

3 years ago

We recently fixed bug 1720079 that could cause high cpu loading. Not sure if that bug is related to this one, but maybe it's worth to use the latest nightly to see if this issue can be reproduced. Thanks.

OUER

Reporter

Comment 6

•

3 years ago

(In reply to Kershaw Chang [:kershaw] from comment #5)

We recently fixed bug 1720079 that could cause high cpu loading. Not sure if that bug is related to this one, but maybe it's worth to use the latest nightly to see if this issue can be reproduced. Thanks.

The issue can still be reproduced on Nightly 92.0a1 (2021-07-15).

aeghn

Comment 7

•

3 years ago

I have had the same problem since Firefox 89 and for this reason some of my friends tried to use chrom* instead, I hope you guys can fix this problem soon.

skyecook320

Comment 8

•

3 years ago

everyone face this issue, I suggest you vote for this bug so firefox will pay attention to it : )

hsingko

Comment 9

•

3 years ago

Me too, this bug is really bother me.

zzjjzzgggg

Comment 10

•

3 years ago

Same problem here. I'm using FoxyProxy in China. Considering to throw away firefox and use chrome instead.

docrage

Comment 11

•

3 years ago

still face the problem on 91 stable.
set fission.autostart enable can remit the symptom a little bit

Dragana Damjanovic [:dragana]

Comment 12

•

3 years ago

Can someone make a http log?
https://firefox-source-docs.mozilla.org/networking/http/logging.html
and
a profile:
https://profiler.firefox.com/

Thank you

skyecook320

Comment 13

•

3 years ago

(In reply to zzjjzzgggg from comment #10)

Same problem here. I'm using FoxyProxy in China. Considering to throw away firefox and use chrome instead.
Just go away, impolite guy!

byzod

Comment 14

•

3 years ago

Attached file tcp rst test log.zip — Details

(In reply to Dragana Damjanovic [:dragana] from comment #12)

Can someone make a http log?
https://firefox-source-docs.mozilla.org/networking/http/logging.html
and
a profile:
https://profiler.firefox.com/

Thank you

I reproduced this bug on FDE 91b9, with a fresh new profile

Steps:

Setup firefox-91.0b9 and create a new profile
Open amo addon list page
Start http log & profile record
Open the first 4 addon pages and wait for page loaded
Stop log, save http log (test log normal in attachment) and profile log
Close all pages but the amo addon list page
Run clumsy mentioned above, set tcp rst chance to 30%
repeat step 4 and 5
Stop after only 1 page loaded because it's too slow and the GUI is laggy like hell, then save http log (test log tcp rst 30% in attachment) and profile log

yxu

Comment 15

•

3 years ago

After connect with users who encountered this problem (unfortunately I failed to reproduce this). Set security.tls.enable_0rtt_data -> False can fix it. Can anyone help verify?

If it is related to this value, maybe we should consider whether there is a problem with firefox in processing TLS1.3 0-RTT.

Zhenbo Li

Comment 16

•

3 years ago

(In reply to yxu from comment #15)

After connect with users who encountered this problem (unfortunately I failed to reproduce this). Set security.tls.enable_0rtt_data -> False can fix it. Can anyone help verify?

If it is related to this value, maybe we should consider whether there is a problem with firefox in processing TLS1.3 0-RTT.

Thank you for your comment. Setting security.tls.enable_0rtt_data to false is a good workaround for me. I will use Firefox with this workaround for a few days and report my test results here if needed.

IMHO, even if firefox's implementation has some deficient about TLS 1.3, the UI should not be blocked. However, it's very common that the UI is totally frozen or responsive very slowly, making it impossible to kill the slow page, and I had to kill the whole firefox process

byzod

Comment 17

•

3 years ago

Attached file tcp rst test log #2&#3.zip — Details

(In reply to yxu from comment #15)

After connect with users who encountered this problem (unfortunately I failed to reproduce this). Set security.tls.enable_0rtt_data -> False can fix it. Can anyone help verify?

If it is related to this value, maybe we should consider whether there is a problem with firefox in processing TLS1.3 0-RTT.

Change the setting fix the problem...partially
First few pages is still loading slow and freeze firefox, but later pages seems normal (after ~ 1 minutes, up to 5 minutes, I didn't test longer)

New behavior I noticed in the tcp rst 30% test: First 3~4 pages is loading extremely slow and freeze firefox no matter 0rtt_data is true or false, but later pages load faster (still much slower than normal though) and freeze firefox less

Steps:

Repeat steps 1~9 of comment #14 but replace step 4 and step 9 to: Open the first 2 addon pages and wait for page loaded, then open more pages

Result log:
test #2 in attachment: tcp rst 30% with security.tls.enable_0rtt_data = true (profile)
test #3 in attachment: tcp rst 30% with security.tls.enable_0rtt_data = false (profile)

Dragana Damjanovic [:dragana]

Comment 18

•

3 years ago

Patch for bug 1382886 landed. Can you try to reproduce the issue? If you can reproduce the issue please make a http log.

Thanks!

Flags: needinfo?(543080122)

byzod

Comment 19

•

3 years ago

Attached file test log tcp rst 30% #4 enable_0rtt_data True nightly.zip — Details

(In reply to Dragana Damjanovic [:dragana] from comment #18)

Patch for bug 1382886 landed. Can you try to reproduce the issue? If you can reproduce the issue please make a http log.

Thanks!

It didn't fix the problem...completely (if I was using the correct version, tested on nightly 93a1, build id 20210819214942)

The GUI seems freeze lesser than before (maybe just my felling) but still failed to response. The already loaded parent page (amo) is even blank when I try to switch back (as shown in the snapshot of profile)

Steps:

Similar to comment #14

Result log:

test #4 in attachment: test log tcp rst 30% #4 enable_0rtt_data True nightly (profile)

Flags: needinfo?(543080122)

Dragana Damjanovic [:dragana]

Comment 20

•

3 years ago

(In reply to byzod from comment #19)

Result log:

test #4 in attachment: test log tcp rst 30% #4 enable_0rtt_data True nightly (profile)

Thank you for the log.
I think that clumsy that you are using to emulate a bad network is interfering with TCP-socket-pair that firefox is using internally. In the log I am seeing that the socket-pair is being broken all the time. That will badly influence Firefox behavior.
I think that clumsy is making an unrealistic set up.

I am currently working on a better fix for bug 1382886 that should fix the issues when proxies are used. I hope to have the fix next week. I will ask you to test it once it land in Nightly.

Thank you for you help!!!!

Manuel Bucher [:manuel]

Updated

•

3 months ago

Blocks: necko-perf

Randell Jesup [:jesup] (needinfo me)

Comment 21

•

2 months ago

Moving to Monitor queue, upping priority to verify.

The profile from that comment shows 4, 8 and 13s janks (and others). This is really bad, IF it is still occurring.

If anyone on this bug who saw this originally can comment if they still see it, that would help. Dragana landed a patch that may have helped here 2 1/2 years ago (a few weeks after her last comment).

If you had flipped the 0RTT pref, please put it back to default to test

Flags: needinfo?(skyecook320)

Flags: needinfo?(plumerlis)

Flags: needinfo?(litimetal)

Flags: needinfo?(docrage)

Flags: needinfo?(aeghn)

Flags: needinfo?(543080122)

Priority: P3 → P2

Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-monitor]

byzod

Comment 22

•

2 months ago

I see improvement but might be just because I upgraded my pc ;)

Steps:

Setup firefox-124.0b3 and create a new profile
Open amo addon list page
Download and run clumsy, set the chance of TCP RST to 30%
Start logging, active modules: timestamp,sync,nsHttp:5,cache2:5,nsSocketTransport:5,nsHostResolver:5 (default)
Open the first 4 addon pages and wait for page loaded
When all pages fully loaded (no loading icon on tabs), stop logging, upload log

Log: http://share.firefox.dev/3uJ5Hec

It's still laggy and firefox keep freezing but at least it's not dead, lagging like hell improved to lagging like shit

Flags: needinfo?(543080122)

byzod

Comment 23

•

2 months ago

(In reply to byzod from comment #22)

...
Steps:
3. Download and run clumsy, set the chance of TCP RST to 30%
...

More cleared step 3:

Download and run clumsy, set the chance of TCP RST to 30%, then click start

Note that clumsy updated few times but for better variable controlling you should still use 0.3rc3

BugBot [:suhaib / :marco/ :calixte]

Comment 24

•

1 month ago

Redirect needinfos that are pending on inactive users to the triage owner.
:jesup, since the bug has recent activity, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(skyecook320)

Flags: needinfo?(rjesup)

Flags: needinfo?(plumerlis)

Flags: needinfo?(docrage)

Flags: needinfo?(aeghn)

Randell Jesup [:jesup] (needinfo me)

Comment 25

•

1 month ago

byzod: thanks for the new info! We'll try to take a look (moving back to triage)

Flags: needinfo?(rjesup)

Whiteboard: [necko-triaged][necko-priority-monitor] → [necko-triaged][necko-priority-new]

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

1 month ago

Whiteboard: [necko-triaged][necko-priority-new] → [necko-triaged][necko-priority-next]

Randell Jesup [:jesup] (needinfo me)

Updated

•

1 month ago

Whiteboard: [necko-triaged][necko-priority-next] → [necko-triaged][necko-priority-queue]

Randell Jesup [:jesup] (needinfo me)

Comment 26

•

13 days ago

I've tried this on Linux, using sudo iptables -A INPUT -p tcp -m statistic --mode random --probability 0.30 --tcp-flags PSH PSH -j REJECT --reject-with tcp-reset
I don't see anything like the long delays in the poll loop. No jank at all (though with that high reset rate loading things is hit or miss)

Looking at https://share.firefox.dev/3xxVbrf from your trace, I see LongTaskSocketProcessing on SocketThread times that exactly match up with the MainThread jank times (unresponsive to events).

LongTaskSocketProcessing markers are added when the socket poll loop takes more than few ms. So something is making this simple loop take several seconds... https://searchfox.org/mozilla-central/source/netwerk/base/nsSocketTransportService2.cpp#1384-1438

Perhaps this is a result of exactly what clumsy is doing. It doesn't appear to happen in Linux when we get RSTs. It may be a windows-specific issue, or an issue with how clumsy simulates RSTs.

OS: Unspecified → Windows

Hardware: Unspecified → Desktop

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 27

•

12 days ago

The problem seems to be that the socket tread is stuck in TryRepairPollableEvent > new PollableEvent while holding the mutex which makes the main thread to hang while trying to get the mutex here.

Andrew fixed a similar shutdown hang in bug 1843384 (though I'm still seeing some related crashes, often with fcagff64.dll on the stack)
@reporter, do you maybe have mcaffee installed?

In any case, I'm thinking we could use the "busy waiting" while the pollable event is being repaired:

Make TryRepairPollable event set mPollableEvent and mPollList[0].fd to null, unlock the mutex
dispatch a background task that would call new PollableEvent without holding the lock.
Once that is done, acquire the mutex, and actually repair the pollable event. I think this should get rid of the hang, at the cost of increased CPU usage on the socket thread (but I'm assuming that's better than a hang).

Alternatively we can try to figure out if there's a way to avoid using PR_NewTCPSocketPair which we know can block for a long time.

Comment 28

•

8 days ago

I found this change https://gitlab.com/openconnect/openconnect/-/merge_requests/320 that seems to have a similar issue.
It seems windows now has AF_UNIX sockets, so possibly we could do the same to avoid hanging with PR_NewTCPSocketPair and the fact that the TCP sockets are affected by network conditions and tools such as clumsy. Note that most of the crashes in bug 1843384 that are still happening also have the mcaffee dll on the stack, so it's also being affected by a third party software.

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

6 days ago

Assignee: nobody → valentin.gosu

2021-07-03.mp4 3 years ago OUER 2.51 MB, video/mp4		Details
tcp rst test log.zip 3 years ago byzod 934.06 KB, application/zip		Details
tcp rst test log #2&#3.zip 3 years ago byzod 6.51 MB, application/zip		Details
test log tcp rst 30% #4 enable_0rtt_data True nightly.zip 3 years ago byzod 488.72 KB, application/zip		Details