TCP RST will cause Firefox UI to lag or unresponsive
Categories
(Core :: Networking, defect, P2)
Tracking
()
People
(Reporter: doudou1041, Assigned: valentin, NeedInfo)
References
(Blocks 1 open bug)
Details
(Whiteboard: [necko-triaged][necko-priority-queue])
Attachments
(4 files)
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0
Steps to reproduce:
- Download clumsy (https://github.com/jagt/clumsy/releases/tag/0.3rc3) and run it.
- Start after setting the chance of TCP RST to 30%.
- Open a few pages in Firefox.
Actual results:
The browser UI is lagging or even unresponsive.
I'm sorry, my English is poor.
Updated•3 years ago
|
This bug affects countries that have limited access or bad connections heavily, for example China
My firefox freeze every 30 or 40 minutes due to the network status thay I can't do anything to improve, I can barely browse the web normally since ver 89
It might be P3 for people in first world, but it's essential P1 to people in developing countries
I hope Mozilla is aware of how many people are affected by this bug and how important it actually is for millions of people worldwide
Comment 3•3 years ago
|
||
same bug here, I'm in China.
same bug here
https://bugzilla.mozilla.org/show_bug.cgi?id=1718719
Comment 5•3 years ago
|
||
We recently fixed bug 1720079 that could cause high cpu loading. Not sure if that bug is related to this one, but maybe it's worth to use the latest nightly to see if this issue can be reproduced. Thanks.
(In reply to Kershaw Chang [:kershaw] from comment #5)
We recently fixed bug 1720079 that could cause high cpu loading. Not sure if that bug is related to this one, but maybe it's worth to use the latest nightly to see if this issue can be reproduced. Thanks.
The issue can still be reproduced on Nightly 92.0a1 (2021-07-15).
I have had the same problem since Firefox 89 and for this reason some of my friends tried to use chrom* instead, I hope you guys can fix this problem soon.
Comment 8•3 years ago
|
||
everyone face this issue, I suggest you vote for this bug so firefox will pay attention to it : )
Comment 10•3 years ago
|
||
Same problem here. I'm using FoxyProxy in China. Considering to throw away firefox and use chrome instead.
Comment 11•3 years ago
|
||
still face the problem on 91 stable.
set fission.autostart enable can remit the symptom a little bit
Comment 12•3 years ago
|
||
Can someone make a http log?
https://firefox-source-docs.mozilla.org/networking/http/logging.html
and
a profile:
https://profiler.firefox.com/
Thank you
Comment 13•3 years ago
|
||
(In reply to zzjjzzgggg from comment #10)
Same problem here. I'm using FoxyProxy in China. Considering to throw away firefox and use chrome instead.
Just go away, impolite guy!
Comment 14•3 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #12)
Can someone make a http log?
https://firefox-source-docs.mozilla.org/networking/http/logging.html
and
a profile:
https://profiler.firefox.com/Thank you
I reproduced this bug on FDE 91b9, with a fresh new profile
Steps:
- Setup firefox-91.0b9 and create a new profile
- Open amo addon list page
- Start http log & profile record
- Open the first 4 addon pages and wait for page loaded
- Stop log, save http log (
test log normal
in attachment) and profile log - Close all pages but the amo addon list page
- Run clumsy mentioned above, set
tcp rst
chance to30%
- repeat step
4
and5
- Stop after only 1 page loaded because it's too slow and the GUI is laggy like hell, then save http log (
test log tcp rst 30%
in attachment) and profile log
Comment 15•3 years ago
|
||
After connect with users who encountered this problem (unfortunately I failed to reproduce this). Set security.tls.enable_0rtt_data -> False can fix it. Can anyone help verify?
If it is related to this value, maybe we should consider whether there is a problem with firefox in processing TLS1.3 0-RTT.
Comment 16•3 years ago
|
||
(In reply to yxu from comment #15)
After connect with users who encountered this problem (unfortunately I failed to reproduce this). Set security.tls.enable_0rtt_data -> False can fix it. Can anyone help verify?
If it is related to this value, maybe we should consider whether there is a problem with firefox in processing TLS1.3 0-RTT.
Thank you for your comment. Setting security.tls.enable_0rtt_data to false is a good workaround for me. I will use Firefox with this workaround for a few days and report my test results here if needed.
IMHO, even if firefox's implementation has some deficient about TLS 1.3, the UI should not be blocked. However, it's very common that the UI is totally frozen or responsive very slowly, making it impossible to kill the slow page, and I had to kill the whole firefox process
Comment 17•3 years ago
|
||
(In reply to yxu from comment #15)
After connect with users who encountered this problem (unfortunately I failed to reproduce this). Set security.tls.enable_0rtt_data -> False can fix it. Can anyone help verify?
If it is related to this value, maybe we should consider whether there is a problem with firefox in processing TLS1.3 0-RTT.
Change the setting fix the problem...partially
First few pages is still loading slow and freeze firefox, but later pages seems normal (after ~ 1 minutes, up to 5 minutes, I didn't test longer)
New behavior I noticed in the tcp rst 30% test: First 3~4 pages is loading extremely slow and freeze firefox no matter 0rtt_data
is true or false, but later pages load faster (still much slower than normal though) and freeze firefox less
Steps:
- Repeat steps 1~9 of comment #14 but replace step 4 and step 9 to: Open the first 2 addon pages and wait for page loaded, then open more pages
Result log:
test #2 in attachment: tcp rst 30% with security.tls.enable_0rtt_data
= true
(profile)
test #3 in attachment: tcp rst 30% with security.tls.enable_0rtt_data
= false
(profile)
Comment 18•3 years ago
|
||
Patch for bug 1382886 landed. Can you try to reproduce the issue? If you can reproduce the issue please make a http log.
Thanks!
Comment 19•3 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #18)
Patch for bug 1382886 landed. Can you try to reproduce the issue? If you can reproduce the issue please make a http log.
Thanks!
It didn't fix the problem...completely (if I was using the correct version, tested on nightly 93a1, build id 20210819214942)
The GUI seems freeze lesser than before (maybe just my felling) but still failed to response. The already loaded parent page (amo) is even blank when I try to switch back (as shown in the snapshot of profile)
Steps:
- Similar to comment #14
Result log:
test #4 in attachment: test log tcp rst 30% #4 enable_0rtt_data True nightly
(profile)
Comment 20•3 years ago
|
||
(In reply to byzod from comment #19)
Result log:
test #4 in attachment:
test log tcp rst 30% #4 enable_0rtt_data True nightly
(profile)
Thank you for the log.
I think that clumsy that you are using to emulate a bad network is interfering with TCP-socket-pair that firefox is using internally. In the log I am seeing that the socket-pair is being broken all the time. That will badly influence Firefox behavior.
I think that clumsy is making an unrealistic set up.
I am currently working on a better fix for bug 1382886 that should fix the issues when proxies are used. I hope to have the fix next week. I will ask you to test it once it land in Nightly.
Thank you for you help!!!!
Updated•3 months ago
|
Comment 21•2 months ago
|
||
Moving to Monitor queue, upping priority to verify.
The profile from that comment shows 4, 8 and 13s janks (and others). This is really bad, IF it is still occurring.
If anyone on this bug who saw this originally can comment if they still see it, that would help. Dragana landed a patch that may have helped here 2 1/2 years ago (a few weeks after her last comment).
If you had flipped the 0RTT pref, please put it back to default to test
Comment 22•2 months ago
|
||
I see improvement but might be just because I upgraded my pc ;)
Steps:
- Setup firefox-124.0b3 and create a new profile
- Open amo addon list page
- Download and run clumsy, set the chance of
TCP RST
to30%
- Start logging, active modules:
timestamp,sync,nsHttp:5,cache2:5,nsSocketTransport:5,nsHostResolver:5
(default) - Open the first 4 addon pages and wait for page loaded
- When all pages fully loaded (no loading icon on tabs), stop logging, upload log
Log: http://share.firefox.dev/3uJ5Hec
It's still laggy and firefox keep freezing but at least it's not dead, lagging like hell
improved to lagging like shit
Comment 23•2 months ago
|
||
(In reply to byzod from comment #22)
...
Steps:
3. Download and run clumsy, set the chance ofTCP RST
to30%
...
More cleared step 3:
- Download and run clumsy, set the chance of TCP RST to 30%, then click
start
Note that clumsy updated few times but for better variable controlling you should still use 0.3rc3
Comment 24•1 month ago
|
||
Redirect needinfos that are pending on inactive users to the triage owner.
:jesup, since the bug has recent activity, could you have a look please?
For more information, please visit BugBot documentation.
Comment 25•1 month ago
|
||
byzod: thanks for the new info! We'll try to take a look (moving back to triage)
Assignee | ||
Updated•1 month ago
|
Updated•1 month ago
|
Comment 26•13 days ago
|
||
I've tried this on Linux, using sudo iptables -A INPUT -p tcp -m statistic --mode random --probability 0.30 --tcp-flags PSH PSH -j REJECT --reject-with tcp-reset
I don't see anything like the long delays in the poll loop. No jank at all (though with that high reset rate loading things is hit or miss)
Looking at https://share.firefox.dev/3xxVbrf from your trace, I see LongTaskSocketProcessing on SocketThread times that exactly match up with the MainThread jank times (unresponsive to events).
LongTaskSocketProcessing markers are added when the socket poll loop takes more than few ms. So something is making this simple loop take several seconds... https://searchfox.org/mozilla-central/source/netwerk/base/nsSocketTransportService2.cpp#1384-1438
Perhaps this is a result of exactly what clumsy is doing. It doesn't appear to happen in Linux when we get RSTs. It may be a windows-specific issue, or an issue with how clumsy simulates RSTs.
Assignee | ||
Comment 27•12 days ago
|
||
The problem seems to be that the socket tread is stuck in TryRepairPollableEvent > new PollableEvent while holding the mutex which makes the main thread to hang while trying to get the mutex here.
Andrew fixed a similar shutdown hang in bug 1843384 (though I'm still seeing some related crashes, often with fcagff64.dll on the stack)
@reporter, do you maybe have mcaffee installed?
In any case, I'm thinking we could use the "busy waiting" while the pollable event is being repaired:
- Make TryRepairPollable event set mPollableEvent and mPollList[0].fd to null, unlock the mutex
- dispatch a background task that would call
new PollableEvent
without holding the lock. - Once that is done, acquire the mutex, and actually repair the pollable event. I think this should get rid of the hang, at the cost of increased CPU usage on the socket thread (but I'm assuming that's better than a hang).
Alternatively we can try to figure out if there's a way to avoid using PR_NewTCPSocketPair which we know can block for a long time.
Assignee | ||
Comment 28•8 days ago
|
||
I found this change https://gitlab.com/openconnect/openconnect/-/merge_requests/320 that seems to have a similar issue.
It seems windows now has AF_UNIX sockets, so possibly we could do the same to avoid hanging with PR_NewTCPSocketPair and the fact that the TCP sockets are affected by network conditions and tools such as clumsy. Note that most of the crashes in bug 1843384 that are still happening also have the mcaffee dll on the stack, so it's also being affected by a third party software.
Assignee | ||
Updated•6 days ago
|
Description
•