Open Bug 1719046 Opened 3 years ago Updated 6 days ago

TCP RST will cause Firefox UI to lag or unresponsive

Categories

(Core :: Networking, defect, P2)

Firefox 91
Desktop
Windows
defect

Tracking

()

UNCONFIRMED

People

(Reporter: doudou1041, Assigned: valentin, NeedInfo)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged][necko-priority-queue])

Attachments

(4 files)

Attached video 2021-07-03.mp4

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0

Steps to reproduce:

  1. Download clumsy (https://github.com/jagt/clumsy/releases/tag/0.3rc3) and run it.
  2. Start after setting the chance of TCP RST to 30%.
  3. Open a few pages in Firefox.

Actual results:

The browser UI is lagging or even unresponsive.


I'm sorry, my English is poor.

See Also: → 1703232
Severity: -- → S3
Priority: -- → P3
Whiteboard: [necko-triaged]

same here

This bug affects countries that have limited access or bad connections heavily, for example China
My firefox freeze every 30 or 40 minutes due to the network status thay I can't do anything to improve, I can barely browse the web normally since ver 89

It might be P3 for people in first world, but it's essential P1 to people in developing countries

I hope Mozilla is aware of how many people are affected by this bug and how important it actually is for millions of people worldwide

same bug here, I'm in China.

We recently fixed bug 1720079 that could cause high cpu loading. Not sure if that bug is related to this one, but maybe it's worth to use the latest nightly to see if this issue can be reproduced. Thanks.

(In reply to Kershaw Chang [:kershaw] from comment #5)

We recently fixed bug 1720079 that could cause high cpu loading. Not sure if that bug is related to this one, but maybe it's worth to use the latest nightly to see if this issue can be reproduced. Thanks.

The issue can still be reproduced on Nightly 92.0a1 (2021-07-15).

I have had the same problem since Firefox 89 and for this reason some of my friends tried to use chrom* instead, I hope you guys can fix this problem soon.

everyone face this issue, I suggest you vote for this bug so firefox will pay attention to it : )

Me too, this bug is really bother me.

Same problem here. I'm using FoxyProxy in China. Considering to throw away firefox and use chrome instead.

still face the problem on 91 stable.
set fission.autostart enable can remit the symptom a little bit

(In reply to zzjjzzgggg from comment #10)

Same problem here. I'm using FoxyProxy in China. Considering to throw away firefox and use chrome instead.
Just go away, impolite guy!

Attached file tcp rst test log.zip

(In reply to Dragana Damjanovic [:dragana] from comment #12)

Can someone make a http log?
https://firefox-source-docs.mozilla.org/networking/http/logging.html
and
a profile:
https://profiler.firefox.com/

Thank you

I reproduced this bug on FDE 91b9, with a fresh new profile

Steps:

  1. Setup firefox-91.0b9 and create a new profile
  2. Open amo addon list page
  3. Start http log & profile record
  4. Open the first 4 addon pages and wait for page loaded
  5. Stop log, save http log (test log normal in attachment) and profile log
  6. Close all pages but the amo addon list page
  7. Run clumsy mentioned above, set tcp rst chance to 30%
  8. repeat step 4 and 5
  9. Stop after only 1 page loaded because it's too slow and the GUI is laggy like hell, then save http log (test log tcp rst 30% in attachment) and profile log

After connect with users who encountered this problem (unfortunately I failed to reproduce this). Set security.tls.enable_0rtt_data -> False can fix it. Can anyone help verify?

If it is related to this value, maybe we should consider whether there is a problem with firefox in processing TLS1.3 0-RTT.

(In reply to yxu from comment #15)

After connect with users who encountered this problem (unfortunately I failed to reproduce this). Set security.tls.enable_0rtt_data -> False can fix it. Can anyone help verify?

If it is related to this value, maybe we should consider whether there is a problem with firefox in processing TLS1.3 0-RTT.

Thank you for your comment. Setting security.tls.enable_0rtt_data to false is a good workaround for me. I will use Firefox with this workaround for a few days and report my test results here if needed.

IMHO, even if firefox's implementation has some deficient about TLS 1.3, the UI should not be blocked. However, it's very common that the UI is totally frozen or responsive very slowly, making it impossible to kill the slow page, and I had to kill the whole firefox process

(In reply to yxu from comment #15)

After connect with users who encountered this problem (unfortunately I failed to reproduce this). Set security.tls.enable_0rtt_data -> False can fix it. Can anyone help verify?

If it is related to this value, maybe we should consider whether there is a problem with firefox in processing TLS1.3 0-RTT.

Change the setting fix the problem...partially
First few pages is still loading slow and freeze firefox, but later pages seems normal (after ~ 1 minutes, up to 5 minutes, I didn't test longer)

New behavior I noticed in the tcp rst 30% test: First 3~4 pages is loading extremely slow and freeze firefox no matter 0rtt_data is true or false, but later pages load faster (still much slower than normal though) and freeze firefox less

Steps:

  1. Repeat steps 1~9 of comment #14 but replace step 4 and step 9 to: Open the first 2 addon pages and wait for page loaded, then open more pages

Result log:
test #2 in attachment: tcp rst 30% with security.tls.enable_0rtt_data = true (profile)
test #3 in attachment: tcp rst 30% with security.tls.enable_0rtt_data = false (profile)

Patch for bug 1382886 landed. Can you try to reproduce the issue? If you can reproduce the issue please make a http log.

Thanks!

Flags: needinfo?(543080122)

(In reply to Dragana Damjanovic [:dragana] from comment #18)

Patch for bug 1382886 landed. Can you try to reproduce the issue? If you can reproduce the issue please make a http log.

Thanks!

It didn't fix the problem...completely (if I was using the correct version, tested on nightly 93a1, build id 20210819214942)

The GUI seems freeze lesser than before (maybe just my felling) but still failed to response. The already loaded parent page (amo) is even blank when I try to switch back (as shown in the snapshot of profile)

Steps:
  1. Similar to comment #14
Result log:

test #4 in attachment: test log tcp rst 30% #4 enable_0rtt_data True nightly (profile)

Flags: needinfo?(543080122)

(In reply to byzod from comment #19)

Result log:

test #4 in attachment: test log tcp rst 30% #4 enable_0rtt_data True nightly (profile)

Thank you for the log.
I think that clumsy that you are using to emulate a bad network is interfering with TCP-socket-pair that firefox is using internally. In the log I am seeing that the socket-pair is being broken all the time. That will badly influence Firefox behavior.
I think that clumsy is making an unrealistic set up.

I am currently working on a better fix for bug 1382886 that should fix the issues when proxies are used. I hope to have the fix next week. I will ask you to test it once it land in Nightly.

Thank you for you help!!!!

Moving to Monitor queue, upping priority to verify.

The profile from that comment shows 4, 8 and 13s janks (and others). This is really bad, IF it is still occurring.

If anyone on this bug who saw this originally can comment if they still see it, that would help. Dragana landed a patch that may have helped here 2 1/2 years ago (a few weeks after her last comment).

If you had flipped the 0RTT pref, please put it back to default to test

Flags: needinfo?(skyecook320)
Flags: needinfo?(plumerlis)
Flags: needinfo?(litimetal)
Flags: needinfo?(docrage)
Flags: needinfo?(aeghn)
Flags: needinfo?(543080122)
Priority: P3 → P2
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-monitor]

I see improvement but might be just because I upgraded my pc ;)

Steps:

  1. Setup firefox-124.0b3 and create a new profile
  2. Open amo addon list page
  3. Download and run clumsy, set the chance of TCP RST to 30%
  4. Start logging, active modules: timestamp,sync,nsHttp:5,cache2:5,nsSocketTransport:5,nsHostResolver:5 (default)
  5. Open the first 4 addon pages and wait for page loaded
  6. When all pages fully loaded (no loading icon on tabs), stop logging, upload log

Log: http://share.firefox.dev/3uJ5Hec

It's still laggy and firefox keep freezing but at least it's not dead, lagging like hell improved to lagging like shit

Flags: needinfo?(543080122)

(In reply to byzod from comment #22)

...
Steps:
3. Download and run clumsy, set the chance of TCP RST to 30%
...


More cleared step 3:

  1. Download and run clumsy, set the chance of TCP RST to 30%, then click start

Note that clumsy updated few times but for better variable controlling you should still use 0.3rc3

Redirect needinfos that are pending on inactive users to the triage owner.
:jesup, since the bug has recent activity, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(skyecook320)
Flags: needinfo?(rjesup)
Flags: needinfo?(plumerlis)
Flags: needinfo?(docrage)
Flags: needinfo?(aeghn)

byzod: thanks for the new info! We'll try to take a look (moving back to triage)

Flags: needinfo?(rjesup)
Whiteboard: [necko-triaged][necko-priority-monitor] → [necko-triaged][necko-priority-new]
Whiteboard: [necko-triaged][necko-priority-new] → [necko-triaged][necko-priority-next]
Whiteboard: [necko-triaged][necko-priority-next] → [necko-triaged][necko-priority-queue]

I've tried this on Linux, using sudo iptables -A INPUT -p tcp -m statistic --mode random --probability 0.30 --tcp-flags PSH PSH -j REJECT --reject-with tcp-reset
I don't see anything like the long delays in the poll loop. No jank at all (though with that high reset rate loading things is hit or miss)

Looking at https://share.firefox.dev/3xxVbrf from your trace, I see LongTaskSocketProcessing on SocketThread times that exactly match up with the MainThread jank times (unresponsive to events).

LongTaskSocketProcessing markers are added when the socket poll loop takes more than few ms. So something is making this simple loop take several seconds... https://searchfox.org/mozilla-central/source/netwerk/base/nsSocketTransportService2.cpp#1384-1438

Perhaps this is a result of exactly what clumsy is doing. It doesn't appear to happen in Linux when we get RSTs. It may be a windows-specific issue, or an issue with how clumsy simulates RSTs.

OS: Unspecified → Windows
Hardware: Unspecified → Desktop

The problem seems to be that the socket tread is stuck in TryRepairPollableEvent > new PollableEvent while holding the mutex which makes the main thread to hang while trying to get the mutex here.

Andrew fixed a similar shutdown hang in bug 1843384 (though I'm still seeing some related crashes, often with fcagff64.dll on the stack)
@reporter, do you maybe have mcaffee installed?

In any case, I'm thinking we could use the "busy waiting" while the pollable event is being repaired:

  • Make TryRepairPollable event set mPollableEvent and mPollList[0].fd to null, unlock the mutex
  • dispatch a background task that would call new PollableEvent without holding the lock.
  • Once that is done, acquire the mutex, and actually repair the pollable event. I think this should get rid of the hang, at the cost of increased CPU usage on the socket thread (but I'm assuming that's better than a hang).

Alternatively we can try to figure out if there's a way to avoid using PR_NewTCPSocketPair which we know can block for a long time.

See Also: → 1843384

I found this change https://gitlab.com/openconnect/openconnect/-/merge_requests/320 that seems to have a similar issue.
It seems windows now has AF_UNIX sockets, so possibly we could do the same to avoid hanging with PR_NewTCPSocketPair and the fact that the TCP sockets are affected by network conditions and tools such as clumsy. Note that most of the crashes in bug 1843384 that are still happening also have the mcaffee dll on the stack, so it's also being affected by a third party software.

Assignee: nobody → valentin.gosu
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: