Closed Bug 1248358 Opened 9 years ago Closed 8 years ago

Experimental TCP auto tuning makes firefox event pipe unusable

Categories

(Core :: Networking, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dragana, Assigned: dragana)

References

Details

(Keywords: crash, hang, Whiteboard: [necko-active])

Crash Data

Attachments

(12 files)

We got a log from a user with a hang in PR_Poll during shutdown.

Looking at the log it does not look like bug 1244749 (hang in PR_Poll after laptop wakes up).

https://bugzilla.mozilla.org/show_bug.cgi?id=1158189#c53
Blocks: 1158189
Attached file Vista log.zip
so in this log:

nsHttpConnectionMgr::Shutdown is called. It posts an event on socketThread

2016-02-13 15:55:11.536000 UTC - 0[1c11140]: nsHttpConnectionMgr::Shutdown
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: STS dispatch [8a89a00]
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Acquiring lock on thread 1c11140
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Acquired lock on thread 1c11140
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Released lock on thread 1c11140
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Acquiring lock on thread 1c11140
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Acquired lock on thread 1c11140
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Released lock on thread 1c11140

....

The event is executed
2016-02-13 15:55:17.621000 UTC - 2608[1c117d0]: nsHttpConnectionMgr::OnMsgShutdown
this closes all connections and posts an event on the main thread

....

this is the event
2016-02-13 15:55:17.648000 UTC - 0[1c11140]: nsHttpConnectionMgr::OnMsgShutdownConfirm
this finishes nsHttpConnectionMgr shutdown
....


hostResolver is shutdown
2016-02-13 15:55:17.651000 UTC - 0[1c11140]: Shutting down host resolver.
...
2016-02-13 15:55:17.653000 UTC - 308[10940140]: DNS lookup thread - queue empty, thread finished.
2016-02-13 15:55:17.653000 UTC - 2780[11a902e0]: DNS lookup thread - queue empty, thread finished.
2016-02-13 15:55:17.653000 UTC - 3172[1c14da0]: DNS lookup thread - queue empty, thread finished.
2016-02-13 15:55:17.654000 UTC - 3804[13efa290]: DNS lookup thread - queue empty, thread finished.
2016-02-13 15:55:17.654000 UTC - 3820[102de530]: DNS lookup thread - queue empty, thread finished.

then comes socketThread shutdown

2016-02-13 15:55:17.655000 UTC - 0[1c11140]: nsSocketTransportService::Shutdown
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Acquiring lock on thread 1c11140
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Acquired lock on thread 1c11140
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Released lock on thread 1c11140
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Acquiring lock on thread 1c11140
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Acquired lock on thread 1c11140
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Released lock on thread 1c11140

The previous lines are a posting of an event on the socket thread that should wake up thread, but it does not.

then something else posts an event on socket thread
2016-02-13 15:55:22.700000 UTC - 2792[1c12a30]: Acquiring lock on thread 1c12a30
2016-02-13 15:55:22.700000 UTC - 2792[1c12a30]: Acquired lock on thread 1c12a30
2016-02-13 15:55:22.700000 UTC - 2792[1c12a30]: Released lock on thread 1c12a30

but socketThreat returns from poll 45s later.

2016-02-13 15:56:05.452000 UTC - 2608[1c117d0]:     ...returned after 47822 milliseconds
2016-02-13 15:56:05.452000 UTC - 2608[1c117d0]: STS dispatch [b8991a0]

the last call to PR_Poll was with 0 sockets attacked, only the ,ThreadEvent:
2016-02-13 15:55:17.633000 UTC - 2608[1c117d0]: STS poll iter [1]
2016-02-13 15:55:17.633000 UTC - 2608[1c117d0]:   calling PR_Poll [active=0 idle=0]
2016-02-13 15:55:17.633000 UTC - 2608[1c117d0]:     timeout = -1 milliseconds
We will need more logging here, and would be good to use Patrick's rework from bug 698882.
Yeah, all those gaps in time are... interesting.
Whiteboard: [necko-active]
I have used patch from bug 1244749(Daniel will not  mind :) ) there are some good checks for mThreadEvent and add some more logging.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ac70d73a64c
from comments mentioned in comment #5 maybe some firefall is causing this, although there are 2 computers too much of a coincident. But we can use this chance to find a way to repair threadEvent in such cases.
May I ask you to download the zip file from:
http://archive.mozilla.org/pub/firefox/try-builds/dd.mozilla@gmail.com-4ac70d73a64c33b17683079d9be9375bf9c362b3/try-win32-debug/firefox-47.0a1.en-US.win32.zip

It contains a firefox version that you do not need to install, just run firefox.exe

This one is based on Nightly with some logging and some checks added.

If you could make a log following the instructions from:
https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

just please use:
NSPR_LOG_MODULES=timestamp,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5,io:5


Can you explain what is happening: If you start firefox it actually does not want to load anything on Vista with "use system proxy" set? Or this behavior appears after some time?

Thanks a lot!

I moved you issue to this bug, it will contain only your issue. It is easier to track.
Flags: needinfo?(thane)
I'll try this build right now.

On the Vista computer, with "use system proxy" checked I can't go to any pages (and I get a crash on exit).  When I switch to "no proxy" I can get to pages, but it's slow.  I still get the error on exit.
Flags: needinfo?(thane)
This is the log from the Vista computer.  Nightly crashed almost immediately on load (the FF Windows never opened).
On this computer, Nightly loaded, but no homepage would load.  As soon as I clicked in the address bar, it went to "not responding" and I had to kill it.
On this computer, Nightly opened, and I was able to type news.google.ca into the address bar.  After five minutes, it had not loaded the page, so I clicked X.  The logging page then went to "not responding" and after a few seconds Windows closed it.
Can you run this build:
http://archive.mozilla.org/pub/firefox/try-builds/dd.mozilla@gmail.com-636b2ffab5a85cc63568299dce43d3e2f535e72b/try-win32-debug/firefox-47.0a1.en-US.win32.zip

I have added even more logging.

and also add nsThread:5 to NSPR_LOG_MODULES. So use:
NSPR_LOG_MODULES=timestamp,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5,io:5,nsThread:5

Thanks a lot!!!
Flags: needinfo?(thane)
On the Vista computer, same result as last time - crash before FF window opened.
Flags: needinfo?(thane)
On this computer FF opened, went to "not responding" as soon as I clicked in the address bar.
On this computer, FF opened, but wouldn't go to any pages (tried www.google.ca and news.google.ca).  After 60 seconds, I closed FF and it went to "no responding".
On both Win7 PCs, I'm running Vipre Business and Windows Firewall.  

On the Vista computer, no AV, Windows Firewall.

All have Flash, Reader, Silverlight and Java running latest versions.
Can you look at the about:crashes and post the link to the crashes on Vista?

Thanks!
Flags: needinfo?(thane)
These are strange behaviors and on 3 different computers. PR_Poll is called 3 time, the first 2 times it returns and the last time it hangs in select.

Next things to try:
1) turn on logging for networkchange events.(this will be done together with other 3)
2) it is strange that it is happening on 3 computers, I will ask reporter to disconnect computers and turn of firewalls and try to see what is happening. 
3) set select timeout to max 25ms always, just to see what is happening.
4) when a event is dispatched on the socketThread setup a timer and when the timer returns and socketThread is still in poll (still have not went out of poll it was in when the event was dispatched) try to close ThreadEvent read socket to see if this is going to cause a return from select. (I will use Patrick's code, bug 698882, maybe it has already landed)

note: changing  "use system proxy" to no proxy fixes the behavior on win7 but not on Vista

Daniel, Patrick, any other suggestions?
Flags: needinfo?(mcmanus)
Flags: needinfo?(daniel)
Note that we don't poll the pollable event socket for EXCEPT...
(In reply to Honza Bambas (:mayhemer) from comment #20)
> Note that we don't poll the pollable event socket for EXCEPT...

That is true. This one is easy to try out here. I will make a patch.
Thanks!
I believe these are the crashes from the Vista machine (the build you sent me crashes so fast, I can't go to about:crashes), so I opened 44.02 and did it there.
https://crash-stats.mozilla.com/report/index/bp-c2bfd190-0655-4ea5-a3e1-ead2f2160217
https://crash-stats.mozilla.com/report/index/bp-6b1e5c50-169c-4bc1-9e26-702672160216

T
Flags: needinfo?(thane)
(In reply to Dragana Damjanovic [:dragana] from comment #21)
> (In reply to Honza Bambas (:mayhemer) from comment #20)
> > Note that we don't poll the pollable event socket for EXCEPT...
> 
> That is true. This one is easy to try out here. I will make a patch.
> Thanks!

Please first see https://bugzilla.mozilla.org/show_bug.cgi?id=698882#c36
I ran another test on the Windows 7 machine after removing Flash.  Still getting crashes and can't get to any pages.
I have another Vista computer with the same symptoms.  I tried to attach the log files, but they are too big.

You can download it from here: http://stuffbythane.com/fflogs/Vistax32-System2.log
I wonder if Thane should give this a try? Seems a minor config change to test. Document is quite old so maybe nothing other that a long shot.

On vista setting HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows NT\Rpc\Server2003NegotiateDisable 1 and restart.

https://support.microsoft.com/en-us/kb/899148#mt3
On the Vista machine above, I disabled DHCP Client, DNS Client, and Windows Firewall services and tried the new build.  It still couldn't open pages, and crashed on exit.  
You can download the log file here: 
http://www.stuffbythane.com/fflogs/Vista-x32-Test2.log
(In reply to Jonathan Howard from comment #26)
> I wonder if Thane should give this a try? Seems a minor config change to
> test. Document is quite old so maybe nothing other that a long shot.
> 
> On vista setting HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows
> NT\Rpc\Server2003NegotiateDisable 1 and restart.
> 
> https://support.microsoft.com/en-us/kb/899148#mt3

No change when I do this (the key HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows NT\Rpc) doesn't exist on either Vista computer.

T
This is a build that have changes suggested in comment #20:

http://archive.mozilla.org/pub/firefox/try-builds/dd.mozilla@gmail.com-d037424a2e7c0a7886e669c6b1c37b511e96e4c4/try-win32-debug/firefox-47.0a1.en-US.win32.zip

Can you try it. Turn on the logging as well. 

Thank you very much for your help.
Flags: needinfo?(thane)
Vista x32 System - nightly.exe crashes before FF window opens.

FF 42.0 opens and will surf to news.google.ca but does so very slowly (page takes about six seconds to load).  Firefox.exe (42.0) process takes 42 seconds to close when I close FF.  No crash on exit.

I have removed Flash, Adobe Reader, Silverlight, ITunes, Quicktime, VLC Player from this computer to see if that would help.
Flags: needinfo?(thane)
On this Win7 x64 machine, Nightly loaded and I was able to open news.google.ca (took about 15 seconds to load the page).  I was unable to open any links from that page.

When I closed Nightly, it gave the error that it had crashed (Windows error, not FF crash report).
On this Vista x32 computer, no pages would load (news.google.ca spun for 5 minutes before I gave up).  When I shutdown Nightly, I got the Windows crash error.

T
I may have found the root problem.  All four computers (belonging to the same user) all had the same program run on them.
SG TCP Optimizer from http://www.speedguide.net/downloads.php

All had been set to 100+Mbps (not sure if the speed is relevant) in the optimizer and set to "optimal" settings.

I re-ran the program, and selected "Windows Default" and restarted.  So far, it looks like 44.0.2 is working normally on one Win7 computer and one Vista computer.  I will continue to test here. 

If this is a dead-end, let me know and I'll stop. :)

T
Both Vista machines appear to running fine with the TCP "tweaks" from SG TCP Optimizer removed.

Oddly enough these tweaks don't impact Chrome or IE since all the machines would work fine with those browsers.  

I will continue to run tests and report back, but I think we have this resolved.
It would be useful to check why SG TCP Optimizer breaks us, and only us.  Might bring some answers what we do differently from other browsers.
(In reply to Honza Bambas (:mayhemer) from comment #20)
> Note that we don't poll the pollable event socket for EXCEPT...

fwiw in general we've found that EXCEPT isn't maskable - you get it whether you ask for it or not. nonetheless, in 698882 I did add it to the set
Flags: needinfo?(mcmanus)
(In reply to Honza Bambas (:mayhemer) from comment #35)
> It would be useful to check why SG TCP Optimizer breaks us, and only us. 
> Might bring some answers what we do differently from other browsers.

definitely.

Dragana, I do have a vista VM hanging around somewhere (though SP might be an issue?) if we need it..
the only think that i know till now, it is slowing down our threadevent socket pair a lot. Think chrome does not use that.
I run a computer repair shop, so I get a lot of computers in each week.  I install FF on all of them, so I'd be happy to do some more testing (including trying the SG TCP Optimizer tweaks) if that helps.
Thanks Thane, I will try to reproduce this locally, it is easier to debug. Probably I will need your help setting SG TCP Ot.

Patrick, I thought at first to try my very old laptop running win XP (I do not know why I still have it around) but newer wireshark is not running on winXP so maybe I will need your VM.
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | nsThread::ProcessNextEvent | nsThread::Shutdown ]
Keywords: crash, hang
Can we keep this separately. In this bug we are going to investigate why is ff so slow with SG TCP Optimizer.

The title does not fit, but at the beginning we did not know what it is.

The crash with this signature are happening after socketThread is shutdown.
Summary: Hang in Pr_Poll during shutdown → SG TCP Optimizer make Firefox very slow
(In reply to Dragana Damjanovic [:dragana] from comment #40)
> Thanks Thane, I will try to reproduce this locally, it is easier to debug.
> Probably I will need your help setting SG TCP Ot.
> 
> Patrick, I thought at first to try my very old laptop running win XP (I do
> not know why I still have it around) but newer wireshark is not running on
> winXP so maybe I will need your VM.

does this involve xp (confirmed anyhow) or just vista? They are totally different tcp stacks.. vista is a lot more like win7.

Anyhow, happy to help. I've got both a xp sp3 and a vista VM and because the VMs are hosted on linux I can just capture any traffic to/from the guest easily without worrying about running tools on the guest.
maybe related to bug 986545 ?
The SG TCP settings only seem to affect Vista and 7 (Vista is more seriously impacted).  XP and 10 don't seem to be affected.
(In reply to Patrick McManus [:mcmanus] from comment #43)
> maybe related to bug 986545 ?

It look like that might be the case. I can't reproduce this problem every time with 'autotuningl=experimental', but it certainly makes Firefox MUCH slower and I get a shutdown hang every once in a while (maybe once in 4 or 5 attempts) when I exit with this option set.
Flags: needinfo?(daniel)
(In reply to Daniel Stenberg [:bagder] from comment #45)

> I get a shutdown hang every once in a while

I should add that I ran my tests on a win7 in a VM.
This is the same issue as bug 986545.

The Issue:

Experimental TCP auto tuning on Windows sets a windows scaling option on each syn packet and set it to 14.
On our socket pair that we use for event signalling we do not change receiver windows and it is by default 8192. So it gets scaled down by 14 bits to 0. So window size that the pipe sockets are sending are 0. And the event pipe does not send any data - it sends data but they are interpreted as zero window probes...

If we increase receiver window size to something larger then 1<<14 everything works fine. 

So in bug 698882 we should increase rwin. On the current version we could fix it as well from nsSocketTransportService.
Summary: SG TCP Optimizer make Firefox very slow → Experimental TCP auto tuning makes firefox event pipe unusable
> 
> So in bug 698882 we should increase rwin. On the current version we could
> fix it as well from nsSocketTransportService.


I need to correct myself: We cannot fix this from STS because rwin needs to be set before connect is called and STS only gets sockets when they are already connected.

So we need to fix this in bug 698882.
Depends on: 698882
it looks like bug 698882 will close this - but I would still like to backport a "fix" that just makes mThreadEvent null (and uses the 25ms timer). Is there a registry setting for "experimental" we can read to use in this case?
(In reply to Patrick McManus [:mcmanus] from comment #50)
> it looks like bug 698882 will close this - but I would still like to
> backport a "fix" that just makes mThreadEvent null (and uses the 25ms
> timer). Is there a registry setting for "experimental" we can read to use in
> this case?

Sorry for a delay, I am just now working on a fix here.
I only found that is possible to get that info with command netsh 
netsh int tcp show global
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fdad27859bbc

try is really slow today... I still need to test the build.
Comment on attachment 8732310 [details] [diff] [review]
bug_1248358_v1.patch

I have tested it with win7.
Attachment #8732310 - Flags: review?(mcmanus)
Comment on attachment 8732310 [details] [diff] [review]
bug_1248358_v1.patch

Review of attachment 8732310 [details] [diff] [review]:
-----------------------------------------------------------------

wow. that's ugly.

gotta think about it.
(In reply to Patrick McManus [:mcmanus] from comment #55)
> Comment on attachment 8732310 [details] [diff] [review]
> bug_1248358_v1.patch
> 
> Review of attachment 8732310 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> wow. that's ugly.
> 
> gotta think about it.

I know. It is a too big change to uplift it, but could not find different way to do it :)
Comment on attachment 8732310 [details] [diff] [review]
bug_1248358_v1.patch

Review of attachment 8732310 [details] [diff] [review]:
-----------------------------------------------------------------

698882 should also resolve this problem on nightly and as this isn't suitable to uplift, let's just close this when 698882 lands without another patch. I'm mostly concerned about adding the popen for 99% of users that don't have this configuration - not worth it.
Attachment #8732310 - Flags: review?(mcmanus) → review-
Bug 698882 has landed, so closing this bug.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: