Closed
Bug 1248358
Opened 9 years ago
Closed 9 years ago
Experimental TCP auto tuning makes firefox event pipe unusable
Categories
(Core :: Networking, defect)
Core
Networking
Tracking
()
RESOLVED
FIXED
People
(Reporter: dragana, Assigned: dragana)
References
Details
(Keywords: crash, hang, Whiteboard: [necko-active])
Crash Data
Attachments
(12 files)
2.64 MB,
application/zip
|
Details | |
9.22 KB,
text/x-log
|
Details | |
125.16 KB,
text/x-log
|
Details | |
387.49 KB,
text/x-log
|
Details | |
44.47 KB,
text/x-log
|
Details | |
1.49 MB,
text/x-log
|
Details | |
1.11 MB,
application/zip
|
Details | |
1.28 MB,
text/x-log
|
Details | |
72.77 KB,
text/x-log
|
Details | |
2.86 MB,
application/zip
|
Details | |
933.83 KB,
application/zip
|
Details | |
3.24 KB,
patch
|
mcmanus
:
review-
|
Details | Diff | Splinter Review |
We got a log from a user with a hang in PR_Poll during shutdown.
Looking at the log it does not look like bug 1244749 (hang in PR_Poll after laptop wakes up).
https://bugzilla.mozilla.org/show_bug.cgi?id=1158189#c53
Assignee | ||
Comment 1•9 years ago
|
||
Assignee | ||
Comment 2•9 years ago
|
||
so in this log:
nsHttpConnectionMgr::Shutdown is called. It posts an event on socketThread
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: nsHttpConnectionMgr::Shutdown
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: STS dispatch [8a89a00]
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Acquiring lock on thread 1c11140
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Acquired lock on thread 1c11140
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Released lock on thread 1c11140
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Acquiring lock on thread 1c11140
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Acquired lock on thread 1c11140
2016-02-13 15:55:11.536000 UTC - 0[1c11140]: Released lock on thread 1c11140
....
The event is executed
2016-02-13 15:55:17.621000 UTC - 2608[1c117d0]: nsHttpConnectionMgr::OnMsgShutdown
this closes all connections and posts an event on the main thread
....
this is the event
2016-02-13 15:55:17.648000 UTC - 0[1c11140]: nsHttpConnectionMgr::OnMsgShutdownConfirm
this finishes nsHttpConnectionMgr shutdown
....
hostResolver is shutdown
2016-02-13 15:55:17.651000 UTC - 0[1c11140]: Shutting down host resolver.
...
2016-02-13 15:55:17.653000 UTC - 308[10940140]: DNS lookup thread - queue empty, thread finished.
2016-02-13 15:55:17.653000 UTC - 2780[11a902e0]: DNS lookup thread - queue empty, thread finished.
2016-02-13 15:55:17.653000 UTC - 3172[1c14da0]: DNS lookup thread - queue empty, thread finished.
2016-02-13 15:55:17.654000 UTC - 3804[13efa290]: DNS lookup thread - queue empty, thread finished.
2016-02-13 15:55:17.654000 UTC - 3820[102de530]: DNS lookup thread - queue empty, thread finished.
then comes socketThread shutdown
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: nsSocketTransportService::Shutdown
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Acquiring lock on thread 1c11140
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Acquired lock on thread 1c11140
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Released lock on thread 1c11140
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Acquiring lock on thread 1c11140
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Acquired lock on thread 1c11140
2016-02-13 15:55:17.655000 UTC - 0[1c11140]: Released lock on thread 1c11140
The previous lines are a posting of an event on the socket thread that should wake up thread, but it does not.
then something else posts an event on socket thread
2016-02-13 15:55:22.700000 UTC - 2792[1c12a30]: Acquiring lock on thread 1c12a30
2016-02-13 15:55:22.700000 UTC - 2792[1c12a30]: Acquired lock on thread 1c12a30
2016-02-13 15:55:22.700000 UTC - 2792[1c12a30]: Released lock on thread 1c12a30
but socketThreat returns from poll 45s later.
2016-02-13 15:56:05.452000 UTC - 2608[1c117d0]: ...returned after 47822 milliseconds
2016-02-13 15:56:05.452000 UTC - 2608[1c117d0]: STS dispatch [b8991a0]
the last call to PR_Poll was with 0 sockets attacked, only the ,ThreadEvent:
2016-02-13 15:55:17.633000 UTC - 2608[1c117d0]: STS poll iter [1]
2016-02-13 15:55:17.633000 UTC - 2608[1c117d0]: calling PR_Poll [active=0 idle=0]
2016-02-13 15:55:17.633000 UTC - 2608[1c117d0]: timeout = -1 milliseconds
Assignee | ||
Comment 3•9 years ago
|
||
We will need more logging here, and would be good to use Patrick's rework from bug 698882.
Comment 4•9 years ago
|
||
Yeah, all those gaps in time are... interesting.
Assignee | ||
Comment 5•9 years ago
|
||
Comments:
https://bugzilla.mozilla.org/show_bug.cgi?id=1158189#c62
https://bugzilla.mozilla.org/show_bug.cgi?id=1158189#c63
are interesting.
Assignee | ||
Updated•9 years ago
|
Whiteboard: [necko-active]
Assignee | ||
Comment 6•9 years ago
|
||
I have used patch from bug 1244749(Daniel will not mind :) ) there are some good checks for mThreadEvent and add some more logging.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ac70d73a64c
Assignee | ||
Comment 7•9 years ago
|
||
from comments mentioned in comment #5 maybe some firefall is causing this, although there are 2 computers too much of a coincident. But we can use this chance to find a way to repair threadEvent in such cases.
Assignee | ||
Comment 8•9 years ago
|
||
May I ask you to download the zip file from:
http://archive.mozilla.org/pub/firefox/try-builds/dd.mozilla@gmail.com-4ac70d73a64c33b17683079d9be9375bf9c362b3/try-win32-debug/firefox-47.0a1.en-US.win32.zip
It contains a firefox version that you do not need to install, just run firefox.exe
This one is based on Nightly with some logging and some checks added.
If you could make a log following the instructions from:
https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
just please use:
NSPR_LOG_MODULES=timestamp,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5,io:5
Can you explain what is happening: If you start firefox it actually does not want to load anything on Vista with "use system proxy" set? Or this behavior appears after some time?
Thanks a lot!
I moved you issue to this bug, it will contain only your issue. It is easier to track.
Flags: needinfo?(thane)
Comment 9•9 years ago
|
||
I'll try this build right now.
On the Vista computer, with "use system proxy" checked I can't go to any pages (and I get a crash on exit). When I switch to "no proxy" I can get to pages, but it's slow. I still get the error on exit.
Flags: needinfo?(thane)
Comment 10•9 years ago
|
||
This is the log from the Vista computer. Nightly crashed almost immediately on load (the FF Windows never opened).
Comment 11•9 years ago
|
||
On this computer, Nightly loaded, but no homepage would load. As soon as I clicked in the address bar, it went to "not responding" and I had to kill it.
Comment 12•9 years ago
|
||
On this computer, Nightly opened, and I was able to type news.google.ca into the address bar. After five minutes, it had not loaded the page, so I clicked X. The logging page then went to "not responding" and after a few seconds Windows closed it.
Assignee | ||
Comment 13•9 years ago
|
||
Can you run this build:
http://archive.mozilla.org/pub/firefox/try-builds/dd.mozilla@gmail.com-636b2ffab5a85cc63568299dce43d3e2f535e72b/try-win32-debug/firefox-47.0a1.en-US.win32.zip
I have added even more logging.
and also add nsThread:5 to NSPR_LOG_MODULES. So use:
NSPR_LOG_MODULES=timestamp,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5,io:5,nsThread:5
Thanks a lot!!!
Flags: needinfo?(thane)
Comment 14•9 years ago
|
||
On the Vista computer, same result as last time - crash before FF window opened.
Flags: needinfo?(thane)
Comment 15•9 years ago
|
||
On this computer FF opened, went to "not responding" as soon as I clicked in the address bar.
Comment 16•9 years ago
|
||
On this computer, FF opened, but wouldn't go to any pages (tried www.google.ca and news.google.ca). After 60 seconds, I closed FF and it went to "no responding".
Comment 17•9 years ago
|
||
On both Win7 PCs, I'm running Vipre Business and Windows Firewall.
On the Vista computer, no AV, Windows Firewall.
All have Flash, Reader, Silverlight and Java running latest versions.
Assignee | ||
Comment 18•9 years ago
|
||
Can you look at the about:crashes and post the link to the crashes on Vista?
Thanks!
Flags: needinfo?(thane)
Assignee | ||
Comment 19•9 years ago
|
||
These are strange behaviors and on 3 different computers. PR_Poll is called 3 time, the first 2 times it returns and the last time it hangs in select.
Next things to try:
1) turn on logging for networkchange events.(this will be done together with other 3)
2) it is strange that it is happening on 3 computers, I will ask reporter to disconnect computers and turn of firewalls and try to see what is happening.
3) set select timeout to max 25ms always, just to see what is happening.
4) when a event is dispatched on the socketThread setup a timer and when the timer returns and socketThread is still in poll (still have not went out of poll it was in when the event was dispatched) try to close ThreadEvent read socket to see if this is going to cause a return from select. (I will use Patrick's code, bug 698882, maybe it has already landed)
note: changing "use system proxy" to no proxy fixes the behavior on win7 but not on Vista
Daniel, Patrick, any other suggestions?
Flags: needinfo?(mcmanus)
Flags: needinfo?(daniel)
![]() |
||
Comment 20•9 years ago
|
||
Note that we don't poll the pollable event socket for EXCEPT...
Assignee | ||
Comment 21•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #20)
> Note that we don't poll the pollable event socket for EXCEPT...
That is true. This one is easy to try out here. I will make a patch.
Thanks!
Comment 22•9 years ago
|
||
I believe these are the crashes from the Vista machine (the build you sent me crashes so fast, I can't go to about:crashes), so I opened 44.02 and did it there.
https://crash-stats.mozilla.com/report/index/bp-c2bfd190-0655-4ea5-a3e1-ead2f2160217
https://crash-stats.mozilla.com/report/index/bp-6b1e5c50-169c-4bc1-9e26-702672160216
T
Flags: needinfo?(thane)
![]() |
||
Comment 23•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #21)
> (In reply to Honza Bambas (:mayhemer) from comment #20)
> > Note that we don't poll the pollable event socket for EXCEPT...
>
> That is true. This one is easy to try out here. I will make a patch.
> Thanks!
Please first see https://bugzilla.mozilla.org/show_bug.cgi?id=698882#c36
Comment 24•9 years ago
|
||
I ran another test on the Windows 7 machine after removing Flash. Still getting crashes and can't get to any pages.
Comment 25•9 years ago
|
||
I have another Vista computer with the same symptoms. I tried to attach the log files, but they are too big.
You can download it from here: http://stuffbythane.com/fflogs/Vistax32-System2.log
Comment 26•9 years ago
|
||
I wonder if Thane should give this a try? Seems a minor config change to test. Document is quite old so maybe nothing other that a long shot.
On vista setting HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows NT\Rpc\Server2003NegotiateDisable 1 and restart.
https://support.microsoft.com/en-us/kb/899148#mt3
Comment 27•9 years ago
|
||
On the Vista machine above, I disabled DHCP Client, DNS Client, and Windows Firewall services and tried the new build. It still couldn't open pages, and crashed on exit.
You can download the log file here:
http://www.stuffbythane.com/fflogs/Vista-x32-Test2.log
Comment 28•9 years ago
|
||
(In reply to Jonathan Howard from comment #26)
> I wonder if Thane should give this a try? Seems a minor config change to
> test. Document is quite old so maybe nothing other that a long shot.
>
> On vista setting HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows
> NT\Rpc\Server2003NegotiateDisable 1 and restart.
>
> https://support.microsoft.com/en-us/kb/899148#mt3
No change when I do this (the key HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows NT\Rpc) doesn't exist on either Vista computer.
T
Assignee | ||
Comment 29•9 years ago
|
||
This is a build that have changes suggested in comment #20:
http://archive.mozilla.org/pub/firefox/try-builds/dd.mozilla@gmail.com-d037424a2e7c0a7886e669c6b1c37b511e96e4c4/try-win32-debug/firefox-47.0a1.en-US.win32.zip
Can you try it. Turn on the logging as well.
Thank you very much for your help.
Flags: needinfo?(thane)
Comment 30•9 years ago
|
||
Vista x32 System - nightly.exe crashes before FF window opens.
FF 42.0 opens and will surf to news.google.ca but does so very slowly (page takes about six seconds to load). Firefox.exe (42.0) process takes 42 seconds to close when I close FF. No crash on exit.
I have removed Flash, Adobe Reader, Silverlight, ITunes, Quicktime, VLC Player from this computer to see if that would help.
Flags: needinfo?(thane)
Comment 31•9 years ago
|
||
On this Win7 x64 machine, Nightly loaded and I was able to open news.google.ca (took about 15 seconds to load the page). I was unable to open any links from that page.
When I closed Nightly, it gave the error that it had crashed (Windows error, not FF crash report).
Comment 32•9 years ago
|
||
On this Vista x32 computer, no pages would load (news.google.ca spun for 5 minutes before I gave up). When I shutdown Nightly, I got the Windows crash error.
T
Comment 33•9 years ago
|
||
I may have found the root problem. All four computers (belonging to the same user) all had the same program run on them.
SG TCP Optimizer from http://www.speedguide.net/downloads.php
All had been set to 100+Mbps (not sure if the speed is relevant) in the optimizer and set to "optimal" settings.
I re-ran the program, and selected "Windows Default" and restarted. So far, it looks like 44.0.2 is working normally on one Win7 computer and one Vista computer. I will continue to test here.
If this is a dead-end, let me know and I'll stop. :)
T
Comment 34•9 years ago
|
||
Both Vista machines appear to running fine with the TCP "tweaks" from SG TCP Optimizer removed.
Oddly enough these tweaks don't impact Chrome or IE since all the machines would work fine with those browsers.
I will continue to run tests and report back, but I think we have this resolved.
![]() |
||
Comment 35•9 years ago
|
||
It would be useful to check why SG TCP Optimizer breaks us, and only us. Might bring some answers what we do differently from other browsers.
Comment 36•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #20)
> Note that we don't poll the pollable event socket for EXCEPT...
fwiw in general we've found that EXCEPT isn't maskable - you get it whether you ask for it or not. nonetheless, in 698882 I did add it to the set
Flags: needinfo?(mcmanus)
Comment 37•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #35)
> It would be useful to check why SG TCP Optimizer breaks us, and only us.
> Might bring some answers what we do differently from other browsers.
definitely.
Dragana, I do have a vista VM hanging around somewhere (though SP might be an issue?) if we need it..
Assignee | ||
Comment 38•9 years ago
|
||
the only think that i know till now, it is slowing down our threadevent socket pair a lot. Think chrome does not use that.
Comment 39•9 years ago
|
||
I run a computer repair shop, so I get a lot of computers in each week. I install FF on all of them, so I'd be happy to do some more testing (including trying the SG TCP Optimizer tweaks) if that helps.
Assignee | ||
Comment 40•9 years ago
|
||
Thanks Thane, I will try to reproduce this locally, it is easier to debug. Probably I will need your help setting SG TCP Ot.
Patrick, I thought at first to try my very old laptop running win XP (I do not know why I still have it around) but newer wireshark is not running on winXP so maybe I will need your VM.
![]() |
||
Updated•9 years ago
|
Assignee | ||
Comment 41•9 years ago
|
||
Can we keep this separately. In this bug we are going to investigate why is ff so slow with SG TCP Optimizer.
The title does not fit, but at the beginning we did not know what it is.
The crash with this signature are happening after socketThread is shutdown.
Assignee | ||
Updated•9 years ago
|
Summary: Hang in Pr_Poll during shutdown → SG TCP Optimizer make Firefox very slow
Comment 42•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #40)
> Thanks Thane, I will try to reproduce this locally, it is easier to debug.
> Probably I will need your help setting SG TCP Ot.
>
> Patrick, I thought at first to try my very old laptop running win XP (I do
> not know why I still have it around) but newer wireshark is not running on
> winXP so maybe I will need your VM.
does this involve xp (confirmed anyhow) or just vista? They are totally different tcp stacks.. vista is a lot more like win7.
Anyhow, happy to help. I've got both a xp sp3 and a vista VM and because the VMs are hosted on linux I can just capture any traffic to/from the guest easily without worrying about running tools on the guest.
Comment 43•9 years ago
|
||
maybe related to bug 986545 ?
Comment 44•9 years ago
|
||
The SG TCP settings only seem to affect Vista and 7 (Vista is more seriously impacted). XP and 10 don't seem to be affected.
Comment 45•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #43)
> maybe related to bug 986545 ?
It look like that might be the case. I can't reproduce this problem every time with 'autotuningl=experimental', but it certainly makes Firefox MUCH slower and I get a shutdown hang every once in a while (maybe once in 4 or 5 attempts) when I exit with this option set.
Flags: needinfo?(daniel)
Comment 46•9 years ago
|
||
(In reply to Daniel Stenberg [:bagder] from comment #45)
> I get a shutdown hang every once in a while
I should add that I ran my tests on a win7 in a VM.
Assignee | ||
Comment 47•9 years ago
|
||
This is the same issue as bug 986545.
The Issue:
Experimental TCP auto tuning on Windows sets a windows scaling option on each syn packet and set it to 14.
On our socket pair that we use for event signalling we do not change receiver windows and it is by default 8192. So it gets scaled down by 14 bits to 0. So window size that the pipe sockets are sending are 0. And the event pipe does not send any data - it sends data but they are interpreted as zero window probes...
If we increase receiver window size to something larger then 1<<14 everything works fine.
So in bug 698882 we should increase rwin. On the current version we could fix it as well from nsSocketTransportService.
Assignee | ||
Updated•9 years ago
|
Summary: SG TCP Optimizer make Firefox very slow → Experimental TCP auto tuning makes firefox event pipe unusable
Assignee | ||
Comment 49•9 years ago
|
||
>
> So in bug 698882 we should increase rwin. On the current version we could
> fix it as well from nsSocketTransportService.
I need to correct myself: We cannot fix this from STS because rwin needs to be set before connect is called and STS only gets sockets when they are already connected.
So we need to fix this in bug 698882.
Comment 50•9 years ago
|
||
it looks like bug 698882 will close this - but I would still like to backport a "fix" that just makes mThreadEvent null (and uses the 25ms timer). Is there a registry setting for "experimental" we can read to use in this case?
Assignee | ||
Comment 51•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #50)
> it looks like bug 698882 will close this - but I would still like to
> backport a "fix" that just makes mThreadEvent null (and uses the 25ms
> timer). Is there a registry setting for "experimental" we can read to use in
> this case?
Sorry for a delay, I am just now working on a fix here.
I only found that is possible to get that info with command netsh
netsh int tcp show global
Assignee | ||
Comment 52•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fdad27859bbc
try is really slow today... I still need to test the build.
Assignee | ||
Comment 53•9 years ago
|
||
Assignee | ||
Comment 54•9 years ago
|
||
Comment on attachment 8732310 [details] [diff] [review]
bug_1248358_v1.patch
I have tested it with win7.
Attachment #8732310 -
Flags: review?(mcmanus)
Comment 55•9 years ago
|
||
Comment on attachment 8732310 [details] [diff] [review]
bug_1248358_v1.patch
Review of attachment 8732310 [details] [diff] [review]:
-----------------------------------------------------------------
wow. that's ugly.
gotta think about it.
Assignee | ||
Comment 56•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #55)
> Comment on attachment 8732310 [details] [diff] [review]
> bug_1248358_v1.patch
>
> Review of attachment 8732310 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> wow. that's ugly.
>
> gotta think about it.
I know. It is a too big change to uplift it, but could not find different way to do it :)
Comment 57•9 years ago
|
||
Comment on attachment 8732310 [details] [diff] [review]
bug_1248358_v1.patch
Review of attachment 8732310 [details] [diff] [review]:
-----------------------------------------------------------------
698882 should also resolve this problem on nightly and as this isn't suitable to uplift, let's just close this when 698882 lands without another patch. I'm mostly concerned about adding the popen for 99% of users that don't have this configuration - not worth it.
Attachment #8732310 -
Flags: review?(mcmanus) → review-
Assignee | ||
Comment 58•9 years ago
|
||
Bug 698882 has landed, so closing this bug.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•