Closed Bug 404059 Opened 15 years ago Closed 15 years ago
connection to t-mobile hotspot startup page takes very long
14.99 KB, text/plain
127.69 KB, image/png
5.33 KB, text/html
43.66 KB, application/octet-stream
11.75 KB, application/x-gzip
7.80 KB, text/plain
This is the output in german t-mobile hotspot/starbucks with the settings suggested by kai in Comment #29
8.37 KB, text/plain
1.22 KB, patch
|Details | Diff | Splinter Review|
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b2pre) Gecko/2007111519 Minefield/3.0b2pre Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b2pre) Gecko/2007111519 Minefield/3.0b2pre The connection to the t-mobile hotspot (one you find at starbucks) takes very long with the latest firefox. The load time with the 2.0 version is like expected - quick, immediate. On my T61 the loading of the webpage is in a range of 2 minutes. Reproducible: Always Steps to Reproduce: 1. go to the nearest t-mobile hotspot (in my case starbucks) 2. open your browser latest Mozilla Firefox 3.0b2pre and try to open any webpage 3. wait 4. open your 2.0x version of the browser and try to connect as well Actual Results: The browser will try to redirect you to the starbucks site. On my machine (T61) it takes about 1 minute to bring up the upper part of the page with the 'connect to internet' options. The frame below this part takes another 1-2 minutes to load and stays black for that amount of time. Then I can use the page fine, login ad browse the internet without any problems and with the expected speed. One interesting part is that while loading and when I see the black screen i tried to use the 'connect to internet' link on top of the page. The debug output is: WARNING: Attempt to close SSL socket from a thread that is not the main thread. Can not cancel pending HTTP reques\ t from NSS: file /home/erikos/mozilla_debug/security/manager/ssl/src/nsSSLThread.cpp, line 415 WARNING: Never finished decoding the JPEG.: file /home/erikos/mozilla_debug/modules/libpr0n/decoders/jpeg/nsJPEGDe\ coder.cpp, line 200 WARNING: empty langgroup: file /home/erikos/mozilla_debug/gfx/thebes/src/gfxFont.cpp, line 874 ###!!! ASSERTION: imgContainer::ReloadImages(): mRestoreDataDone shoudl be true!: 'mRestoreDataDone', file /home/e\ rikos/mozilla_debug/modules/libpr0n/src/imgContainer.cpp, line 1486 WARNING: Never finished decoding the JPEG.: file /home/erikos/mozilla_debug/modules/libpr0n/decoders/jpeg/nsJPEGDe\ coder.cpp, line 200 ###!!! ASSERTION: imgContainer::ReloadImages(): mRestoreDataDone shoudl be true!: 'mRestoreDataDone', file /home/e\ rikos/mozilla_debug/modules/libpr0n/src/imgContainer.cpp, line 1486 JPEG decoding error: Invalid JPEG file structure: two SOI markers Expected Results: Similar results with 3.0 than with 2.0 I have done my tests with a version with debug=enabled. Full log output can be found at: http://dev.laptop.org/~erikos/verbose_starbucks.log I will attach the debug logs to this bug and the source of the page.
Interesting! I've seen this bug as well in the SJC airport. I never bothered to try and older Firefox. I just figured that it was a problem with the t-mobile stuff.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Is this a linux specific bug?
Unfortunately, we don't know; no one here runs development builds on Windows or OSX. It would not surprise me either way. It is a clear regression from FF 2.x on Linux.
I'm pretty sure that I saw this bug while sitting in a (terrible!) internet cafe in SJC. I was using OSX at the time.
I went down to the local starbucks and verified that: 1. The problem does not happen with FF2 on OSX. 2. The problem does show up on a trunk build (Gecko/2007111904) in OSX 3. The problem does show up on a trunk debug build (on or about Nov 16) in Linux I took some HTTP logs that might help in deciphering the issue.
This is a log of the http transactions on the wire under linux. The long delay takes place during this bit of the log: troying nsHttpChannel @9b8d2e8 -1209029952: nsHttpResponseHead::Reset -1209029952: nsHttpHandler::Observe [topic="timer-callback"] -1211130992[92f77a0]: nsHttpConnectionMgr::OnMsgPruneDeadConnections -1209029952: nsHttpHandler::Observe [topic="timer-callback"] -1211130992[92f77a0]: nsHttpConnectionMgr::OnMsgPruneDeadConnections -1209029952: nsHttpHandler::Observe [topic="timer-callback"] -1211130992[92f77a0]: nsHttpConnectionMgr::OnMsgPruneDeadConnections Everything before that and after that is relatively fast. Not sure if this is related to the problem but someone who knows the HTTP state machine might want to look at this log to see if there's anything fishy going on.
Marking as a regression.
I wonder if the dead connections might be related to this error, which appears repeatedly in Simon log. -1210811504[a54acd0]: [ae70678] Lower layer connect error: -5934 It's coming from nsNSSIOLayer.cpp, ccing people which worked on that code recently.
Thanks Chris for testing with OSX, so we know this in not Linux only - I only did test under Linux so far. Marco is correct the ''Lower layer connect error: -5934'' is suspicious.
(In reply to comment #11) > -1210811504[a54acd0]: [ae70678] Lower layer connect error: -5934 I don't know yet what's going on, but -5934 means: PR_IN_PROGRESS_ERROR See http://developer.mozilla.org/en/docs/PR_GetConnectStatus "If PR_GetError returns PR_IN_PROGRESS_ERROR, the nonblocking connection is still in progress and has not completed yet."
From my limited understanding of that code it seem like the error might be "normal". It's a non-blocking connect and we are checking status just after the method call. http://lxr.mozilla.org/seamonkey/source/security/manager/ssl/src/nsNSSIOLayer.cpp#1183 Looking into Simon and Chris logs it seem like, for several of the https connection we make, we are blocking into nsSocketTransportService::Poll until the timeout registered by nsHttpHandler::StartPruneDeadConnectionsTimer (15 secs) expires and prune the connection.
If I remember properly the socket transport service is generic. So it's just processing data on the connection. The question is, what causes an http handler to finish normally and why aren't we hitting it in this case?
if you're saying this is reliably reproducable, and happens on t-mobile hotspots in germany, I'm willing to go to one
It's reproducable here. But I don't think I can get funding to fly to germany to test it there. :)
Component: General → Networking
Product: Firefox → Core
QA Contact: general → networking
Version: unspecified → Trunk
I tested on a t-mobile hotspot in Boston/Starbucks. Tomorrow I want to test on one in Germany/Berlin. I will let you know :)
(In reply to comment #17) > It's reproducable here. But I don't think I can get funding to fly to germany > to test it there. :) There is a burger venue 1.4 km away from me that is said to have a t-mobile hotspot, I should be able to travel to that place. I even got a login. I will bring firefox 2 and firefox 3 latest nightly builds and I will be able to test both linux and xp. (In reply to comment #19) > I tested on a t-mobile hotspot in Boston/Starbucks. Tomorrow I want to test on > one in Germany/Berlin. I will let you know :) Ok, please do, thanks!
Moving OS from Linux to All.
OS: Linux → All
ccing Dolske and DCamp - can you guys help out?
Just guessing but based on conversation I just had with dcamp, I suspect this might have to do with a laggy/blocked OCSP responder for the cert on the t-mobile hotspot's page. I'd let him mention this himself, but I think he's running sprints between the MV office and the nearest starbucks, to track down the problem. :)
A large drink of your choice to whomever finds the fix for this bug... Thank you for all your efforts!
It looks like this is related to ocsp verification - the main connection isn't resuming until that oscp request (which is being redirected by tmobile) is pruned. Sure enough, turning off ocsp verification works around it. I'll poke around that and see what I see.
Possible side effect from enabling OCSP by default. Commercial hotspots use a secure page for signing up, until you're registered they redirect all http and https traffic to their login server, including ocsp requests (which are http, too). I wonder what happens when the OCSP request gets redirected. I would expect the server immediately responds, be it an error message or a web page. But it should differ from what our OCSP code expects, and it should fail immediately. At that point, our default graceful failure should allow the https connection to proceed. I wonder why there is a delay. Are we retrying in a loop?
More detail for those following at home who might be confused: OCSP is Online Certificate Status Protocol - it's a way for browsers to verify in real time that a site's certificate is still valid (hasn't been revoked.) Whether a particular certificate authority chooses to provide an OCSP server or not is really up to them but we generally want them to, because it lets us be certain that a certificate is valid and because the old way of checking, through manually downloaded Certificate Revocation Lists (CRLs) was patent-encumbered until recently. A couple months ago we turned on the ability to use OCSP by default on trunk. Again, it's not a requirement that certs support them, but prior to flipping that switch, certs whose providers were trying to give us that information were having it ignored, because no one ever knew how to turn the checking on. Technically one could argue that if a cert provides an OCSP responder and you can't connect to it, you should treat the cert as invalid, but in practical terms that seemed like it would cause far too much headache for the payoff of preventing an imaginary attack, so the current default is "Check OCSP if it's provided, but if that check fails, assume it's a valid cert and move along." You can find these preferences in Options->Advanced->Encryption->Verification. As I said to Dave earlier when he asked, unchecking the top box will disable the checks, and as he mentions in comment 25, that seems to fix the problem. T-mobile would seem to be using a Verisign certificate with an OCSP responder specified, even though they then turn around and block access to the OCSP responder. That's going to be a pain for them over time because IIRC, IE7 turns OCSP checking on by default as well. They could fix this either by presenting a certificate without an OCSP responder, or by allowing the OCSP response through their redirect. As Kai mentions though, we should also see why we are taking so long here, because that seems to be our bug, not theirs.
Sounds like we have a problem to report to TMobile as well. We have their attention here at OLPC and will do so in the morning...
if you can run a debug version at the hotspot, you could do: export NSPR_LOG_FILE="log.txt" export NSPR_LOG_MODULES="pipnss:5" export NSS_TRACE_OCSP=1
Looks like we're checking OCSP on the redirected OCSP check, quickly hitting the active connection limit for the domain. I'll go get you that pipnss log now.
(In reply to comment #28) > Sounds like we have a problem to report to TMobile as well. We have their > attention here at OLPC and will do so in the morning... I'd prefer if they don't fix it until we understand why we delay so long. Right now we have a test case :-) and I'm sure other providers will eventually make the same mistake.
OK, while I'll get in touch with them tomorrow, I'll try to make sure your test case doesn't immediately vanish, particularly since it appears there is a simple work-around by setting a preference (which can make us sleep much better).
Oh, if the are blocking the address for the OCSP responder, won't you do a normal TCP connection timeout trying to access the address? That might cause the long delays....
OCSP is usually going to port 80, too. If tmobile redirects access to any webserver to their own login server, I'd expect they redirect the ocsp request to their own server too. But who knows what magic is listening on the http requests. Maybe they have a sniffer and only redirect it when it looks like a simple "HTTP GET" request? I realize that OCSP is using a "HTTP POST" request, so maybe they don't redirect that. If they simply block such request, that would indeed explain the timeout. FWIW, our code retries to do OCSP twice.
Ok, comment 0 says it takes about 2 minutes to load the page. I can see in mozilla/security/nss/lib/certhigh/ocsp.c in fetchOcspHttpClientV1 it requests a 60 seconds timeout for the http connection, and in nsNSSHttpRequestSession::trySendAndReceiveFcn a retry count of 2 is being used.
As an added data point, https://www.t-mobile.com/ also presents a verisign cert with an OCSP responder specified, and the responder is just plain old: http://ocsp.verisign.com So yes, port 80. Attempting to load that url directly does get me a result instantaneously (loading the t-mobile page was also normal speed), so I think we can rule out the possibility that verisign's OCSP responder is just laggy/unresponsive.
(In reply to comment #34) > But who knows what magic is listening on the http requests. > Maybe they have a sniffer and only redirect it when it looks like a simple > "HTTP GET" request? > > I realize that OCSP is using a "HTTP POST" request, so maybe they don't > redirect that. > If they simply block such request, that would indeed explain the timeout. There's definitely a redirect happening there in the http log. It's a 302, so we redirect it as a GET. tmobile seems to be responding quickly to requests for that redirected URL if OCSP is turned off, but with OCSP turned on we aren't getting a response to that GET. I suspect it's because we're holding that OCSP request for a new OCSP request (because it's asking for another https site) etc. Between the original request and the OCSP request, we're hitting the connection managers active request limit for the tmobile site.
(fwiw the extra column in front of the nspr output is timestamps)
I'm not sure how to read the timestamps in the log. However: Our OCSP code always tries to verify the cert for "now". It attempts to verify cert1 for timestamp 01:02:03 One minute later it probably has timed out. Some component tries again, this time for timestamp 01:03:03 We find a cache entry (which says: we recently tried, don't try again right now). Two seconds later the next OCSP request is started. This time for cert2, for timestamp 01:03:05 One minute later it times out, and a retry is made for timestamp 01:04:05 Again, because of the cache entry, no further attempt is made. Contrary to my comment 40, I believe, no OCSP requests are attempted for cert3 and cert 4. While they show up in the log, no http requested is started, probably cert3 and cert4 don't specify an OCSP responder. I think that explains the 2 minutes timeout, 2 different certs that specify an OCSP responder are involved in loading the start page, and we run into a timeout of 1 minute for each.
The log also tells us that we try to connect to http://ocsp.verisign.com:80/ for both verification attempts. Either that address is blocked/not redirected (unlikely) or the redirection target is not giving back any response to our POST request (seems more likely). An NSPR log that includes details from the http layer might answer this question, but I'm not sure if the answer will help us. Is there anything we can do to detect that we are running in a "network sandbox" like it happens in a "hotspot not yet paid for" scenario? I don't think it's reasonable to use a timeout as small 1-2 seconds for OCSP, IMHO that would really make it useless. But any bigger timeout will make us look broken. Should we pop up some UI that gives feedback on the "waiting for OCSP" state, with the option to temporarily turn off OCSP? I wonder what happens with IE7
I just went to Starbucks with an OLPC laptop and verified that setting security.OCSP.enabled=0 makes the problem go away, which means this bug is no longer a blocker at OLPC's end. Thanks very much for everyone's help!
German starbucks seems to be different from the American one. You get a real cup when you drink your tee in house and the milk has 3.8% of fat. Besides that the t-mobile access point interface is different and therefore the error seen in the starbucks in Boston does not occur. I have attached the logs. Great work by everyone, thanks!
Unless someone violently complains, I'm going to tell TMobile to get their hotspots fixed. It isn't fair to have them fail indefinitely while a permanent fix gets into Firefox...
Our theory is that IE7 might have similar problems as well since they turned on the same functionality by default. (They might not have the same timeout problems that we do, but they probably aren't checking the cert as they should based on the configuration.)
(In reply to comment #48) > Our theory is that IE7 might have similar problems as well since they turned on > the same functionality by default. (They might not have the same timeout > problems that we do, but they probably aren't checking the cert as they should > based on the configuration.) > Can we get a workaround in FF3 since I doubt TMobile will fix this before we ship and I'd guess others have this problem as well...
(In reply to comment #42) > I don't think it's reasonable to use a timeout as small 1-2 seconds for OCSP, > IMHO that would really make it useless. But any bigger timeout will make us > look broken. Kai - it feels like there's a middle ground here. The CAs in the CABForum have accepted language that talks about guaranteeing revocation responses within 3 seconds. I know that EV is a fraction of overall SSL, even overall SSL with OCSP responders specified, but nevertheless it makes it seem to me that, for instance, a 10s timeout might be reasonable? I don't know if our current 60s timeout was just a nice round number, or if it's based on guidelines for OCSP, but it makes the difference between looking like hotel wireless is slow, and looking like there is no internet access. (In reply to comment #49) > Can we get a workaround in FF3 since I doubt TMobile will fix this before we > ship and I'd guess others have this problem as well... Yeah, I am seeing similar behaviour with my hotel internet here from iBAHN, which specifies an OCSP responder at http://ocsp.digicert.com/. I really suspect that these sites are just not realizing what their certs are trying to do, given that FF2, IE6, and Safari don't check by default. So yeah, they will probably eventually fix it themselves, doubly so if we get tech-evang on them, but mitigating the timeout still feels like a good idea to me. The other option, turning OCSP back off by default, feels like a step backwards in terms of keeping users safe, especially given that we don't auto-fetch CRLs right now, so I'd rather not go down that route if we have other choices.
(In reply to comment #50) > I know that EV is a fraction of overall SSL, even overall SSL with > OCSP responders specified, but nevertheless it makes it seem to me that, for > instance, a 10s timeout might be reasonable? Yes. Let's do that.
Right now the timeout value is defined within NSS, and can not be controlled by the application. I filed bug 406120 to get a clean solution, a new API function to set the value. Luckily NSS is using "http delegation", it calls a callback into PSM for executing the http request, and that callback is used for OCSP requests, only. So we have a change to look at the requested timeout, and forcefully lower it. It's a little hack, but probably ok until we get bug 406120 fixed.
The "http delegation" callback will someday (soon ?) also be used to fetch CRLs. A 10 second timeout might be too little for fetching a large CRL (some are megabytes in size).
This patch could be used as a short term fix. But this workaround patch changes the default for any requests, be it OCSP, CRL or anything else. I agree with Nelson's comment, and we should use a better fix once bug 406120 is ready (use the new API function and set a separate OCSP timeout value).
Attachment #290835 - Flags: review?(rrelyea)
(In reply to comment #53) > The "http delegation" callback will someday (soon ?) also be used to fetch > CRLs. A 10 second timeout might be too little for fetching a large CRL > (some are megabytes in size). To be clear, my expectation is that this is the timeout on connection building - I wouldn't expect us to, for instance, kill connections already halfway through a CRL transfer?
(In reply to comment #55) > > To be clear, my expectation is that this is the timeout on connection building > - I wouldn't expect us to, for instance, kill connections already halfway > through a CRL transfer? The action is happening asynchronously on separate threads, it's difficult for the waiting thread to learn that the other thread has got a connection but is not yet done with transferring. In other words, right now we only have a single timeout: The maximum amount of time the consumer wants to wait for the data to come in.
(In reply to comment #56) > (In reply to comment #55) > > > > To be clear, my expectation is that this is the timeout on connection building > > - I wouldn't expect us to, for instance, kill connections already halfway > > through a CRL transfer? > > The action is happening asynchronously on separate threads, it's difficult for > the waiting thread to learn that the other thread has got a connection but is > not yet done with transferring. > > In other words, right now we only have a single timeout: The maximum amount of > time the consumer wants to wait for the data to come in. Ah, well then I even more forcefully agree with you and Nelson in comment 54 that we should make sure to separate the cases more cleanly once bug 406120 is fixed. :)
Once bug 406120 gets fixed, we could: - revert the workaround patch attached to this bug - add a call to the new API to lower the timeout for OCSP (independently of timeouts NSS will be using for other downloads)
Any chance we can get either the workaround or the real solution by the b2 code freeze tomorrow? Would be great to get this into FF2 b2..
Comment on attachment 290835 [details] [diff] [review] Workaround Patch v1 [checked in] r+ rrelyea, The concensus says that this is an ok work around until bug 406120 is fixed.
Attachment #290835 - Flags: review?(rrelyea) → review+
Comment on attachment 290835 [details] [diff] [review] Workaround Patch v1 [checked in] Workaround patch checked in. We should leave this open for the better fix once we have the NSS update.
Attachment #290835 - Attachment description: Workaround Patch v1 → Workaround Patch v1 [checked in]
Can someone confirm that this does indeed reduce the lag experienced?
any update from people who were able to reproduce this bug? Did the patch help?
I tried to get a test account through tmobile but they don't give test accounts. So unless someone who has an account can verify this, we may need to get an account.
I'll see what I can do; by chance we have a call with TMobile tomorrow on other issues, and this topic is near the top of my list to discuss with them.... So I will ask...
I signed up for a Day Pass and I am testing this at a tmobile hotspot (Borders location in Sunnyvale) running Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9b3pre) Gecko/2008010804 Minefield/3.0b3pre. I did not experience any of the issues reported in Comment 0 and all the various web pages I tried to visit loaded fine.
(In reply to comment #66) > I signed up for a Day Pass and I am testing this at a tmobile hotspot (Borders > location in Sunnyvale) running Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; > en-US; rv:1.9b3pre) Gecko/2008010804 Minefield/3.0b3pre. I did not experience > any of the issues reported in Comment 0 and all the various web pages I tried > to visit loaded fine. > Should we close this?
(In reply to comment #67) > (In reply to comment #66) > > I signed up for a Day Pass and I am testing this at a tmobile hotspot (Borders > > location in Sunnyvale) running Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; > > en-US; rv:1.9b3pre) Gecko/2008010804 Minefield/3.0b3pre. I did not experience > > any of the issues reported in Comment 0 and all the various web pages I tried > > to visit loaded fine. > > > > Should we close this? I think we have identified and mitigated the problem, verified the solution, and there's certainly no new, active work happening here. Marking this FIXED. If anyone disagrees, holler.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Issue is Resolved - removing QA-Wanted Keywords - QA-Wanted query clean-up task
You need to log in before you can comment on or make changes to this bug.