Closed Bug 646417 Opened 13 years ago Closed 5 years ago

On MeeGo - Channel which lose connection can not reestablish it but shows error page

Categories

(Firefox for Android Graveyard :: General, defect)

x86
MeeGo
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jbos, Unassigned)

References

Details

Attachments

(1 file)

1. Establish a Channel i.e. a huge Download.
2. Cut Connection from external (ie. switch of accesspoint)
3. See Connection Dialog
4. Select another Connection (i.e. another accesspoint)
5. See Connection Established

Actual Outcome:
browser showes error page "host not found"

Expected Outcome:
browser reestablish and revoke channel and keep downloading.
I found that we return with "OnLookupCompleted" when the lookup is NOT completed, 

The state of the hostresolver is *still* resolving.

I added a workaround to check in OnLookupCompleted if the host is still resolving. If so -> we do not tell the channel "onLookupCompleted" and wait for the time when its really completed.

With this fix the expected outcome is actual outcome.
Attachment #522971 - Flags: review?(romaxa)
Depends on: 646396, 646408
Blocks: 609844
Comment on attachment 522971 [details] [diff] [review]
Avoid wrong interpretation of onLookupCompleted

Assigning to Patrick since he knows DNS better than I.

Assuming checking nsHostResolver::resolving is ok like this, presumably we can make it public without the #ifdef?
Attachment #522971 - Flags: review?(romaxa) → review?(mcmanus)
I wasn't sure / able to test for side effects of this change in other systems, thats why i restricted it. 

One thing i expect. You can simply see / test it on N900 Fennec (GTK, Maemo). This change should actually fix a similar bug (seeing error page after connecting).


( ofcourse you need to disable the ifdef than)
Comment on attachment 522971 [details] [diff] [review]
Avoid wrong interpretation of onLookupCompleted

First, we cannot have a member variable (resolving) be public vs private depending on compile flags. That's just too fragile. If you really need that, just add a public read-only accessor available to all platforms.

Second, I can't figure out what you're trying to accomplish here. The DNS lookup finished resolving before going into the callback you're modifying - and the resolving attribute was updated - see nsHostResolver::OnLookupComplete() .. am I missing something there?

But you're not checking "resolving" on your DNS lookup, you're checking it on the underlying DNS record. In the case that we failed to get a lookup it is very likely that a subsequent use has caused us to retry nearly immediately (we'll even retry in the background while giving out cached negative results)..

so it looks like you're trying to hook into some kind of race condition..

or do you believe that you're getting to the code you modify through some path other than nsHostResolver::OnLookupComplete() or at least through some path in there that doesn't reset resolving?

Third, if your code removes mlistener->onlookupcomplete() here, when does it get called?

I think we need to get at the underlying issue here.
Attachment #522971 - Flags: review?(mcmanus) → review-
(In reply to comment #4)

> 
> or do you believe that you're getting to the code you modify through some path
> other than nsHostResolver::OnLookupComplete() or at least through some path in
> there that doesn't reset resolving?
> 

I phrased this badly - I made it sound like that's the only way the callback can happen - and of course that isn't true. There are certainly cases (cache hits and detachcallback() among them) where the callback can happen and resolving could be true.

But in all those cases we want to complete the onlookupcomplete() chain.. so my comment about the underlying issue remains true - in what circumstance are you getting the onlookupcomplete that you don't want it - and how does ifdef'ing it out make it complete later? (I wouldn't think it would still be on one of the callback linked lists at that point..)

I don't yet see any reason there should be platform specific ifdef code in this code path, so the root issue is key.
Thanks for Review!

Mhm I think what you say is of course correct and I support your opinion. I never said I know what I do in this area. The bug is not only hunting MeeGo but also (exactly the same bug) is hunting Maemo for over 2 Years now!

Its top bug for Maemo Bugs.


(In reply to comment #4)
> Comment on attachment 522971 [details] [diff] [review]
> Avoid wrong interpretation of onLookupCompleted
> 
> First, we cannot have a member variable (resolving) be public vs private
> depending on compile flags. That's just too fragile. If you really need that,
> just add a public read-only accessor available to all platforms.
>

Yes. 
 
> Second, I can't figure out what you're trying to accomplish here. The DNS
> lookup finished resolving before going into the callback you're modifying - and
> the resolving attribute was updated - see nsHostResolver::OnLookupComplete() ..
> am I missing something there?


The stuff is really complicated and I just describe what I learned from debug output.

1. Connection Thread Failed through loose of network 
2. Connection Thread hit "Recover from Error" http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransport2.cpp#1250  

And returns with "TryAgain=true"
3. We trigger a setup of the socket and go into an async hostresolving.

----

4. We do our first return from the resolving with a onLookupCompleted. In our first return the Variable resolving is TRUE. 
5. We check if we can establish the Connection / have a valid host, since the Lookup is still resolving this failes and cause the error page to appear.


6. After some time, there comes a second call onLookupCompleted! This time resolving is false. 
7. The Networksthread / Socket get already killed and Error Page is already triggered. But this time Step 5 would not fail.


 

> 
> But you're not checking "resolving" on your DNS lookup, you're checking it on
> the underlying DNS record. In the case that we failed to get a lookup it is
> very likely that a subsequent use has caused us to retry nearly immediately
> (we'll even retry in the background while giving out cached negative results)..
> 
> so it looks like you're trying to hook into some kind of race condition..
> 
> or do you believe that you're getting to the code you modify through some path
> other than nsHostResolver::OnLookupComplete() or at least through some path in
> there that doesn't reset resolving?
> 
> Third, if your code removes mlistener->onlookupcomplete() here, when does it
> get called?
> 
> I think we need to get at the underlying issue here.


Yes, I think there is something hidden in the shadow, It might fit together that on Maemo / MeeGo the Network is not online when the interface is up. There needs to get some internal DNS Masks setuped (on the device runs an dns server). This is NOT signaled and take up to 10 seconds to accomplish.


(In reply to comment #5)
> (In reply to comment #4)
> 
> > 
> > or do you believe that you're getting to the code you modify through some path
> > other than nsHostResolver::OnLookupComplete() or at least through some path in
> > there that doesn't reset resolving?
> > 
> 
> I phrased this badly - I made it sound like that's the only way the callback
> can happen - and of course that isn't true. There are certainly cases (cache
> hits and detachcallback() among them) where the callback can happen and
> resolving could be true.
> 
> But in all those cases we want to complete the onlookupcomplete() chain.. so my
> comment about the underlying issue remains true - in what circumstance are you
> getting the onlookupcomplete that you don't want it - and how does ifdef'ing it
> out make it complete later? (I wouldn't think it would still be on one of the
> callback linked lists at that point..)
> 
> I don't yet see any reason there should be platform specific ifdef code in this
> code path, so the root issue is key.
> 
> The stuff is really complicated and I just describe what I learned from debug
> output.
> 
> 1. Connection Thread Failed through loose of network 
> 2. Connection Thread hit "Recover from Error"
> http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransport2.cpp#1250 
> 
> And returns with "TryAgain=true"
> 3. We trigger a setup of the socket and go into an async hostresolving.
> 

that's super information.. I suspect the code is calling AsyncResolve(hostname, flags, listener, nsnull).. then the network error is happening.. and then it calls AsyncResolve(hostname, flags, listener, nsnull) again.. and then you get two callbacks - which is to be expected because you called asyncresolve() twice. The first one dates from when you had the network error, so it isn't surprising that you don't want it. But it should be canceled properly, not just filtered out. Filtering it has potential for memory leaks plus all kinds of false positive race conditions where some requests won't get any callbacks at all.

how do you know that recoverFromError returns with tryagain=true from the logs?

just a guess.. but it seems it could be fixed by having the nsSocketTransport2 code just cancel the first request before generating the second one. Could you try out something like this, and report back?

diff --git a/netwerk/base/src/nsSocketTransport2.cpp b/netwerk/base/src/nsSocketTransport2.cpp
--- a/netwerk/base/src/nsSocketTransport2.cpp
+++ b/netwerk/base/src/nsSocketTransport2.cpp
@@ -1300,16 +1300,23 @@ nsSocketTransport::RecoverFromError()
 
         if (mState == STATE_CONNECTING) {
             mState = STATE_RESOLVING;
             msg = MSG_DNS_LOOKUP_COMPLETE;
         }
         else {
             mState = STATE_CLOSED;
             msg = MSG_ENSURE_CONNECT;
+            
+            // make sure there isn't any pending DNS request before starting
+            // over.
+            if (mDNSRequest) {
+                mDNSRequest->Cancel(NS_ERROR_ABORT);
+                mDNSRequest = nsnull;
+            }
         }
 
         rv = PostEvent(msg, NS_OK);
         if (NS_FAILED(rv))
             tryAgain = PR_FALSE;
     }
 
     return tryAgain;
i tried this, but didn't help.

I also found that autodial is false (even when pref is true) and only changed when you change pref from true->false->true.

IOService should initialize the autodial at init.
(In reply to comment #8)
> i tried this, but didn't help.
> 

ok. that's too bad :(

But you get the idea, right? Let's confirm if ->AsyncResolve() is being called twice with the same callback object and without a cancel. If that is true, then we should expect two callbacks and the fix from there should be pretty straight forward. Perhaps the onlookupcomplete() has been scheduled, but not recevied, before my cancel in comment 7.. in that case the right thing to do is a little more complicated, but it still isn't in the dns code directly.
Closing all opened bug in a graveyard component
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: