Closed Bug 1191253 Opened 9 years ago Closed 9 years ago

"Link monitor" thread isn't started properly on Linux

Categories

(Core :: Networking, defect)

Unspecified
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla43
Tracking Status
firefox42 --- fixed
firefox43 --- fixed

People

(Reporter: bagder, Assigned: bagder)

References

Details

Attachments

(1 file, 2 obsolete files)

When landing that work, the function call was changed from NS_NewThread() to NS_NewNamedThread() late in the game but without using the latter function with the correct set of arguments and thus the "Link Monitor" thread is never actually running!

(Linux-specific code-path)
Status: NEW → ASSIGNED
The patch only adds an argument to make the initial event gets dispatched
Attachment #8643597 - Flags: review?(mcmanus)
Attachment #8643597 - Flags: review?(mcmanus) → review+
https://hg.mozilla.org/mozilla-central/rev/fea6620f36b8
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
See Also: → 1193796
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/3c8f6736b07c since it caused Bug 1193541.

Backing this out, we're back to not running the thread so it will not properly detect network changes and online/offline situations.
Blocks: 1179568
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Thanks to the guest account I got from :botond I could (finally) see the problem trigger in real-time and figure out what it was going on that triggered bug 1193541 and why we had to back this patch out.

In these networks we apparently rather frequently get a whole bunch of IPv6 routing table updates involving a link-local address as gateway. I'm not sure that the significance of these updates are, but they're not really the sort of updates we're looking for here anyway since we use the routing table updates as a signal that the "network" changed and these updates on the link-local level are clearly not changing the network in a notable way that would need us to trigger network-changed events.

I'll polish up my patch and take it through a try-run before I post it. Stay tuned.
so these are extraneous events that lead to too many network change events.. and that's going to clear and reset a bunch of things un-necessarily, so it makes sense to filter them out.

but why did it break connectivity to google? It doesn't seem like we should get into a fail state?
Flags: needinfo?(daniel)
These events keep coming on this network, so it caused network-changed events get sent in bursts every few seconds, which we should avoid. But then as to exactly what those network-changed events did that caused google and spotify-webplayer to fail, I'm not sure. I've not seen a very accurate description of what exactly happened.

It could indeed be valuable for me to make a local build that just sends network-change repeatedly at a fair interval to see what breaks and why since we I can't think of any reason why we shouldn't be able to should handle that pretty gracefully. I'll run some tests like this and see what I learn.

(the actual patch for this issue is still pending mostly since try has been closed this evening)
Flags: needinfo?(daniel)
Here's a version of this patch that starts the helper thread properly _and_ ignores all routing table updates using a link-local IPv6 address as gateway.

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f6c1945a20dd
Attachment #8643597 - Attachment is obsolete: true
Attachment #8653985 - Flags: review?(mcmanus)
:botond, do you think you can give this a quick try in your end to see if anything weird pops up? I ran with this using my guest account in your machine but it isn't really possible to run the browser properly and for an extended period in such a running-remote setup. (Sweden<->Canada)
Flags: needinfo?(botond)
Comment on attachment 8653985 [details] [diff] [review]
v2-0001-bug-1191253-start-the-link-monitor-ignore-link-lo.patch

Review of attachment 8653985 [details] [diff] [review]:
-----------------------------------------------------------------

I think its important to understand the failure mode before re-enabling this.. so let's review it then.
Attachment #8653985 - Flags: review?(mcmanus)
(In reply to Daniel Stenberg [:bagder] from comment #10)
> :botond, do you think you can give this a quick try in your end to see if
> anything weird pops up? I ran with this using my guest account in your
> machine but it isn't really possible to run the browser properly and for an
> extended period in such a running-remote setup. (Sweden<->Canada)

I've been using a build with this patch for a few hours, and I haven't experienced any issues accessing Google properties, or any other networking-related issues.
Flags: needinfo?(botond)
(In reply to Patrick McManus [:mcmanus] from comment #11)

> I think its important to understand the failure mode before re-enabling
> this.. so let's review it then.

Okay, let's do that. I'm fairly confident though that this bug that caused too many network-change events to happen was one bug that we have to fix and the inability to deal with that large amount of events is a different one that should warrant its own bug and remedies. I'm on it.
Here's what I've learned so far:

My test-Firefox sends a fake network change event every 100ms, non-stop, to really torture everything.

I then pop a youtube video (meaning HTTP/2) and it really doesn't work properly.

Turns out I can make it really usable again by simply not sending off the HTTP/2 pings nsHttpConnection::CheckForTraffic() will result in. And then it struck me that I've previously seen that Google's HTTP/2 server doesn't like a series of pings and sends a protocol error on the third ping within a short period of time (I don't know what heuristics it uses). I could easily verify this with h2i too.
(In reply to Daniel Stenberg [:bagder] from comment #14)
> Turns out I can make it really usable again by simply not sending off the
> HTTP/2 pings nsHttpConnection::CheckForTraffic() will result in. And then it
> struck me that I've previously seen that Google's HTTP/2 server doesn't like
> a series of pings and sends a protocol error on the third ping within a
> short period of time (I don't know what heuristics it uses). I could easily
> verify this with h2i too.

that makes sense - and its a great find, because that was a latent issue that was going to impact more than Linux :)

So it all works fine (if suboptimally, obviously) if just the ping is suppressed?

what do you think we should do here?
(In reply to Patrick McManus [:mcmanus] from comment #15)

> So it all works fine (if suboptimally, obviously) if just the ping is
> suppressed?

Yes. Removing the pings made Firefox usable again.

> what do you think we should do here?

1) land this patch, as it makes the Linux version more in line with how Firefox works on Windows and Mac already

2) file a new bug and work on rate limitation logic for network-change events. I'll write down my thinking first thing in that bug. Primarily I think delaying the first event so that we can catch rapid subsequent ones and avoid generating many, and then I'll work on logic to avoid sending further pings over a h2/spdy connection that hasn't transferred any data since the previous ping. (and instead switch over to the more passive HTTP/1.1 method)
See Also: → 1201037
make it so!
Comment on attachment 8653985 [details] [diff] [review]
v2-0001-bug-1191253-start-the-link-monitor-ignore-link-lo.patch

Review of attachment 8653985 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/system/linux/nsNotifyAddrListener_Linux.cpp
@@ +179,2 @@
>  
>              // We are just intrested in main routing table

this isn't part of this patch, so we can open a new bug if necessary, but this is a odd assumption right?

@@ +196,5 @@
> +                    if (route_entry->rtm_family == AF_INET6) {
> +                        unsigned char *g = (unsigned char *)
> +                            RTA_DATA(attr);
> +                        if((g[0] == 0xFE) && ((g[1] & 0xc0) == 0x80)) {
> +                            link_local = true;

break;

@@ +206,5 @@
> +            if (!link_local) {
> +                LOG(("OnNetlinkMessage: route update\n"));
> +                networkChange = true;
> +            }
> +            else {

same line as }
Attachment #8653985 - Flags: review+
(In reply to Patrick McManus [:mcmanus] from comment #18)

> >              // We are just intrested in main routing table
> 
> this isn't part of this patch, so we can open a new bug if necessary, but
> this is a odd assumption right?

Probably. I've struggled to understand and found documentation on exactly how these tables are used and not used so I've tried to take the conservative path -- trying to avoid superfluous notifications. But yes, we can deal with this separately.
Fixed review remarks.
Attachment #8653985 - Attachment is obsolete: true
Attachment #8656512 - Flags: review+
Keywords: checkin-needed
(In reply to Daniel Stenberg [:bagder] from comment #19)
> (In reply to Patrick McManus [:mcmanus] from comment #18)
> 
> > >              // We are just intrested in main routing table
> > 
> > this isn't part of this patch, so we can open a new bug if necessary, but
> > this is a odd assumption right?
> 
> Probably. I've struggled to understand and found documentation on exactly
> how these tables are used and not used so I've tried to take the
> conservative path -- trying to avoid superfluous notifications. But yes, we
> can deal with this separately.

I've used them to introduce src address based routing in the past. (i.e. you use the src address to find a table, and each table has a different gateway).
https://hg.mozilla.org/mozilla-central/rev/20ab8a8bdd20
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: mozilla42 → mozilla43
Depends on: 1234548
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: