"Link monitor" thread isn't started properly on Linux

RESOLVED FIXED in Firefox 42

Status

()

RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: bagder, Assigned: bagder)

Tracking

Trunk
mozilla43
Unspecified
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox42 fixed, firefox43 fixed)

Details

Attachments

(1 attachment, 2 obsolete attachments)

(Assignee)

Description

3 years ago
When landing that work, the function call was changed from NS_NewThread() to NS_NewNamedThread() late in the game but without using the latter function with the correct set of arguments and thus the "Link Monitor" thread is never actually running!

(Linux-specific code-path)
(Assignee)

Updated

3 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 1

3 years ago
Created attachment 8643597 [details] [diff] [review]
0001-bug-1191253-start-the-Link-Monitor-thread-appropriat.patch

The patch only adds an argument to make the initial event gets dispatched
Attachment #8643597 - Flags: review?(mcmanus)
Attachment #8643597 - Flags: review?(mcmanus) → review+
https://hg.mozilla.org/mozilla-central/rev/fea6620f36b8
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox42: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
See Also: → bug 1193796
(Assignee)

Comment 5

3 years ago
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/3c8f6736b07c since it caused Bug 1193541.

Backing this out, we're back to not running the thread so it will not properly detect network changes and online/offline situations.
Blocks: 1179568
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 6

3 years ago
Thanks to the guest account I got from :botond I could (finally) see the problem trigger in real-time and figure out what it was going on that triggered bug 1193541 and why we had to back this patch out.

In these networks we apparently rather frequently get a whole bunch of IPv6 routing table updates involving a link-local address as gateway. I'm not sure that the significance of these updates are, but they're not really the sort of updates we're looking for here anyway since we use the routing table updates as a signal that the "network" changed and these updates on the link-local level are clearly not changing the network in a notable way that would need us to trigger network-changed events.

I'll polish up my patch and take it through a try-run before I post it. Stay tuned.
so these are extraneous events that lead to too many network change events.. and that's going to clear and reset a bunch of things un-necessarily, so it makes sense to filter them out.

but why did it break connectivity to google? It doesn't seem like we should get into a fail state?
Flags: needinfo?(daniel)
(Assignee)

Comment 8

3 years ago
These events keep coming on this network, so it caused network-changed events get sent in bursts every few seconds, which we should avoid. But then as to exactly what those network-changed events did that caused google and spotify-webplayer to fail, I'm not sure. I've not seen a very accurate description of what exactly happened.

It could indeed be valuable for me to make a local build that just sends network-change repeatedly at a fair interval to see what breaks and why since we I can't think of any reason why we shouldn't be able to should handle that pretty gracefully. I'll run some tests like this and see what I learn.

(the actual patch for this issue is still pending mostly since try has been closed this evening)
Flags: needinfo?(daniel)
(Assignee)

Comment 9

3 years ago
Created attachment 8653985 [details] [diff] [review]
v2-0001-bug-1191253-start-the-link-monitor-ignore-link-lo.patch

Here's a version of this patch that starts the helper thread properly _and_ ignores all routing table updates using a link-local IPv6 address as gateway.

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f6c1945a20dd
Attachment #8643597 - Attachment is obsolete: true
Attachment #8653985 - Flags: review?(mcmanus)
(Assignee)

Comment 10

3 years ago
:botond, do you think you can give this a quick try in your end to see if anything weird pops up? I ran with this using my guest account in your machine but it isn't really possible to run the browser properly and for an extended period in such a running-remote setup. (Sweden<->Canada)
Flags: needinfo?(botond)
Comment on attachment 8653985 [details] [diff] [review]
v2-0001-bug-1191253-start-the-link-monitor-ignore-link-lo.patch

Review of attachment 8653985 [details] [diff] [review]:
-----------------------------------------------------------------

I think its important to understand the failure mode before re-enabling this.. so let's review it then.
Attachment #8653985 - Flags: review?(mcmanus)
(In reply to Daniel Stenberg [:bagder] from comment #10)
> :botond, do you think you can give this a quick try in your end to see if
> anything weird pops up? I ran with this using my guest account in your
> machine but it isn't really possible to run the browser properly and for an
> extended period in such a running-remote setup. (Sweden<->Canada)

I've been using a build with this patch for a few hours, and I haven't experienced any issues accessing Google properties, or any other networking-related issues.
Flags: needinfo?(botond)
(Assignee)

Comment 13

3 years ago
(In reply to Patrick McManus [:mcmanus] from comment #11)

> I think its important to understand the failure mode before re-enabling
> this.. so let's review it then.

Okay, let's do that. I'm fairly confident though that this bug that caused too many network-change events to happen was one bug that we have to fix and the inability to deal with that large amount of events is a different one that should warrant its own bug and remedies. I'm on it.
(Assignee)

Comment 14

3 years ago
Here's what I've learned so far:

My test-Firefox sends a fake network change event every 100ms, non-stop, to really torture everything.

I then pop a youtube video (meaning HTTP/2) and it really doesn't work properly.

Turns out I can make it really usable again by simply not sending off the HTTP/2 pings nsHttpConnection::CheckForTraffic() will result in. And then it struck me that I've previously seen that Google's HTTP/2 server doesn't like a series of pings and sends a protocol error on the third ping within a short period of time (I don't know what heuristics it uses). I could easily verify this with h2i too.
(In reply to Daniel Stenberg [:bagder] from comment #14)
> Turns out I can make it really usable again by simply not sending off the
> HTTP/2 pings nsHttpConnection::CheckForTraffic() will result in. And then it
> struck me that I've previously seen that Google's HTTP/2 server doesn't like
> a series of pings and sends a protocol error on the third ping within a
> short period of time (I don't know what heuristics it uses). I could easily
> verify this with h2i too.

that makes sense - and its a great find, because that was a latent issue that was going to impact more than Linux :)

So it all works fine (if suboptimally, obviously) if just the ping is suppressed?

what do you think we should do here?
(Assignee)

Comment 16

3 years ago
(In reply to Patrick McManus [:mcmanus] from comment #15)

> So it all works fine (if suboptimally, obviously) if just the ping is
> suppressed?

Yes. Removing the pings made Firefox usable again.

> what do you think we should do here?

1) land this patch, as it makes the Linux version more in line with how Firefox works on Windows and Mac already

2) file a new bug and work on rate limitation logic for network-change events. I'll write down my thinking first thing in that bug. Primarily I think delaying the first event so that we can catch rapid subsequent ones and avoid generating many, and then I'll work on logic to avoid sending further pings over a h2/spdy connection that hasn't transferred any data since the previous ping. (and instead switch over to the more passive HTTP/1.1 method)
(Assignee)

Updated

3 years ago
See Also: → bug 1201037
make it so!
Comment on attachment 8653985 [details] [diff] [review]
v2-0001-bug-1191253-start-the-link-monitor-ignore-link-lo.patch

Review of attachment 8653985 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/system/linux/nsNotifyAddrListener_Linux.cpp
@@ +179,2 @@
>  
>              // We are just intrested in main routing table

this isn't part of this patch, so we can open a new bug if necessary, but this is a odd assumption right?

@@ +196,5 @@
> +                    if (route_entry->rtm_family == AF_INET6) {
> +                        unsigned char *g = (unsigned char *)
> +                            RTA_DATA(attr);
> +                        if((g[0] == 0xFE) && ((g[1] & 0xc0) == 0x80)) {
> +                            link_local = true;

break;

@@ +206,5 @@
> +            if (!link_local) {
> +                LOG(("OnNetlinkMessage: route update\n"));
> +                networkChange = true;
> +            }
> +            else {

same line as }
Attachment #8653985 - Flags: review+
(Assignee)

Comment 19

3 years ago
(In reply to Patrick McManus [:mcmanus] from comment #18)

> >              // We are just intrested in main routing table
> 
> this isn't part of this patch, so we can open a new bug if necessary, but
> this is a odd assumption right?

Probably. I've struggled to understand and found documentation on exactly how these tables are used and not used so I've tried to take the conservative path -- trying to avoid superfluous notifications. But yes, we can deal with this separately.
(Assignee)

Comment 20

3 years ago
Created attachment 8656512 [details] [diff] [review]
v3-0001-bug-1191253-start-the-link-monitor-ignore-link-lo.patch

Fixed review remarks.
Attachment #8653985 - Attachment is obsolete: true
Attachment #8656512 - Flags: review+
(Assignee)

Updated

3 years ago
Keywords: checkin-needed
(In reply to Daniel Stenberg [:bagder] from comment #19)
> (In reply to Patrick McManus [:mcmanus] from comment #18)
> 
> > >              // We are just intrested in main routing table
> > 
> > this isn't part of this patch, so we can open a new bug if necessary, but
> > this is a odd assumption right?
> 
> Probably. I've struggled to understand and found documentation on exactly
> how these tables are used and not used so I've tried to take the
> conservative path -- trying to avoid superfluous notifications. But yes, we
> can deal with this separately.

I've used them to introduce src address based routing in the past. (i.e. you use the src address to find a table, and each table has a different gateway).
https://hg.mozilla.org/mozilla-central/rev/20ab8a8bdd20
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
status-firefox43: --- → fixed
Resolution: --- → FIXED
Target Milestone: mozilla42 → mozilla43
(Assignee)

Updated

3 years ago
Duplicate of this bug: 1159889

Updated

3 years ago
Depends on: 1234548
Duplicate of this bug: 396781
You need to log in before you can comment on or make changes to this bug.