Hangs during the network:link-status-changed event (with nsNotifyAddrListener::ChangeEvent::Run() on the stack)

RESOLVED FIXED in Firefox 55

Status

()

Core
Networking
RESOLVED FIXED
2 months ago
a month ago

People

(Reporter: Ehsan, Assigned: valentin)

Tracking

unspecified
mozilla55
Points:
---

Firefox Tracking Flags

(firefox55 fixed)

Details

(Whiteboard: [qf:p1][bhr][necko-quantum][necko-active])

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

There are some hangs being reported through BHR from telemetry with nsNotifyAddrListener::ChangeEvent::Run() on the stack while dispatching the network:link-status-changed event when we don't process any events on the UI thread for more than 8 *seconds*:

Some example call stacks can be seen on this page:

https://people-mozilla.org/~mlayzell/bhr/20170405/15.html (search for "nsNotifyAddrListener" on this page.)

It's possible that by searching in the rest of the data at https://people-mozilla.org/~mlayzell/bhr/20170405.html you can find more example hangs.

It seems like we may need to break down this work.  A lot goes down under this.  For example we flush the DNS cache here which involves walking a linked list which is extremely cache hostile and inefficient and traversing a hashtable which is also very slow (I don't know much about the sizes of the data structures we can have here) <https://searchfox.org/mozilla-central/rev/ce5ccb6a8ca803271c946ccb8b43b7e7d8b64e7a/netwerk/dns/nsHostResolver.cpp#617>, we can dispatch the offline event to all of the windows in the process (for example here <https://people-mozilla.org/~mlayzell/bhr/20170405/15.html#71> which involves running some JS), respawning some threads for things like the DNS resolver and probably unnecessary in the first place, reload the PAC, and synchronously run captive portal detection when we come back online!  And to make things worse, add-ons can currently also make this arbitrarily more expensive but at least that may stop being a problem after 57.

We should look at all of the inefficiencies involved here, but at the very least we should try to yield to the event loop in between these operations.
A few thoughts (thanks btw!)

1] this code is going to often be run when a laptop is coming back from sleep.. A lot of OS thrashing happens at that time (at least for me), so some unknown amount of this 'hang' is probably elapsed time we're not really getting the CPU for simply because this code correlates with wakeup.

2] realoadPac() based on another bug filed today can go to disk via a windows library call implementation... again wakeup is a crappy time to be using other services, so it wouldn't suprise me if that library call was less responsive than usual. Hopefully 1360164 resolves that. ni bagder

3] that dns code was ancient when I started here.. but iirc its working with at max on the order of 500 elements and a hashtable sized for that.. so while there are certainly quicker ways to cleanup (and in this case cleanup async - the code is shared with shutdown which is why it is sync.. but that need not be so) its not on the order of 8000ms in any universe - more likely a blocking io somewhere which is why I'm skeptical a yield is going to help much.

4] we do want CP detection here, I don't understand why its _sync_ tho. ni valentin.
Flags: needinfo?(daniel)
Flags: needinfo?(valentin.gosu)
(In reply to Patrick McManus [:mcmanus] from comment #1)
> A few thoughts (thanks btw!)
> 
> 1] this code is going to often be run when a laptop is coming back from
> sleep.. A lot of OS thrashing happens at that time (at least for me), so
> some unknown amount of this 'hang' is probably elapsed time we're not really
> getting the CPU for simply because this code correlates with wakeup.

About this, see bug 1360361 also.
(In reply to Patrick McManus [:mcmanus] from comment #1)
> 4] we do want CP detection here, I don't understand why its _sync_ tho. ni
> valentin.

I don't exactly understand how it is _sync_. It calls into captivedetect.js and performs the check asyncly.
This may be related to bug 1350470 - the fact that we have several JS observers listening for the "network:offline-status-changed" topic, may slow things down a bit.
Flags: needinfo?(valentin.gosu)

Updated

2 months ago
Whiteboard: [qf:p1] → [qf:p1][bhr]
The synchronicity in question here is from the viewpoint of the main thread event loop, IOW when I said "synchronously run captive portal detection" I meant directly calling a function that would start off the captive portal detection work, as opposed to posting a runnable to the event queue to trigger off that work (aka, the captive portal detection kick off) asynchronously.

Basically, a *lot* of things happen during network:link-status-changed, and during this time we cannot process any events from our event queue and that is a problem.
Bug 1360164 is for when using "system proxy" settings on Windows and seems to be responsible for some 100ms "waste".
Flags: needinfo?(daniel)
Comment hidden (mozreview-request)
this patch is fine, but the real win is in 1360164

Comment 8

2 months ago
mozreview-review
Comment on attachment 8865389 [details]
Bug 1360328 - Dispatch a runnable to RecheckCaptivePortal instead of calling it immediately

https://reviewboard.mozilla.org/r/137066/#review140084
Attachment #8865389 - Flags: review?(mcmanus) → review+
Assignee: nobody → valentin.gosu
Whiteboard: [qf:p1][bhr] → [qf:p1][bhr][necko-quantum][necko-active]

Comment 9

a month ago
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/5d4eb3398998
Dispatch a runnable to RecheckCaptivePortal instead of calling it immediately r=mcmanus

Comment 10

a month ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/5d4eb3398998
Status: NEW → RESOLVED
Last Resolved: a month ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Did we want to do more investigation on the issues in comment 0?  It seems like this patch addressed only one of the problems.  I just checked the latest BHR data available <https://s3-us-west-2.amazonaws.com/bhr-data/v1/20170516/all.html> (warning, ~450MB file) and there are still many hangs with signatures involving nsNotifyAddrListener (but note that now the native stacks are collected 128ms into the hang, so the comparison with the data in comment 0 is difficult...)

At any rate, delving into the BHR data may reveal interesting things...
See Also: → bug 1366133
You need to log in before you can comment on or make changes to this bug.