Closed Bug 1360328 Opened 7 years ago Closed 7 years ago

Hangs during the network:link-status-changed event (with nsNotifyAddrListener::ChangeEvent::Run() on the stack)

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla55

Project Flags:

Performance Impact

high

Tracking Flags:

Tracking

Status

firefox55

---

fixed

People

(Reporter: ehsan.akhgari, Assigned: valentin)

References

Details

(Whiteboard: [bhr][necko-quantum][necko-active])

Attachments

(1 file)

Bug 1360328 - Dispatch a runnable to RecheckCaptivePortal instead of calling it immediately 7 years ago Valentin Gosu [:valentin] (he/him) 59 bytes, text/x-review-board-request	mcmanus : review+	Details

(no longer active)

Reporter

Description

•

7 years ago

There are some hangs being reported through BHR from telemetry with nsNotifyAddrListener::ChangeEvent::Run() on the stack while dispatching the network:link-status-changed event when we don't process any events on the UI thread for more than 8 *seconds*:

Some example call stacks can be seen on this page:

https://people-mozilla.org/~mlayzell/bhr/20170405/15.html (search for "nsNotifyAddrListener" on this page.)

It's possible that by searching in the rest of the data at https://people-mozilla.org/~mlayzell/bhr/20170405.html you can find more example hangs.

It seems like we may need to break down this work.  A lot goes down under this.  For example we flush the DNS cache here which involves walking a linked list which is extremely cache hostile and inefficient and traversing a hashtable which is also very slow (I don't know much about the sizes of the data structures we can have here) <https://searchfox.org/mozilla-central/rev/ce5ccb6a8ca803271c946ccb8b43b7e7d8b64e7a/netwerk/dns/nsHostResolver.cpp#617>, we can dispatch the offline event to all of the windows in the process (for example here <https://people-mozilla.org/~mlayzell/bhr/20170405/15.html#71> which involves running some JS), respawning some threads for things like the DNS resolver and probably unnecessary in the first place, reload the PAC, and synchronously run captive portal detection when we come back online!  And to make things worse, add-ons can currently also make this arbitrarily more expensive but at least that may stop being a problem after 57.

We should look at all of the inefficiencies involved here, but at the very least we should try to yield to the event loop in between these operations.

Patrick McManus [:mcmanus]

Comment 1

•

7 years ago

A few thoughts (thanks btw!)

1] this code is going to often be run when a laptop is coming back from sleep.. A lot of OS thrashing happens at that time (at least for me), so some unknown amount of this 'hang' is probably elapsed time we're not really getting the CPU for simply because this code correlates with wakeup.

2] realoadPac() based on another bug filed today can go to disk via a windows library call implementation... again wakeup is a crappy time to be using other services, so it wouldn't suprise me if that library call was less responsive than usual. Hopefully 1360164 resolves that. ni bagder

3] that dns code was ancient when I started here.. but iirc its working with at max on the order of 500 elements and a hashtable sized for that.. so while there are certainly quicker ways to cleanup (and in this case cleanup async - the code is shared with shutdown which is why it is sync.. but that need not be so) its not on the order of 8000ms in any universe - more likely a blocking io somewhere which is why I'm skeptical a yield is going to help much.

4] we do want CP detection here, I don't understand why its _sync_ tho. ni valentin.

Flags: needinfo?(daniel)

Patrick McManus [:mcmanus]

Updated

•

7 years ago

Flags: needinfo?(valentin.gosu)

(no longer active)

Reporter

Comment 2

•

7 years ago

(In reply to Patrick McManus [:mcmanus] from comment #1)
> A few thoughts (thanks btw!)
> 
> 1] this code is going to often be run when a laptop is coming back from
> sleep.. A lot of OS thrashing happens at that time (at least for me), so
> some unknown amount of this 'hang' is probably elapsed time we're not really
> getting the CPU for simply because this code correlates with wakeup.

About this, see bug 1360361 also.

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 3

•

7 years ago

(In reply to Patrick McManus [:mcmanus] from comment #1)
> 4] we do want CP detection here, I don't understand why its _sync_ tho. ni
> valentin.

I don't exactly understand how it is _sync_. It calls into captivedetect.js and performs the check asyncly.
This may be related to bug 1350470 - the fact that we have several JS observers listening for the "network:offline-status-changed" topic, may slow things down a bit.

Flags: needinfo?(valentin.gosu)

Mike Conley (:mconley) (:⚙️)

Updated

•

7 years ago

Whiteboard: [qf:p1] → [qf:p1][bhr]

(no longer active)

Reporter

Comment 4

•

7 years ago

The synchronicity in question here is from the viewpoint of the main thread event loop, IOW when I said "synchronously run captive portal detection" I meant directly calling a function that would start off the captive portal detection work, as opposed to posting a runnable to the event queue to trigger off that work (aka, the captive portal detection kick off) asynchronously.

Basically, a *lot* of things happen during network:link-status-changed, and during this time we cannot process any events from our event queue and that is a problem.

Daniel Stenberg [:bagder]

Comment 5

•

7 years ago

Bug 1360164 is for when using "system proxy" settings on Windows and seems to be responsible for some 100ms "waste".

Flags: needinfo?(daniel)

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/137066/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/137066/

Patrick McManus [:mcmanus]

Comment 7

•

7 years ago

this patch is fine, but the real win is in 1360164

Patrick McManus [:mcmanus]

Comment 8

•

7 years ago

mozreview-review

Comment on attachment 8865389 [details]
Bug 1360328 - Dispatch a runnable to RecheckCaptivePortal instead of calling it immediately

https://reviewboard.mozilla.org/r/137066/#review140084

Attachment #8865389 - Flags: review?(mcmanus) → review+

Patrick McManus [:mcmanus]

Updated

•

7 years ago

Assignee: nobody → valentin.gosu

Whiteboard: [qf:p1][bhr] → [qf:p1][bhr][necko-quantum][necko-active]

Pulsebot

Comment 9

•

7 years ago

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/5d4eb3398998
Dispatch a runnable to RecheckCaptivePortal instead of calling it immediately r=mcmanus

Carsten Book [:Tomcat]

Comment 10

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5d4eb3398998

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox55: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

(no longer active)

Reporter

Comment 11

•

7 years ago

Did we want to do more investigation on the issues in comment 0?  It seems like this patch addressed only one of the problems.  I just checked the latest BHR data available <https://s3-us-west-2.amazonaws.com/bhr-data/v1/20170516/all.html> (warning, ~450MB file) and there are still many hangs with signatures involving nsNotifyAddrListener (but note that now the native stacks are collected 128ms into the hang, so the comparison with the data in comment 0 is difficult...)

At any rate, delving into the BHR data may reveal interesting things...

(no longer active)

Reporter

Updated

•

7 years ago

Updated

•

2 years ago

Performance Impact: --- → P1

Whiteboard: [qf:p1][bhr][necko-quantum][necko-active] → [bhr][necko-quantum][necko-active]

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Hangs during the network:link-status-changed event (with nsNotifyAddrListener::ChangeEvent::Run() on the stack)

Categories

(Core :: Networking, enhancement)

Tracking

()

People

(Reporter: ehsan.akhgari, Assigned: valentin)

References

Details

(Whiteboard: [bhr][necko-quantum][necko-active])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Updated

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Updated

Updated

Attachment

General

Description

File Name

Content Type