Open Bug 1350470 Opened 7 years ago Updated 2 months ago

SetConnectivity message is slow to run

Categories

(Core :: Networking, enhancement, P2)

enhancement

Tracking

()

People

(Reporter: billm, Assigned: valentin, NeedInfo)

Details

(Keywords: stale-bug, Whiteboard: [necko-triaged])

Attachments

(1 file)

We have telemetry on how long all runnables take to run. If you sort them, you end up with this:

[(u'PVsync::Msg_Notify', 24308880900),
 (u'setTimeout(labeled)', 13472920095),
 (u'setInterval(labeled)', 8386581738),
 (u'PCompositorBridge::Msg_DidComposite', 4355274389),
 (u'anonymous runnable', 4040951138),
 (u'PostMessageEvent(labeled)', 1839384353),
 (u'PHttpChannel::Msg_OnStopRequest(labeled)', 1839366084),
 (u'CCTimerFired(labeled)', 1806332943),
 (u'InterSliceGCTimerFired(labeled)', 1804956911),
 (u'Anonymous_callback_timer', 1755452972),
 (u'PBrowser::Msg_RealMouseMoveEvent(labeled)', 1494906689),
 (u'non-nsINamed runnable', 1111543939),
 (u'ICCTimerFired(labeled)', 931788436),
 (u'NotifyOffThreadScriptLoadCompletedRunnable(labeled)', 919349567),
 (u'nsDocument::DispatchContentLoadedEvents(labeled)', 752455048),
 (u'Anonymous_interface_timer', 697262137),
 (u'PHttpChannel::Msg_OnStartRequest(labeled)', 575561100),
 (u'PContent::Msg_SetConnectivity', 561082376),
 ...

Most of these things you expect to run a lot and take a while: running JS, painting, loading stuff, handling events, GC/CC. But PContent::Msg_SetConnectivity looks out of place to me. Its frequency is pretty low, but the total time we spend in it is high: it's ranked #308 by frequency (how many times it runs) but #18 by total time spent in that runnable.

Is it expected to take a long time? Why is it so slow?
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(honzab.moz)
While the method by itself doesn't do anything would account for that, it does end up calling nsIOService::SetConnectivityInternal which then calls NotifyObservers. There are a few of them, most implemented in JS, which may account for the slightly higher delay.

I am now pondering a fix. Is the IPC message queue the same as the main thread event queue? If they were different, it should be easy to dispatch the event over to the main thread, to prevent it from blocking other IPC events. If they are the same, dispatching wouldn't make things any faster on the main thread, but maybe we should do it anyway?
Flags: needinfo?(wmccloskey)
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(honzab.moz)
Everything here is on the main thread, so dispatching won't improve anything. I think the next step would be to add timers around the different parts of nsIOService::SetConnectivityInternal to see which part is slow.
Flags: needinfo?(wmccloskey)
(In reply to Valentin Gosu [:valentin] from comment #3)
> It's probably one of these:
> http://searchfox.org/mozilla-central/search?q=%22network%3Aoffline-status-
> changed%22

It looks like the update service might be doing I/O on the main thread. That would be a good place to add timers.
Assignee: nobody → valentin.gosu
Whiteboard: [necko-active]
(In reply to Bill McCloskey (:billm) from comment #4)
> (In reply to Valentin Gosu [:valentin] from comment #3)
> > It's probably one of these:
> > http://searchfox.org/mozilla-central/search?q=%22network%3Aoffline-status-
> > changed%22
> 
> It looks like the update service might be doing I/O on the main thread. That
> would be a good place to add timers.
Should we file a separate bug for this?

It seems odd to me that the update service would be running in the content process.
I added timers to the observers to figure out which part of the code is delaying the message.
Comment on attachment 8857476 [details]
Bug 1350470 - Add telemetry to diagnose why the SetConnectivity IPC message is so slow to run

f? for data review
Attachment #8857476 - Flags: feedback?(benjamin)
Comment on attachment 8857476 [details]
Bug 1350470 - Add telemetry to diagnose why the SetConnectivity IPC message is so slow to run

https://reviewboard.mozilla.org/r/128908/#review132094

::: browser/base/content/browser.js:6153
(Diff revision 1)
>      ioService.offline = !ioService.offline;
>    },
>  
>    // nsIObserver
>    observe(aSubject, aTopic, aState) {
> +    let time = Date.now();

I highly recommend using TelemetryStopwatch here instead of hand-rolled timing. In particular it deals with clock skew and monotonic clocks correctly.

Or if that doesn't work because of keyed histograms, at least use the monotonic Cu.now() rather that Date.now()

See https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/measuring-time.html for some docs!

::: toolkit/components/telemetry/Histograms.json:3764
(Diff revision 1)
>      "n_buckets": 10,
>      "description": "Result of nsSocketTransportService::ProbeMaxCount()",
>      "bug_numbers": [1260218],
>      "alert_emails": ["necko@mozilla.com"]
>    },
> +  "NETWORK_SENDCONNECTIVITY_TIME_MS": {

I probably would have written this as five separate histograms rather than keyed, but it's not a big deal.
Attachment #8857476 - Flags: review+
Comment on attachment 8857476 [details]
Bug 1350470 - Add telemetry to diagnose why the SetConnectivity IPC message is so slow to run

to be clear, this is a data-r, I did not review the code in detail and I would recommend using the monotonic timers consistently
Attachment #8857476 - Flags: feedback?(benjamin)
Comment on attachment 8857476 [details]
Bug 1350470 - Add telemetry to diagnose why the SetConnectivity IPC message is so slow to run

https://reviewboard.mozilla.org/r/128908/#review132258

I noticed that we also need to cover:
http://searchfox.org/mozilla-central/search?q=%22network%3Aoffline-about-to-go-offline%22

And what about this code?
http://searchfox.org/mozilla-central/rev/944f87c575e8a0bcefc1ed8efff10b34cf7a5169/dom/push/PushService.jsm#298

And this code:
http://searchfox.org/mozilla-central/rev/944f87c575e8a0bcefc1ed8efff10b34cf7a5169/toolkit/components/jsdownloads/src/DownloadIntegration.jsm#1072

::: browser/base/content/browser.js:6161
(Diff revision 1)
>  
>      // This notification is also received because of a loss in connectivity,
>      // which we ignore by updating the UI to the current value of io.offline
>      this._updateOfflineUI(Services.io.offline);
> +
> +    if (Services.appinfo.processType == Services.appinfo.PROCESS_TYPE_CONTENT) {

I don't think this code will ever run in the child process, so we can ignore it.

::: netwerk/base/nsIOService.cpp:1174
(Diff revision 1)
>              static_cast<CaptivePortalService*>(mCaptivePortalService.get())->Start();
>          } else {
>              static_cast<CaptivePortalService*>(mCaptivePortalService.get())->Stop();
>          }
> +        Telemetry::Accumulate(Telemetry::NETWORK_SENDCONNECTIVITY_TIME_MS,
> +                              NS_LITERAL_CSTRING("nsIOService.cpp"),

Maybe give a more descriptive name here. Something about captive portal service.

::: netwerk/base/nsIOService.cpp:1195
(Diff revision 1)
>      if (aConnectivity) {
>          // If we were previously offline due to connectivity=false,
>          // send the ONLINE notification
>          observerService->NotifyObservers(
>              static_cast<nsIIOService *>(this),
>              NS_IOSERVICE_OFFLINE_STATUS_TOPIC,
>              (u"" NS_IOSERVICE_ONLINE));
>      } else {
>          // If we were previously online and lost connectivity
>          // send the OFFLINE notification
>          observerService->NotifyObservers(static_cast<nsIIOService *>(this),
>                                           NS_IOSERVICE_GOING_OFFLINE_TOPIC,
>                                           u"" NS_IOSERVICE_OFFLINE);
>          observerService->NotifyObservers(static_cast<nsIIOService *>(this),
>                                           NS_IOSERVICE_OFFLINE_STATUS_TOPIC,
>                                           u"" NS_IOSERVICE_OFFLINE);
>      }

Can you also time these notifications? Ideally we would cover them individually, but it would be good to know that we're not missing anything.
Attachment #8857476 - Flags: review?(wmccloskey)
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Moving to p3 because no activity for at least 24 weeks.
Priority: P1 → P3
Severity: normal → S3

I'll take a look at whether this is still an issue or if we can close it.

Flags: needinfo?(valentin.gosu)
Priority: P3 → P2
Whiteboard: [necko-active] → [necko-triaged]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: