Open Bug 1657159 Opened 4 years ago Updated 3 years ago

Canonicalize/expose a "network readiness" state for use in feature code

Categories

(Core :: Networking, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: nhnt11, Unassigned)

References

Details

(Whiteboard: [necko-triaged])

We currently have a few different ways to infer network connectivity:

  1. Network change events and nsINetworkLinkService.isLinkUp
  2. "network:captive-portal-connectivity" observer notification from Captive Portal Service
  3. navigator.onLine Web API
  4. State of various network capabilities individually in nsINetworkConnectivityService
  5. Others?

It's hard to know without networking expertise what the right way to infer internet connectivity is. It would be great if there was a canonically accepted "internet state" that one could look at, with each possible state and limitations documented clearly, and known to be well-tested on all major platforms, and a "recommended usage" guideline, to streamline feature development that depends on this information.

The use-case that inspires this request is the DoH heuristics. The ideal API for that code would be a single canonical observer notification to signal connectivity changes (vs link) and a corresponding state available to be queried.

Priority: -- → P2
Whiteboard: [necko-triaged]

Brain dump:

It doesn't really feel practical to me to constantly monitor internet connectivity throughout a session. Perhaps instead, what we need is for a consumer to say "I need to do something that depends on internet connectivity" and be able to query network state on-demand, as well as a global notification for generic network events - for example, trigger a check at startup, upon network changes, when resuming from sleep, etc.

If this sounds familiar, it's because this is pretty much exactly how the captive portal service works. So the solution here might be to canonicalize the captive portal service into something slightly more generic e.g. "network readiness service". In my experience the captive portal detection mechanism of querying a canonical domain and comparing to an expected canonical response is extremely reliable for detecting network connectivity if not actual portals. There's a sense of distrust around the CPS though, so this is where the documentation on good practice and limitations might be very helpful.

I suspect one flow that would work very well for the DoH heuristics would be something like:

  1. Upon network change, start a captive portal detection (not via the Service, but rather via CaptiveDetect.jsm (or maybe even via the Service's recheckCaptivePortal API))
  2. When connectivity is available, perform heuristics.
  3. After heuristics results are available, perform another detection.
  4. If detection says internet access is still available, use the heuristics results. Else, discard results and go back to step 2.

Canonicalizing a version of this flow that is not specific to "captive portals" might be one way to go.

Every train of thought I have around this leads me to the conclusion I already mentioned above that we need to canonicalize/abstract good practice and limitations around this.

Another use case of this is instrumentation of network code. It's really important to be able to associate data with the readiness state of the network.

Also: it would be even more awesome if all network telemetry was keyed on a high-level readiness state like "not_ready", "ready", and "unknown".

I did some more digging and the NetworkConnectivityService already does a lot of the relevant work, e.g. observing connectivity and link changes, keeping state updated, etc. I think some good upgrades would be:

  1. Define a standard heuristic that is a combination of the capabilities that the NCS is validating, to be used by general feature code, and document it.
  2. Expose a unified API for querying the state of the network, rather than per-capability (I imagine it'd return the standard heuristic result by default, and consumers can request specific capabilities as needed if they want)

Beyond this, I'm trying to figure out whether we can leverage this to key all necko telemetry by the network state at the time of recording. Currently there's an implicit (maybe explicit, but I don't yet know where this might be captured) assumption that our performance and correctness guarantees depend on the stability and availability of the network. We can make this explicit by classifying the data we're recording and establishing a process around it for analysis.

My vision is e.g. a helper class for telemetry that exposes wrapper APIs that automatically key the data on current network state , and all necko telemetry accumulation happens through this class. It should be possible to statically enforce this in netwerk/, and we can support keyed histograms by simply concatenating the keys with a delimiter. One might wonder about the cost of multiplying the number of keys by (number of network state keys), but e.g. Histogram definitions do not require explicit enumeration of the keys that will be used, and I don't see any reason this would be problematic during analysis.

Priority: P2 → P3
Severity: -- → N/A
You need to log in before you can comment on or make changes to this bug.