Closed Bug 650143 Opened 13 years ago Closed 8 years ago

Necko telemetry: make list of all stats we want to track

Categories

(Core :: Networking, defect)

defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: jduell.mcbugs, Unassigned)

References

Details

So we need to figure out what we want to track for necko telemetry.  Once we've got that list we also need to figure out 1) any privacy issues; 2) whether we already collect the data in necko, or need to start collecting it; and 3) making sure we expose JS APIs to access the stats so the telemetry code can collect them.  We can break implementation out into separate bugs--let's keep this one focused on what we want to collect (and any privacy issues).

The best list I've seen so far is Patrick's, below.  We probably also want to collect aggregate stats from the web timing spec (see bug 576006).

----
a (perhaps partial) wishlist:

distribution of rtts are a great way to get started - crosstabbed
against hardware type as best we can (wired, 802.11, non-802.11
wireless)... probably done as handshake time..

A series of timestamped events containing major UI events (such as tab
switching) along with necko requests and the eventual sizes and types of
those responses along with when chunks of them were consumed by the
DOM... that's really interesting from a scheduling perspective -  I've
always thought there might be some interesting cases where we have all N
parallel connections active and a request for something really high
priority (e.g. css for the viewable tab) comes in.. it might well be
worth pausing 1 or more of the active connections and starting a new one
for the css.. we would at least have a model for evaluating that with
the data.

distributions of XHR frequency, its latency as compared to handshake
rtt, and transfer size. 

make a guess at the server's congestion window peaks.. try and figure
out if that is bound by bandwidth or congestion control.. bundle that
together with rtt and the data on size and we can at least figure out
what a best-case bound for certain scenarios can be.

our rwin

when we get a non-idempotent method, what else is going on temporally?
Are they so uncommon we just shouldn't mix them into the pconn pool?
What are their transfer rates - that seems like a simple question, but
due to an undersized snd buffer on <= xp FF had horrible upload rates as
recently as 2 years ago.

# parallel connections per host and # per tab.. 

fraction of connections and hosts that are persistent connection and/or
pipeline eligible.. and the fraction that use them

lifetime of a idle (and separately all) persistent connection(s).. along
with information on who closes them.. rate of unexpected pconn reuse
failures and subsequent reschedules.

dns lookup latencies and retry and failure rates..

dns cache hit rates (both normal and prefetched)...

peak and typical queue sizes for both http transactions and dns lookups
along with their arrival patterns

hit rates and eviction history for the disk cache..

fraction of transactions that are cancelled and the reason (dom
cancelled it, timed out, stalled, etc..)

rates of gzip encoding.. rates of ssl.. impact of either of those on
latency or transfer time as compared to other documents of similar size
and rtt..

any IPv6 activity and any successful IPv6 activity.

rate of successful revalidations.

how many redirects go to the same hostname? how many redirects go to the
same IP?

I'm sure other people have other things to add that they would like to
know about too, and not all of this has an immediate and obvious use of
course but having a full picture can be quite helpful when trying to
explore any particular theory. I've got a couple colleagues in academia
that did some really useful (but 10 years old -
http://www.amazon.com/Web-Protocols-Practice-Networking-Measurement/dp/0201710889) work on characterization and I will query them to see if there is other data they think a study like this could helpfully generate and update.
Blocks: 650129
Also:

- Whether using a proxy (and what type: regular, SOCKS)

- Whether using a PAC file/URL.
The "rates of gzip encoding" mentioned in comment #0 is pretty relevant for bug #648429. May be rephrased as "ratio of compressed vs uncompressed files".
I started a page on the wiki for collecting and organizing these:

https://wiki.mozilla.org/Necko:Telemetry

When you sign into the wiki, you can add a watch to the page to see updates.
Oh, right--duh!  A wiki is much better than Bugzilla for this.  Thanks!
bug 585196 landed. See the cycle collector probe for an example of how to add timing probes. 
Install http://people.mozilla.com/~tglek/telemetry/ping.telemetry.xpi and go to about:histograms to see your measurements. Lets get some probes landed ASAP
(In reply to comment #5)
> go to about:histograms to see your measurements. 

The address is about:telemetry.
Blocks: 658894
Blocks: 659396
closing idle trackers
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.