Closed Bug 1431168 Opened 6 years ago Closed 6 years ago

Evaluate common patterns of usage for TMO for performance

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chutten, Assigned: chutten)

Details

One of the loudest pain points from Firefox Engineers using TMO dashboards was the performance. And it is pretty lousy.

This bug is to track the work determining why and where it is poor, and what can be done to fix it.

This will be done when we have:

1) An inventory of the steps dashboards take to render various data, both from a fresh load and within a load
2) Performance analysis of those same steps to determine worst offenders
3) A list of actions (or a bug tree) of things we can address immediately, or later this year, to address perceived performance issues.
Firefox queues any requests that exceed the number in `network.http.max-persistent-connections-per-server` (default of 6)

This means if we request more than 6 resources from TMO, or hit the aggregates.tmo API more than 6 times in quick succession, we will see long "blocked" times in the Timings panel of the Network tab of devtools.

And we do both of those things on load of either aggregate dashboard. Specifically, for dist, we load 14 subresources from telemetry.mozilla.org: 5 stylesheets and 9 scripts. Scripts are loaded serially, but the stylesheets will likely fill most of the six slots while the JS is being fetched and run. These subresources are to support:

bootstrap-multiselect (css and js) -- the main mechanism for choosing which parameters to view.
daterangepicker (css and js) -- Appears to be unused. QUICK WIN: remove these/
elessar (css and js) -- for constructing and styling the range selector in Advanced Settings (dist only)
metricsgraphics (css and js) -- plots
dashboards (css and js) -- appearance and behaviour common to both dist and evo
analytics.js -- Analytics, if DNT isn't on
d3pie.min.js -- For displaying pie charts (flag & boolean histograms and use counters)
telemetry.js -- data fetching
dist.js -- dist-specific behaviour

For evo, the picture is largely the same, but without elessar and d3pie, and with evo.js instead of dist.js.

Aside from removing daterangepicker, there seems to be nothing terribly easy to consider here. We could consider loading some things later, or only when the advanced settings section is opened, but I don't see that improving things terribly much.

We could look into bundling all of the js into a single file in a pack step. This introduces process complexity, but may prove to be worth it to eliminate the 150-200ms blocked delay currently experienced on a fresh load of dist. This can wait for now as we investigate bigger fish, but is something to keep in mind.
Oh, poop. We use daterangepicker for the left-hand side date control and elessar to do the range.

Let's see how many users use the daterangepicker. May be able to remove it if it's unused.
(I can at least remove daterangepicker from evo.html)

On to looking at network blocking for aggregates.tmo. It appears as though aggregates.tmo connections do not count towards t.m.o totals for the magic "6" throttling number. That being said, "blocked" numbers still show up in the Timings tab even on the first requests (even if I'm hosting the frontend locally) so maybe there's another limit we're hitting.

This mystery aside, let's cover some basics. telemetry.js (v2) is the way the aggregates dashboards (dist, evo) interact with aggregates.tmo. (And the way several other things interact with aggregates.tmo, see the architectural diagram: http://bit.ly/2Dfpzd2)

telemetry.js' method of interacting with aggregates.tmo's API is... interesting. The first thing it does upon `Telemetry.init(...)` is ask nine questions of aggregates.tmo.

The first is: "What channels are there?" (/aggregates_by/build_id/channels/)
This can take over 100ms to answer (excluding connection blocking, TLS setup, DNS...), and is unlikely to change its four answers any time soon. But still telemetry.js asks.

The other eight wait for the answer to the first and are: "For each of those channels, for each of submission_date and build_id, what dates and versions do you have?" (/aggregates_by/${build_id_or_submission_date}/channels/${channel}/dates/ which, internally is the get_dates_metrics endpoint)
These are sent off as fast as Firefox will allow, and have responses that vary in size from 40k-400k, taking between 4ms-440ms for the network to pull down. Service delays are anywhere from 100ms-400ms.

telemetry.js stores all of these things in in-memory structures so that when follow-up questions are asked of telemetry.js in the form "What data do you have for nightly/59 for GC_MS, by build_id?" it can translate this into the API call "What data do you have for nightly/59 for GC_MS, by build_id, for dates {date1, date2, date3...}?" where {date1...} is a list of all available dates (or, if the user's messed with the date slider at the bottom, it will be a slice of those available dates) (aka /aggregates_by/build_id/channels/nightly/?version=59&metric=GC_MS&dates=date1,date2,...)

Before that, telemetry.js consumers generally ask "What are acceptable filters for ${channel}/${version}?" (/filters/?channel=${channel}&version=${version})
This covers all "filters" including all acceptable metrics (histograms for which there is an entry in that channel's Histograms.json and for which we've seen some data in the provided channel and version, and all scalars for which the second half of that is true) for the channel and version we're interested in.

If we added a new service endpoint that'd allow us to batch those eight simultaneous get_dates_metrics hits into a single request, I think we'd save a lot of client-side delay here. Aside from that, nothing really sticks out. telemetry.js needs those dates for future calls, and needs the filters to populate its dropdowns. 

Alternatively, if the service were redesigned to report and accept date ranges, or to provide data for all dates, then we could probably skip get_dates_metrics altogether.
Well here's something odd.

When you change the measure within evo, we call Telemetry.getHistogramInfo for each channel+version on the new metric. This does a full submission_date data request of the service... which we then drop on the floor, because it's missing the &application=Firefox filter.

To be clear, we probably don't need to do this in the first place. When we Telemetry.getEvolution(...) we are given the `kind` and the `buckets` which are the only pieces of information that are needed.

I've whipped up a little PR to fix these things I've found along the way: https://github.com/mozilla/telemetry-dashboard/pull/377
For the client-side, this might be helpful:
https://developers.google.com/speed/pagespeed/insights/?utm_source=analytics&tab=desktop&url=http%3A%2F%2Ftelemetry.mozilla.org%2Fnew-pipeline%2Fdist.html

This e.g. points out that we are missing compression on multiple resources.

A thought from here, for when we sorted the insights a bit:
Before bigger changes, we might want to instrument user timings for the important scenarios (time to interactive, time to "useful", time to "useful" when switching probes, ...).
Maybe something that integrates into the "User timings" report in GA.
Nice perspective, Georg!

Oh geez, we don't have Content-Encoding: gzip on the responses? Yeesh. If that's a server-side switch to flip, we should go ahead and save that 0.4MB on every load.

However, it's just 400k.

The other recommendations are also micro:
1) Move analytics.js below the fold. -- Sure, should be simple enough. Won't save much, but it won't cost much.
2) Optimize CSS Delivery -- "Find out what styles can be inlined." Nope! Not gonna do that. Sensible CSP rules forbid this
3) Specify cache expiry on static resources -- Everything already has an ETag. I suppose we could Cache-Control everything to a day/week.
4) Minifying JS is 165k, HTML is 5k, CSS is another 3k. -- metricsgraphics is the biggest slice of this, and we treat it as a black box so we don't even need source maps. We update it very seldomly, so we don't even need vast changes to our process to do this. bootstrap-multiselect and daterangepicker are other notable libs we should minify here.

Action items from the pagespeed:
* :whd, is it easy to add gzip content encoding to telemetry.mozilla.org-served content? How about specifying per-resource cache policies?
* Move analytics.js below the fold
* Minify vendored libraries. daterangepicker and multiselect don't have minified editions, so we'll need to do those manually. metrics-graphics does have a minified edition for 2.7.0 that we ought to be able to slot in[1]. 

[1]: https://github.com/mozilla/metrics-graphics/releases
Flags: needinfo?(whd)
(In reply to Chris H-C :chutten from comment #6)
> :whd, is it easy to add gzip content encoding to telemetry.mozilla.org-served content? How about specifying per-resource cache policies?

It is is possible to have cloudfront automatically compress objects [1]. I have enabled this setting and invalidated the cache, after which the PSI score went up to 83.

It's possible to specify per-resource cache policies.

[1] https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html
Flags: needinfo?(whd)
As per our vidyo conversation I've enabled HTTP 2.0 on this cloudfront distribution.
Thanks to :jason and :whd for the quick wins!

Per that conversation with :jason, it seems per-resource caching is totally a thing we _could_ do, but it would be best configured with some sort of machine-readable file during the deploy step. That way it's automated.

Oh, and we should probably have a deploy step.

:rhudson has already done some thinking aloud about deploy over in github [1] which dovetails with this caching discussion. A deploy step changes the resources we make available, which changes caching policy... which means we should settle on how we want to deploy before we consider how we want to cache. At least now we know what's possible and what is easy.

[1]: https://github.com/mozilla/telemetry-dashboard/issues/392
The outcome of this bug should be a document detailing how the frontend (dist, evo, dashboard js) interacts with the library (telemetry.js) and how the library interacts with the backend (aggregates.tmo service) so that it can be used in planning the evolution of the TMO architecture and design.
Priority: P2 → P1
Everything except for the bug tree of actions to perform have been completed. The bug tree will be created as part of broader TMO 2018 planning.

The doc I produced (as mentioned in comment #10) is here: https://docs.google.com/document/d/1o6p38pnMwEOfxib6-GxM0Li8iNHZpUAmvS5Sl1D9sxk/edit
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Webtools → Data Platform and Tools
Component: Telemetry Dashboards (TMO) → General
You need to log in before you can comment on or make changes to this bug.