Closed Bug 818711 Opened 12 years ago Closed 7 years ago

Get two telemetry probes to measure how long it takes for us to start painting since we kick off a load, and how long it takes for that load to get onto the screen

Categories

(Core :: DOM: Navigation, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: ehsan.akhgari, Unassigned)

References

Details

So I was thinking about ways that we can evaluate the change made in bug 792438, and I think we should get a couple of telemetry probes:

1. A probe which measures how long it takes for us to begin the first paint after we've kicked off a load.

One of the first things that we do when we kick off a load is to create a load group.  We can measure the current time when we create a new load group, and keep it there.  Then later on when we get to PresShell::Paint, we can look at the timestamp for the docshell's load group and report the difference.

2. A probe which measures how long it takes for that paint to get to the screen.

Here, we basically feed the timestamp stored on the load group off to the layer manager, and mark the next paint as important.  Then when the layer manager is done painting, we report the difference.

The idea is to get these probes backported to Aurora and Beta and then measure the difference between them and Nightly.  If bug 792438 has improved anything, the first probe absolutely needs to go down.  Ideally the second probe will as well.  If the first probe goes down but the second one doesn't, then we should investigate and figure out what we need to do.
+1 for this - I 've been asking for better metrics for a while. Apparently I've been asking the wrong people :) Thanks - this is great!

I'd also like some kind of "last paint" metric that captures the concept of when rendering is basically done - but pageload might certainly still be going on. call it page usable. I forsee practical problems with figuring this out - but it makes sense to me to be the primary thing to optimize for.

I have to raise a little skepticism that comparative telemetry using absolute timings between channels will do the trick. There are a lot of different combinations of variables at play (cache, uri, network speed, router buffer depth, cpu speed) and I suspect it will take a very large amount of data to smooth all of that out and its not obvious to me that nightly (or any channel?) has enough telemetry data to do that... 

If you look at the existing "evolution" medians for TOTAL_CONTENT_PAGE_LOAD_TIME on aurora you'll see a wide scatter of values from 600 to 1100ms more or less evenly distributed over time.. I suspect that is because of variance in what is actually being measured (again - {uri, cache, network, cpu}) in the dataset. The beta numbers are a bit better (ranging from 800 to 1000) but still have a lot of variance. nightly looks a lot like aurora. 

If we decide the telemetry isn't clear the new metrics can still be used in synthetic tests that do single variable comparisons.

We would use TOTAL_CONTENT_PAGE_LOAD_TIME for page-load-time, right? I think that's fine - but there possibly aren't currently enough buckets in the histogram if you look at the existing telemetry.. 100ms gaps between data points in common ranges. I'd like to see more granularity there while we're updating things.
(In reply to comment #1)
> +1 for this - I 've been asking for better metrics for a while. Apparently I've
> been asking the wrong people :) Thanks - this is great!

Seems like we both have done too much thinking and too little communication with each other.  But that is a nice problem to have!  :-)

> I'd also like some kind of "last paint" metric that captures the concept of
> when rendering is basically done - but pageload might certainly still be going
> on. call it page usable. I forsee practical problems with figuring this out -
> but it makes sense to me to be the primary thing to optimize for.

Hmm, I'm having difficulty imagining how we would define that.  The problem is that each time that we paint off of the refresh driver, the paint is considered final.  More data might come from the network in the future, and that might cause a future paint off of the refresh driver, but we don't necessarily know if a given paint is before or after "page usable".

There might be a couple of things which we can use.  One is the DOMContentReady event.  The other is info from Necko on whether all loads ave finished.  But we should remember that various things on a page can trigger consistent repaints even after the page has done loading, such as animations, scripts modifying the DOM, animated gifs, etc.  It's not clear to me where we would draw this line...

> I have to raise a little skepticism that comparative telemetry using absolute
> timings between channels will do the trick. There are a lot of different
> combinations of variables at play (cache, uri, network speed, router buffer
> depth, cpu speed) and I suspect it will take a very large amount of data to
> smooth all of that out and its not obvious to me that nightly (or any channel?)
> has enough telemetry data to do that... 
> 
> If you look at the existing "evolution" medians for
> TOTAL_CONTENT_PAGE_LOAD_TIME on aurora you'll see a wide scatter of values from
> 600 to 1100ms more or less evenly distributed over time.. I suspect that is
> because of variance in what is actually being measured (again - {uri, cache,
> network, cpu}) in the dataset. The beta numbers are a bit better (ranging from
> 800 to 1000) but still have a lot of variance. nightly looks a lot like aurora.

Yeah, I hear you on this.  Two things to keep in mind though.  We don't have any data measuring how soon we can trigger the first paint, which is what this bug is suggesting to add.  Also, once we have the data, we might be able to throw some statistics at it to see if we can derive any sensible conclusions from it.  (We may see that the first paint metric is 4 times on Nightly on average compared to Aurora/Beta, and that might tell us that this optimization can hurt more than it can help, although I don't personally expect to see such huge differences...)

> If we decide the telemetry isn't clear the new metrics can still be used in
> synthetic tests that do single variable comparisons.

Absolutely!  about:telemetry for the win!

> We would use TOTAL_CONTENT_PAGE_LOAD_TIME for page-load-time, right? I think
> that's fine - but there possibly aren't currently enough buckets in the
> histogram if you look at the existing telemetry.. 100ms gaps between data
> points in common ranges. I'd like to see more granularity there while we're
> updating things.

I don't think TOTAL_CONTENT_PAGE_LOAD_TIME is what we want here.  It includes so much more stuff than what we're interested in.  Also, because it's measuring a much longer time period, it's inherently more prone to noise.
(In reply to Ehsan Akhgari [:ehsan] from comment #2)
> (In reply to comment #1)
> > +1 for this - I 've been asking for better metrics for a while. Apparently I've
> > been asking the wrong people :) Thanks - this is great!
> 
> Seems like we both have done too much thinking and too little communication
> with each other.  But that is a nice problem to have!  :-)
> 
> > I'd also like some kind of "last paint" metric that captures the concept of
> > when rendering is basically done - but pageload might certainly still be going
> > on. call it page usable. I forsee practical problems with figuring this out -
> > but it makes sense to me to be the primary thing to optimize for.
> 

I think we should track DOMContentReady, good idea. (Is that different than DOMContentLoded? I didn't see it in the telem list - but maybe I missed it) too.. it is a little closer to what I was thinking.

But I'm really hoping somebody smarter than me can come up with a proxy for "page usable" or "page pretty stable above the fold" which is the metric I think we want to be driving for. "last paint" was too literal of a description. I knew it would be complicated and probably never precise, but there are several possible optimizations I think we need it to measure:
 * Are we loading images off screen that are slowing down loading on screen images?
 * Could we be getting image geometry information (from head of image) earlier in order to reduce the number of reflows? And can we do that while still maintaining decent overall page load time?

I don't have code for either of those things (and don't even know quite how to approach the first one), but to even take on the project requires a metric to optimize for and I don't think we've got a good one right now.


> 
> Yeah, I hear you on this.  Two things to keep in mind though.  We don't have
> any data measuring how soon we can trigger the first paint, which is what
> this bug is suggesting to add.

+1!


> I don't think TOTAL_CONTENT_PAGE_LOAD_TIME is what we want here.  It

I was really unclear. sorry. All I meant was 
 a] confirm that T_C_P_L_T is the normal definition of pageload (i.e. tp5 numbers).. it looks that way from where it is recorded in docshell, but when I get outside of netwerk I shouldn't be trusted :)
 b] if so we should track, to the extent there is enough data there, in conjunction with the css/js prioritization patch to figure out how much the patch regresses total page load because that's the inherent tradeoff in that patch's strategy.
(In reply to comment #3)
> (In reply to Ehsan Akhgari [:ehsan] from comment #2)
> > (In reply to comment #1)
> > > +1 for this - I 've been asking for better metrics for a while. Apparently I've
> > > been asking the wrong people :) Thanks - this is great!
> > 
> > Seems like we both have done too much thinking and too little communication
> > with each other.  But that is a nice problem to have!  :-)
> > 
> > > I'd also like some kind of "last paint" metric that captures the concept of
> > > when rendering is basically done - but pageload might certainly still be going
> > > on. call it page usable. I forsee practical problems with figuring this out -
> > > but it makes sense to me to be the primary thing to optimize for.
> > 
> 
> I think we should track DOMContentReady, good idea. (Is that different than
> DOMContentLoded? I didn't see it in the telem list - but maybe I missed it)
> too.. it is a little closer to what I was thinking.

Yes, DOMContentReady is an imaginary event that never gets dispatched.  DOMContentLoaded is the real-world equivalent.  ;-)

> But I'm really hoping somebody smarter than me can come up with a proxy for
> "page usable" or "page pretty stable above the fold" which is the metric I
> think we want to be driving for. "last paint" was too literal of a description.
> I knew it would be complicated and probably never precise, but there are
> several possible optimizations I think we need it to measure:
>  * Are we loading images off screen that are slowing down loading on screen
> images?
>  * Could we be getting image geometry information (from head of image) earlier
> in order to reduce the number of reflows? And can we do that while still
> maintaining decent overall page load time?

(Note that while I agree that this probe would be useful, it's different to what I'm proposing here -- and I think we should focus on getting this telemetry first.)
Will some of the Quantum DOM probes or the dependencies of bug 1307244 suffice here?
Flags: needinfo?(ehsan)
I think probe #1 in comment 0 is not captured there, and I can't remember why #2 was useful any more...  That being said, this bug probably doesn't serve any purpose any more since years have passed since...
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(ehsan)
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.