Bug 957274 (toolbox-perf-stats)

Instrument toolbox opening performance

RESOLVED WONTFIX

Status

()

Firefox
Developer Tools
RESOLVED WONTFIX
4 years ago
4 years ago

People

(Reporter: canuckistani, Assigned: miker)

Tracking

26 Branch
x86
Mac OS X
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

We do a lot when the toolbox opens. Performance seems to be pretty good right now compared to chrome ( chrome seems a little slower on my machine ) but it would be great to track the performance of this initial, critical action with telemetry or ideally FHR.

Critical questions this helps us with are:

 - are we regressing?
 - is performance variable on specific platforms?
 - are (increasingly popular) devtools toolbox extensions impacting performance here?

This measurement not only helps us make as good an impression with developers opening the toolbox dozens of times a day, but could also prevent a poor experience for non-developers who accidentally open the toolbox. I've also logged bug 957261 to track a first-run experience for the devtools.
I'd very much welcome this, but since our panels initialize lazily, we should take into consideration the fact that different tools will take different amounts of time to load. Moreover the most common shortcuts F12 and Cmd-Shift-I can open different tools at different times, depending on which panel was open last. So we should measure all tools that have separate menu entries in the Web Developer menu (or separate keyboard shortcuts) to identify regressions, but perhaps use an average over all tools to look at more often.
We would be best to time opening of the first tool to load rather than the toolbox itself as each tool takes a different amount of time to load and any tool could be the first to open in the toolbox. We should have:

DEVTOOLS_TOOLBOX_INIT_DEBUGGER_INSPECTOR
DEVTOOLS_TOOLBOX_INIT_DEBUGGER_JSDEBUGGER
DEVTOOLS_TOOLBOX_INIT_DEBUGGER_JSPROFILER
DEVTOOLS_TOOLBOX_INIT_DEBUGGER_NETMONITOR
DEVTOOLS_TOOLBOX_INIT_DEBUGGER_OPTIONS
DEVTOOLS_TOOLBOX_INIT_DEBUGGER_SHADEREDITOR
DEVTOOLS_TOOLBOX_INIT_DEBUGGER_STYLEEDITOR
DEVTOOLS_TOOLBOX_INIT_DEBUGGER_WEBCONSOLE

New metrics URL:
http://telemetry.mozilla.org/#nightly/29/DEVTOOLS_NETMONITOR_OPENED_BOOLEAN
Assignee: nobody → mratcliffe
Status: NEW → ASSIGNED
(In reply to Panos Astithas [:past] from comment #1)
> I'd very much welcome this, but since our panels initialize lazily, we
> should take into consideration the fact that different tools will take
> different amounts of time to load. Moreover the most common shortcuts F12
> and Cmd-Shift-I can open different tools at different times, depending on
> which panel was open last. So we should measure all tools that have separate
> menu entries in the Web Developer menu (or separate keyboard shortcuts) to
> identify regressions, but perhaps use an average over all tools to look at
> more often.

Is there a way to zero in on the first paint of the toolbox area? Visually it looks to me like we open up something very basic at the very beginning and then fill in selected or most recent tool.
(In reply to Michael Ratcliffe [:miker] [:mratcliffe] from comment #2)
> We would be best to time opening of the first tool to load rather than the
> toolbox itself as each tool takes a different amount of time to load and any
> tool could be the first to open in the toolbox. We should have:
> 
> DEVTOOLS_TOOLBOX_INIT_DEBUGGER_INSPECTOR
> DEVTOOLS_TOOLBOX_INIT_DEBUGGER_JSDEBUGGER
> DEVTOOLS_TOOLBOX_INIT_DEBUGGER_JSPROFILER
> DEVTOOLS_TOOLBOX_INIT_DEBUGGER_NETMONITOR
> DEVTOOLS_TOOLBOX_INIT_DEBUGGER_OPTIONS
> DEVTOOLS_TOOLBOX_INIT_DEBUGGER_SHADEREDITOR
> DEVTOOLS_TOOLBOX_INIT_DEBUGGER_STYLEEDITOR
> DEVTOOLS_TOOLBOX_INIT_DEBUGGER_WEBCONSOLE
> 
> New metrics URL:
> http://telemetry.mozilla.org/#nightly/29/DEVTOOLS_NETMONITOR_OPENED_BOOLEAN

Yeah exactly, and the new telemetry dashboard and 'hack your own' library are really promising for us.

I wonder if the best approach is to measure the most basic 'first paint' performance in FHR ( a very diverse set of data and a single number that represents toolbox opening perf generically ) and also add in tool-specific timings in telemetry.

Comment 5

4 years ago
I thought I left a comment here… meh.

Toolbox loading needs to be instrumented too (not just the first tool). We keep doing changing on the lazy-loading mechanism, we need to make sure we don't regress. There's also some interesting performance differences between very-first-startup (no XUL cache), cold startup (not in memory), and 2nd startup.

(In reply to Jeff Griffiths (:canuckistani) from comment #3)
> Is there a way to zero in on the first paint of the toolbox area? Visually
> it looks to me like we open up something very basic at the very beginning
> and then fill in selected or most recent tool.

We open an empty box (for immediate feedback when the user press F12), then add the tabbar, then load the selected tool. We can measure any of these steps.
Adding gps and bsmedberg as they are the FHR gurus.
(In reply to Paul Rouget [:paul] from comment #5)
...
> Toolbox loading needs to be instrumented too (not just the first tool). We
> keep doing changing on the lazy-loading mechanism, we need to make sure we
> don't regress. There's also some interesting performance differences between
> very-first-startup (no XUL cache), cold startup (not in memory), and 2nd
> startup.

For the fine-grained measures I think additional telemetry probes are the way to go, with some way of differentiating the 3 start-up scenarios you mentioned. I also want to consider a single, very generic FHR measurement that we can use as a comparison across a wide variety of systems and as a way to calibrate what telemetry gives us.

> (In reply to Jeff Griffiths (:canuckistani) from comment #3)
> > Is there a way to zero in on the first paint of the toolbox area? Visually
> > it looks to me like we open up something very basic at the very beginning
> > and then fill in selected or most recent tool.
> 
> We open an empty box (for immediate feedback when the user press F12), then
> add the tabbar, then load the selected tool. We can measure any of these
> steps.

I bet the tabbar addition is the sweet spot, based on almost no knowledge. It feels right to me because it's soon enough that it might have less variance, but might also regress if we do something silly. We would need the same measure in telemetry, and this would allow us to things like:

 - get FHR & telemetry numbers
 - infer other numbers by comparing the FHR & telemetry numbers and then applying numbers from telemtrry for the loading of specific tools
Alias: toolbox-perf-stats

Comment 8

4 years ago
This seems like an unusual thing to measure in FHR. I'm surprised that of all the things related to devtools, this particular performance number is either 1) likely to regress or 2) would benefit the user. And because you proposed to measure time-to-tabbox, it would be surprising that debugging addons would affect this time, no? I would expect that normally, any performance effects would be seen in the actual time to open the tools.

It almost seems like you really want to know something else, but we can't measure it directly and so we're using toolbox times as some sort of proxy.

Also it's not clear whether you expect this to be a measured item for a limited period of time. We need to understand who's going to make the report to monitor this item and who's going to look it; especially because monitoring performance numbers across different machine types is tricky business and typically requires cohort analysis.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #8)
> This seems like an unusual thing to measure in FHR. I'm surprised that of
> all the things related to devtools, this particular performance number is
> either 1) likely to regress or 2) would benefit the user. And because you
> proposed to measure time-to-tabbox, it would be surprising that debugging
> addons would affect this time, no? I would expect that normally, any
> performance effects would be seen in the actual time to open the tools.
>
> It almost seems like you really want to know something else, but we can't
> measure it directly and so we're using toolbox times as some sort of proxy.

My goal is to strike a balance wrt FHR & telemetry and focus on something 
that applies to both tools users and the general population. I actually got 
the idea from something you previously said - that FHR data could be used to 
normalize Telemetry data biases. So in that sense, sure, there are two things 
I'm looking for, both the FHR numbers themselves to give me a depth of results 
but also the ratio of total reports from FHR to these specific reports, 
allowing me to also make sense of the more fine-grained measures in telemetry.

The idea for measuring toolbox perf seemed like a good fit for a few reasons:
* as previously stated ( and especially since we took over F12 ) this is 
  something that could happen to any user, and it does
* this measure is designed to be a canary signal to us for a variety of 
  problems that could affect us and may well be affecting us right now in 
  a wider population of users than we currently see.

So I wasn't really thinking when I said 'tabbox'. That's dumb, instead we 
should measure the time until the current tool is ready, whatever that tools 
is. The question I have for you is this: how feasible it would be to also 
capture what the current tool is that's being opened? I know there is a 
concern with FHR data that it would be used to fingerprint users. Collecting 
this perf number from the toolbox and the tool that got opened as a result 
*seems* innocuous to me?

> Also it's not clear whether you expect this to be a measured item for a
> limited period of time. We need to understand who's going to make the report
> to monitor this item and who's going to look it; especially because
> monitoring performance numbers across different machine types is tricky
> business and typically requires cohort analysis.

I logged this bug, I assumed it was on me to generate the report and consume 
the results. This is something I need to know more about and is exactly the 
reason why I cc'd you on this bug. 

Is there a similar measure we currently have with similar properties:
* a performance measurement involving timings
* it needs to be tracked over time to detect regressions
* it can be variable amongst different types of systems, requiring grouping of 
  reports / cohort analysis

How does that measure work / how have we dealt with issues like this in the 
past? When you currently do Cohort analysis, is this something you've automated or is it a 
manual step every time?

Thanks for the feedback, I really appreciate any advice you might have for us.
Seems like these should be telemetry to me, not FHR.
Ah, because Telemetry is only on for nightly it is better for us to use FHR. This was denied us in the past but Telemetry does not fulfill our needs.

So, FHR it is.
To summarize the above we should add the following to Telemetry:
- DEVTOOLS_INSPECTOR_INIT
- DEVTOOLS_JSDEBUGGER_INIT
- DEVTOOLS_JSPROFILER_INIT
- DEVTOOLS_NETMONITOR_INIT
- DEVTOOLS_OPTIONS_INIT
- DEVTOOLS_SHADEREDITOR_INIT
- DEVTOOLS_STYLEEDITOR_INIT
- DEVTOOLS_WEBCONSOLE_INIT

And the following to FHR:
- DEVTOOLS_TOOLBOX_INIT

It seems strange only to have DEVTOOLS_TOOLBOX_INIT in FHR as the tools themselves are far more likely to regress.

Telemetry pings are fairly useless to us as they are only on by default in nightly.

Comment 13

4 years ago
Michael, telemetry is currently on by default in nightly and aurora, and I plan on turning it on by default in beta (see firefox-dev post from yesterday).

If the primary purpose is to track performance regressions caused by *our* code changes, nightly is by far the best place for data and telemetry should be fine. By the time this gets to beta/release, the regression ranges are entire 6-weeks chunks of time and it's very difficult to figure out what caused a regression. That's why for most performance metrics we try to focus foremost on nightly users: even if they aren't representative, the granularity and ability to react quickly usually trumps.

If the primary purpose is to see whether some addons are causing devtools to load more slowly, then beta or even release populations make more sense, becase nightly/aurora populations typically don't use a wide variety of addons because they break much more often.

Any data collection that goes into FHR (on by default for all users) has to provide direct user benefit. It's still not clear to me how we'd expose tips based on this measurement to users via about:healthreport.

(In reply to Jeff Griffiths (:canuckistani) from comment #9)

> My goal is to strike a balance wrt FHR & telemetry and focus on something

My question was a little more general: what are the questions we expect to answer using this data, and how do we expect to help users by answering these questions?

I'm surprised that performance is the primary question we want to answer, because when we talked earlier you were asking about usage in general. So if we mainly want to detect performance regressions, we should focus on telemetry and nightly. If we want to answer other kinds of questions, let's write down exactly what those are.

> So I wasn't really thinking when I said 'tabbox'. That's dumb, instead we 
> should measure the time until the current tool is ready, whatever that tools 
> is. The question I have for you is this: how feasible it would be to also 
> capture what the current tool is that's being opened? I know there is a 
> concern with FHR data that it would be used to fingerprint users. Collecting 
> this perf number from the toolbox and the tool that got opened as a result 
> *seems* innocuous to me?

I'm not particularly worried about privacy fingerprinting. I'm mainly worried that I don't see a path to any user benefit.

> I logged this bug, I assumed it was on me to generate the report and consume 
> the results.

Right now anything requiring cohort analysis is very hard and will probably require significant assistance from the metrics team. My is that we can publish the algorithms and scripts for more people to use, but I doubt that it's something you or I could run directly at the moment.

> Is there a similar measure we currently have with similar properties:
> * a performance measurement involving timings
> * it needs to be tracked over time to detect regressions
> * it can be variable amongst different types of systems, requiring grouping
> of 
>   reports / cohort analysis

The closest thing we have now is startup time measurements.
Okay, so DEVTOOLS_TOOLBOX_INIT should also go in Telemetry and not FHR.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #13)
> Michael, telemetry is currently on by default in nightly and aurora, and I
> plan on turning it on by default in beta (see firefox-dev post from
> yesterday).
...
> Right now anything requiring cohort analysis is very hard and will probably
> require significant assistance from the metrics team. My is that we can
> publish the algorithms and scripts for more people to use, but I doubt that
> it's something you or I could run directly at the moment.

These two factors give me pause:

 - it sounds like processing the perf data from FHR is a ton of work
 - if we stick with telemetry and put these performance measures there, we can add all the measures we need

As you know I had an earlier plan that involved a lot of metrics added to FHR, but when I asked around about limitations wrt FHR, I was told that any measures added to FHR need to be a user benefit. I identified the toolbox performance opening measure as the one that might affect any user and that would continue to be a concern as devtools extensions become more prevalent. What you're telling me is:

a) the analysis of this measure is really expensive, and 
b) we can probably get this info from the newly-opted-in beta population.

Mike: yeah, it sounds like this should stay in Telemetry for now.
If our goal is to track performance regressions based on code changes, I think we'd be better off creating something akin to a Talos test for Toolbox. We'll get quicker turn-around if not immediate feedback and it absolves us of the pain of having to request and process information from outside our normal development channels.

Correlating FHR data to a specific changeset is liable to be tricky, as we'll be based off of a single build id which could represent a merge of hundreds if not thousands of changesets. Talos runs per checkin.
(In reply to Rob Campbell [:rc] (:robcee) from comment #16)
> If our goal is to track performance regressions based on code changes, I
> think we'd be better off creating something akin to a Talos test for
> Toolbox. We'll get quicker turn-around if not immediate feedback and it
> absolves us of the pain of having to request and process information from
> outside our normal development channels.
> 
> Correlating FHR data to a specific changeset is liable to be tricky, as
> we'll be based off of a single build id which could represent a merge of
> hundreds if not thousands of changesets. Talos runs per checkin.

100% agree, Talos tests would make a heckuvalot more sense for this use case.
Okay, I think this bug is dead as is so I'm wont-fixing it. Thanks for the feedback. It looks like the followups should be:

* move ahead with leveraging the existing telemetry data ( I've moved to preserve our existing probes over in bug 971360 ) and creating a devtools dashboard for this and other data
* consider other uses for FHR with devtools
* hope current proposals for turning on telemetry in the Beta channel go well

In particular it would be nice to instrument some basic performance measures across a wide pool of systems, but FHR data analysis is simply too expensive to efficiently provide this for us.
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.