Closed Bug 1454724 Opened 2 years ago Closed 2 years ago

Add a telemetry measure for ghost windows on a per-session basis

Categories

(Toolkit :: Telemetry, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: mccr8, Assigned: mccr8)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

The existing ghost windows telemetry records the number of ghost windows every few minutes. The problem with this is that if a user has a session open for a few hours (recording hundreds of instances of no ghost windows), then they get some ghost windows, they're still reporting < 1% ghost windows. This means that many ghost window regressions only show up in the 95th percentile bucket, which telemetry alerting does not deal with well.

This is made worse because if somebody has a lot of ghost windows, their browser performance may suffer, so they may restart, which I guess is a kind of sampling bias.

Instead, for such a user I think it would make more sense to record a single "yes!" for the entire session if the user ever experiences a ghost window. I don't know if recording a high water mark is better than a single bool. Getting 5 ghost windows is not as bad as getting 100, but they are still bad if your system is not powerful. I guess I can start with a bool and see how that goes.

I'd like to have this alongside the existing GHOST_WINDOWS telemetry, to see how they look relative to each other. GHOST_WINDOWS has been around a while, and I'm familiar with its behavior.
It sounds like a scalar value is what I want.
Component: DOM → Telemetry
Product: Core → Toolkit
1. What questions will you answer with this data?

How often are users experiencing leaking windows?

2. Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements?

The main goal is to catch regressions like bug 1447871. Leaking windows cause severe performance problems for users (for instance, 1 second hangs, and high memory usage), and we cause regressions in this behavior a few times a year. With telemetry, we can hopefully catch these sooner.

3. What alternative methods did you consider to answer these questions? Why were they not sufficient?

We have automated leak testing on TreeHerder, which catches some of these leaks, but our tests don't catch every thing a user can do. For instance, we had one of these leaks that would only happen if you closed a tab with a pending geolocation permission prompt. Another leak was caused by some kind of interaction with AdblockPlus. Another thing we've considered in some kind of "leaking site" reporter (like the crash reporter), where Firefox would ask the user if they want to submit a URL to Mozilla if the browser notices a leak, but that is a lot more involved and invasive than a telemetry measure like this.

4. Can current instrumentation answer these questions?

This is actually the same as the existing GHOST_WINDOW telemetry, except that instead of recording the value many times during a session, the highest value is taken. My hope is that this will make it easier to get automated alerts when we regress this value. Right now, the regressions are clearest in the 95th percentile, and apparently whatever algorithm is used for alerts doesn't deal with that. I filed bug 1305142 on that. (I have no idea why that was graveyarded...) If this new measurement works better, I'll remove GHOST_WINDOW and make this one permanent (GHOST_WINDOW is permanent).

5. List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki.

This measures the maximum number of leaked windows. It falls under Technical Data. I have this as opt-out because GHOST_WINDOWS is also opt out, and my new measurement is strictly weaker, and intended as a replacement.

6. How long will this data be collected? Choose one of the following:
    I want this data to be collected for 6 months initially (potentially renewable).
If this works well, I'll want to make it permanent instead of GHOST_WINDOWS, but I'll deal with that in a later bug. We only have these bad regressions a few times a year, so I may need to wait a while to see what the telemetry looks like in that case.

7. What populations will you measure?

All Firefox users.

8. If this data collection is default on, what is the opt-out mechanism for users?

I have no idea. I assume that Services.telemetry.scalarSetMaximum() will do the right thing, but maybe it doesn't.

9. Please provide a general description of how you will analyze this data.

First, I need to get some indication that something has gone wrong with ghost windows. If telemetry alerts were working, I'd see an email from that. Otherwise, somebody ends up filing a bug about getting leaks. Then I can look back through telemetry evolution and see if there was a particular day where the numbers got worse. GHOST_WINDOWS has been a good indicator of the major leaks we've added in the past, and I'm hoping the new more focused thing will be more useful.

10. Where do you intend to share the results of your analysis?

I file bugs about the leaks, or post in existing bugs about the leaks. For instance, bug 1447871 comment 45.
You didn't set ni? for a data steward, but I was reading my bugmail and happened across this so I'll take a look. :)

Before we begin, allow me to mention that I'd love to be involved in the review of the patch to add the collection. Adding it as a scalar is an excellent idea (and I'm not just saying that because I had the same idea a year ago: bug 1364502 comment#5 ), but I might have comments about how it will be defined.

Also, I've revived bug 1305142 and rehomed it under the current alerting bugtree. Thank you for filing it. It was graveyarded as part of a component shuffle a while back.

Also Also, The opt-out mechanism for users is the usual "Send technical data to Mozilla to improve Firefox" checkmark in about:preferences#privacy

Now, on to the review

DATA COLLECTION REVIEW RESPONSE:

    Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate? 

Yes, standard Telemetry considerations apply.

    Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, standard Telemetry considerations apply.

    If the request is for permanent data collection, is there someone who will monitor the data over time?**

Not for permanent collection, though it may be in future.

    Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical data

    Is the data collection request for default-on or default-off?

default-on

    Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)?

No.

    Is the data collection covered by the existing Firefox privacy notice?

Yes.

    Does there need to be a check-in in the future to determine whether to renew the data?

Yes. :mccr8, when you file the follow-up please mention it in this bug.

---
Result: datareview+
(In reply to Chris H-C :chutten from comment #3)
> You didn't set ni? for a data steward, but I was reading my bugmail and
> happened across this so I'll take a look. :)

Yeah, I was waiting to flag you for review and data review until after a try run came back. Thanks for looking at it.

Try run was green except for an eslint issue that I've fixed:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=53d0dd415f3c43219211655b9ea2b42e6364441c
Comment on attachment 8969044 [details]
Bug 1454724 - Add scalar telemetry for max ghost windows.

https://reviewboard.mozilla.org/r/237728/#review243508

Looks good! I have two... questions? comments? ...two "things" I want to bring up (possibly just to sate my curiosity), but the code would do exactly what it ought to do so I'm okay if it's pushed as-is.

::: toolkit/components/telemetry/Scalars.yaml:1638
(Diff revision 1)
> +    kind: uint
> +    notification_emails:
> +      - memshrink-telemetry-alerts@mozilla.com
> +    release_channel_collection: opt-out
> +    record_in_processes:
> +      - 'content'

You don't want to record ghost windows in the 'main' process as well? They sometimes creep in: https://mzl.la/2qH2aIn

::: toolkit/components/telemetry/TelemetrySession.jsm:1085
(Diff revision 1)
>  
>      // GHOST_WINDOWS is opt-out as of Firefox 55
>      c("GHOST_WINDOWS", "ghostWindows");
>  
> +    // MAX_GHOST_WINDOWS_SCALAR_NAME is opt-out because it is derived from GHOST_WINDOWS.
> +    Services.telemetry.scalarSetMaximum(MAX_GHOST_WINDOWS_SCALAR_NAME, mgr.ghostWindows);

Huh, I kinda figured you'd record it directly inside the memory manager (like, say, here: https://searchfox.org/mozilla-central/rev/f65d7528e34ef1a7665b4a1a7b7cdb1388fcd3aa/dom/base/nsWindowMemoryReporter.cpp#896 )

That's not to say this is wrong. This is fine! Nothing wrong with this. But I've never been happy with the whole `gatherMemory` mechanism for occasionally, at poorly-defined times, recording values. Especially when we already know when they change.
Attachment #8969044 - Flags: review?(chutten) → review+
(In reply to Chris H-C :chutten from comment #6)
> You don't want to record ghost windows in the 'main' process as well? They
> sometimes creep in: https://mzl.la/2qH2aIn

I'm not sure what a ghost window in the main process actually is. The only ghost windows I've come across are webpages, so I just set it to record them only in the content process to avoid adding a bunch of zeroes. That is strange that they do exist, though!

> That's not to say this is wrong. This is fine! Nothing wrong with this. But
> I've never been happy with the whole `gatherMemory` mechanism for
> occasionally, at poorly-defined times, recording values. Especially when we
> already know when they change.

Yeah, that makes more sense. I was just following the existing code. I guess for this, where there's only a single value we're recording, reporting it more often is okay.
Scalars.yaml is the same as before.
Comment on attachment 8969044 [details]
Bug 1454724 - Add scalar telemetry for max ghost windows.

I guess you should look at this, though it is pretty trivial.
Attachment #8969044 - Flags: review+ → review?(chutten)
Comment on attachment 8969044 [details]
Bug 1454724 - Add scalar telemetry for max ghost windows.

https://reviewboard.mozilla.org/r/237728/#review243812

Thanks for bringing me back to review the change. LGTM.
Attachment #8969044 - Flags: review?(chutten) → review+
Pushed by amccreight@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7b01ccfbcc6e
Add scalar telemetry for max ghost windows. r=chutten
https://hg.mozilla.org/mozilla-central/rev/7b01ccfbcc6e
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
You need to log in before you can comment on or make changes to this bug.