Closed Bug 1542833 Opened 8 months ago Closed 8 months ago

Add a measure to telemetry to detect if this ping comes from a cold startup

Categories

(Toolkit :: Startup and Profile System, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox68 --- fixed

People

(Reporter: dthayer, Assigned: dthayer)

Details

(Whiteboard: [fxperf:p1])

Attachments

(2 files)

Because cold startups (startups after an OS restart) can perform substantially different than warm startups, it would be ideal to have some numbers on how many cold startups we see vs warm startups, and to get a sense in the telemetry of in what ways startup performance differs between them.

This will help prioritize work on startup performance, as well as provide concrete reasons to measure cold startups through automation, something that is currently not done (talos currently does one cold startup, ignores it, and then measures the warm startups.)

The simplest way to implement this, as far as I can tell, is by including a "timeSinceOSStart" measure in our main ping. Then, we could do the following:

let A be a main ping
let B be the next main ping (by creationDate) with clientID == A.clientID

isColdStartup := A.creationDate > B.creationDate - B.timeSinceOSStart

Chris, is there a way of computing this server-side and storing the result in some nicely accessible place so when analyzing it we don't have to write the same code over and over? Or can you think of an easy way to compute this client side without having to store / read anything from disk that we're not already reading from?

Flags: needinfo?(chutten)

I'd recommend against doing datetime math using client clocks.

If all we're trying to do is identify the first "main" ping since the OS started up, then the ideal collection from a Lean Data POV would be a bool scalar "was_cold_start". How to compute it is the question.

What data do we have to identify cold/warm starts? Is the only information we can get "timeSinceOSStart"? How does that work for multiple OS starts where we start Firefox earlier after restarting the OS?

Flags: needinfo?(chutten)

So the idea is basically that "cold startup" would mean "first start since the last OS reboot"? But that would be affected by Windows prefetch / SuperFetch, right? Would there be other alternatives like trying to detect real disk activity when the libraries are being loaded (is it even possible to detect that vs. loading from paged memory?), or looking at the .pf files on disk, or learning client side what a cold start vs. warm startup means for that system?

Whiteboard: [fxperf] → [fxperf:p2]

(In reply to :Felipe Gomes (needinfo me!) from comment #3)

So the idea is basically that "cold startup" would mean "first start since the last OS reboot"? But that would be affected by Windows prefetch / SuperFetch, right?

I think it's tricky whether we want to lump prefetched runs with warm or with cold startups. Truthfully I don't have a deep enough understanding of Superfetch to say exactly what it does and how it interacts with Firefox. What I have noticed is that when we launch Firefox, we load a prefetch file which seems to load a sequence of segments that we normally would load during startup. If that's all we're dealing with I think it's entirely reasonable to still call that a cold startup. However, if Superfetch is actually loading xul.dll into physical memory before we even decide to launch the program, I agree that is a bit of a different case. But in the end I think I'm still more interested in the user's behaviors around startup, rather than the physical realities.

Would there be other alternatives like trying to detect real disk activity when the libraries are being loaded (is it even possible to detect that vs. loading from paged memory?), or looking at the .pf files on disk, or learning client side what a cold start vs. warm startup means for that system?

I think that we might be able to query whether any part of a file's contents are in the system file cache. RAMMap does this; I don't know if we need special privileges to do so. I would like to take a look at Process Hacker and its source, as apparently it offers similar features? But I haven't had a chance to yet. However I suspect that we won't be able to do this performantly so I doubt there's any chance we can do it before we load xul.dll.

(In reply to Chris H-C :chutten from comment #2)

I'd recommend against doing datetime math using client clocks.

I trust you've run into things with this, but can you clarify a bit? While I would love an absolutely reliable measure, even an approximation would be helpful. Do you have an idea of how often a user's clock on a particular machine is unstable enough that macro-scale measurements like this would be erroneous?

If all we're trying to do is identify the first "main" ping since the OS started up, then the ideal collection from a Lean Data POV would be a bool scalar "was_cold_start". How to compute it is the question.

What data do we have to identify cold/warm starts? Is the only information we can get "timeSinceOSStart"? How does that work for multiple OS starts where we start Firefox earlier after restarting the OS?

Can you clarify what you mean by "multiple OS starts where we start Firefox earlier after restarting the OS?" I can't seem to parse that out.

Flags: needinfo?(chutten)

(In reply to Doug Thayer [:dthayer] from comment #4)

(In reply to Chris H-C :chutten from comment #2)

I'd recommend against doing datetime math using client clocks.

I trust you've run into things with this, but can you clarify a bit? While I would love an absolutely reliable measure, even an approximation would be helpful. Do you have an idea of how often a user's clock on a particular machine is unstable enough that macro-scale measurements like this would be erroneous?

Clock skew and its stability is an active area of research. :frank was looking at it most recently, but I don't remember what the findings were beyond supporting my thesis of "avoid client clocks where possible". He'd probably know, though.

It's true, though, that when the rubber hits the road all you were suggesting was an order over pings using creationDate. It should be fairly obvious from your analyses when you include pathological cases (like the clock being reset on each boot), so it shouldn't be too much of a chore to ignore them.

If all we're trying to do is identify the first "main" ping since the OS started up, then the ideal collection from a Lean Data POV would be a bool scalar "was_cold_start". How to compute it is the question.

What data do we have to identify cold/warm starts? Is the only information we can get "timeSinceOSStart"? How does that work for multiple OS starts where we start Firefox earlier after restarting the OS?

Can you clarify what you mean by "multiple OS starts where we start Firefox earlier after restarting the OS?" I can't seem to parse that out.

What I mean is, imagining two boots in a day, one at 9am another at 10am. Firefox was started and stopped twice, once at 0930 and the other at 1010.

With one ping with a creationDate of 0930 and a timeSinceOSStart of 30min, and a second ping with a creationDate of 1010 and a timeSinceOSStart of 10min, how does your algorithm identify that both of these pings are from cold starts?

I think a client-side measure of "was_cold_start" is the way to go as, much like Visual Effects, the closer you can get to the final picture on the set (measurements), the easier of a time you'll have in post (analysis).

Flags: needinfo?(chutten)

(In reply to Chris H-C :chutten from comment #5)

With one ping with a creationDate of 0930 and a timeSinceOSStart of 30min, and a second ping with a creationDate of 1010 and a timeSinceOSStart of 10min, how does your algorithm identify that both of these pings are from cold starts?

So, I think we could do this client-side with just a pref, and I'm not sure how "use a pref" evaded me as an idea. But it should work the same either way, provided we have a "last ping"-like thing and a "current ping"-like thing.

let lastCollectionTime = serverSide ? lastPing.creationDate : pref("...lastCollectionTime");
let currentTime = serverSide ? currentPing.creationDate : now();
let isColdStart = false;
if (currentTime - timeSinceOSStart > lastCollectionTime) {
  isColdStart = true;
}

(And if lastPing or similar doesn't exist, then just set isColdStart = true)

So, currentTime - timeSinceOSStart == 10:10 - 0:10 == 10:00, and lastCollectionTime == 09:30, so 10:00 > 09:30, thus isColdStart == true

Sounds like a solution to me. This can then supply isColdStart to the value of a bool scalar and we're off to the races.

Assignee: nobody → dothayer
Status: NEW → ASSIGNED
Whiteboard: [fxperf:p2] → [fxperf:p1]
Attached file Data review request
Attachment #9059060 - Flags: data-review?(chutten)
Priority: -- → P2
Comment on attachment 9059060 [details]
Data review request

Preliminary notes:

For permanent collections like these it is strongly recommended that you include a regression test to ensure the collection doesn't break while our attention is elsewhere.

DATA COLLECTION REVIEW RESPONSE:

    Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. This collection is Telemetry so is documented in its definitions file [Scalars.yaml](https://hg.mozilla.org/mozilla-central/file/tip/toolkit/components/telemetry/Scalars.yaml) and the [Probe Dictionary](https://telemetry.mozilla.org/probe-dictionary/).

    Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

    If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Doug Thayer is responsible.

    Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical. (The request says Cat2 Interaction, but "starting the browser" isn't really interaction... either way, the process is the same)

    Is the data collection request for default-on or default-off?

Default on for all channels.

    Does the instrumentation include the addition of any new identifiers?

No.

    Is the data collection covered by the existing Firefox privacy notice?

Yes.

    Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.

---
Result: datareview+
Attachment #9059060 - Flags: data-review?(chutten) → data-review+
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cb964035e639
Collect cold startup scalar r=chutten,florian
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

Current numbers on Nightly for this:

startup.is_cold == false: 498806
startup.is_cold == true: 103045

So, roughly 1 in 6 startups in Nightly is cold. It will be interesting to see how this differs in Beta.

You need to log in before you can comment on or make changes to this bug.