Closed Bug 1635242 Opened 5 years ago Closed 4 years ago

Lifecycle Design for Project FOG

Categories

(Toolkit :: Telemetry, task, P1)

task

Tracking

()

RESOLVED FIXED

People

(Reporter: chutten, Assigned: chutten)

References

Details

(Whiteboard: [telemetry:fog:m6])

Attachments

(1 file)

The Glean SDK currently has four pings sent on triggers provided by the application:

  • The "baseline" ping is sent on application foreground and background, and is sent on application start if the background-reason ping failed to be sent. (docs)
  • The "metrics" ping is sent at 4AM client-local if the app is open, otherwise on the first application start after 4AM client-local (this is a simplification)
  • The "events" ping is sent on application background or whenever 500 unsent events are pending. The on-background ping sometimes fails to send due to the application closing and in that case the ping is sent on the next application start. (docs)
  • The "deletion-request" ping is immediately sent when the user opts out of data collection using the application's UI or if the user opts out between application invocation. It is retried on successive application starts. (docs)

Firefox Desktop needs to supply these signals to the Glean SDK in order for it to know when to send these pings. But the question suggests itself: "Do these even make sense for a Desktop application like Firefox?"

This bug is about working with Data Science to propose a design for scheduling these pings and having that design proposal reviewed by the team and Data Science.

Assignee: nobody → chutten
Status: NEW → ASSIGNED
Priority: P3 → P1

Going to put this design on hold a little while as we collect some information about whether window focus/blur has suitable characteristics for use. (will file a blocking bug for instrumenting)

Assignee: chutten → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3
Depends on: 1647876
Whiteboard: [telemetry:fog:m4] → [telemetry:fog:m6]

I've been doing some EDA with window raised and user activity and I'm liking the flexibility user activity provides. So I think we should move forward with user activity as the trigger with the aim of creating "baseline" pings that describe the same active population, but with reasons that aren't "foreground" and "background" to mark them as having distinct causes. (reason "dirty_startup" might remain)

For the "foreground" analogue, I propose reason "active" that is sent:

  • When the user first becomes active (user-interaction-active is received) after X seconds of inactivity, or upon startup.
    • inactivity measured from the first user-interaction-inactive received after a user-interaction-active.

For the "background" analogue, I propose reason "inactive" that is sent:

  • When the user first becomes inactive (user-interaction-inactive is received) after Y seconds of activity.
    • optionally, it could be sent Z seconds after the user becomes inactive after Y seconds of activity.

For the values of X and Y (and optionally Z) I don't have decent estimates. My EDA cannot calculate X because we don't have a measure of windows of inactivity. As for Y, well, that I can find a first estimate for by taking the minimum of the maximum activity periods in a day's "main" pings: eleven seconds.

To find the first estimate for X we will need to measure periods of inactivity like we do for activity. (To estimate Z would require event math or paired data collection, neither of which I think we need to do for the first attempt). Then with the first estimates from independent analyses we can ratchet up the values one bucket at a time to tighten our estimates up.

I won't be able to get to this instrumentation for a week, though.

Depends on: 1660887

I've been looking at Beta data and I have a tentative set of lifecycle signals that will let us cover over 99% of clients that spend any time at all interacting with the browser with only an average of 3 to 8 "baseline" pings per client per day.

Tentative Criteria: Send a "baseline" ping

  • At app startup, with reason "startup" or "active"
  • When the user becomes active after 20min or more of inactivity, with reason "active"
  • When the user becomes inactive after 2min or more of activity, with reason "inactive"

Assumption: I continue to assume that baseline engagement information delivered via "baseline" ping is only useful on the resolution of the submission day. (ie, no one's looking at hourly active users (and if they did, HAU were they doing it? ba dum tish)).

Spanning Method: To determine the number of clients we'll miss with these criteria, count all clients who, within a given submission day:

  • have only pings withsubsession_counter > 1 (if they didn't, we'd send reason "startup")
  • have only periods of inactivity <= 1200s (if any exceeded that, we'd send reason "active")
  • have only periods of activity <= 120s (if any exceeded that, we'd send reason "inactive")

I also exclude any clients who we did receive "main" pings from but have no periods of activity or inactivity over the submission day. I view these clients as not engaging with the browser, and thus not needing to send "baseline" pings. (And probably shouldn't be sending us "main" pings in Telemetry, but I'm not gonna change that). If we're interested in these (picking a term out of thin air) Idle Clients in FOG, we can count them by the client_ids that send us "metrics" pings but no "baseline" pings.

Ping Counting Method: To determine the number of "baseline" pings these criteria will send, sum these counts:

  • Number of pings with subsession_counter = 1 (reason "startup")
  • Number of samples in FOG_EVAL_USER_INACTIVE_S > 1200s (reason "active")
  • Number of samples in FOG_EVAL_USER_ACTIVE_S > 120s (reason "inactive")

Further Work: This work must be redone on Release after we have about a week's data (so next Tuesday the 29th at the absolute earliest). We can also keep twiddling the numbers to see if we can lower the thresholds of (in)activity without ballooning the number of pings (is it worth an extra 1 ping per day to get that last tenth of a percent of clients?), or if there are other local maxima yet to be found.

This isn't the most rigorous work, I just increased the numbers until there were more than 0 missed clients. A better statistician could come up with better numbers. But if these hold out in Release we can still use them while awaiting better numbers as a change in these thresholds will result in a (until Firefox Telemetry is sunsetted) pre-measurable change in population we can decide to opt into at the time.

Assignee: nobody → chutten
Status: NEW → ASSIGNED
Priority: P3 → P1

We haven't reached saturation on 81, but we're getting there and the numbers are looking good.

We'll lose sight of 0.2% users who had at least one active/inactive period with the thresholds as proposed (2min/20min). This, coupled with losing sight of the users who have 0 active/inactive periods will result in a 1.3% drop in DAU (slightly different on weekends and across OSes as the interaction models differ).

I propose that this is okay. Losing sight of the 0.2% least active profiles each day is no biggie, and no longer receiving empty pings from another 1.1% of them will have a concentrating effect on analyses. I'm not a Data Scientist, though. Luckily, this is exactly the point at which I was planning on bringing in Data Science to show me what I'm missing. ni?marissa -- Should I file a JIRA ticked for getting Data Science's opinion on the aforementioned application lifecycle design?

And on the "pings per client per day" front, we're at about 10. A little higher than Beta's numbers, but still 50% lower than mobile. I think this should be fine. I'm not a Data Platform Engineer, though. Luckily, this is exactly the point at which I was planning on bringing in Data Platform Engineering to show me what I'm missing. ni?mreid -- "baseline" pings are small. Are we cool (budget and alerting wise) if nearly every Firefox Desktop profile started sending 10 of them each day?

Flags: needinfo?(mreid)
Flags: needinfo?(mgorlick)

(In reply to Chris H-C :chutten from comment #5)

And on the "pings per client per day" front, we're at about 10. A little higher than Beta's numbers, but still 50% lower than mobile. I think this should be fine. I'm not a Data Platform Engineer, though. Luckily, this is exactly the point at which I was planning on bringing in Data Platform Engineering to show me what I'm missing. ni?mreid -- "baseline" pings are small. Are we cool (budget and alerting wise) if nearly every Firefox Desktop profile started sending 10 of them each day?

How small are we talking here? Is it safe to say that the FOG baseline pings should have about the same size characteristics as baseline pings for other Glean apps?

Flags: needinfo?(mreid) → needinfo?(chutten)

(In reply to Mark Reid [:mreid] from comment #6)

(In reply to Chris H-C :chutten from comment #5)

And on the "pings per client per day" front, we're at about 10. A little higher than Beta's numbers, but still 50% lower than mobile. I think this should be fine. I'm not a Data Platform Engineer, though. Luckily, this is exactly the point at which I was planning on bringing in Data Platform Engineering to show me what I'm missing. ni?mreid -- "baseline" pings are small. Are we cool (budget and alerting wise) if nearly every Firefox Desktop profile started sending 10 of them each day?

How small are we talking here? Is it safe to say that the FOG baseline pings should have about the same size characteristics as baseline pings for other Glean apps?

Yes. We're not adding anything FOG-specific to a Glean SDK builtin ping. It would make Alessio angry : D

So it's pretty much ping_info plus client_info plus glean.baseline.duration plus headers. (a la docs)

Flags: needinfo?(chutten) → needinfo?(mreid)

Cool. I put together a spreadsheet to estimate the cost (based on Jason's excellent prior work along these lines).

Conservatively, it looks like it should cost no more than $2000/month which seems acceptable to me.

Jason, does that math look right to you? Does that budget impact sound ok?

Flags: needinfo?(mreid) → needinfo?(jthomas)

Also ni? Dave on the budget question.

Flags: needinfo?(dparfitt)

FWIW we hope that this will result in net savings when we decrease the amount of Telemetry being sent. As soon as we land this I guess starts a ticking clock for Migration to hurry up and recoup the costs : )

(In reply to Mark Reid [:mreid] from comment #8)

Cool. I put together a spreadsheet to estimate the cost (based on Jason's excellent prior work along these lines).

Conservatively, it looks like it should cost no more than $2000/month which seems acceptable to me.

Jason, does that math look right to you? Does that budget impact sound ok?

I've updated the spreadsheet to correct GB/TB conversions and also added compressed storage pricing calculation. Based on the adjustments and with compressed storage pricing I think it is going to cost less than $3000/month.

Flags: needinfo?(jthomas)

+1 on the budget

Flags: needinfo?(dparfitt)

JIRA Data Science Request filed https://jira.mozilla.com/browse/DO-338

Flags: needinfo?(mgorlick)
See Also: → 1673645

Leif and I had a chat, and the approach has been approved in broad strokes. We're okay to proceed with implementation.

Blocks: 1670262
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
See Also: → 1685522
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: