Desktop Firefox New Profile Pings reporting "default" and "restart" as startup_profile_selection_reason
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: shong, Unassigned)
References
(Blocks 1 open bug)
Details
Issue
From May 4th - May 19th 2023, an unusually high number of new profile pings are reporting "restart" or "default" as the startup_profile_selection_reason
.
new profile pings - profile selection reason
This effect is even more pronounced when we isolate to the subset of installs and new profiles that qualify as "mozorg windows funnel" (defined here).
Here is the ratio of mozorg windows funnel installs vs new profiles:
Overview NP and Installs MozOrg Only
From May 4th to May 19th 2023, we see the ratio of new profiles to new installs drop from the 90s to the 70s. That's because one of the qualifiers for mozorg windows funnel is startup_profile_selection_reason = 'firstrun-created-default'
.
New profiles that would normally report startup_profile_selection_reason = 'firstrun-created-default'
seem to be reporting "restart" or "default" instead over this period.
deepdive miscategorized pings - Startup reason
After this period, behavior seems to return to normal.
Can we find out what happened over this period in the product (installer, pingsender, etc.) that caused new profile pings to report "restart" or "default" as the profile selection reason instead of 'firstrun-created-default'?
Reporter | ||
Comment 1•8 months ago
|
||
After speaking with Client Telemetry DE, it sounds like New Profile Pings should never be reporting "default" or "restart" for startup_profile_selection_reason.
However, it's happening very regularly. Not just in the incident cited above, but day to day.
So this sounds like a straight up bug.
Request:
- Fix NPPs so they stop reporting "default" or "restart" for startup_profile_selection_reason
- Investigate and explain the edge cases that are leading to these values being reported currently
- Or, provide explanation for why this is happening if it is indeed expected behavior (and not a bug)
Reporter | ||
Updated•8 months ago
|
Reporter | ||
Comment 2•8 months ago
|
||
Blocking these 2 Jira tickets:
Comment 3•8 months ago
|
||
That value is set here: https://searchfox.org/mozilla-central/rev/669fac9888b173c02baa4c036e980c0c204dfe02/toolkit/profile/nsToolkitProfileService.cpp#570-571
mStartupReason
is determined in this function: SelectStartupProfile
.
After speaking with Client Telemetry DE, it sounds like New Profile Pings should never be reporting "default" or "restart" for startup_profile_selection_reason.
That is my assumption, but I don't know the code good enough to understand under what circumstances it ends up being default
or restart
for a fresh profile.
Regardless that particular piece of code and data collection is not owned by us (we should probably also move the bug to the right component after).
Dave, are you the right person for this?
Comment 4•8 months ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #3)
That value is set here: https://searchfox.org/mozilla-central/rev/669fac9888b173c02baa4c036e980c0c204dfe02/toolkit/profile/nsToolkitProfileService.cpp#570-571
mStartupReason
is determined in this function:SelectStartupProfile
.After speaking with Client Telemetry DE, it sounds like New Profile Pings should never be reporting "default" or "restart" for startup_profile_selection_reason.
That is my assumption, but I don't know the code good enough to understand under what circumstances it ends up being
default
orrestart
for a fresh profile.
Regardless that particular piece of code and data collection is not owned by us (we should probably also move the bug to the right component after).
There is at least one case where it makes sense to submit default
as the value. That is when the user crashes in the first 30 minutes of using Firefox. In this case the new profile ping will be submitted in their second session using the startup reason for that second startup which would likely be default
. Hopefully that happens less often than we are seeing.
The restart
case is more odd. From the code we can only set that if the XRE_PROFILE_PATH
environment variable is set, which we only really set when restarting the app, but the new profile ping is meant to be submitted on shutdown in that case so should have the correct value. Possibly someone is setting that environment variable in the wild but unlikely to be in numbers large enough for us to really see this.
Seeing both of these more commonly suggests that something is broken in our attempt to submit the new profile ping during the first session. Perhaps the code that deals with submitting the ping on shutdown isn't working, in which case any restart of Firefox, or quit and relaunch within that first 30 minute window would cause us to see default
and restart
frequently.
Reporter | ||
Comment 5•8 months ago
|
||
Ah, interesting. Thanks so much Dave!
Yes, these seem far too common for the expected behavior to explain.
For "default", we see this happen about 15% of the time (currently, with up to 40% of the time for 6 months in 2023), and for "restart", we see this happen about 3% of the time (currently, with one month spikes of up to 6% and 10% in 2022 and 2023).
Prevelance of Default and Restart in NPP
One thing that doesn't make sense is that we see the prevalence of "default" and "restart" spike temporarily sometimes, these sudden changes don't align with Firefox release dates.
If this is due to a bug with how New Profile Pings submit on shutdown isn't working, would that code be determined by the Firefox version? Or is that handled by Pingsender? Is Pingsender versioned independently of Firefox versions?
In either case, this sounds like a bug we should fix. Which component should I file this under to get it in the right place?
Comment 6•8 months ago
|
||
(In reply to Su-Young Hong from comment #5)
Ah, interesting. Thanks so much Dave!
Yes, these seem far too common for the expected behavior to explain.
For "default", we see this happen about 15% of the time (currently, with up to 40% of the time for 6 months in 2023), and for "restart", we see this happen about 3% of the time (currently, with one month spikes of up to 6% and 10% in 2022 and 2023).
Prevelance of Default and Restart in NPP
One thing that doesn't make sense is that we see the prevalence of "default" and "restart" spike temporarily sometimes, these sudden changes don't align with Firefox release dates.
Do we also see spikes of other reasons, i.e. are more users creating profiles than normal during those spikes?
If this is due to a bug with how New Profile Pings submit on shutdown isn't working, would that code be determined by the Firefox version? Or is that handled by Pingsender? Is Pingsender versioned independently of Firefox versions?
No, pingsender ships as part of Firefox.
In either case, this sounds like a bug we should fix. Which component should I file this under to get it in the right place?
The submission of the new profile ping is handled by the telemetry code. I can't see any way that the startup reason could be set to restart
inadvertently so I guess I would start looking at the ping submission. I did take a quick look myself but couldn't see anything obviously wrong.
Reporter | ||
Comment 7•8 months ago
|
||
No, not linked to a general spike of New Profiles. The spikes seem specific to the reasons. See this chart: https://sql.telemetry.mozilla.org/queries/101703/source#250575
Hmm, that just makes it more curious.
Comment 8•8 months ago
•
|
||
The spike in "default"-reason "new-profile" pings from comment 7's https://sql.telemetry.mozilla.org/queries/101703/source#250575 resembles very closely a Firefox Desktop adoption+abandonment curve. The adoption begins around 2022-12-11 which lines up with Fx108 hitting release. The abandonment starts on release week for Fx114 (2023-06-06). This is evidence consistent with a code change riding the 108 train being a culprit, with the behaviour remedied in Fx114.
Also: though the upload of the "new-profile" ping is handled by Legacy Telemetry code, it doesn't (and arguably can't) change the startup_profile_selection_reason
which is inside the payload. That's handled entirely inside nsToolkitProfileService as Jan-Erik noted in comment 3. (( There were no code changes to nsToolkitProfileService
between July of 2022 and May of 2023 meaning no changes to the code under instrumentation in any of the affected versions, for what that's worth. It is possible there were changes to configuration, though. ))
Also, since I was curious, I noticed the spike is cross-platform: all three of Windows, Mac, and Linux are equally affected with identical timing (query). A similar look across countries shows a wide geographical impact with no clear culprit country. This evidence is inconsistent with a theory that it's another instance of One Weird Client or One Weird Datacentre causing an outsized effect (see BrowserStack, Aurora 51, French Schoolchildren, and many many other historical instances of narrow populations with outsized effect)
Moving to Startup & Profiles since that's the code under instrumentation.
Comment 9•8 months ago
|
||
(In reply to Chris H-C :chutten from comment #8)
Also: though the upload of the "new-profile" ping is handled by Legacy Telemetry code, it doesn't (and arguably can't) change the
startup_profile_selection_reason
which is inside the payload. That's handled entirely inside nsToolkitProfileService
This is true, but the scheduling the the ping is handled by Legacy Telemetry code and that is my best guess for where the issue lies. If we fail to send it in the first run of Firefox (user doesn't keep running for more than 30 minutes and the code that handles sending on shutdown somehow doesn't run) then we will send the ping in a subsequent run at which point the startup reason will have changed.
Comment 10•8 months ago
|
||
Oh, I see what you mean. It could be that code near sendNewProfilePing() is more relevant to the discussion, in which case it definitely would be Toolkit :: Telemetry
.
Alas, the "new-profile" ping doesn't contain any instrumentation that'd tell us whether it's the first session or not (it'd be handy if it had the profileSubsessionCount
in there at least, though that's very "main"-ping specific). You'd think the top-level ping reason
("startup" v. "shutdown") proportion would provide some clues, but a quick check shows that the proportion remains pretty constant across the '22-'23 spike.
To your point of 30mins, that's not 100% correct. The "new-profile" ping is scheduled to be submitted inside TelemetryController
's "delayed init", which happens 60s after profile-after-change
. It's true we only record the bookkeeping that it's been sent after submitExternalPing
's returned promise resolves, but that resolves on ping save if we're early in the first session. So we can assume human users who've allowed the browser to run more than a couple of minutes have mostly guaranteed to have submitted and persisted their "new-profile" ping in that first session.
Now, if something goes wrong with the bookkeeping, we could submit and send additional "new-profile" pings later as well. We've seen these sorts of weird I/O failures in small JSON files before (bug 1697045 and friends). Since all these analyses have been per-client-day-reason, we'd not necessarily notice if the same client sent us multiple "new-profile" pings across multiple days, or on the same day with multiple reasons.
Ah, and wouldn't you know it, only half of "new-profile" pings during the '22-'23 spike are solitary. Half of all "new-profile" pings are likely coming from subsequent sessions.
Probably best to only pay attention to the first "new-profile" ping from a given client, then. Passing to Data :: General
and Su to figure out what to do with this information.
Comment 11•8 months ago
|
||
One thing I suggested elsewhere is that it may make sense to gather the new profile ping during the first session so that if we end up submitting it in a later session we are still submitting what would be the original data.
Comment 12•8 months ago
|
||
That is indeed how the code is written. During delayed init (60s after profile-after-change) we assemble the ping as part of submission and upload it if possible, saving it for subsequent sending if not. The trick is that there appears to be a fault in detecting that there's been a "new-profile" ping uploaded or saved, so the same client can send more than one.
(( Not one to miss an opportunity to talk about Glean, Glean solves this problem through its use of persistent storage for metrics data and makes it trivially detectable in data through ping_info.seq
. It also has C++ APIs for ping submission, so there's no need to farm out to TelemetryControllerParent
for something that more properly should be scheduled by nsToolkitProfileService
. ))
Comment 13•8 months ago
|
||
I am incorrect and :mossop is correct: I missed a step. Though scheduleNewProfilePing
happens during delayed init, I glossed straight over the (default) 30min send delay between it and sendNewProfilePing
.
So it looks like a "new-profile" ping will not collect the information supplied by nsToolkitProfileService
until at least 31mins after profile-after-change, which is quite some time. But in the event of a shutdown, we do finalize the 30min delay early, so if the session's short, we will still make an effort. This is how we get "new-profile" pings with payload.reason = 'shutdown'
, which account for between 1.5x-2x as many pings as those with payload.reason = 'startup'
(which is the ping reason when the first session lasts long enough).
So though I was wrong (and not just a little bit), I ended up still providing some correct information: We do collect, assemble, and submit (a word here meaning "call submitExternalPing
") the "new-profile" ping in the first session.
Reporter | ||
Comment 14•7 months ago
•
|
||
For clients sending "default" reason in New Profile Pings, Chutten's intuition that most of these are coming from users sending a second New Profile Ping on a later session is correct.
Most of the enormous spike in "default" reason New Profile Pings in 2023 (see scary blue line here: https://sql.telemetry.mozilla.org/queries/101818/source#250852) are extraneous pings coming from profiles that have already sent us a New Profile Ping with the correct startup reason (firstrun, etc.).
I don't think we need to worry too much about this group, since the New Profile definition in our downstream datasets de-dupes these.
There is a population of users who send us a "default" reason New Profile Ping as their first New Profile Pings. Before August 2022, this group was about 5-6% of new profile ping clients (deduped). After August 2022, we see a big spike (incident?) and then the level shifts to about 10%.
- see: https://sql.telemetry.mozilla.org/queries/101828/source#250881
- see: https://sql.telemetry.mozilla.org/queries/101849/source#250926
That still seems like a lot. Would "Firefox crashed before NPP got sent" be common enough to explain 10% (of daily new client ids that send us a New Profile Ping that day)?
Comment 15•4 months ago
•
|
||
I noticed that OS pre-installs seem to be a particularly strong driver in "default" - see https://sql.telemetry.mozilla.org/queries/104298/source#256890
Maybe it helps understand the root cause?
Description
•