Focus Android appears to be, in some infrequent cases, unable to promptly upload pings that have been submitted
Categories
(Data Platform and Tools :: Glean: SDK, task, P1)
Tracking
(Not tracked)
People
(Reporter: chutten, Assigned: chutten)
References
Details
Attachments
(1 file)
In Focus Android we've noticed some differences between how Legacy Telemetry (via "core" pings) and Glean (via "baseline" pings with reason "inactive") arrive at the Data Platform. As you may see in the attached screenshot, for some (as-yet-unknown) number of clients with much activity on many days, the Data Platform did not receive the upload of a Glean SDK ping for some sequences of days. However, after those periods of quiet, the Glean SDK-sent pings appear to "catch up": all the previous days' pings arrive within the same day.
It is known and acknowledged that though the trigger for submitting Glean "baseline" pings with reason "inactive" and Legacy "core" pings are the same, the mechanisms used to upload them are different. Legacy uses an older API that trades power for reliable/frequent upload. Glean was asked in Fenix not to wake up the device with its uploads. So is it possible that this is an understandable difference between the two approaches?
It could be part of it... but Glean will send pending pings on any session that is long enough, and the periods of quiet in some of the examined ping records contain session durations in the thousands of seconds. The Glean SDK ought to have had plenty of opportunity to send pings.
This bug is about looking into the character of this effect. Can we discern from available data whether this effect is due to a code fault that we can fix? Can more data help? Will other fixes change the behaviour?
Expect: Exploratory Data Analysis, Reading Focus Android code, Wondering How Android APIs Work.
Assignee | ||
Comment 1•2 years ago
|
||
An approach I've already tried was to determine a population-wide way to detect these clients (in absence of a proper linking mechanism like bug 1805256 will provide) by calculating the proportion of days the Data Platform received a ping from a client (distinct client_id, DATE(submission_timestamp)) (called CDOU) to the proportion of days the client submitted a ping (distinct client_id, DATE(ping_info.parsed_end_time)) (called activity-CDOU). If Glean is submitting pings reliably but not uploading them reliably, then the ratio between CDOU and activity-CDOU should be high.
The highest proportion of clients whose CDOU to activity-CDOU ratio exceeded 2 was amongst those calculated by clients submitting "baseline" pings with reason "inactive" by Focus Android.
- Focus Android via "baseline" pings with reason "inactive": 0.66%
- Focus Android via "core" pings: 0.45%
- Fenix via "baseline" pings with reason "inactive": 0.15%
This suggests to me that there is something unique to Focus Android amongst products and to "baseline" pings with reason "inactive" amongst data collection systems within Focus Android.
With bug 1805256 incoming, I'm not sure how useful this information might be, but I felt I should write it up since I've referenced it in meetings. The next step on this might be to wait a month or so and see what we can learn when cross-system linking becomes possible.
Assignee | ||
Comment 2•2 years ago
|
||
(See Also Brad's robust analysis here: https://docs.google.com/document/d/1SNS3yoCIZZ5_2Idbmj8Dbs8mYdK5EU0RVjrj1XyjYAc/edit# )
Comment 3•2 years ago
|
||
Bumping this to P1/S1. This is of high importance to our team.
Updated•2 years ago
|
Comment 4•2 years ago
|
||
One idea I had, though it's unlikely to actually work:
Look at the network errors reported. If the baseline pings are tried they should increase in each subsequent ping.
Problem: network errors are only in the metrics ping, so we either need to see if that same issue happens on metrics pings or if we can overlap that with baseline pings enough.
Comment 5•2 years ago
|
||
For that specific clients we indeed also don't get metrics pings in that timeframe: https://sql.telemetry.mozilla.org/queries/89253/source#221008
Comment 6•2 years ago
|
||
Looking at the network errors we see the spike in recoverable errors: https://sql.telemetry.mozilla.org/queries/89254/source#221010
(time frame picked randomly based on gaps in https://sql.telemetry.mozilla.org/queries/89113/source#220665)
Notably for that client no metrics pings between Sep 24 and Oct 3 (inclusive) and Oct 29 and Nov 5 (inclusive).
Network errors we see in pings on Oct 5, 13, 15, 16 and Nov 9, 16.
That doesn't even neatly line up with the missing days. I'm more confused.
Assignee | ||
Comment 7•2 years ago
|
||
Line of Investigation: Please check whether the uploader is configured the same between Focus Android and Fenix.
Comment 8•2 years ago
|
||
(In reply to Chris H-C :chutten from comment #7)
Line of Investigation: Please check whether the uploader is configured the same between Focus Android and Fenix.
Yes, there is a minor difference it seems:
Focus sets usePrivateRequest
whereas Fenix does not
Assignee | ||
Updated•2 years ago
|
Comment 9•2 years ago
|
||
Hey Chris, did you look further into the differences from comment 8?
Assignee | ||
Comment 10•2 years ago
|
||
Yes, and the use of private requests doesn't look like it would have impact on the number of neterrs we're seeing (but we'd need a necko person to step in to make sure I'm not off the rails here).
usePrivateRequest: true
gets you down to this part of the WebExecutor
if you're using GeckoViewFetchClient
which (as far as I can tell) both Fenix and Focus Android use. The way I read this is it tells the channel to behave per Private Browsing and for that it'll use cookie jar settings appropriate for PBM (based on the pref network.cookie.cookieBehavior.pbmode
(which on my Firefox Desktop Beta is 5
for REJECT_TRACKER_AND_PARTITION_FOREIGN
)).
This appears to only make us be more strict in what we'll accept content-wise, and since we only POST
to incoming.tmo which is a host that doesn't care about cookies, I'm not sure how changes to cookie policy or storageprinciple would affect an upload.
This may be a dead end. Though it might be fun to hook that up to a NimbusFeature and A/B it.
Assignee | ||
Comment 11•2 years ago
|
||
While we're waiting for the legacy client_id (uplift approved and merged) I'm still looking at things in terms of how activity manifests itself in data. In other words, I've looked at CDOU : activity-CDOU again.
A reminder that the "batching+catchup" behaviour results in a bunch of days' pings all arriving on our server on the same day, artificially reducing the "days seen". A signature for this is having more "activity days seen" (which, when summed, I call "activity-CDOU") than "submission days seen" (which, when summed, I call "CDOU") (a recast of the old submission_date/activity_date discussions). Previous investigations have found differences within Focus Android and between Focus Android and other products in the proportion of clients having low ratios of CDOU : activity-CDOU
... but I was only looking at summary statistics. I wanted a better view of the populations.
So I threw them all in a CDF. What you're looking at is how much of the population has a ratio of CDOU : activity-CDOU
lower than the value on the x-axis. This is helpful for finding out how much of the client population is contributing too few days to CDOU-like measures. I'll draw your attention to three areas:
- ratio < 0.6 - You can see the gap where Focus Android clients' activities measured via Glean "baseline" pings with reason "inactive" (
glean-baseline-inactive
) are higher on the graph. This shows it has a larger proportion of terrible ratios than anything else (as was previously reported). The crossover point appears to be around 0.5. - ratio = 0.99 - The values at this x-axis show the proportion of clients in November who were undercounted. Legacy "core"-measured users were far and away the worst represented here with 14.6% of the user base having a deflationary impact on CDOU. How much of an impact depends both on how far away the client is from 1.00 and how many days that particular client was seen/had activity).
- ratio = 1.00 - The values at this x-axis show the proportion of clients in November counted, at best, fairly. Subtracting from the 0.99 values we can find the proportions that are counted "correctly" ( 0.01 ratio) by CDOU-style measures:
- Focus Android Legacy "core": 0.90166 - 0.14561 = 75.6%
- Focus Android Glean "baseline" pings with reason "inactive": 0.97222 - 0.06150 = 91%
- (Focus iOS: 87.6%, Fenix: 92.3%, Firefox Desktop: 96.6%)
- ratio = 1.00 - (Yes, again) Given that the values at this x-axis show the proportion of clients in November counted, at best, fairly, then it follows that 1.00 - the value at this x shows the proportion of clients exerting inflationary pressure on CDOU-style measures. Of course how much effect this has on the final value requires taking into account how much each client is above ratio = 1.00 and how many days that client was seen/had activity. (Of note: all populations have "three nines" of their population no worse than ratio = 2.00)
- Focus Android Legacy "core": 9.8%
- Focus Android Glean "baseline" pings with reason "inactive": 2.8%
- (Focus iOS: 2.8%, Fenix: 3.6%, Firefox Desktop: 1.3%)
In conclusion, there's evidence that at least some of the discrepancy between measurement systems can be explained by Legacy "core" having nearly 10% of its population inflating its activity by reporting the same day's activity on pings received on multiple different days. Glean's comparable reporting has less than a third of the population showing that effect.
If this is an interesting point of analysis, the next step I would consider would be to look at this in terms of proportion of days seen instead of in proportion of population... in fact, now that I think on it, it should be rather easy to extend the query to look at those. So I did that. But I was past my EOW when I did that, so analysis will have to wait. It's enough for now to note that the proportions we're seeing wrt total CDOU are more or less the same as the previous CDF's proportions of client_count (ie, that I didn't waste too much time writing up that report instead of writing and running this better query).
Assignee | ||
Comment 12•2 years ago
|
||
Now that we have a month's data (albeit over calendar-ending holidays), let's take a look at these batching clients, shall we?
...uh, well that's weird. Over the first fourteen days of the year, looking at all the legacy "core" pings or Glean "baseline" pings from all Focus Android clients that said anything... there are over 3x as many client-days where there are Glean "baseline" pings but no Legacy "core" pings than there are the reverse. (query here)
Which is to say: if batching is a problem that is reducing the number of days seen... in modern builds of Focus from clients in the first two weeks of the year, it weirdly appears that Legacy's the one with the problem, not Glean?
Brad, could you check over my query to see if anything's amiss in there? It seems as though there's a much smaller than expected effect. Does this match differences in days_seen if we similarly only look at modern builds?
Comment 13•2 years ago
|
||
Thanks for putting this together, Chris! The query looks correct to me. Pairing the client_ids
is a really useful way to slice the data; it seems that if we only receive a ping from a single telemetry system on any given client-day, it's most likely that the ping will be from Glean. I don't think that's something that could state with confidence before. My only question with the query is: why only look at the first two weeks of the year instead of all days after the instrumentation was implemented?
I think this is consistent the story we get when we look at the "lag days" plots; I've updated the "lag days" query to now differentiate between the "old builds" (that we would have been looking at before) and the "new builds".
What we see, which is consistent with what you're reporting, is that Glean is much more likely to report data on the correct day -- at least as much as we can determine what the "correct day" is. 97%
of the rows that Glean reports have lag in days = 0
, compared to 86%
for Legacy. I think the issue comes in when we have more than 7 days lag in reporting, which is when we'd start to see differences in reporting for activity_segment
. For Glean, 0.17%
of rows have more than 7 days reporting lag, whereas for Legacy it's 0.07%
(I did that calculation by hand because I didn't want to add another subquery, but you can also see that result in the histograms, where Glean has higher bars when the lag is long). These are admittedly small percentages, but I think they're still our best explanation for the activity segment differences.
It's possible that the modified "lag days" query is missing part of the story, please do let me know if you think there could be important information that it's not accounting for.
I'm not sure what the next steps would be. In the spirit of brainstorming solutions, can we confirm either of these statements?
A - Glean's reporting lag is the same or better than Legacy's (i.e. the "lag days" query logic is somehow misleading)
B - Glean's reporting lag is different, but this is due to expected behavior due to the way Glean's/Legacy's ping sending is implemented
I think your CDF plots are getting at something like (A), since Legacy is inflating its activity, but I don't think I follow the argument all the way through.
Assignee | ||
Comment 14•2 years ago
|
||
(In reply to Brad Ochocki from comment #13)
My only question with the query is: why only look at the first two weeks of the year instead of all days after the instrumentation was implemented?
Oh, I was assuming the query'd take absolute ages if I looked any wider. I've forked it to look at everything since Dec 16 and the numbers just double, so the proportions appear to be stable over the available data (and it didn't take any longer, so my concerns were unfounded. Oh well, it won't be the first thing I'm wrong about.).
I think this is consistent the story we get when we look at the "lag days" plots; I've updated the "lag days" query to now differentiate between the "old builds" (that we would have been looking at before) and the "new builds".
Verrry interesting plot. I see no problems with the lag days query, and I see what you mean with your explanations. It's lacking a client dimension which would give more strength to your comment about activity segments: that pings are more likely to be received either on time or very late is important, but we don't show how often "on time + very late" is exhibited by the same client (ie, the batching). 0.17% vs 0.07% is infuriatingly too small to account for the size of segment change we were seeing before, but maybe in new builds the segments are that close together.
( I don't think we need to strengthen the activity segment comment. It's a shame we don't have build ids in unified_metrics
... ah, what the heck. I can build it from pieces (see below) )
But to your specific statements:
A - Glean's reporting lag is the same or better than Legacy's (i.e. the "lag days" query logic is somehow misleading)
Well, it definitely is better on average (look at all those same-day ping receipts: 97% vs 86%), but when there's a delay of more than one day, Glean's is more likely to be worse (on a per-ping basis).
B - Glean's reporting lag is different, but this is due to expected behavior due to the way Glean's/Legacy's ping sending is implemented
Well I am beginning to expect it... but that may be because I've been living with this for a while. We know there are differences between the mechanisms that legacy and Glean use to upload their pings. They're differences within Android, so rather unfortunately I can't be terribly certain they're to blame (even if I found and read and understood the code, I probably wouldn't be able to understand how it'd exhibit differences at scale).
I'm not sure what the next steps would be.
You and me both.
Me, I'd like to stop looking at this because it makes my head hurt. Do you think we are at a point where we can stop with the plots and say Glean's behaviour on Focus Android is satisfactory for DS' needs?
For extra credit, I hacked together a "only new build ids" version of your activity segment query, and it is rather different: There is no longer much of a difference at all between counts of "core" clients, and it really shows what we discovered last year that Glean is really good at counting infrequent clients via "baseline" pings (and that it counts more of every kind of client).
However, I'm not 100% behind this query: I'm not sure the weekly combination is helping us here (maybe a seven-day running average would be nicer), I'm not sure whether the size of the population and duration they've had the fix will be enough to conclude anything... Brad, what do you think? Is this a sign that things are good/better enough?
Comment 15•2 years ago
|
||
Thanks for the follow-up Chris, I think we're on the same (headache-inducing) page.
The new plot you put together is extremely encouraging -- Glean seems to be more likely to count infrequent clients, and we don't see activity segment differences for the most important group (core users). Additionally, the fact that Glean counts many more "day-of" pings is extremely useful since we're moving to DAU this year, and we know there are caveats around inferring activity date from ping receipt date. Glean being better at reporting pings when they happen is a valuable improvement in that respect.
From my perspective, we've done due diligence here. I really appreciate the Glean team implementing the client_id
pairing in Glean and putting so much thoughtful effort into helping me understand the differences between the systems.
I'm happy to close this bug and to move forward with the recommendation to rely on Glean telemetry.
Assignee | ||
Comment 16•2 years ago
|
||
Excellent. Then we are agreed. Glad this is working out, and glad we now know of a few more ways to look at data.
Description
•