Closed Bug 1651110 Opened 4 years ago Closed 4 years ago

Validate FOG in Nightly

Categories

(Toolkit :: Telemetry, task, P1)

task

Tracking

()

RESOLVED FIXED

People

(Reporter: chutten, Assigned: chutten)

References

Details

(Whiteboard: [telemetry:fog:m5])

As of the end of M4 we have the capability to, via custom pings, validate the operation of FOG in the wild. Let's do that.

The minimal things we should check include:

  • Errors
  • Population counts
    • comparing to Telemetry will be fuzzy until we get C++ and JS APIs, but for now we can do broad strokes counts of per-platform, per-architecture, per-locale sorts of things

With confidence that things aren't going awry (we're not initing in more than one process, we're not experiencing bad errors, we're not missing most of the population, etc) we'll be able to let FOG ride the trains.

Blocks: 1651111
Depends on: 1654339
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Depends on: 1658359
Depends on: 1664461

"fog-validation" pings are, as of the most recent schema deploy, flowing in. Earlier today I spotted 88k pings from about 16k clients in the live tables.

Analysis should be against the stable tables and can begin to be written tomorrow. Given Telemetry's known timing character, we may wish to delay the deciding analysis to later in the week.

A casual look at 20 sampled pings and a rollup of all the client_info data shows:

  • os, device_manufacturer, and device_model are all set to "unknown" when they shouldn't be.
    • os should be correct, and the others should be absent. This is due to us waiting on an updated SDK version and forgetting to take out our coverups when we updated. Oops. See bug 1672184 for the fix.
  • Already getting some pings from a "release" channel.
    • Only from Linux. This means that they have APP_UPDATE_CHANNEL set to "release" (whatever that means for their distro specifically) but their build milestone is_nightly. Maybe we should prioritize collecting some indication of whether the update url is not-moz, or emphasize in FOG's docs that app_channel means the "Default Update Channel" and so therefore might be weird on distributed Firefoxen.
  • Speaking of channel, got some pings from "nightly-autoland" as well.
    • This must've been one heckuva long test run, as these pings aren't sent for an hour after a session starts.
    • I'm reaching out to #build:mozilla.org to see if we can set up Source Tagging sooner rather than later.
  • Locale is blank (as expected).
    • See bug 1626969 for discussion about what locale should be and how it might not fit in the client_info.
  • normalized_app_name is blank.
    • I... am confused by this column. I'm reaching out to DE about how to get this column to have a proper value.

So far so "good" : D

See Also: → 1672191
See Also: → 1672184

Durations of the pings look more-or-less fine: https://sql.telemetry.mozilla.org/queries/75690/source#188840

Some handfuls of weirdos posting sub-one-hour timings. Some thousands of weirdos posting > 16h timings. Mostly, though, it's people reporting within 5min of exactly an hour's duration.

See Also: → 1672455

Alrighty, taking a closer look now.

Client Counts

It's a tricky business trying to validate Glean-sent pings by comparing them to Telemetry-sent pings... just ask anyone on our team that's done this before on mobile.

To try and get the most accurate comparison between these two systems, I've had to filter to only "main" pings with at least one hour of session time on the clock (since "fog-validation" pings are only sent hourly). Unfortunately that means that FOG is reporting from more clients than Telemetry is reporting from.

How much more? About 5.4% over this past week, starting at 30% high and then going down to 6.4% high per-day.

This is weird. I don't expect "fog-validation" pings to be getting to us at any greater a rate than Telemetry. In fact, I expected the opposite, since FOG's code hasn't had nearly as much time to have reliability bugs found and fixed in it.

Now, we know for a fact that there's some profile inflation going on in Telemetry. Perhaps as much as 2.5% of pings are coming from profiles that share a client_id. That's not enough to account for the full difference here unless inflation is particularly bad on Nightly.

I think we'll have to start adding the Telemetry client_id to the "fog-validation" pings so we can more easily narrow down whether this is a timing effect, a profile inflation effect, or something completely else.

But let's carry on for now.

OS

When we look at the figures split by OS We notice that the largest of FOG's client count gains (proportionally-speaking) are coming from earlier versions of the OS families. The one particularly-egregious example is Windows 7 (Windows_NT 6.1) which is 10% higher-counted by "fog-validation" pings than by "main" pings with sessionLength >= 60 * 60.

Since profile inflation isn't expected to be a uniform process, this would be weak evidence for such a theory.


Y'know what, let's just add the Telemetry client_id in there. We need the JS API for UUID sooner or later, let's implement it sooner, eh?

Priority: P3 → P1
Depends on: 1674233

Even without the Telemetry client_id, we can validate that FOG isn't a menace by counting the number of weird things happening.

Scope

All "fog-validation" pings sent by clients in the week of days between Oct 24 and Oct 30 (inclusive) measured by submission timestamp.

1,444,706 pings sent from 43,421 clients. A mean of 33 pings per client over that 7-day span.

OS

The proportions of values of OS names and versions looks good. A lot of Windows, modern versions are more popular than old version, and only 9 clients ever reporting "unknown". 43 clients reported multiple os name + version combinations, most of which could be legitimate OS upgrades (to Windows 10, from Darwin 18 to 19 to 20, Linux kernels from 4 to 5, etc.) or just a side-effect of how FOG stores its data in the profile dir, and some users just port their profile dir around with them to different machines.

seq holes and dupes

Glean pings have sequence numbers that should never duplicate and should very rarely result in holes in the seq number record.

Looking at the week's pings, we see 82.8% of clients have no dupes or holes of any size or description. Of the remaining 17.2% that have some sort of problem, 10.4% have exactly one problem. 3.1% have exactly two. And then the long tail begins.

If you tweak that query to look at the disposition of problems, you'll find that only 64 clients (0.15%) have reported dupes (some of them have a real problem with it, with dupe counts as high as 69.). That means the number of problems is approximately just the number of holes in the seq record. 8% of all these problems are down to a single missing ping.

I'm not sure what the Glean SDK's current expected rate of ping loss is, but 17.2% of clients experiencing it at least once seems high. This could be a sign that there's something wrong with ping sending in FOG.

Other client_info Fields

  • app_display_version should be 100% 84.0a1, but isn't quite, with 3 clients sending at least one ping purporting to be from 83.0a1. This is the standard level of weirdness I'd expect.
  • There's 1 client claiming to be on power pc and the rest seem in the right proportion for Nightly: 94% 64-bit machines.
  • First Run Dates are only occasionally really off, going as far back as Jan 4, 1980. But it's far more likely that these clients run Nightly often enough to have caught the first time we shipped FOG actually init'ing back at the beginning of September or when I refactored the init better in bug 1658359.
  • No one's found a way to lie about the SDK version yet... and apparently we haven't bothered to update it since the end of July : |

Pings per client

Looking at the distribution of pings per client we see a nice clean cutoff at 168 (which is 7 (the number of days in the sample) times 24 (the number of hours in a day, coincidentally the frequency at which "fog-validation" pings are sent is one per hour)) as we'd hope. There's one client out at 232 perhaps suggesting that there's some profile duplication going on (where multiple machines are sending pings as the same client). If that's the client with 69 dupes from the above seq section, then the true counts are 163 and 69.

Conclusion

We need to get a handle on what's with the seq holes, but I don't see anything that should prevent us from shipping to Beta.

ni?travis for a pair of eyes on my analysis and opinions on my conclusion.

Flags: needinfo?(tlong)

(In reply to Chris H-C :chutten from comment #5)

Even without the Telemetry client_id, we can validate that FOG isn't a menace by counting the number of weird things happening.

Scope

All "fog-validation" pings sent by clients in the week of days between Oct 24 and Oct 30 (inclusive) measured by submission timestamp.

1,444,706 pings sent from 43,421 clients. A mean of 33 pings per client over that 7-day span.

OS

The proportions of values of OS names and versions looks good. A lot of Windows, modern versions are more popular than old version, and only 9 clients ever reporting "unknown". 43 clients reported multiple os name + version combinations, most of which could be legitimate OS upgrades (to Windows 10, from Darwin 18 to 19 to 20, Linux kernels from 4 to 5, etc.) or just a side-effect of how FOG stores its data in the profile dir, and some users just port their profile dir around with them to different machines.

Just a matter of curiosity, but do we understand what conditions we see "unknown" would be? I like that we can see cases like OS updates and portable profile users seem like an interesting demographic that would be interesting to study.

seq holes and dupes

Glean pings have sequence numbers that should never duplicate and should very rarely result in holes in the seq number record.

Looking at the week's pings, we see 82.8% of clients have no dupes or holes of any size or description. Of the remaining 17.2% that have some sort of problem, 10.4% have exactly one problem. 3.1% have exactly two. And then the long tail begins.

If you tweak that query to look at the disposition of problems, you'll find that only 64 clients (0.15%) have reported dupes (some of them have a real problem with it, with dupe counts as high as 69.). That means the number of problems is approximately just the number of holes in the seq record. 8% of all these problems are down to a single missing ping.

I'm not sure what the Glean SDK's current expected rate of ping loss is, but 17.2% of clients experiencing it at least once seems high. This could be a sign that there's something wrong with ping sending in FOG.

I forked your query and aimed it at Fenix baseline pings to compare, and it looks like the "at least one" column is definitely ranking in at < 17.2%. It seems prudent to file a bug to investigate this further, in fact taking another look at this across each of the platforms might be a good idea to do from time to time, and not something we currently do (or have alerting for).

Other client_info Fields

  • app_display_version should be 100% 84.0a1, but isn't quite, with 3 clients sending at least one ping purporting to be from 83.0a1. This is the standard level of weirdness I'd expect.
  • There's 1 client claiming to be on power pc and the rest seem in the right proportion for Nightly: 94% 64-bit machines.

What! PowerPC? This instills me with all sorts of images of a lone hacker in a dark room with an old Macintosh :) I love data!

Pings per client

Looking at the distribution of pings per client we see a nice clean cutoff at 168 (which is 7 (the number of days in the sample) times 24 (the number of hours in a day, coincidentally the frequency at which "fog-validation" pings are sent is one per hour)) as we'd hope. There's one client out at 232 perhaps suggesting that there's some profile duplication going on (where multiple machines are sending pings as the same client). If that's the client with 69 dupes from the above seq section, then the true counts are 163 and 69.

This looks legit to me, A nice looking curve.

Conclusion

We need to get a handle on what's with the seq holes, but I don't see anything that should prevent us from shipping to Beta.

Agreed

ni?travis for a pair of eyes on my analysis and opinions on my conclusion.

A great analysis to read, as usual. I have no arguments against your hypotheses and I'd say it is reporting as intended. One thing I'm curious about that I expected to see in your analysis but didn't: a distribution of durations. But perhaps this was due to instrumentation? Overall, amazing and you have my r+

Flags: needinfo?(tlong)

(In reply to Travis Long [:travis_] from comment #6)

(In reply to Chris H-C :chutten from comment #5)

OS

The proportions of values of OS names and versions looks good. A lot of Windows, modern versions are more popular than old version, and only 9 clients ever reporting "unknown". 43 clients reported multiple os name + version combinations, most of which could be legitimate OS upgrades (to Windows 10, from Darwin 18 to 19 to 20, Linux kernels from 4 to 5, etc.) or just a side-effect of how FOG stores its data in the profile dir, and some users just port their profile dir around with them to different machines.

Just a matter of curiosity, but do we understand what conditions we see "unknown" would be? I like that we can see cases like OS updates and portable profile users seem like an interesting demographic that would be interesting to study.

It'd have to be something foreign enough to fall out of this batch of target_os. This is the OS it was built for, not the OS it's running on, so it's a big ol' shrug from me.

seq holes and dupes

Glean pings have sequence numbers that should never duplicate and should very rarely result in holes in the seq number record.

Looking at the week's pings, we see 82.8% of clients have no dupes or holes of any size or description. Of the remaining 17.2% that have some sort of problem, 10.4% have exactly one problem. 3.1% have exactly two. And then the long tail begins.

If you tweak that query to look at the disposition of problems, you'll find that only 64 clients (0.15%) have reported dupes (some of them have a real problem with it, with dupe counts as high as 69.). That means the number of problems is approximately just the number of holes in the seq record. 8% of all these problems are down to a single missing ping.

I'm not sure what the Glean SDK's current expected rate of ping loss is, but 17.2% of clients experiencing it at least once seems high. This could be a sign that there's something wrong with ping sending in FOG.

I forked your query and aimed it at Fenix baseline pings to compare, and it looks like the "at least one" column is definitely ranking in at < 17.2%. It seems prudent to file a bug to investigate this further, in fact taking another look at this across each of the platforms might be a good idea to do from time to time, and not something we currently do (or have alerting for).

Looking at the scope of that query (https://sql.telemetry.mozilla.org/queries/76130/source) we're looking at about 13.2M clients there showing that Fenix's "clients with at least one" is about 9.8%. This suggests two things to me:

  1. That it's probably not a small absolute number of weird clients that seem overlarge because of the population size (In the FOG sample it's 250k-ish clients, in the Fenix sample it's 1.3M). If Fenix also had a similar number of weirdos, the proportion'd be 2%.
  2. Maybe age of clients matters. Maybe a hole is something that happens early in a profile so it's a function of how many clients are new. Nightly churns through clients fairly quickly (see first_use_date distribution), so maybe it's an overrepresentation of a problem that solves itself?

Anyhoo, definitely gotta file a bug for this.

Conclusion

A great analysis to read, as usual. I have no arguments against your hypotheses and I'd say it is reporting as intended. One thing I'm curious about that I expected to see in your analysis but didn't: a distribution of durations. But perhaps this was due to instrumentation? Overall, amazing and you have my r+

Right. I did that up in bug 1651110 Comment #3, but didn't update it to the new sample. Lemme do that real quick.

"Durations" (end_time - start_time)

Every single "fog-validation" ping should, in an ideal world, have a duration of exactly one hour... since that's the time I ask it to sleep between submissions. However, the real world's not ideal, and data is messy. Clients can change their wall clocks between start and end. Laptops can fall asleep.

At the very least we can be sure nothing will ever go backward? Well, no, clients can change the wall clocks backwards. Not backwards by more than a day or two, right? Well... at least we don't have any ping in the sample that went backward further than two months.

81.7% of pings (at least one from each of 93.6% of the clients in the sample) come in within 5min (up or down) of on time.

If timezone shifts were causing some of these, we'd see small jumps 60min up and down from the target 60... which we do see, but weakly (but especially around 0min). Weirdly, once you get above about 10h duration, you start seeing nearly twice as many pings and clients per interval as you were seeing back in the four hundreds of minutes.

Overall I'm not concerned. The duration is highly artificial and depends on wall clock. Not like "baseline" which purports to have a monotonic "real" duration.

Blocks: 1675877
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
No longer depends on: 1674233
Blocks: 1678146
You need to log in before you can comment on or make changes to this bug.