Closed Bug 1675877 Opened 5 years ago Closed 5 years ago

What is up with `seq` holes in FOG's "fog-validation" pings?

Categories

(Toolkit :: Telemetry, task, P1)

task

Tracking

()

RESOLVED FIXED
86 Branch
Tracking Status
firefox86 --- fixed

People

(Reporter: chutten|PTO, Assigned: chutten|PTO)

References

Details

(Whiteboard: [telemetry:fog:m?])

Attachments

(3 files)

With context from the validation analysis, figure out what's going on with 17.2% of clients reporting at least one hole in the seq record.

  • Is this rate stable over time? Is it the same 17.2% each time, or a different set each week?
  • Is it common across all facets (os family and version, architecture, etc)?
  • Is it related to Glean db age?
  • Has a Glean SDK update (see bug 1675534) changed the situation?
Whiteboard: [telemetry:fog:m?]

A renewed look, this time looking at a week of RLB-powered beta, shows 99.94% of clients have no problems with dupes, but only 73.26% of clients have a complete "unholey" seq record. That's 26.73% with holes. 11.5% have exactly one hole of size one, leaving about 15% of clients in the sample having more than one hole (12%) or one hole of size > 1 (3%).

So the problem persists, and seems worse in Beta.

Per-day we're seeing a fairly constant 8-9% of clients with problems. This suggests that the 26.73% represents a semi-stable subpopulation that are hole-prone which on some days have holes and other days don't (either because they aren't using Firefox or because they managed to escape holes over that 24h).

Windows is the worst offender when you look at it per os, and of the Windowses, Windows 7 is the worst (31.7% holey) of the bunch. But it's not like the others' 17%ish numbers are great either, so platform is a factor in the magnitude of the issue, not the presence of one.

As for the age of these things, I looked at client_info.first_run_date. I originally thought all of the dbs should be shiny and new because we only shipped in 85, but I had once again forgotten that 84 was when the original db shipped so I needed to widen my gaze. It was illuminating to see that the holeliness rate is a flat 20% of clients no matter how long they've been using Glean. All the patterns on there are explained by population effects (notably the ramp to the right which is an artefact of how rapidly people update their beta builds. Look at that shelf around Dec 21 from folks who turned off their computers for the holidays)/


Why are these clients so holely? I still don't have an answer. Glean (as with Telemetry) is supposed to be a reliably data delivery system. We're not supposed to miss things.

Maybe there's something to do with the behaviour of "fog-validation" pings themselves. Maybe it's tied to network error logic. Maybe there's a difference between sending pings that were submitted in a previous session over pings submitted just before they were sent.

At least that one I can test. If there was a problem sending things that had to be persisted between sessions, we would see a definite difference between pings received around the time they were sent, and pings received far later (like a day later) than they were sent. We don't see such a difference, with the rate of holely clients remaining fairly flat no matter the delay of the pings they sent.

There is insufficient information contained in the "fog-validation" ping to determine the cause of the holes. All we can do is be vigilant when we start sending more information-rich pings like "baseline" and "metrics" to ensure we're not leaving a similar number of them unreceived.

Jan-Erik: it's not a fun conclusion, but it is as far as I think I can take this.

Assignee: nobody → chutten
Status: NEW → ASSIGNED
Flags: needinfo?(jrediger)
Priority: P3 → P1

(In reply to Chris H-C :chutten from comment #1)

A renewed look, this time looking at a week of RLB-powered beta, shows 99.94% of clients have no problems with dupes, but only 73.26% of clients have a complete "unholey" seq record. That's 26.73% with holes. 11.5% have exactly one hole of size one, leaving about 15% of clients in the sample having more than one hole (12%) or one hole of size > 1 (3%).

So the problem persists, and seems worse in Beta.

Is this using the previous FOG Glean-like implementation? It's not using the newest RLB? If so, does the problem persist with RLB?

That's scary high numbers indeed and very worth a deeper investigation, but I agree that right now we can't track that in the wild a lot.
Do you have an idea what additional data could help us track this down? Maybe we can extend the fog-validation ping to get that data.

And as :Dexter asked: what code are we running? Might this have been fixed already by the RLB?
If so, when will it hit beta/stable?

Flags: needinfo?(jrediger) → needinfo?(chutten)

(In reply to Alessio Placitelli [:Dexter] from comment #2)

(In reply to Chris H-C :chutten from comment #1)

A renewed look, this time looking at a week of RLB-powered beta, shows 99.94% of clients have no problems with dupes, but only 73.26% of clients have a complete "unholey" seq record. That's 26.73% with holes. 11.5% have exactly one hole of size one, leaving about 15% of clients in the sample having more than one hole (12%) or one hole of size > 1 (3%).

So the problem persists, and seems worse in Beta.

Is this using the previous FOG Glean-like implementation? It's not using the newest RLB? If so, does the problem persist with RLB?

This is using post-RLB Firefox Beta 85, I'm afraid. RLB broadly landed in bug 1662868 which hit the mid/end of the 85 train (Nightlies starting 20201204094005. No uplift, so Beta data started with the 85 merge on Dec 14th).

(In reply to Jan-Erik Rediger [:janerik] from comment #3)

That's scary high numbers indeed and very worth a deeper investigation, but I agree that right now we can't track that in the wild a lot.
Do you have an idea what additional data could help us track this down? Maybe we can extend the fog-validation ping to get that data.

After a conversation in the Team Meeting we came to a theory that this is due to I/O errors on the client when trying to save the ping to disk. To detect this properly we'd need to instrument glean-core right about here, which FOG can't do. And since FOG doesn't send any builtin pings that glean-core owns (yet), glean-core can't instrument it in a way we can observe in FOG.

So I'm gonna file a bug for improving I/O Error handling in the Glean SDK ping submission parts (something we want to do anyway), and use this bug to add a 'whether the disk holding the Firefox profile is SSD or a HDD' data collection to "fog-validation" pings. If these seq holes happen entirely or predominately on spinning disks, that would be consistent with the theory that IO Errors in ping submission are a likely culprit and we should prioritize that I/O Error handling improvement bug I'm about to file.

And in the meantime, we keep chugging along at granting RLB its builtin ping schedulers.

Flags: needinfo?(chutten)
See Also: → 1685745
Attached file data collection review
Attachment #9196117 - Flags: data-review?(teon)

Comment on attachment 9196117 [details]
data collection review

  1. Is there or will there be documentation that describes the schema for the ultimate data set in a public, complete, and accurate way?
    Yes, in the metrics.md file.

  2. Is there a control mechanism that allows the user to turn the data collection on and off?
    Yes, through The Data Choices preference in Firefox's Preferences.

  3. If the request is for permanent data collection, is there someone who will monitor the data over time?
    N/A, The collection expires in Firefox 89.

  4. Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
    Category 1: Technical data

  5. Is the data collection request for default-on or default-off?
    Default-on

  6. Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)?
    No

  7. Is the data collection covered by the existing Firefox privacy notice?
    Yes

  8. Does there need to be a check-in in the future to determine whether to renew the data?
    No

  9. Does the data collection use a third-party collection tool?
    No

Attachment #9196117 - Flags: data-review+
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b30f1f159f47 Record for fog validation whether profile is on a ssd r=janerik

It was a too-long line in the metric description that wasn't picked up in the build due to FOG not linting for YAML problems, only metrics problems.

Should be much better now.

Flags: needinfo?(chutten)
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f6372171d104 Record for fog validation whether profile is on a ssd r=janerik https://hg.mozilla.org/integration/autoland/rev/81896a1fdf2e Ensure FOG lints the YAML and the metrics r=janerik
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch
Attachment #9196117 - Flags: data-review?(teon)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: