Closed Bug 1675877 Opened 5 years ago Closed 5 years ago

What is up with `seq` holes in FOG's "fog-validation" pings?

Tracking

()

Status:

RESOLVED FIXED

Milestone:

86 Branch

Tracking Flags:

Tracking

Status

firefox86

---

fixed

People

(Reporter: chutten|PTO, Assigned: chutten|PTO)

References

Details

(Whiteboard: [telemetry:fog:m?])

Attachments

(3 files)

data collection review 5 years ago Chris H-C :chutten\|PTO (back Oct 23) 2.55 KB, text/plain	travis_ : data-review+	Details
Bug 1675877 - Record for fog validation whether profile is on a ssd r?janerik! 5 years ago Chris H-C :chutten\|PTO (back Oct 23) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1675877 - Ensure FOG lints the YAML and the metrics r?janerik! 5 years ago Chris H-C :chutten\|PTO (back Oct 23) 48 bytes, text/x-phabricator-request		Details \| Review

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Description

•

5 years ago

With context from the validation analysis, figure out what's going on with 17.2% of clients reporting at least one hole in the seq record.

Is this rate stable over time? Is it the same 17.2% each time, or a different set each week?
Is it common across all facets (os family and version, architecture, etc)?
Is it related to Glean db age?
Has a Glean SDK update (see bug 1675534) changed the situation?

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Updated

•

5 years ago

Whiteboard: [telemetry:fog:m?]

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 1

•

5 years ago

A renewed look, this time looking at a week of RLB-powered beta, shows 99.94% of clients have no problems with dupes, but only 73.26% of clients have a complete "unholey" seq record. That's 26.73% with holes. 11.5% have exactly one hole of size one, leaving about 15% of clients in the sample having more than one hole (12%) or one hole of size > 1 (3%).

So the problem persists, and seems worse in Beta.

Per-day we're seeing a fairly constant 8-9% of clients with problems. This suggests that the 26.73% represents a semi-stable subpopulation that are hole-prone which on some days have holes and other days don't (either because they aren't using Firefox or because they managed to escape holes over that 24h).

Windows is the worst offender when you look at it per os, and of the Windowses, Windows 7 is the worst (31.7% holey) of the bunch. But it's not like the others' 17%ish numbers are great either, so platform is a factor in the magnitude of the issue, not the presence of one.

As for the age of these things, I looked at client_info.first_run_date. I originally thought all of the dbs should be shiny and new because we only shipped in 85, but I had once again forgotten that 84 was when the original db shipped so I needed to widen my gaze. It was illuminating to see that the holeliness rate is a flat 20% of clients no matter how long they've been using Glean. All the patterns on there are explained by population effects (notably the ramp to the right which is an artefact of how rapidly people update their beta builds. Look at that shelf around Dec 21 from folks who turned off their computers for the holidays)/

Why are these clients so holely? I still don't have an answer. Glean (as with Telemetry) is supposed to be a reliably data delivery system. We're not supposed to miss things.

Maybe there's something to do with the behaviour of "fog-validation" pings themselves. Maybe it's tied to network error logic. Maybe there's a difference between sending pings that were submitted in a previous session over pings submitted just before they were sent.

At least that one I can test. If there was a problem sending things that had to be persisted between sessions, we would see a definite difference between pings received around the time they were sent, and pings received far later (like a day later) than they were sent. We don't see such a difference, with the rate of holely clients remaining fairly flat no matter the delay of the pings they sent.

There is insufficient information contained in the "fog-validation" ping to determine the cause of the holes. All we can do is be vigilant when we start sending more information-rich pings like "baseline" and "metrics" to ensure we're not leaving a similar number of them unreceived.

Jan-Erik: it's not a fun conclusion, but it is as far as I think I can take this.

Assignee: nobody → chutten

Status: NEW → ASSIGNED

Flags: needinfo?(jrediger)

Priority: P3 → P1

Alessio Placitelli [:Dexter]

Comment 2

•

5 years ago

(In reply to Chris H-C :chutten from comment #1)

A renewed look, this time looking at a week of RLB-powered beta, shows 99.94% of clients have no problems with dupes, but only 73.26% of clients have a complete "unholey" seq record. That's 26.73% with holes. 11.5% have exactly one hole of size one, leaving about 15% of clients in the sample having more than one hole (12%) or one hole of size > 1 (3%).

So the problem persists, and seems worse in Beta.

Is this using the previous FOG Glean-like implementation? It's not using the newest RLB? If so, does the problem persist with RLB?

Jan-Erik Rediger [:janerik]

Comment 3

•

5 years ago

That's scary high numbers indeed and very worth a deeper investigation, but I agree that right now we can't track that in the wild a lot.
Do you have an idea what additional data could help us track this down? Maybe we can extend the fog-validation ping to get that data.

And as :Dexter asked: what code are we running? Might this have been fixed already by the RLB?
If so, when will it hit beta/stable?

Flags: needinfo?(jrediger) → needinfo?(chutten)

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 4

•

5 years ago

(In reply to Alessio Placitelli [:Dexter] from comment #2)

(In reply to Chris H-C :chutten from comment #1)

A renewed look, this time looking at a week of RLB-powered beta, shows 99.94% of clients have no problems with dupes, but only 73.26% of clients have a complete "unholey" seq record. That's 26.73% with holes. 11.5% have exactly one hole of size one, leaving about 15% of clients in the sample having more than one hole (12%) or one hole of size > 1 (3%).

So the problem persists, and seems worse in Beta.

Is this using the previous FOG Glean-like implementation? It's not using the newest RLB? If so, does the problem persist with RLB?

This is using post-RLB Firefox Beta 85, I'm afraid. RLB broadly landed in bug 1662868 which hit the mid/end of the 85 train (Nightlies starting 20201204094005. No uplift, so Beta data started with the 85 merge on Dec 14th).

(In reply to Jan-Erik Rediger [:janerik] from comment #3)

That's scary high numbers indeed and very worth a deeper investigation, but I agree that right now we can't track that in the wild a lot.
Do you have an idea what additional data could help us track this down? Maybe we can extend the fog-validation ping to get that data.

After a conversation in the Team Meeting we came to a theory that this is due to I/O errors on the client when trying to save the ping to disk. To detect this properly we'd need to instrument glean-core right about here, which FOG can't do. And since FOG doesn't send any builtin pings that glean-core owns (yet), glean-core can't instrument it in a way we can observe in FOG.

So I'm gonna file a bug for improving I/O Error handling in the Glean SDK ping submission parts (something we want to do anyway), and use this bug to add a 'whether the disk holding the Firefox profile is SSD or a HDD' data collection to "fog-validation" pings. If these seq holes happen entirely or predominately on spinning disks, that would be consistent with the theory that IO Errors in ping submission are a likely culprit and we should prioritize that I/O Error handling improvement bug I'm about to file.

And in the meantime, we keep chugging along at granting RLB its builtin ping schedulers.

Flags: needinfo?(chutten)

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Updated

•

5 years ago

Comment 5

•

5 years ago

Attached file data collection review — Details

Attachment #9196117 - Flags: data-review?(teon)

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 6

•

5 years ago

Attached file Bug 1675877 - Record for fog validation whether profile is on a ssd r?janerik! — Details

Travis Long [:travis_]

Comment 7

•

5 years ago

Comment on attachment 9196117 [details]
data collection review

Is there or will there be documentation that describes the schema for the ultimate data set in a public, complete, and accurate way?
Yes, in the metrics.md file.
Is there a control mechanism that allows the user to turn the data collection on and off?
Yes, through The Data Choices preference in Firefox's Preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
N/A, The collection expires in Firefox 89.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Category 1: Technical data
Is the data collection request for default-on or default-off?
Default-on
Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)?
No
Is the data collection covered by the existing Firefox privacy notice?
Yes
Does there need to be a check-in in the future to determine whether to renew the data?
No
Does the data collection use a third-party collection tool?
No

Attachment #9196117 - Flags: data-review+

Pulsebot

Comment 8

•

5 years ago

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b30f1f159f47 Record for fog validation whether profile is on a ssd r=janerik

Atila Butkovits

Comment 9

•

5 years ago

Backed outfor causing documentation failure

Backout link: https://hg.mozilla.org/integration/autoland/rev/050b1d03ee25a3a0c1667a2f6e6f4a1e5c7310ef

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=Td1U1Z-fQEmzQ2rOzWu_4Q.0&searchStr=documentation%2Copt%2Cdocumentation%2Csource-test-doc-generate%2Cgenerate&revision=b30f1f159f4735c3d19f27ab304c893569b4c7f5

Failure log: https://treeherder.mozilla.org/logviewer?job_id=326541802&repo=autoland&lineNumber=236

Flags: needinfo?(chutten)

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 10

•

5 years ago

Attached file Bug 1675877 - Ensure FOG lints the YAML and the metrics r?janerik! — Details

Depends on D101368

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 11

•

5 years ago

It was a too-long line in the metric description that wasn't picked up in the build due to FOG not linting for YAML problems, only metrics problems.

Should be much better now.

Flags: needinfo?(chutten)

Pulsebot

Comment 12

•

5 years ago

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f6372171d104 Record for fog validation whether profile is on a ssd r=janerik https://hg.mozilla.org/integration/autoland/rev/81896a1fdf2e Ensure FOG lints the YAML and the metrics r=janerik

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 13

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f6372171d104
https://hg.mozilla.org/mozilla-central/rev/81896a1fdf2e

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox86: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 86 Branch

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Updated

•

5 years ago

Attachment #9196117 - Flags: data-review?(teon)

You need to log in before you can comment on or make changes to this bug.

data collection review 5 years ago Chris H-C :chutten\|PTO (back Oct 23) 2.55 KB, text/plain	travis_ : data-review+	Details
Bug 1675877 - Record for fog validation whether profile is on a ssd r?janerik! 5 years ago Chris H-C :chutten\|PTO (back Oct 23) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1675877 - Ensure FOG lints the YAML and the metrics r?janerik! 5 years ago Chris H-C :chutten\|PTO (back Oct 23) 48 bytes, text/x-phabricator-request		Details \| Review

Bugzilla

What is up with `seq` holes in FOG's "fog-validation" pings?

Categories

(Toolkit :: Telemetry, task, P1)

Tracking

()

People

(Reporter: chutten|PTO, Assigned: chutten|PTO)

References

Details

(Whiteboard: [telemetry:fog:m?])

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Attachment

General

Description

File Name

Content Type