schema error in `org_mozilla_fenix.health_v1` for #/metrics/schema: counter
Categories
(Data Platform and Tools :: General, defect, P1)
Tracking
(firefox-esr140 unaffected, firefox145 unaffected, firefox146+ fixed, firefox147+ fixed)
| Tracking | Status | |
|---|---|---|
| firefox-esr140 | --- | unaffected |
| firefox145 | --- | unaffected |
| firefox146 | + | fixed |
| firefox147 | + | fixed |
People
(Reporter: efilho, Assigned: janerik)
References
Details
(Whiteboard: [dataquality])
Attachments
(2 files)
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-beta+
|
Details | Review |
Schema error for #/metrics/schema: counter in org_mozilla_fenix.health_v1.
Encountered 127218 errors in the past week which is a change of from the previous week. The error count is 0.90% of 14116325.0 valid pings.
See runbook for resolution: https://mozilla-hub.atlassian.net/wiki/spaces/DATA/pages/1134854153/Schema+Errors
Updated•2 months ago
|
Comment 2•2 months ago
|
||
Example of error metric payloads: https://sql.telemetry.mozilla.org/queries/112174/source
{
"metrics": {
"schema: counter": {
"glean.validation.pings_submitted": {
"events": 1
}
},
...
},
...
}
glean.validation.pings_submitted is a labeled counter but the payloads have "schema: counter" https://dictionary.telemetry.mozilla.org/apps/fenix/metrics/glean_validation_pings_submitted
This is only seen in fenix right now and also seen in the baseline ping starting in version 146.
Jan-Erik, do you know what could be causing this?
| Assignee | ||
Comment 4•2 months ago
|
||
I'm a bit baffled by this. It should be impossible for Glean to generate this.
We read data from the database and decode it into a metric type.
From that metric type we determine the metric type name it goes in ("counter", etc.).
Due to how we encode things labeled metrics are handled slightly differently. A / in the name means we're looking at a labeled thing.
For that we determine we prefix the metric type name with labeled_.
"schema: counter" is simply not a string that is constructed in our code base. And yet we see it with what looks like otherwise valid data?
I'll keep the ni? for a moment to look at this again.
| Assignee | ||
Comment 5•2 months ago
•
|
||
So I queried a bit more: https://sql.telemetry.mozilla.org/queries/112407/source
- This is visible on both beta and nightly on Android, across metrics, baseline and health ping.
- It seems to have leveled off (or at least signifcantly slowed down) over the past few days
- It started on 2025-11-05 on nightly, 2025-11-11 on beta.
- Firefox Desktop shows a handful of instances, since 2025-08-24 (how long do we keep data in the structured errors table?)
- metrics & baseline ping
- Nothing on iOS
Glean v65.0.0 was release on 2025-08-18, landed in m-c 2025-08-22.
So if indeed those 2025-08-24 instances are the first onces, it could be because of changes in that release.
Why not earlier on Android though?
Why not on iOS at all?
(Update: clarified which pings this is seen on)
| Assignee | ||
Comment 6•2 months ago
|
||
So the initial look at it didn't turn up anything in particular that stands out.
Maybe a slight overrepresentation of lower-spec devices, but that would need a bit more work to confirm. And still wouldn't tell us too much about where the bug could be hiding.
:efilho, what are the current numbers? Am I correct that it's currently still low enough that we don't need to look much further for now?
Comment 7•2 months ago
|
||
how long do we keep data in the structured errors table?
Errors are kept for 775 days
These are the currently affected pings: https://mozilla.cloud.looker.com/x/A6bs7J4FcSqXVWZqgM6JD7
The % affected pings is error count / rows in stable table. This goes up to 2.18% for fenix beta metrics which is much higher than the usual non-issue schema errors. 1.35% for baseline
Error counts over time per app and error path: https://mozilla.cloud.looker.com/x/u8m1sWZBoeGsXfUdy9uNR8
The growth has slowed down which lines up with the release uptake. I would say this amount of errors is too high to ignore and we should try to figure it out before this hits the release channel where it could become problematic. Does looking at what went out in the 2025-11-05 fenix nightly give any hints?
| Assignee | ||
Updated•2 months ago
|
| Assignee | ||
Comment 8•2 months ago
|
||
In yesterday's meeting, last minute, chutten correctly recognized that the broken "schema: counter" and the correct "labeled_counter" consist of the same amount of characters (15). And data in that somehow-broken part is indeed what is otherwise a labeled counter.
This reeks of some memory corruption or re-use of another buffer.
I'll take another look and see if there's indication based on the date this started happening and also check the code again.
| Assignee | ||
Comment 9•2 months ago
|
||
Other instances of this as per the looker view above are always the labeled metric types, and labeled gets replaced by schema: . labeled counters are just the most frequent used type of these.
| Assignee | ||
Comment 10•2 months ago
|
||
It seems to have started on Android with the v66.1.0 release.
It did land in m-c on 2025-11-04 (bug 1997923).
That was rather small release, focusing on a single feature. Most of the fixes are nowhere close to the ping payload generation.
Is this spooky action at a distance?
The only thing from the changset going into that release that sticks out to me on a first look is this:
https://github.com/mozilla/glean/commit/22cd16658e01c4740e125370842c8cc946977a8f
We're (logic-correctly) triggering the uploader when we know we have pings to send.
This does call back into Kotlin, from which it eventually gets back to Rust.
But at the point of the call the ping is already assembled. I'm not sure how we could corrupt the payload at that point. So is it already generated corrupted there?
| Assignee | ||
Updated•1 month ago
|
| Assignee | ||
Comment 11•1 month ago
|
||
Comment 12•1 month ago
|
||
Comment 13•1 month ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #8)
In yesterday's meeting, last minute, chutten correctly recognized that the broken "schema: counter" and the correct "labeled_counter" consist of the same amount of characters (15). And data in that somehow-broken part is indeed what is otherwise a labeled counter.
This reeks of some memory corruption or re-use of another buffer.
Yes, but it also has the faint whiff of compiler errors. I believe that we have bumped Rust versions and clang versions in the relevant time frames, is it possible that compiler changes line up with the Glean changes?
Comment 14•1 month ago
|
||
| bugherder | ||
| Assignee | ||
Comment 15•1 month ago
|
||
(In reply to Nick Alexander :nalexander [he/him] from comment #13)
Yes, but it also has the faint whiff of compiler errors. I believe that we have bumped Rust versions and clang versions in the relevant time frames, is it possible that compiler changes line up with the Glean changes?
I double-check those. This might be a combination of changed compilers + just enough changes in the code to trigger a compiler bug. On Slack the theory is that it's due to LTO (or similar) and differences between arm32 and arm64.
| Assignee | ||
Comment 16•1 month ago
|
||
Original Revision: https://phabricator.services.mozilla.com/D274027
Updated•1 month ago
|
Comment 17•1 month ago
|
||
firefox-beta Uplift Approval Request
- User impact if declined: Telemetry data from a subset of users will not be correctly ingested.
- Code covered by automated testing: yes
- Fix verified in Nightly: yes
- Needs manual QE test: no
- Steps to reproduce for manual QE testing: -
- Risk associated with taking this patch: low
- Explanation of risk level: Minimal patch of the Glean SDK that doesn't change logic
- String changes made/needed: -
- Is Android affected?: yes
Updated•1 month ago
|
Comment 18•1 month ago
|
||
| uplift | ||
| Assignee | ||
Comment 19•1 month ago
|
||
(In reply to Nick Alexander :nalexander [he/him] from comment #13)
Yes, but it also has the faint whiff of compiler errors. I believe that we have bumped Rust versions and clang versions in the relevant time frames, is it possible that compiler changes line up with the Glean changes?
The timelines don't match up, see bug 1948826.
It's more likely this triggered a pre-existing bug in the toolchain targeting armv7.
Comment 20•1 month ago
|
||
:janerik, does still need to have a leave-open keyword, or should it be resolved?
Updated•1 month ago
|
| Assignee | ||
Comment 21•1 month ago
|
||
No, with the patch landed and the early data showing it works we have this covered. I will check the data again on Monday, so we can close out the incident.
Updated•1 month ago
|
Description
•