Closed Bug 1999791 Opened 2 months ago Closed 1 month ago

schema error in `org_mozilla_fenix.health_v1` for #/metrics/schema: counter

Tracking

(firefox-esr140 unaffected, firefox145 unaffected, firefox146+ fixed, firefox147+ fixed)

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

firefox-esr140

---

unaffected

firefox145

---

unaffected

firefox146

fixed

firefox147

fixed

People

(Reporter: efilho, Assigned: janerik)

References

Details

(Whiteboard: [dataquality])

Attachments

(2 files)

Bug 1999791 - Update to Glean v66.1.2 r?chutten! 1 month ago Jan-Erik Rediger [:janerik] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1999791 - Update to Glean v66.1.2 1 month ago Jan-Erik Rediger [:janerik] 48 bytes, text/x-phabricator-request	phab-bot : approval-mozilla-beta+	Details \| Review

Eduardo Filho [:efilho]

Reporter

Description

•

2 months ago

Schema error for #/metrics/schema: counter in org_mozilla_fenix.health_v1.

Encountered 127218 errors in the past week which is a change of from the previous week. The error count is 0.90% of 14116325.0 valid pings.

See runbook for resolution: https://mozilla-hub.atlassian.net/wiki/spaces/DATA/pages/1134854153/Schema+Errors

Jira Integration Bot

Updated

•

2 months ago

See Also: → https://mozilla-hub.atlassian.net/browse/DENG-10175

Ben Wu [:benwu]

Updated

•

2 months ago

Duplicate of this bug: 1999794

Ben Wu [:benwu]

Comment 2

•

2 months ago

Example of error metric payloads: https://sql.telemetry.mozilla.org/queries/112174/source

{
    "metrics": {
        "schema: counter": {
            "glean.validation.pings_submitted": {
                "events": 1
            }
        },
        ...
    },
    ...
}

glean.validation.pings_submitted is a labeled counter but the payloads have "schema: counter" https://dictionary.telemetry.mozilla.org/apps/fenix/metrics/glean_validation_pings_submitted

This is only seen in fenix right now and also seen in the baseline ping starting in version 146.

Jan-Erik, do you know what could be causing this?

Ben Wu [:benwu]

Comment 3

•

2 months ago

Oops, missed the ni?

Flags: needinfo?(jrediger)

Jan-Erik Rediger [:janerik]

Assignee

Comment 4

•

2 months ago

I'm a bit baffled by this. It should be impossible for Glean to generate this.
We read data from the database and decode it into a metric type.
From that metric type we determine the metric type name it goes in ("counter", etc.).
Due to how we encode things labeled metrics are handled slightly differently. A / in the name means we're looking at a labeled thing.
For that we determine we prefix the metric type name with labeled_.

"schema: counter" is simply not a string that is constructed in our code base. And yet we see it with what looks like otherwise valid data?
I'll keep the ni? for a moment to look at this again.

Jan-Erik Rediger [:janerik]

Assignee

Comment 5

•

2 months ago

•

Edited

So I queried a bit more: https://sql.telemetry.mozilla.org/queries/112407/source

This is visible on both beta and nightly on Android, across metrics, baseline and health ping.
It seems to have leveled off (or at least signifcantly slowed down) over the past few days
It started on 2025-11-05 on nightly, 2025-11-11 on beta.
Firefox Desktop shows a handful of instances, since 2025-08-24 (how long do we keep data in the structured errors table?)
- metrics & baseline ping
Nothing on iOS

Glean v65.0.0 was release on 2025-08-18, landed in m-c 2025-08-22.
So if indeed those 2025-08-24 instances are the first onces, it could be because of changes in that release.

Why not earlier on Android though?
Why not on iOS at all?

(Update: clarified which pings this is seen on)

Flags: needinfo?(jrediger)

Jan-Erik Rediger [:janerik]

Assignee

Comment 6

•

2 months ago

So the initial look at it didn't turn up anything in particular that stands out.
Maybe a slight overrepresentation of lower-spec devices, but that would need a bit more work to confirm. And still wouldn't tell us too much about where the bug could be hiding.

:efilho, what are the current numbers? Am I correct that it's currently still low enough that we don't need to look much further for now?

Flags: needinfo?(efilho)

Ben Wu [:benwu]

Comment 7

•

2 months ago

how long do we keep data in the structured errors table?

Errors are kept for 775 days

These are the currently affected pings: https://mozilla.cloud.looker.com/x/A6bs7J4FcSqXVWZqgM6JD7

The % affected pings is error count / rows in stable table. This goes up to 2.18% for fenix beta metrics which is much higher than the usual non-issue schema errors. 1.35% for baseline

Error counts over time per app and error path: https://mozilla.cloud.looker.com/x/u8m1sWZBoeGsXfUdy9uNR8

The growth has slowed down which lines up with the release uptake. I would say this amount of errors is too high to ignore and we should try to figure it out before this hits the release channel where it could become problematic. Does looking at what went out in the 2025-11-05 fenix nightly give any hints?

Flags: needinfo?(efilho)

Jan-Erik Rediger [:janerik]

Assignee

Updated

•

2 months ago

Assignee: nobody → jrediger

Priority: -- → P1

Jan-Erik Rediger [:janerik]

Assignee

Comment 8

•

2 months ago

In yesterday's meeting, last minute, chutten correctly recognized that the broken "schema: counter" and the correct "labeled_counter" consist of the same amount of characters (15). And data in that somehow-broken part is indeed what is otherwise a labeled counter.
This reeks of some memory corruption or re-use of another buffer.

I'll take another look and see if there's indication based on the date this started happening and also check the code again.

Jan-Erik Rediger [:janerik]

Assignee

Comment 9

•

2 months ago

Other instances of this as per the looker view above are always the labeled metric types, and labeled gets replaced by schema: . labeled counters are just the most frequent used type of these.

Jan-Erik Rediger [:janerik]

Assignee

Comment 10

•

2 months ago

It seems to have started on Android with the v66.1.0 release.
It did land in m-c on 2025-11-04 (bug 1997923).

That was rather small release, focusing on a single feature. Most of the fixes are nowhere close to the ping payload generation.
Is this spooky action at a distance?

The only thing from the changset going into that release that sticks out to me on a first look is this:
https://github.com/mozilla/glean/commit/22cd16658e01c4740e125370842c8cc946977a8f

We're (logic-correctly) triggering the uploader when we know we have pings to send.
This does call back into Kotlin, from which it eventually gets back to Rust.
But at the point of the call the ping is already assembled. I'm not sure how we could corrupt the payload at that point. So is it already generated corrupted there?

Jan-Erik Rediger [:janerik]

Assignee

Updated

•

1 month ago

Updated

•

1 month ago

Keywords: leave-open

Jan-Erik Rediger [:janerik]

Assignee

Comment 11

•

1 month ago

Attached file Bug 1999791 - Update to Glean v66.1.2 r?chutten! — Details

Pulsebot

Comment 12

•

1 month ago

Pushed by jrediger@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/8dd3430b7a02 https://hg.mozilla.org/integration/autoland/rev/9a55d6e8c855 Update to Glean v66.1.2 r=chutten,supply-chain-reviewers,mach-reviewers,ahochheiden

Nick Alexander :nalexander [he/him]

Comment 13

•

1 month ago

(In reply to Jan-Erik Rediger [:janerik] from comment #8)

In yesterday's meeting, last minute, chutten correctly recognized that the broken "schema: counter" and the correct "labeled_counter" consist of the same amount of characters (15). And data in that somehow-broken part is indeed what is otherwise a labeled counter.
This reeks of some memory corruption or re-use of another buffer.

Yes, but it also has the faint whiff of compiler errors. I believe that we have bumped Rust versions and clang versions in the relevant time frames, is it possible that compiler changes line up with the Glean changes?

Flags: needinfo?(jrediger)

amarc

Comment 14

•

1 month ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9a55d6e8c855

Jan-Erik Rediger [:janerik]

Assignee

Comment 15

•

1 month ago

(In reply to Nick Alexander :nalexander [he/him] from comment #13)

Yes, but it also has the faint whiff of compiler errors. I believe that we have bumped Rust versions and clang versions in the relevant time frames, is it possible that compiler changes line up with the Glean changes?

I double-check those. This might be a combination of changed compilers + just enough changes in the code to trigger a compiler bug. On Slack the theory is that it's due to LTO (or similar) and differences between arm32 and arm64.

Jan-Erik Rediger [:janerik]

Assignee

Comment 16

•

1 month ago

Attached file Bug 1999791 - Update to Glean v66.1.2 — Details

Original Revision: https://phabricator.services.mozilla.com/D274027

Phabricator Automation

Updated

•

1 month ago

Attachment #9529217 - Flags: approval-mozilla-beta?

Phabricator Automation

Comment 17

•

1 month ago

firefox-beta Uplift Approval Request

User impact if declined: Telemetry data from a subset of users will not be correctly ingested.
Code covered by automated testing: yes
Fix verified in Nightly: yes
Needs manual QE test: no
Steps to reproduce for manual QE testing: -
Risk associated with taking this patch: low
Explanation of risk level: Minimal patch of the Glean SDK that doesn't change logic
String changes made/needed: -
Is Android affected?: yes

Phabricator Automation

Updated

•

1 month ago

Attachment #9529217 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Pulsebot

Comment 18

•

1 month ago

uplift

https://github.com/mozilla-firefox/firefox/commit/b763582b1e00 https://hg.mozilla.org/releases/mozilla-beta/rev/b24c6080eafe

Jan-Erik Rediger [:janerik]

Assignee

Comment 19

•

1 month ago

(In reply to Nick Alexander :nalexander [he/him] from comment #13)

Yes, but it also has the faint whiff of compiler errors. I believe that we have bumped Rust versions and clang versions in the relevant time frames, is it possible that compiler changes line up with the Glean changes?

The timelines don't match up, see bug 1948826.
It's more likely this triggered a pre-existing bug in the toolchain targeting armv7.

Flags: needinfo?(jrediger)

Donal Meehan [:dmeehan]

Comment 20

•

1 month ago

:janerik, does still need to have a leave-open keyword, or should it be resolved?

Flags: needinfo?(jrediger)

Donal Meehan [:dmeehan]

Updated

•

1 month ago

status-firefox145: --- → unaffected

status-firefox146: --- → affected

status-firefox147: --- → affected

status-firefox-esr140: --- → unaffected

tracking-firefox146: --- → +

tracking-firefox147: --- → +

Jan-Erik Rediger [:janerik]

Assignee

Comment 21

•

1 month ago

No, with the patch landed and the early data showing it works we have this covered. I will check the data again on Monday, so we can close out the incident.

Status: NEW → RESOLVED

Closed: 1 month ago

Flags: needinfo?(jrediger)

Keywords: leave-open

Resolution: --- → FIXED

Donal Meehan [:dmeehan]

Updated

•

1 month ago

status-firefox146: affected → fixed

status-firefox147: affected → fixed

Yannis Juglaret [:yannis]

Updated

•

1 month ago

Blocks: 2003320

You need to log in before you can comment on or make changes to this bug.