Closed Bug 1999791 Opened 2 months ago Closed 1 month ago

schema error in `org_mozilla_fenix.health_v1` for #/metrics/schema: counter

Categories

(Data Platform and Tools :: General, defect, P1)

defect

Tracking

(firefox-esr140 unaffected, firefox145 unaffected, firefox146+ fixed, firefox147+ fixed)

RESOLVED FIXED
Tracking Status
firefox-esr140 --- unaffected
firefox145 --- unaffected
firefox146 + fixed
firefox147 + fixed

People

(Reporter: efilho, Assigned: janerik)

References

Details

(Whiteboard: [dataquality])

Attachments

(2 files)

Schema error for #/metrics/schema: counter in org_mozilla_fenix.health_v1.

Encountered 127218 errors in the past week which is a change of from the previous week. The error count is 0.90% of 14116325.0 valid pings.

See runbook for resolution: https://mozilla-hub.atlassian.net/wiki/spaces/DATA/pages/1134854153/Schema+Errors

Duplicate of this bug: 1999794

Example of error metric payloads: https://sql.telemetry.mozilla.org/queries/112174/source

{
    "metrics": {
        "schema: counter": {
            "glean.validation.pings_submitted": {
                "events": 1
            }
        },
        ...
    },
    ...
}

glean.validation.pings_submitted is a labeled counter but the payloads have "schema: counter" https://dictionary.telemetry.mozilla.org/apps/fenix/metrics/glean_validation_pings_submitted

This is only seen in fenix right now and also seen in the baseline ping starting in version 146.

Jan-Erik, do you know what could be causing this?

Oops, missed the ni?

Flags: needinfo?(jrediger)

I'm a bit baffled by this. It should be impossible for Glean to generate this.
We read data from the database and decode it into a metric type.
From that metric type we determine the metric type name it goes in ("counter", etc.).
Due to how we encode things labeled metrics are handled slightly differently. A / in the name means we're looking at a labeled thing.
For that we determine we prefix the metric type name with labeled_.

"schema: counter" is simply not a string that is constructed in our code base. And yet we see it with what looks like otherwise valid data?
I'll keep the ni? for a moment to look at this again.

So I queried a bit more: https://sql.telemetry.mozilla.org/queries/112407/source

  • This is visible on both beta and nightly on Android, across metrics, baseline and health ping.
  • It seems to have leveled off (or at least signifcantly slowed down) over the past few days
  • It started on 2025-11-05 on nightly, 2025-11-11 on beta.
  • Firefox Desktop shows a handful of instances, since 2025-08-24 (how long do we keep data in the structured errors table?)
    • metrics & baseline ping
  • Nothing on iOS

Glean v65.0.0 was release on 2025-08-18, landed in m-c 2025-08-22.
So if indeed those 2025-08-24 instances are the first onces, it could be because of changes in that release.

Why not earlier on Android though?
Why not on iOS at all?

(Update: clarified which pings this is seen on)

Flags: needinfo?(jrediger)

So the initial look at it didn't turn up anything in particular that stands out.
Maybe a slight overrepresentation of lower-spec devices, but that would need a bit more work to confirm. And still wouldn't tell us too much about where the bug could be hiding.

:efilho, what are the current numbers? Am I correct that it's currently still low enough that we don't need to look much further for now?

Flags: needinfo?(efilho)

how long do we keep data in the structured errors table?

Errors are kept for 775 days

These are the currently affected pings: https://mozilla.cloud.looker.com/x/A6bs7J4FcSqXVWZqgM6JD7

The % affected pings is error count / rows in stable table. This goes up to 2.18% for fenix beta metrics which is much higher than the usual non-issue schema errors. 1.35% for baseline

Error counts over time per app and error path: https://mozilla.cloud.looker.com/x/u8m1sWZBoeGsXfUdy9uNR8

The growth has slowed down which lines up with the release uptake. I would say this amount of errors is too high to ignore and we should try to figure it out before this hits the release channel where it could become problematic. Does looking at what went out in the 2025-11-05 fenix nightly give any hints?

Flags: needinfo?(efilho)
Assignee: nobody → jrediger
Priority: -- → P1

In yesterday's meeting, last minute, chutten correctly recognized that the broken "schema: counter" and the correct "labeled_counter" consist of the same amount of characters (15). And data in that somehow-broken part is indeed what is otherwise a labeled counter.
This reeks of some memory corruption or re-use of another buffer.

I'll take another look and see if there's indication based on the date this started happening and also check the code again.

Other instances of this as per the looker view above are always the labeled metric types, and labeled gets replaced by schema: . labeled counters are just the most frequent used type of these.

It seems to have started on Android with the v66.1.0 release.
It did land in m-c on 2025-11-04 (bug 1997923).

That was rather small release, focusing on a single feature. Most of the fixes are nowhere close to the ping payload generation.
Is this spooky action at a distance?

The only thing from the changset going into that release that sticks out to me on a first look is this:
https://github.com/mozilla/glean/commit/22cd16658e01c4740e125370842c8cc946977a8f

We're (logic-correctly) triggering the uploader when we know we have pings to send.
This does call back into Kotlin, from which it eventually gets back to Rust.
But at the point of the call the ping is already assembled. I'm not sure how we could corrupt the payload at that point. So is it already generated corrupted there?

See Also: → 2002282
Keywords: leave-open
Pushed by jrediger@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/8dd3430b7a02 https://hg.mozilla.org/integration/autoland/rev/9a55d6e8c855 Update to Glean v66.1.2 r=chutten,supply-chain-reviewers,mach-reviewers,ahochheiden

(In reply to Jan-Erik Rediger [:janerik] from comment #8)

In yesterday's meeting, last minute, chutten correctly recognized that the broken "schema: counter" and the correct "labeled_counter" consist of the same amount of characters (15). And data in that somehow-broken part is indeed what is otherwise a labeled counter.
This reeks of some memory corruption or re-use of another buffer.

Yes, but it also has the faint whiff of compiler errors. I believe that we have bumped Rust versions and clang versions in the relevant time frames, is it possible that compiler changes line up with the Glean changes?

Flags: needinfo?(jrediger)

(In reply to Nick Alexander :nalexander [he/him] from comment #13)

Yes, but it also has the faint whiff of compiler errors. I believe that we have bumped Rust versions and clang versions in the relevant time frames, is it possible that compiler changes line up with the Glean changes?

I double-check those. This might be a combination of changed compilers + just enough changes in the code to trigger a compiler bug. On Slack the theory is that it's due to LTO (or similar) and differences between arm32 and arm64.

Attachment #9529217 - Flags: approval-mozilla-beta?

firefox-beta Uplift Approval Request

  • User impact if declined: Telemetry data from a subset of users will not be correctly ingested.
  • Code covered by automated testing: yes
  • Fix verified in Nightly: yes
  • Needs manual QE test: no
  • Steps to reproduce for manual QE testing: -
  • Risk associated with taking this patch: low
  • Explanation of risk level: Minimal patch of the Glean SDK that doesn't change logic
  • String changes made/needed: -
  • Is Android affected?: yes
Attachment #9529217 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to Nick Alexander :nalexander [he/him] from comment #13)

Yes, but it also has the faint whiff of compiler errors. I believe that we have bumped Rust versions and clang versions in the relevant time frames, is it possible that compiler changes line up with the Glean changes?

The timelines don't match up, see bug 1948826.
It's more likely this triggered a pre-existing bug in the toolchain targeting armv7.

Flags: needinfo?(jrediger)

:janerik, does still need to have a leave-open keyword, or should it be resolved?

Flags: needinfo?(jrediger)

No, with the patch landed and the early data showing it works we have this covered. I will check the data again on Monday, so we can close out the incident.

Status: NEW → RESOLVED
Closed: 1 month ago
Flags: needinfo?(jrediger)
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: