Open Bug 1762568 Opened 3 years ago Updated 3 years ago

Some data appears with incorrect labels, with a 1 character variation compared to the label that exists

Categories

(Data Platform and Tools :: Glean: SDK, defect, P4)

defect

Tracking

(Not tracked)

People

(Reporter: florian, Unassigned)

Details

See https://sql.telemetry.mozilla.org/queries/84163#208466

The incorrect labels are web.bAckground-perceivable (A instead of a), web*background (* instead of .), parent.ac|ive.playing-video (| instead of t), parent.ioactive (o instead of n), and an empty label.

It happens on so little data that I wonder if this might be from (a) user(s) with faulty hardware.

In all the above cases, there's only 1 bit of difference:

"A".charCodeAt() - "a".charCodeAt()
-32
"*".charCodeAt() - ".".charCodeAt()
-4
"|".charCodeAt() - "t".charCodeAt()
8
"o".charCodeAt() - "n".charCodeAt()
1

Alrighty, let's catalogue these:

expected actual index of differing character character difference
parent.Active parent.active 7 A - a = -32
web.bAckground-perceivable web.background-perceivable 5 A - a = -32
web*background web.background 3 * - . = -4
parent.ac|ive.playing-video parent.active.playing-video 9 | - t = 8
parent.ioactive parent.inactive 8 o - n = 1

Hm. No real pattern here yet. Let's look at the clients sending us these things... They're from five separate pings from five separate clients. Let's look at the five pings

Times

The pings were received on March 14, 17, 19, 21, and 25. They covered periods of time for the single day prior to their received dates aside from the one received March 25 which covered data starting from Tuesday March 22 (suggesting the client wasn't active between the Tuesday and the Friday).

Clients

  • All on the release channel. All on Windows.
  • Windows 6.1 (Win7): 3, Windows 6.3 (Win8.1): 1, Windows 10.0 (Win10): 1
  • 3 on x86_64, 2 on x86 (both on Win7)
  • 4 on 98.0.1, 1 (the one from March 14) on 98.0. 98.0.1 was released on March 14, so all of these versions were current to their time period.

Geo

  • Hungary, China, Belgium, Belarus, and Poland.
  • All of the client-local timestamps in ping_info have timezones matching the country (+1 and +8)

power.cpu_time_per_process_type_ms

  • The ping with the parent.ioactive label did not have a parent.inactive label. Ditto with parent.ac|ive.playing-video.
  • The ping with the web.bAckground-perceivable label did have a web.background-perceivable label (both values around 67k). Ditto with parent.Active (42k, with 504k in parent.active) and web*background (1, with 5.8k in web.background).

Theories

Okay, so nothing immediately jumps out. The software versions were all current. The clients are from multiple global regions. These happened over the course of two weeks. There's nothing to say it won't happen again: there's no appreciable time before these five instances (bug 1747138 which landed this per-process probe shipped on Fx98 (March 8)), and very little time has passed since the most recent weird sample.

Maybe we can determine where the bitflip happened? The code that constructs these labels does so on every call to RecordPowerMetrics (which happens very frequently) suggesting (to me) that a single bitflip would be unlikely to have the perceived effect. Instead I'd more expect that our bitflipped labels would have very low values and the normal labels would always be present and hold a reasonable value.

Also, this is a statically-labeled labeled_counter: these bitflipped labels aren't permitted and should result in __other__ values (only two of the five pings have values in the __other__ label) meaning that the bitflip almost certainly did not happen outside of glean-core.

We could have a bitflip in the in-memory glean-core CounterMetric's name. Instead of being cpu_time_per_process_type_ms#parent.active it could've flipped to cpu_time_per_process_type_ms#parent.Active... but no, that wouldn't have worked either since every call to glean-core's LabeledMetric::get results in the creation of a new metric object. (seems like this might become a performance problem. Filed bug 1762612)

To me it seems most likely that the bitflip happens on disk to the stored value. That would help explain why some of the pings have both bitflipped and unbitflipped labels: once the bit is flipped, the db will add another record for the unbitflipped variant on the next data recording.

...or. Or I suppose it could happen on the way to the disk. FOG configures the Rust Glean SDK to delay the storing to disk of ping-lifetime data (which power.cpu_time_per_process_type_ms is). The intermediary data gets stored in-memory... so it could be an in-memory bitflip here, which would act like an on-disk bitflip in the db.

...or. Or I suppose bitflips could happen somewhere inside rkv. I'm not sure how it works and am not familiar enough with the code to go diving for answers in it.

The way forward

Given that this is so unlikely to happen above the level of the Glean SDK, I'm moving this bug to that component next. I think there are probably things we can instrument in the storage layer and report via the Glean Error Reporting Mechanism (GERM) (like maybe checking labeled_ metrics with static labels against the labels in the db when assembling pings?) that could help us narrow down where these flips happen.

Aside from that I think we're stuck waiting for more of these to come in to perhaps give us enough of a pattern to guide investigation... or at least that's how it looks from where I'm sitting.

Severity: -- → S4
Component: Telemetry → Glean: SDK
Priority: -- → P1
Product: Toolkit → Data Platform and Tools

Oh, and some other things that just occurred to me:

  • The bitflips are unlikely to have occcured at any time during or after ping assembly, otherwise we wouldn't have any ping with both bitflipped and unbitflipped labels.
  • If storage is bitflipping labels, then what else might it bitflip? If it bitflips metrics names then we might have more than just five instances to build patterns from (we'll have to check additional_properties for off-by-one names)

Moving to backlog. I don't think there's something actionable or urgent to do about it right now.

Priority: P1 → P4
You need to log in before you can comment on or make changes to this bug.