Some data appears with incorrect labels, with a 1 character variation compared to the label that exists
Categories
(Data Platform and Tools :: Glean: SDK, defect, P4)
Tracking
(Not tracked)
People
(Reporter: florian, Unassigned)
Details
See https://sql.telemetry.mozilla.org/queries/84163#208466
The incorrect labels are web.bAckground-perceivable
(A
instead of a
), web*background
(*
instead of .
), parent.ac|ive.playing-video
(|
instead of t
), parent.ioactive
(o
instead of n
), and an empty label.
It happens on so little data that I wonder if this might be from (a) user(s) with faulty hardware.
Reporter | ||
Comment 1•3 years ago
|
||
In all the above cases, there's only 1 bit of difference:
"A".charCodeAt() - "a".charCodeAt()
-32
"*".charCodeAt() - ".".charCodeAt()
-4
"|".charCodeAt() - "t".charCodeAt()
8
"o".charCodeAt() - "n".charCodeAt()
1
Comment 2•3 years ago
|
||
Alrighty, let's catalogue these:
expected | actual | index of differing character | character difference |
---|---|---|---|
parent.Active |
parent.active |
7 | A - a = -32 |
web.bAckground-perceivable |
web.background-perceivable |
5 | A - a = -32 |
web*background |
web.background |
3 | * - . = -4 |
parent.ac|ive.playing-video |
parent.active.playing-video |
9 | | - t = 8 |
parent.ioactive |
parent.inactive |
8 | o - n = 1 |
Hm. No real pattern here yet. Let's look at the clients sending us these things... They're from five separate pings from five separate clients. Let's look at the five pings
Times
The pings were received on March 14, 17, 19, 21, and 25. They covered periods of time for the single day prior to their received dates aside from the one received March 25 which covered data starting from Tuesday March 22 (suggesting the client wasn't active between the Tuesday and the Friday).
Clients
- All on the release channel. All on Windows.
- Windows 6.1 (Win7): 3, Windows 6.3 (Win8.1): 1, Windows 10.0 (Win10): 1
- 3 on x86_64, 2 on x86 (both on Win7)
- 4 on 98.0.1, 1 (the one from March 14) on 98.0. 98.0.1 was released on March 14, so all of these versions were current to their time period.
Geo
- Hungary, China, Belgium, Belarus, and Poland.
- All of the client-local timestamps in
ping_info
have timezones matching the country (+1 and +8)
power.cpu_time_per_process_type_ms
- The ping with the
parent.ioactive
label did not have aparent.inactive
label. Ditto withparent.ac|ive.playing-video
. - The ping with the
web.bAckground-perceivable
label did have aweb.background-perceivable
label (both values around 67k). Ditto withparent.Active
(42k, with 504k inparent.active
) andweb*background
(1, with 5.8k inweb.background
).
Theories
Okay, so nothing immediately jumps out. The software versions were all current. The clients are from multiple global regions. These happened over the course of two weeks. There's nothing to say it won't happen again: there's no appreciable time before these five instances (bug 1747138 which landed this per-process probe shipped on Fx98 (March 8)), and very little time has passed since the most recent weird sample.
Maybe we can determine where the bitflip happened? The code that constructs these labels does so on every call to RecordPowerMetrics (which happens very frequently) suggesting (to me) that a single bitflip would be unlikely to have the perceived effect. Instead I'd more expect that our bitflipped labels would have very low values and the normal labels would always be present and hold a reasonable value.
Also, this is a statically-labeled labeled_counter
: these bitflipped labels aren't permitted and should result in __other__
values (only two of the five pings have values in the __other__
label) meaning that the bitflip almost certainly did not happen outside of glean-core.
We could have a bitflip in the in-memory glean-core CounterMetric's name. Instead of being cpu_time_per_process_type_ms#parent.active
it could've flipped to cpu_time_per_process_type_ms#parent.Active
... but no, that wouldn't have worked either since every call to glean-core's LabeledMetric::get
results in the creation of a new metric object. (seems like this might become a performance problem. Filed bug 1762612)
To me it seems most likely that the bitflip happens on disk to the stored value. That would help explain why some of the pings have both bitflipped and unbitflipped labels: once the bit is flipped, the db will add another record for the unbitflipped variant on the next data recording.
...or. Or I suppose it could happen on the way to the disk. FOG configures the Rust Glean SDK to delay the storing to disk of ping-lifetime data (which power.cpu_time_per_process_type_ms
is). The intermediary data gets stored in-memory... so it could be an in-memory bitflip here, which would act like an on-disk bitflip in the db.
...or. Or I suppose bitflips could happen somewhere inside rkv
. I'm not sure how it works and am not familiar enough with the code to go diving for answers in it.
The way forward
Given that this is so unlikely to happen above the level of the Glean SDK, I'm moving this bug to that component next. I think there are probably things we can instrument in the storage layer and report via the Glean Error Reporting Mechanism (GERM) (like maybe checking labeled_
metrics with static labels against the labels in the db when assembling pings?) that could help us narrow down where these flips happen.
Aside from that I think we're stuck waiting for more of these to come in to perhaps give us enough of a pattern to guide investigation... or at least that's how it looks from where I'm sitting.
Comment 3•3 years ago
|
||
Oh, and some other things that just occurred to me:
- The bitflips are unlikely to have occcured at any time during or after ping assembly, otherwise we wouldn't have any ping with both bitflipped and unbitflipped labels.
- If storage is bitflipping labels, then what else might it bitflip? If it bitflips metrics names then we might have more than just five instances to build patterns from (we'll have to check
additional_properties
for off-by-one names)
Comment 4•3 years ago
|
||
Moving to backlog. I don't think there's something actionable or urgent to do about it right now.
Description
•