Closed Bug 1621757 Opened 4 years ago Closed 4 years ago

Investigate: What happens if a metric changes data type in the client?

Categories

(Data Platform and Tools :: Glean: SDK, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdroettboom, Assigned: Dexter)

Details

Attachments

(2 files)

We have a good understanding of what happens to the database schema if a metric changes its metric type. We should investigate what happens in the client database if a metric is recorded with one type, and later again with the same name as a different type, to understand the scope of the problem (if any) and what steps we should take to make that safer or more robust.

Assignee: nobody → alessio.placitelli
Priority: P3 → P1
Whiteboard: [telemetry:glean-rs:m?]

Metrics are recorded in different stores based on their lifetimes. Within a single lifetime store, metrics are additionally persisted partitioned by the 'storage name' (i.e. the ping they will end up in) and metric id.

When changing the type of a metric, the metric id will remain the same, since the metric identifier is made up of its category and the metric name.

This implies that the Glean SDK will face some data-loss scenarios if the the type of a metric changes, but the lifetime and the container ping name stay the same.

Lifetime change Store change Outcome
Y N No data loss. Depending on the old and new lifetime, data from both the metric type may be sent in the same ping.
Y Y No data loss. The new ping will contain only the data from the new type. The old ping/store will contain data from the old type.
N Y No data loss. The new ping will contain only the data from the new type. The old ping/store will contain data from the old type.
N N Some data loss. The ping will contain only the data from the new type even if some data was previously stored with the old type.

The following is a breakdown of the different data-loss scenarios:

Ping lifetime

If delay_ping_lifetime_io is active

  • there was a prevously stored value: if the transform function uses the previous stored value, the result depends on how the metric type handles faulty old values;
  • no prevously stored value: nothing odd happens

If delay_ping_lifetime_io is inactive

  • there was a prevously stored value: depends on the transform function as well.
  • no prevously stored value: nothing odd happens

Other lifetimes

The behaviour is similar to the ping lifetime, without the distinction for the delayed I/O.

@Jan-Erik, does the above make sense given your knowledge of the core part?
@Mike, assuming the above is correct, I think there's minimal chance for unexpected data loss. Ideally, that would only really be an issue for user lifetime metrics that changed the type (other lifetimes are, in theory, shorter so even if there's data loss it would be less dramatic).

This seems like something that should rather be addressed through documentation. Thoughts?

Flags: needinfo?(jrediger)
Attached file poc_test.rs

Here's the test that I used to investigate the problem.

Whoops, forgot to flag you Mike on comment 1.

Flags: needinfo?(mdroettboom)

Yes, I agree, this is something that could be addressed with docs, and reduces the importance of the glinter history checking proposed in bug 1621975.

Flags: needinfo?(mdroettboom)

That investigation sounds largely correct to me.
Maybe one thing to call out explicitly: we overwrite "invalid" data on metric modification functions. E.g. for a timing distribution we replace whatever data was stored with a fresh histogram if we can't decode it (same for counter, string list, etc). This is relevant for "same lifetime, changed type".

Would it make sense to put that test into our code base somehow? Even if incomplete and we #[ignore] it it could act as a reminder for us (and link back to the bug here)

Flags: needinfo?(jrediger)
Attached file GitHub Pull Request
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: