Closed Bug 1608585 Opened 6 years ago Closed 6 years ago

Label definition in Rust code doesn't match the schema

Categories

(Data Platform and Tools :: Glean: SDK, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdroettboom, Assigned: mdroettboom)

Details

(Whiteboard: [telemetry:glean-rs:m11])

Attachments

(2 files)

Just reading through Rust code on a lazy Friday afternoon, noticed that:

While the regular expression used to validate dynamic labels in the client, and the regular expression used to validate static labels in glean_parser match, they don't match the one used to validate labels in the schema.

Client SDK and glean_parser: "^[a-z_][a-z0-9_-]{0,29}(\\.[a-z0-9_-]{0,29})*$"
Pipeline schema: "^[a-z_][a-z0-9_]{0,29}(\\.[a-z_][a-z0-9_]{0,29})+$"

  1. we have a theoretical situation where the client might generate pings that are invalid against the schema.
  2. I don't think either of these regexes is correct.

The pipeline schema doesn't allow hyphens, but I think that is something we want to do. Can't find the original bug, but this PR allowed them in Glean.

The client SDK regex allows sections following the period to begin with a digit, which is unlike most conventions, and is very likely to be disallowed by something further down the pipeline.

I would suggest we correct them both to the following, and update unit tests accordingly:

"^[a-z_-][a-z0-9_-]{0,29}(\\.[a-z_-][a-z0-9_-]{0,29})+$"

What's that joke about regular expressions again?

For reference we had a similar regex in Glean-ac: https://github.com/mozilla-mobile/android-components/commit/237cb5b90b9ac9889912bee064010625361feb6e

private val labelRegex = Regex("^[a-z_][a-z0-9_]{0,29}(\\.[a-z0-9_]{0,29})*$")

Your proposed regex doesn't match label1 which leads me to believe this is far more complex then what we want.

Simplified proposal:

^[a-z_](\.[a-z0-9]|[a-z0-9_-])*$
  1. Start with a lower-case letter or underscore
  2. Followed repeatedly by:
    • A dot, followed by a number or character.
    • Or alphanumeric character/dash/underscore

This requires the length check to be done on top of that (likely before applying the regex in our codebase, the schema already does it anyway)

I tested this against our listed inline docs.

Maybe I'm overlooking a case though?

Flags: needinfo?(mdroettboom)

(In reply to Jan-Erik Rediger [:janerik] from comment #3)

Your proposed regex doesn't match label1 which leads me to believe this is far more complex then what we want.

Indeed. It should be (* rather than + at the end):

"^[a-z_-][a-z0-9_-]{0,29}(\\.[a-z_-][a-z0-9_-]{0,29})*$"

The advantage of this more complicated regex is that it confirms that labels follow the category and name length of 30 characters enforced elsewhere, which allows a more seamless transfer between labeled and non-labeled metrics.

But the simpler one is probably "good enough".

Flags: needinfo?(mdroettboom)

Ah, I was unaware that we do split labels up in category/name with limits on its own.

Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:m11]
Assignee: nobody → mdroettboom

On further investigation, there is no regex used in the pipeline schemas:

https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/templates/include/glean/labeled_group.1.schema.json

We still should fix the regex using in glean and glean_parser, but this makes all of this much lower risk for data loss.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: