Label definition in Rust code doesn't match the schema
Categories
(Data Platform and Tools :: Glean: SDK, defect, P3)
Tracking
(Not tracked)
People
(Reporter: mdroettboom, Assigned: mdroettboom)
Details
(Whiteboard: [telemetry:glean-rs:m11])
Attachments
(2 files)
Just reading through Rust code on a lazy Friday afternoon, noticed that:
While the regular expression used to validate dynamic labels in the client, and the regular expression used to validate static labels in glean_parser match, they don't match the one used to validate labels in the schema.
Client SDK and glean_parser: "^[a-z_][a-z0-9_-]{0,29}(\\.[a-z0-9_-]{0,29})*$"
Pipeline schema: "^[a-z_][a-z0-9_]{0,29}(\\.[a-z_][a-z0-9_]{0,29})+$"
- we have a theoretical situation where the client might generate pings that are invalid against the schema.
- I don't think either of these regexes is correct.
The pipeline schema doesn't allow hyphens, but I think that is something we want to do. Can't find the original bug, but this PR allowed them in Glean.
The client SDK regex allows sections following the period to begin with a digit, which is unlike most conventions, and is very likely to be disallowed by something further down the pipeline.
I would suggest we correct them both to the following, and update unit tests accordingly:
"^[a-z_-][a-z0-9_-]{0,29}(\\.[a-z_-][a-z0-9_-]{0,29})+$"
What's that joke about regular expressions again?
Comment 1•6 years ago
|
||
For reference we had a similar regex in Glean-ac: https://github.com/mozilla-mobile/android-components/commit/237cb5b90b9ac9889912bee064010625361feb6e
private val labelRegex = Regex("^[a-z_][a-z0-9_]{0,29}(\\.[a-z0-9_]{0,29})*$")
Comment 2•6 years ago
|
||
and then we allowed dashes there too: https://github.com/mozilla-mobile/android-components/pull/3801/files
and this came from https://bugzilla.mozilla.org/show_bug.cgi?id=1566764
also relevant: https://bugzilla.mozilla.org/show_bug.cgi?id=1585017
Comment 3•6 years ago
|
||
Your proposed regex doesn't match label1 which leads me to believe this is far more complex then what we want.
Comment 4•6 years ago
|
||
Simplified proposal:
^[a-z_](\.[a-z0-9]|[a-z0-9_-])*$
- Start with a lower-case letter or underscore
- Followed repeatedly by:
- A dot, followed by a number or character.
- Or alphanumeric character/dash/underscore
This requires the length check to be done on top of that (likely before applying the regex in our codebase, the schema already does it anyway)
I tested this against our listed inline docs.
Maybe I'm overlooking a case though?
| Assignee | ||
Comment 5•6 years ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #3)
Your proposed regex doesn't match
label1which leads me to believe this is far more complex then what we want.
Indeed. It should be (* rather than + at the end):
"^[a-z_-][a-z0-9_-]{0,29}(\\.[a-z_-][a-z0-9_-]{0,29})*$"
The advantage of this more complicated regex is that it confirms that labels follow the category and name length of 30 characters enforced elsewhere, which allows a more seamless transfer between labeled and non-labeled metrics.
But the simpler one is probably "good enough".
Comment 6•6 years ago
|
||
Ah, I was unaware that we do split labels up in category/name with limits on its own.
Updated•6 years ago
|
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Comment 7•6 years ago
|
||
On further investigation, there is no regex used in the pipeline schemas:
We still should fix the regex using in glean and glean_parser, but this makes all of this much lower risk for data loss.
| Assignee | ||
Comment 8•6 years ago
|
||
| Assignee | ||
Comment 9•6 years ago
|
||
| Assignee | ||
Updated•6 years ago
|
Description
•