Proposal to change `labeled_*` metrics' label format to be more lenient
Categories
(Data Platform and Tools :: Glean Metric Types, task)
Tracking
(Not tracked)
People
(Reporter: chutten, Unassigned)
References
(Blocks 1 open bug)
Details
Proposal for changing an existing or adding a new Glean metric type
Who is the individual/team requesting this change?
Chris H-C (:chutten), Glean SDK team. On behalf of :florian and :nika.
Is this about changing an existing metric type or creating a new one?
Changing labeled metric types in the present and into the future.
Can you describe the data that needs to be recorded?
Thread names, IPC Message names, Search engine names, <other capitalized or punctuated strings>
Can you provide a raw sample of the data that needs to be recorded (this is in the abstract, and not any particular implementation details about its representation in the payload or the database)
PSocketProcess__Msg_OnHttpActivityDistributorObserveConnection
What is the business question/use-case that requires the data to be recorded?
Performance and power use in this case. In the broader case, being able to send unconjugated search engine names may be tied to business purposes.
How would the data be consumed?
Same as now (no change): Looker, GLAM, SQL, etc.
Why existing metric types are not enough?
labeled_*
metrics apply a variety of slightly-different regexes to determine what labels are permitted. Most attempt to conform to the label format which mandates 30-character words delimited by .
with a max length of around 71 (to ensure Glean metrics can be encoded).
According to the docs, this is To ensure maximum support in database columns. But this is incorrect: if we're trying to support these as valid identifiers, field names, or column names in BQ, then we can't use .
and have a much wider variety of characters available to us.
Also, because labeled_*
metric types support dynamic labels, we can never make them into column names anyway. They could contain any (valid) string.
What is the timeline by which the data needs to be collected?
It's already being collected. Folks are conjugating into label format themselves.
Reporter | ||
Comment 1•2 years ago
|
||
Relevant BQ docs for column names say:
A column name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_), and it must start with a letter or underscore. The maximum column name length is 300 characters. A column name cannot use any of the following prefixes:
_TABLE_
_FILE_
_PARTITION
_ROW_TIMESTAMP
__ROOT__
_COLIDENTIFIER
. Duplicate column names are not allowed even if the case differs. For example, a column named Column1 is considered identical to a column named column1.
Or, up to 300 characters of case-insensitive alphanum plus underscore (_
) with initial letter or underscore.
Relevant BQ docs for STRUCT
(and thus RECORD
) field names only specify a max nesting depth of 15.
Relevant BQ docs for unquoted SQL identifiers permit dashes (-
), but otherwise mimic the column name restrictions.
Reporter | ||
Comment 3•1 year ago
|
||
Earlier this year we expanded the limit to 71 characters of printable ASCII in bug 1672273. More than enough for PSocketProcess__Msg_OnHttpActivityDistributorObserveConnection
Description
•