Closed Bug 1792025 Opened 2 years ago Closed 1 year ago

Proposal to change `labeled_*` metrics' label format to be more lenient

Categories

(Data Platform and Tools :: Glean Metric Types, task)

task

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: chutten, Unassigned)

References

(Blocks 1 open bug)

Details

Proposal for changing an existing or adding a new Glean metric type

Who is the individual/team requesting this change?

Chris H-C (:chutten), Glean SDK team. On behalf of :florian and :nika.

Is this about changing an existing metric type or creating a new one?

Changing labeled metric types in the present and into the future.

Can you describe the data that needs to be recorded?

Thread names, IPC Message names, Search engine names, <other capitalized or punctuated strings>

Can you provide a raw sample of the data that needs to be recorded (this is in the abstract, and not any particular implementation details about its representation in the payload or the database)

PSocketProcess__Msg_OnHttpActivityDistributorObserveConnection

What is the business question/use-case that requires the data to be recorded?

Performance and power use in this case. In the broader case, being able to send unconjugated search engine names may be tied to business purposes.

How would the data be consumed?

Same as now (no change): Looker, GLAM, SQL, etc.

Why existing metric types are not enough?

labeled_* metrics apply a variety of slightly-different regexes to determine what labels are permitted. Most attempt to conform to the label format which mandates 30-character words delimited by . with a max length of around 71 (to ensure Glean metrics can be encoded).

According to the docs, this is To ensure maximum support in database columns. But this is incorrect: if we're trying to support these as valid identifiers, field names, or column names in BQ, then we can't use . and have a much wider variety of characters available to us.

Also, because labeled_* metric types support dynamic labels, we can never make them into column names anyway. They could contain any (valid) string.

What is the timeline by which the data needs to be collected?

It's already being collected. Folks are conjugating into label format themselves.

Relevant BQ docs for column names say:

A column name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_), and it must start with a letter or underscore. The maximum column name length is 300 characters. A column name cannot use any of the following prefixes: _TABLE_ _FILE_ _PARTITION _ROW_TIMESTAMP __ROOT__ _COLIDENTIFIER. Duplicate column names are not allowed even if the case differs. For example, a column named Column1 is considered identical to a column named column1.

Or, up to 300 characters of case-insensitive alphanum plus underscore (_) with initial letter or underscore.

Relevant BQ docs for STRUCT (and thus RECORD) field names only specify a max nesting depth of 15.

Relevant BQ docs for unquoted SQL identifiers permit dashes (-), but otherwise mimic the column name restrictions.

Duplicate of this bug: 1800491

Earlier this year we expanded the limit to 71 characters of printable ASCII in bug 1672273. More than enough for PSocketProcess__Msg_OnHttpActivityDistributorObserveConnection

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.