Investigate whether string metrics can have longer values
Categories
(Data Platform and Tools :: Glean: SDK, defect, P1)
Tracking
(Not tracked)
People
(Reporter: mdroettboom, Assigned: Dexter)
Details
(Whiteboard: [telemetry:mobilesdk:m?])
Attachments
(2 files)
The maximum length of a string scalar value is currently hard coded to 50 characters. This is the same as Desktop Telemetry.
:boek raised concerns that this may not be enough. One possible solution is to make this configurable on a per-metric basis, though there may be implications further down in the pipeline that will be difficult to address.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 1•6 years ago
|
||
Hey Ryan! In Glean, we generally limit strings to be at most 50 characters long. However, the submission URL for search engines sent by Fenix might be longer than that, and might get cropped.
Is the submission URL really required for analysis? Since we ship the engines and we just report user-added engines as "custom"
, can't we model this data point different? E.g. use some id or something?
Comment 2•6 years ago
|
||
Thanks, Alessio!
Using an identifier seems like it would cause problems. We'd need to reference a separate table to understand what the identifier means. It's also nice to have a secondary measure of what the URL was in practice. For example, can we be sure that searches against the google
engine always have the same URL - across all app versions and across time?
What's the reasoning behind the 50 character limit? I'm skeptical of changing the data we collect unless there's a compelling technical restriction.
Thanks again!
Comment 3•6 years ago
|
||
To just add a high-level perspective: having one hard limit for string lengths seems important, to guide our data collection practices & specify worst-case behaviors.
It could be more than 50 if that's generally useful, but should be capped at a reasonable defined length.
Assignee | ||
Comment 4•6 years ago
|
||
(In reply to Ryan Harter [:harter] from comment #2)
Thanks, Alessio!
My pleasure!
Using an identifier seems like it would cause problems. We'd need to reference a separate table to understand what the identifier means. It's also nice to have a secondary measure of what the URL was in practice. For example, can we be sure that searches against the
I see, now I understand the problem a bit better, thanks for expanding on this. This seems to be a genuinely useful thing to have.
What's the reasoning behind the 50 character limit? I'm skeptical of changing the data we collect unless there's a compelling technical restriction.
With Glean, we're providing an SDK that tries to do the right thing out of the box, with sane defaults and limits. We offer higher level metric types, and all the collected data must be collected with these types. For the specific example of String
metric types, we ported over the string limits that were in place on desktop as well. Unfortunately, this limit isn't being enforced for this search field on Desktop, because it lives in the Telemetry environment.
The current limit for string values is 50 characters.
(In reply to Georg Fritzsche [:gfritzsche] from comment #3)
To just add a high-level perspective: having one hard limit for string lengths seems important, to guide our data collection practices & specify worst-case behaviors.
It could be more than 50 if that's generally useful, but should be capped at a reasonable defined length.
As Georg mentioned, the limit can be adjusted. However, let's keep in mind that this will be adjusted, if needed, for the metric type itself (so for all consumers using the type).
Your explanation in comment 2 makes sense (of course!), so I think it's ok for us to consider bumping the string limit a bit more. Do you have any statistic on the average length of the submission URLs for search engines, to inform our limit?
Comment 5•6 years ago
|
||
Here's a query looking at data for May.
Looks like the median URL is 51 characters but the maximum is just under 1k characters. I don't have context on what the cost function looks like on your side, but it seems like the benefit starts maxing out between 100 and 1k characters.
Does it make sense to have a separate long_string
type instead of affecting all existing measurements?
Comment 6•6 years ago
|
||
(In reply to Ryan Harter [:harter] from comment #5)
Looks like the median URL is 51 characters but the maximum is just under 1k characters. I don't have context on what the cost function looks like on your side, but it seems like the benefit starts maxing out between 100 and 1k characters.
Does it make sense to have a separate
long_string
type instead of affecting all existing measurements?
The problem is that once we have a type that allows significantly longer strings, they will probably get used more and more.
String lengths of up to 1k don't sound like something we want to make easy to collect at scale without a bandwidth and pipeline impact discussion.
For unblocking the specific Fenix MVP delivery issue here, could we get high value out of a string length limit of 100 and have a discussion on how we can address this use-case better?
Comment 7•6 years ago
|
||
(In reply to Ryan Harter [:harter] from comment #5)
Here's a query looking at data for May.
It looks like we don't have permission to view this.
Comment 8•6 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #6)
For unblocking the specific Fenix MVP delivery issue here, could we get high value out of a string length limit of 100 and have a discussion on how we can address this use-case better?
Sounds good to me! ni? :boek to confirm all of the mobile URLs will fit within the 100 character limit.
Hey :harter, :gfritzsche
It looks like the largest string that I could find is 76 characters which is for Wikipedia. So I think the 100 character limit should be fine!
Assignee | ||
Comment 10•6 years ago
|
||
(In reply to Jeff Boek [:boek] from comment #9)
Hey :harter, :gfritzsche
It looks like the largest string that I could find is 76 characters which is for Wikipedia. So I think the 100 character limit should be fine!
All right, thanks Jeff, I'll work on a PR.
Assignee | ||
Comment 11•6 years ago
|
||
Assignee | ||
Comment 12•6 years ago
|
||
Assignee | ||
Comment 13•6 years ago
|
||
Both PRs were merged! Closing this, new data should be coming in good with the next Fenix build.
Description
•