Closed Bug 1552905 Opened 5 years ago Closed 5 years ago

Investigate whether string metrics can have longer values

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdroettboom, Assigned: Dexter)

Details

(Whiteboard: [telemetry:mobilesdk:m?])

Attachments

(2 files)

The maximum length of a string scalar value is currently hard coded to 50 characters. This is the same as Desktop Telemetry.

:boek raised concerns that this may not be enough. One possible solution is to make this configurable on a per-metric basis, though there may be implications further down in the pipeline that will be difficult to address.

Assignee: nobody → alessio.placitelli
Priority: -- → P1

Hey Ryan! In Glean, we generally limit strings to be at most 50 characters long. However, the submission URL for search engines sent by Fenix might be longer than that, and might get cropped.

Is the submission URL really required for analysis? Since we ship the engines and we just report user-added engines as "custom", can't we model this data point different? E.g. use some id or something?

Flags: needinfo?(rharter)

Thanks, Alessio!

Using an identifier seems like it would cause problems. We'd need to reference a separate table to understand what the identifier means. It's also nice to have a secondary measure of what the URL was in practice. For example, can we be sure that searches against the google engine always have the same URL - across all app versions and across time?

What's the reasoning behind the 50 character limit? I'm skeptical of changing the data we collect unless there's a compelling technical restriction.

Thanks again!

Flags: needinfo?(rharter) → needinfo?(alessio.placitelli)

To just add a high-level perspective: having one hard limit for string lengths seems important, to guide our data collection practices & specify worst-case behaviors.
It could be more than 50 if that's generally useful, but should be capped at a reasonable defined length.

(In reply to Ryan Harter [:harter] from comment #2)

Thanks, Alessio!

My pleasure!

Using an identifier seems like it would cause problems. We'd need to reference a separate table to understand what the identifier means. It's also nice to have a secondary measure of what the URL was in practice. For example, can we be sure that searches against the google engine always have the same URL - across all app versions and across time?

I see, now I understand the problem a bit better, thanks for expanding on this. This seems to be a genuinely useful thing to have.

What's the reasoning behind the 50 character limit? I'm skeptical of changing the data we collect unless there's a compelling technical restriction.

With Glean, we're providing an SDK that tries to do the right thing out of the box, with sane defaults and limits. We offer higher level metric types, and all the collected data must be collected with these types. For the specific example of String metric types, we ported over the string limits that were in place on desktop as well. Unfortunately, this limit isn't being enforced for this search field on Desktop, because it lives in the Telemetry environment.

The current limit for string values is 50 characters.

(In reply to Georg Fritzsche [:gfritzsche] from comment #3)

To just add a high-level perspective: having one hard limit for string lengths seems important, to guide our data collection practices & specify worst-case behaviors.
It could be more than 50 if that's generally useful, but should be capped at a reasonable defined length.

As Georg mentioned, the limit can be adjusted. However, let's keep in mind that this will be adjusted, if needed, for the metric type itself (so for all consumers using the type).

Your explanation in comment 2 makes sense (of course!), so I think it's ok for us to consider bumping the string limit a bit more. Do you have any statistic on the average length of the submission URLs for search engines, to inform our limit?

Flags: needinfo?(alessio.placitelli) → needinfo?(rharter)

Here's a query looking at data for May.

Looks like the median URL is 51 characters but the maximum is just under 1k characters. I don't have context on what the cost function looks like on your side, but it seems like the benefit starts maxing out between 100 and 1k characters.

Does it make sense to have a separate long_string type instead of affecting all existing measurements?

Flags: needinfo?(rharter)

(In reply to Ryan Harter [:harter] from comment #5)

Looks like the median URL is 51 characters but the maximum is just under 1k characters. I don't have context on what the cost function looks like on your side, but it seems like the benefit starts maxing out between 100 and 1k characters.

Does it make sense to have a separate long_string type instead of affecting all existing measurements?

The problem is that once we have a type that allows significantly longer strings, they will probably get used more and more.
String lengths of up to 1k don't sound like something we want to make easy to collect at scale without a bandwidth and pipeline impact discussion.

For unblocking the specific Fenix MVP delivery issue here, could we get high value out of a string length limit of 100 and have a discussion on how we can address this use-case better?

Flags: needinfo?(rharter)

(In reply to Ryan Harter [:harter] from comment #5)

Here's a query looking at data for May.

It looks like we don't have permission to view this.

(In reply to Georg Fritzsche [:gfritzsche] from comment #6)

For unblocking the specific Fenix MVP delivery issue here, could we get high value out of a string length limit of 100 and have a discussion on how we can address this use-case better?

Sounds good to me! ni? :boek to confirm all of the mobile URLs will fit within the 100 character limit.

Flags: needinfo?(rharter) → needinfo?(j)

Hey :harter, :gfritzsche

It looks like the largest string that I could find is 76 characters which is for Wikipedia. So I think the 100 character limit should be fine!

Flags: needinfo?(j)

(In reply to Jeff Boek [:boek] from comment #9)

Hey :harter, :gfritzsche

It looks like the largest string that I could find is 76 characters which is for Wikipedia. So I think the 100 character limit should be fine!

All right, thanks Jeff, I'll work on a PR.

Attached file Fix for glean-ac
Attached file Fix for glean-core

Both PRs were merged! Closing this, new data should be coming in good with the next Fenix build.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: