New Glean metric types being incorrectly added as repeated key/value in schemas
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: klukas, Unassigned)
References
Details
(Whiteboard: [dataplatform])
Attachments
(4 files)
Previous bug https://bugzilla.mozilla.org/show_bug.cgi?id=1697698 discusses how the labeled_rate and jwe metric types had incorrectly flowed downstream to BQ tables as repeated key/value fields.
It looks like the same has now happened for the new url
and text
metric types. We need to figure out how to improve processes to prevent this from happening in the future.
Reporter | ||
Comment 1•3 years ago
|
||
I'm reporting this after noticing the problem in the burnham_stable.metrics_v1
table, but I assume this applies to all deployed Glean tables.
Reporter | ||
Updated•3 years ago
|
Comment 2•3 years ago
|
||
I suspect the problem here is that mozilla-schema-generator is generating some kind of default column for glean metric types it doesn't understand, although I don't know that for certain.
I can take this, but would be good to find someone who can pair with me on this, to spread the domain expertise here around a bit.
Comment 3•3 years ago
|
||
Ok, :relud and I figured out the root cause, we think. It appears that mozilla-schema-generator augments the glean core schema with additional schema information, including some things from this file called glean.yaml
: https://github.com/mozilla/mozilla-schema-generator/blob/main/mozilla_schema_generator/configs/glean.yaml
In the case of url
and text
, there are no entries for them, so jsonschema-transpiler just maps directly from their schema entries (which are just objects) directly to the BigQuery map type.
We think the solution is probably to validate that for each glean metric type (as defined in its schema), a corresponding entry exists in glean.yaml. If not, bail. Filed bug 1739239 for that.
We'll have to allow url
and text
(as well as labeled rate and jwe from bug 1697698) for now, until we figure out to remediate that issue. For now, none of those metric types will be ingested.
Comment 4•3 years ago
|
||
As of this moment, Rally does not have any live, active studies that are using glean.
Though, we are actively developing a new study that will use be using glean and at least one, if not both, of these types. We expect to launch in December.
Comment 5•3 years ago
|
||
Ok, so to follow up, bug 1739239 should make sure this doesn't happen anymore. I'm not really sure what the best options are for cleanup? What I can think of is:
- Remove these metrics from the existing glean schema (or have MSG do it) so no bigquery columns are generated for them by MSG. I think it's ok to have columns in our tables that are not generated by our schemas, though correct me if I'm wrong.
- Rewrite the existing tables to not include the bogus columns (this will be every glean table so is probably going to be time consuming / expensive, though maybe we can piggy back on top of shredder).
- Add the right metric mappings to glean.yaml (re-adding them to the glean core schema if necessary), so that future metrics of these kinds can come in.
Does that sound reasonable? Anything I'm missing?
Comment 6•3 years ago
|
||
I think it's ok to have columns in our tables that are not generated by our schemas, though correct me if I'm wrong.
I'm fairly certain that would cause schema deploys to fail.
maybe we can piggy back on top of shredder
shredder changes one partition at a time, so it can't delete columns
Reporter | ||
Comment 7•3 years ago
|
||
I think it's going to be hard to justify the amount of work it would take to remove these fields from all Glean tables. As relud points out, we can't do this partition-by-partition, so we'd need to rewrite the entire table at once. There may come a time where BQ's DELETE COLUMN machinery allows deleting nested fields, which could make this much easier.
For the time being, I can think of two possible ways forward.
One option would be to add some special casing in the pipeline and in schema generation to rename these metric types to url2
and text2
. We'd leave the existing url
and text
fields in place indefinitely in all schemas. url2
and text2
would only show up in schemas for pings where they're actually used. We'd never need to delete a column.
We should figure out the Rally use cases for url
and text
. If these don't need to be added to any existing pings, then I think we can focus on a solution where we modify MSG to do the "right thing" for all new pings, and somehow annotate all existing pings to preserve the map-like behavior.
Comment 8•3 years ago
|
||
We should figure out the Rally use cases for url and text. If these don't need to be added to any existing pings, then I think we can focus on a solution where we modify MSG to do the "right thing" for all new pings, and somehow annotate all existing pings to preserve the map-like behavior.
Per https://bugzilla.mozilla.org/show_bug.cgi?id=1737656#c4 these don't need to be added to any existing pings. I agree with the approach outlined here.
Comment 9•3 years ago
|
||
Reporter | ||
Comment 10•3 years ago
|
||
PR is merged, which should allow all new Glean tables to support these types correctly. The updated MSG image is deployed, and I'm letting the MSG task run again in https://workflow.telemetry.mozilla.org/tree?dag_id=probe_scraper&root= to verify behavior that there are no surprises in production behavior.
Reporter | ||
Comment 11•3 years ago
|
||
The rerun of the MSG task finished with "Everything up-to-date" from the git push
, indicating that the output was identical.
Comment 12•3 years ago
•
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #11)
The rerun of the MSG task finished with "Everything up-to-date" from the
git push
, indicating that the output was identical.
Awesome! Follow up items that I can think of:
- (short term) Add docs to https://mozilla.github.io/glean documenting this gotcha (that these types won't be available for most existing applications and pings)
- (short term) Perform some kind of retrospective on how this happened.
- (medium-to-long term) Fix this problem in the source tables or come up with an alternate column name schema as outlined in comment 7 (maybe we should file another bug for this).
Reporter | ||
Comment 13•3 years ago
|
||
(In reply to William Lachance (:wlach) from comment #12)
- (short term) Add docs to https://mozilla.github.io/glean documenting this gotcha (that these types won't be available for most existing applications and pings)
- (short term) Perform some kind of retrospective on how this happened.
:wlach - Would you be able to take on either or both of these tasks?
- (medium-to-long term) Fix this problem in the source tables or come up with an alternate column name schema as outlined in comment 7 (maybe we should file another bug for this).
We can remove these from user-facing views by updating the generate_views
machinery in bigquery-etl. There's already a similar special case for cleaning fenix metrics specifically and the logic for this would be similar. Unless we find a use case for one of text
or url
in an existing ping, I don't think we should invest in trying to remove the columns from the source tables; the longer we wait to handle that, the more BQ functionality may be in place to ease the transition.
Comment 14•3 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #13)
(In reply to William Lachance (:wlach) from comment #12)
- (short term) Add docs to https://mozilla.github.io/glean documenting this gotcha (that these types won't be available for most existing applications and pings)
- (short term) Perform some kind of retrospective on how this happened.
:wlach - Would you be able to take on either or both of these tasks?
Yup, I'd be happy to. I can do (1) right now. I'll organize something for (2) next week.
- (medium-to-long term) Fix this problem in the source tables or come up with an alternate column name schema as outlined in comment 7 (maybe we should file another bug for this).
We can remove these from user-facing views by updating the
generate_views
machinery in bigquery-etl. There's already a similar special case for cleaning fenix metrics specifically and the logic for this would be similar. Unless we find a use case for one oftext
orurl
in an existing ping, I don't think we should invest in trying to remove the columns from the source tables; the longer we wait to handle that, the more BQ functionality may be in place to ease the transition.
Cool, that makes sense. I'd be tempted to defer any work here for a while. It's not clear to me these new metric types are useful for anything besides rally, which your work today unblocks.
Comment 15•3 years ago
|
||
Comment 16•3 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #13)
Unless we find a use case for one of
text
orurl
in an existing ping, I don't think we should invest in trying to remove the columns from the source tables; the longer we wait to handle that, the more BQ functionality may be in place to ease the transition.
Ugh. That's bad news.
While text
is probably not of much use outside of special use cases like Rally, url
seems much more useful and we had plans to e.g. migrate the default search URL to this type to avoid length restrictions for usual strings.
Desktop currently doesn't report that data, but it would have really been great if we could use url
here instead of string
.
Reporter | ||
Comment 17•3 years ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #16)
While
text
is probably not of much use outside of special use cases like Rally,url
seems much more useful and we had plans to e.g. migrate the default search URL to this type to avoid length restrictions for usual strings.
Desktop currently doesn't report that data, but it would have really been great if we could useurl
here instead ofstring
.
I'd love to hear some more detail on this. Is there a specific set of existing pings you have in mind? Are these all metrics
pings for various apps? If we know the set of target tables, we can do some analysis on data sizes to help figure out how difficult it would be to recreate the table to drop the legacy fields.
:whd - Is there a clear path in your mind for how we'd deal with terraform changes if we manually recreated a table to drop these metrics.url
fields, etc? We'd update the tables manually, and adjust mozilla-schema-generator to output schemas to match the new structure. But then I expect terraform would throw an error when trying to deploy the new schemas. Do we have prior art for importing schema changes performed manually?
Comment 18•3 years ago
|
||
It came up in a Fenix bug, so it would go into the Fenix metrics
ping.
Similar for Desktop, though as said we don't have that data there yet and to my knowledge it's not something that's gonna be implemented in the next few weeks.
Reporter | ||
Comment 19•3 years ago
|
||
so it would go into the Fenix metrics ping.
Of course, it's our largest Glean ping that would be affected :) That org_mozilla_firefox_stable.metrics_v1
table is ~116 TB. Which honestly shouldn't cause too much of a problem.
I think finding a way forward in the near term is doable here. It's a matter of figuring out what an end-to-end change would look like for the two possible ways forward here (recreating tables on a case-by-case basis or adding pipeline support for massaging these types to url2
, etc.) and picking which one seems like the best balance of effort and maintainability.
Comment 20•3 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #17)
:whd - Is there a clear path in your mind for how we'd deal with terraform changes if we manually recreated a table to drop these
metrics.url
fields, etc?
I maintain the opinion from https://bugzilla.mozilla.org/show_bug.cgi?id=1627722#c1 that this operation has significant operational complexity we should avoid (since again, this appears to be a nested column that can't be dropped and re-added using native BQ APIs). To my knowledge, bug #1565074, which took place before the AWS cutover was complete, is the last time we made a change that required ingestion pipeline stoppage.
We'd update the tables manually, and adjust mozilla-schema-generator to output schemas to match the new structure. But then I expect terraform would throw an error when trying to deploy the new schemas. Do we have prior art for importing schema changes performed manually?
Every call to terraform plan
refreshes terraform state from BQ before computing a plan of changes (or terraform refresh
to refresh terraform state without computing a plan). I wouldn't expect terraform to error in the case described unless the input schemas and the actual BQ resource schemas are mismatched at plan time.
Reporter | ||
Comment 21•3 years ago
|
||
the last time we made a change that required ingestion pipeline stoppage
I'm usually focused on _stable
tables in my mind, so often forget about the additional complexity involved for live tables. I agree that we should avoid going down a path that's going to require pipeline stoppage.
I'm convinced at this point to move forward with introducing url2
and text2
fields. Basic steps will look like:
- Update the logic for prepping records for loading to live tables in gcp-ingestion to rename
metrics.url
tometrics.url2
, etc. - Update mozilla-schema-generator to make the same name change
- Update bigquery-etl's generate_views module to ignore legacy
url
etc. fields and to renamemetrics.url2
tometrics.url
, etc.
Comment 22•3 years ago
|
||
Comment 23•3 years ago
|
||
Reporter | ||
Comment 24•3 years ago
|
||
The following two PRs are ready for review:
- https://github.com/mozilla/gcp-ingestion/pull/1881
- https://github.com/mozilla/mozilla-schema-generator/pull/214
Neither of these should affect current behavior, since no metrics of the affected types are being sent. But both of these need to be in place before we add a url or text metric to a glean ping in any app.
Once those are in place, we can work on changing generated views to drop the legacy fields and to rename the fixed ones (removing the "2").
Reporter | ||
Comment 25•3 years ago
|
||
These two PRs are now in place, so we should now be capable of defining new url
and text
type metrics in both new and existing Glean pings, and they will show up under metrics.url2
and metrics.text2
in the live and stable tables.
Reporter | ||
Comment 26•3 years ago
|
||
I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1741487 for the follow-up of updating generated views.
Assignee | ||
Updated•2 years ago
|
Updated•1 year ago
|
Description
•