Closed Bug 1737656 Opened 3 years ago Closed 3 years ago

New Glean metric types being incorrectly added as repeated key/value in schemas

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Unassigned)

References

Details

(Whiteboard: [dataplatform])

Attachments

(4 files)

Previous bug https://bugzilla.mozilla.org/show_bug.cgi?id=1697698 discusses how the labeled_rate and jwe metric types had incorrectly flowed downstream to BQ tables as repeated key/value fields.

It looks like the same has now happened for the new url and text metric types. We need to figure out how to improve processes to prevent this from happening in the future.

I'm reporting this after noticing the problem in the burnham_stable.metrics_v1 table, but I assume this applies to all deployed Glean tables.

Whiteboard: [data-platform-infra-wg]

I suspect the problem here is that mozilla-schema-generator is generating some kind of default column for glean metric types it doesn't understand, although I don't know that for certain.

I can take this, but would be good to find someone who can pair with me on this, to spread the domain expertise here around a bit.

Assignee: nobody → wlachance
See Also: → 1739239

Ok, :relud and I figured out the root cause, we think. It appears that mozilla-schema-generator augments the glean core schema with additional schema information, including some things from this file called glean.yaml: https://github.com/mozilla/mozilla-schema-generator/blob/main/mozilla_schema_generator/configs/glean.yaml

In the case of url and text, there are no entries for them, so jsonschema-transpiler just maps directly from their schema entries (which are just objects) directly to the BigQuery map type.

We think the solution is probably to validate that for each glean metric type (as defined in its schema), a corresponding entry exists in glean.yaml. If not, bail. Filed bug 1739239 for that.

We'll have to allow url and text (as well as labeled rate and jwe from bug 1697698) for now, until we figure out to remediate that issue. For now, none of those metric types will be ingested.

As of this moment, Rally does not have any live, active studies that are using glean.
Though, we are actively developing a new study that will use be using glean and at least one, if not both, of these types. We expect to launch in December.

Ok, so to follow up, bug 1739239 should make sure this doesn't happen anymore. I'm not really sure what the best options are for cleanup? What I can think of is:

  1. Remove these metrics from the existing glean schema (or have MSG do it) so no bigquery columns are generated for them by MSG. I think it's ok to have columns in our tables that are not generated by our schemas, though correct me if I'm wrong.
  2. Rewrite the existing tables to not include the bogus columns (this will be every glean table so is probably going to be time consuming / expensive, though maybe we can piggy back on top of shredder).
  3. Add the right metric mappings to glean.yaml (re-adding them to the glean core schema if necessary), so that future metrics of these kinds can come in.

Does that sound reasonable? Anything I'm missing?

Assignee: wlachance → nobody
Flags: needinfo?(jklukas)
Flags: needinfo?(dthorn)

I think it's ok to have columns in our tables that are not generated by our schemas, though correct me if I'm wrong.

I'm fairly certain that would cause schema deploys to fail.

maybe we can piggy back on top of shredder

shredder changes one partition at a time, so it can't delete columns

Flags: needinfo?(dthorn)

I think it's going to be hard to justify the amount of work it would take to remove these fields from all Glean tables. As relud points out, we can't do this partition-by-partition, so we'd need to rewrite the entire table at once. There may come a time where BQ's DELETE COLUMN machinery allows deleting nested fields, which could make this much easier.

For the time being, I can think of two possible ways forward.

One option would be to add some special casing in the pipeline and in schema generation to rename these metric types to url2 and text2. We'd leave the existing url and text fields in place indefinitely in all schemas. url2 and text2 would only show up in schemas for pings where they're actually used. We'd never need to delete a column.

We should figure out the Rally use cases for url and text. If these don't need to be added to any existing pings, then I think we can focus on a solution where we modify MSG to do the "right thing" for all new pings, and somehow annotate all existing pings to preserve the map-like behavior.

Flags: needinfo?(jklukas)

We should figure out the Rally use cases for url and text. If these don't need to be added to any existing pings, then I think we can focus on a solution where we modify MSG to do the "right thing" for all new pings, and somehow annotate all existing pings to preserve the map-like behavior.

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1737656#c4 these don't need to be added to any existing pings. I agree with the approach outlined here.

PR is merged, which should allow all new Glean tables to support these types correctly. The updated MSG image is deployed, and I'm letting the MSG task run again in https://workflow.telemetry.mozilla.org/tree?dag_id=probe_scraper&root= to verify behavior that there are no surprises in production behavior.

The rerun of the MSG task finished with "Everything up-to-date" from the git push, indicating that the output was identical.

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #11)

The rerun of the MSG task finished with "Everything up-to-date" from the git push, indicating that the output was identical.

Awesome! Follow up items that I can think of:

  1. (short term) Add docs to https://mozilla.github.io/glean documenting this gotcha (that these types won't be available for most existing applications and pings)
  2. (short term) Perform some kind of retrospective on how this happened.
  3. (medium-to-long term) Fix this problem in the source tables or come up with an alternate column name schema as outlined in comment 7 (maybe we should file another bug for this).

(In reply to William Lachance (:wlach) from comment #12)

  1. (short term) Add docs to https://mozilla.github.io/glean documenting this gotcha (that these types won't be available for most existing applications and pings)
  2. (short term) Perform some kind of retrospective on how this happened.

:wlach - Would you be able to take on either or both of these tasks?

  1. (medium-to-long term) Fix this problem in the source tables or come up with an alternate column name schema as outlined in comment 7 (maybe we should file another bug for this).

We can remove these from user-facing views by updating the generate_views machinery in bigquery-etl. There's already a similar special case for cleaning fenix metrics specifically and the logic for this would be similar. Unless we find a use case for one of text or url in an existing ping, I don't think we should invest in trying to remove the columns from the source tables; the longer we wait to handle that, the more BQ functionality may be in place to ease the transition.

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #13)

(In reply to William Lachance (:wlach) from comment #12)

  1. (short term) Add docs to https://mozilla.github.io/glean documenting this gotcha (that these types won't be available for most existing applications and pings)
  2. (short term) Perform some kind of retrospective on how this happened.

:wlach - Would you be able to take on either or both of these tasks?

Yup, I'd be happy to. I can do (1) right now. I'll organize something for (2) next week.

  1. (medium-to-long term) Fix this problem in the source tables or come up with an alternate column name schema as outlined in comment 7 (maybe we should file another bug for this).

We can remove these from user-facing views by updating the generate_views machinery in bigquery-etl. There's already a similar special case for cleaning fenix metrics specifically and the logic for this would be similar. Unless we find a use case for one of text or url in an existing ping, I don't think we should invest in trying to remove the columns from the source tables; the longer we wait to handle that, the more BQ functionality may be in place to ease the transition.

Cool, that makes sense. I'd be tempted to defer any work here for a while. It's not clear to me these new metric types are useful for anything besides rally, which your work today unblocks.

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #13)

Unless we find a use case for one of text or url in an existing ping, I don't think we should invest in trying to remove the columns from the source tables; the longer we wait to handle that, the more BQ functionality may be in place to ease the transition.

Ugh. That's bad news.
While text is probably not of much use outside of special use cases like Rally, url seems much more useful and we had plans to e.g. migrate the default search URL to this type to avoid length restrictions for usual strings.
Desktop currently doesn't report that data, but it would have really been great if we could use url here instead of string.

(In reply to Jan-Erik Rediger [:janerik] from comment #16)

While text is probably not of much use outside of special use cases like Rally, url seems much more useful and we had plans to e.g. migrate the default search URL to this type to avoid length restrictions for usual strings.
Desktop currently doesn't report that data, but it would have really been great if we could use url here instead of string.

I'd love to hear some more detail on this. Is there a specific set of existing pings you have in mind? Are these all metrics pings for various apps? If we know the set of target tables, we can do some analysis on data sizes to help figure out how difficult it would be to recreate the table to drop the legacy fields.

:whd - Is there a clear path in your mind for how we'd deal with terraform changes if we manually recreated a table to drop these metrics.url fields, etc? We'd update the tables manually, and adjust mozilla-schema-generator to output schemas to match the new structure. But then I expect terraform would throw an error when trying to deploy the new schemas. Do we have prior art for importing schema changes performed manually?

Flags: needinfo?(whd)

It came up in a Fenix bug, so it would go into the Fenix metrics ping.
Similar for Desktop, though as said we don't have that data there yet and to my knowledge it's not something that's gonna be implemented in the next few weeks.

so it would go into the Fenix metrics ping.

Of course, it's our largest Glean ping that would be affected :) That org_mozilla_firefox_stable.metrics_v1 table is ~116 TB. Which honestly shouldn't cause too much of a problem.

I think finding a way forward in the near term is doable here. It's a matter of figuring out what an end-to-end change would look like for the two possible ways forward here (recreating tables on a case-by-case basis or adding pipeline support for massaging these types to url2, etc.) and picking which one seems like the best balance of effort and maintainability.

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #17)

:whd - Is there a clear path in your mind for how we'd deal with terraform changes if we manually recreated a table to drop these metrics.url fields, etc?

I maintain the opinion from https://bugzilla.mozilla.org/show_bug.cgi?id=1627722#c1 that this operation has significant operational complexity we should avoid (since again, this appears to be a nested column that can't be dropped and re-added using native BQ APIs). To my knowledge, bug #1565074, which took place before the AWS cutover was complete, is the last time we made a change that required ingestion pipeline stoppage.

We'd update the tables manually, and adjust mozilla-schema-generator to output schemas to match the new structure. But then I expect terraform would throw an error when trying to deploy the new schemas. Do we have prior art for importing schema changes performed manually?

Every call to terraform plan refreshes terraform state from BQ before computing a plan of changes (or terraform refresh to refresh terraform state without computing a plan). I wouldn't expect terraform to error in the case described unless the input schemas and the actual BQ resource schemas are mismatched at plan time.

Flags: needinfo?(whd)

the last time we made a change that required ingestion pipeline stoppage

I'm usually focused on _stable tables in my mind, so often forget about the additional complexity involved for live tables. I agree that we should avoid going down a path that's going to require pipeline stoppage.

I'm convinced at this point to move forward with introducing url2 and text2 fields. Basic steps will look like:

The following two PRs are ready for review:

Neither of these should affect current behavior, since no metrics of the affected types are being sent. But both of these need to be in place before we add a url or text metric to a glean ping in any app.

Once those are in place, we can work on changing generated views to drop the legacy fields and to rename the fixed ones (removing the "2").

See Also: → 1741399

These two PRs are now in place, so we should now be capable of defining new url and text type metrics in both new and existing Glean pings, and they will show up under metrics.url2 and metrics.text2 in the live and stable tables.

See Also: → 1741487

I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1741487 for the follow-up of updating generated views.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
See Also: → 1751739
Component: Datasets: General → General
Whiteboard: [data-platform-infra-wg] → [dataplatform]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: