Closed Bug 1737656 Opened 3 years ago Closed 3 years ago

New Glean metric types being incorrectly added as repeated key/value in schemas

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: klukas, Unassigned)

References

Details

(Whiteboard: [dataplatform])

Attachments

(4 files)

Link to GitHub pull-request: https://github.com/mozilla/mozilla-schema-generator/pull/213 3 years ago GitHub Bugzilla PR Linker 60 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1862 3 years ago GitHub Bugzilla PR Linker 42 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla/gcp-ingestion/pull/1881 3 years ago GitHub Bugzilla PR Linker 50 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla/mozilla-schema-generator/pull/214 3 years ago GitHub Bugzilla PR Linker 60 bytes, text/x-github-pull-request		Details \| Review

Jeff Klukas [:klukas] (UTC-4)

Reporter

Description

•

3 years ago

Previous bug https://bugzilla.mozilla.org/show_bug.cgi?id=1697698 discusses how the labeled_rate and jwe metric types had incorrectly flowed downstream to BQ tables as repeated key/value fields.

It looks like the same has now happened for the new url and text metric types. We need to figure out how to improve processes to prevent this from happening in the future.

Jeff Klukas [:klukas] (UTC-4)

Reporter

Comment 1

•

3 years ago

I'm reporting this after noticing the problem in the burnham_stable.metrics_v1 table, but I assume this applies to all deployed Glean tables.

Jeff Klukas [:klukas] (UTC-4)

Reporter

Updated

•

3 years ago

Whiteboard: [data-platform-infra-wg]

William Lachance (:wlach)

Comment 2

•

3 years ago

I suspect the problem here is that mozilla-schema-generator is generating some kind of default column for glean metric types it doesn't understand, although I don't know that for certain.

I can take this, but would be good to find someone who can pair with me on this, to spread the domain expertise here around a bit.

Assignee: nobody → wlachance

William Lachance (:wlach)

Updated

•

3 years ago

Comment 3

•

3 years ago

Ok, :relud and I figured out the root cause, we think. It appears that mozilla-schema-generator augments the glean core schema with additional schema information, including some things from this file called glean.yaml: https://github.com/mozilla/mozilla-schema-generator/blob/main/mozilla_schema_generator/configs/glean.yaml

In the case of url and text, there are no entries for them, so jsonschema-transpiler just maps directly from their schema entries (which are just objects) directly to the BigQuery map type.

We think the solution is probably to validate that for each glean metric type (as defined in its schema), a corresponding entry exists in glean.yaml. If not, bail. Filed bug 1739239 for that.

We'll have to allow url and text (as well as labeled rate and jwe from bug 1697698) for now, until we figure out to remediate that issue. For now, none of those metric types will be ingested.

Jonathan Epstein

Comment 4

•

3 years ago

As of this moment, Rally does not have any live, active studies that are using glean.
Though, we are actively developing a new study that will use be using glean and at least one, if not both, of these types. We expect to launch in December.

William Lachance (:wlach)

Comment 5

•

3 years ago

Ok, so to follow up, bug 1739239 should make sure this doesn't happen anymore. I'm not really sure what the best options are for cleanup? What I can think of is:

Remove these metrics from the existing glean schema (or have MSG do it) so no bigquery columns are generated for them by MSG. I think it's ok to have columns in our tables that are not generated by our schemas, though correct me if I'm wrong.
Rewrite the existing tables to not include the bogus columns (this will be every glean table so is probably going to be time consuming / expensive, though maybe we can piggy back on top of shredder).
Add the right metric mappings to glean.yaml (re-adding them to the glean core schema if necessary), so that future metrics of these kinds can come in.

Does that sound reasonable? Anything I'm missing?

Assignee: wlachance → nobody

Flags: needinfo?(jklukas)

Flags: needinfo?(dthorn)

Daniel Thorn [:relud]

Comment 6

•

3 years ago

I think it's ok to have columns in our tables that are not generated by our schemas, though correct me if I'm wrong.

I'm fairly certain that would cause schema deploys to fail.

maybe we can piggy back on top of shredder

shredder changes one partition at a time, so it can't delete columns

Flags: needinfo?(dthorn)

Jeff Klukas [:klukas] (UTC-4)

Reporter

Comment 7

•

3 years ago

I think it's going to be hard to justify the amount of work it would take to remove these fields from all Glean tables. As relud points out, we can't do this partition-by-partition, so we'd need to rewrite the entire table at once. There may come a time where BQ's DELETE COLUMN machinery allows deleting nested fields, which could make this much easier.

For the time being, I can think of two possible ways forward.

One option would be to add some special casing in the pipeline and in schema generation to rename these metric types to url2 and text2. We'd leave the existing url and text fields in place indefinitely in all schemas. url2 and text2 would only show up in schemas for pings where they're actually used. We'd never need to delete a column.

We should figure out the Rally use cases for url and text. If these don't need to be added to any existing pings, then I think we can focus on a solution where we modify MSG to do the "right thing" for all new pings, and somehow annotate all existing pings to preserve the map-like behavior.

Flags: needinfo?(jklukas)

Wesley Dawson [:whd]

Comment 8

•

3 years ago

We should figure out the Rally use cases for url and text. If these don't need to be added to any existing pings, then I think we can focus on a solution where we modify MSG to do the "right thing" for all new pings, and somehow annotate all existing pings to preserve the map-like behavior.

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1737656#c4 these don't need to be added to any existing pings. I agree with the approach outlined here.

GitHub Bugzilla PR Linker

Comment 9

•

3 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/mozilla-schema-generator/pull/213 — Details

Jeff Klukas [:klukas] (UTC-4)

Reporter

Comment 10

•

3 years ago

PR is merged, which should allow all new Glean tables to support these types correctly. The updated MSG image is deployed, and I'm letting the MSG task run again in https://workflow.telemetry.mozilla.org/tree?dag_id=probe_scraper&root= to verify behavior that there are no surprises in production behavior.

Jeff Klukas [:klukas] (UTC-4)

Reporter

Comment 11

•

3 years ago

The rerun of the MSG task finished with "Everything up-to-date" from the git push, indicating that the output was identical.

William Lachance (:wlach)

Comment 12

•

3 years ago

•

Edited

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #11)

The rerun of the MSG task finished with "Everything up-to-date" from the git push, indicating that the output was identical.

Awesome! Follow up items that I can think of:

(short term) Add docs to https://mozilla.github.io/glean documenting this gotcha (that these types won't be available for most existing applications and pings)
(short term) Perform some kind of retrospective on how this happened.
(medium-to-long term) Fix this problem in the source tables or come up with an alternate column name schema as outlined in comment 7 (maybe we should file another bug for this).

Jeff Klukas [:klukas] (UTC-4)

Reporter

Comment 13

•

3 years ago

(In reply to William Lachance (:wlach) from comment #12)

(short term) Add docs to https://mozilla.github.io/glean documenting this gotcha (that these types won't be available for most existing applications and pings)

(short term) Perform some kind of retrospective on how this happened.

:wlach - Would you be able to take on either or both of these tasks?

(medium-to-long term) Fix this problem in the source tables or come up with an alternate column name schema as outlined in comment 7 (maybe we should file another bug for this).

We can remove these from user-facing views by updating the generate_views machinery in bigquery-etl. There's already a similar special case for cleaning fenix metrics specifically and the logic for this would be similar. Unless we find a use case for one of text or url in an existing ping, I don't think we should invest in trying to remove the columns from the source tables; the longer we wait to handle that, the more BQ functionality may be in place to ease the transition.

William Lachance (:wlach)

Comment 14

•

3 years ago

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #13)

(In reply to William Lachance (:wlach) from comment #12)

(short term) Add docs to https://mozilla.github.io/glean documenting this gotcha (that these types won't be available for most existing applications and pings)

(short term) Perform some kind of retrospective on how this happened.

:wlach - Would you be able to take on either or both of these tasks?

Yup, I'd be happy to. I can do (1) right now. I'll organize something for (2) next week.

(medium-to-long term) Fix this problem in the source tables or come up with an alternate column name schema as outlined in comment 7 (maybe we should file another bug for this).

We can remove these from user-facing views by updating the generate_views machinery in bigquery-etl. There's already a similar special case for cleaning fenix metrics specifically and the logic for this would be similar. Unless we find a use case for one of text or url in an existing ping, I don't think we should invest in trying to remove the columns from the source tables; the longer we wait to handle that, the more BQ functionality may be in place to ease the transition.

Cool, that makes sense. I'd be tempted to defer any work here for a while. It's not clear to me these new metric types are useful for anything besides rally, which your work today unblocks.

GitHub Bugzilla PR Linker

Comment 15

•

3 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1862 — Details

Jan-Erik Rediger [:janerik]

Comment 16

•

3 years ago

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #13)

Unless we find a use case for one of text or url in an existing ping, I don't think we should invest in trying to remove the columns from the source tables; the longer we wait to handle that, the more BQ functionality may be in place to ease the transition.

Ugh. That's bad news.
While text is probably not of much use outside of special use cases like Rally, url seems much more useful and we had plans to e.g. migrate the default search URL to this type to avoid length restrictions for usual strings.
Desktop currently doesn't report that data, but it would have really been great if we could use url here instead of string.

Jeff Klukas [:klukas] (UTC-4)

Reporter

Comment 17

•

3 years ago

(In reply to Jan-Erik Rediger [:janerik] from comment #16)

While text is probably not of much use outside of special use cases like Rally, url seems much more useful and we had plans to e.g. migrate the default search URL to this type to avoid length restrictions for usual strings.
Desktop currently doesn't report that data, but it would have really been great if we could use url here instead of string.

I'd love to hear some more detail on this. Is there a specific set of existing pings you have in mind? Are these all metrics pings for various apps? If we know the set of target tables, we can do some analysis on data sizes to help figure out how difficult it would be to recreate the table to drop the legacy fields.

:whd - Is there a clear path in your mind for how we'd deal with terraform changes if we manually recreated a table to drop these metrics.url fields, etc? We'd update the tables manually, and adjust mozilla-schema-generator to output schemas to match the new structure. But then I expect terraform would throw an error when trying to deploy the new schemas. Do we have prior art for importing schema changes performed manually?

Flags: needinfo?(whd)

Jan-Erik Rediger [:janerik]

Comment 18

•

3 years ago

It came up in a Fenix bug, so it would go into the Fenix metrics ping.
Similar for Desktop, though as said we don't have that data there yet and to my knowledge it's not something that's gonna be implemented in the next few weeks.

Jeff Klukas [:klukas] (UTC-4)

Reporter

Comment 19

•

3 years ago

so it would go into the Fenix metrics ping.

Of course, it's our largest Glean ping that would be affected :) That org_mozilla_firefox_stable.metrics_v1 table is ~116 TB. Which honestly shouldn't cause too much of a problem.

I think finding a way forward in the near term is doable here. It's a matter of figuring out what an end-to-end change would look like for the two possible ways forward here (recreating tables on a case-by-case basis or adding pipeline support for massaging these types to url2, etc.) and picking which one seems like the best balance of effort and maintainability.

Wesley Dawson [:whd]

Comment 20

•

3 years ago

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #17)

:whd - Is there a clear path in your mind for how we'd deal with terraform changes if we manually recreated a table to drop these metrics.url fields, etc?

I maintain the opinion from https://bugzilla.mozilla.org/show_bug.cgi?id=1627722#c1 that this operation has significant operational complexity we should avoid (since again, this appears to be a nested column that can't be dropped and re-added using native BQ APIs). To my knowledge, bug #1565074, which took place before the AWS cutover was complete, is the last time we made a change that required ingestion pipeline stoppage.

We'd update the tables manually, and adjust mozilla-schema-generator to output schemas to match the new structure. But then I expect terraform would throw an error when trying to deploy the new schemas. Do we have prior art for importing schema changes performed manually?

Every call to terraform plan refreshes terraform state from BQ before computing a plan of changes (or terraform refresh to refresh terraform state without computing a plan). I wouldn't expect terraform to error in the case described unless the input schemas and the actual BQ resource schemas are mismatched at plan time.

Flags: needinfo?(whd)

Jeff Klukas [:klukas] (UTC-4)

Reporter

Comment 21

•

3 years ago

the last time we made a change that required ingestion pipeline stoppage

I'm usually focused on _stable tables in my mind, so often forget about the additional complexity involved for live tables. I agree that we should avoid going down a path that's going to require pipeline stoppage.

I'm convinced at this point to move forward with introducing url2 and text2 fields. Basic steps will look like:

Update the logic for prepping records for loading to live tables in gcp-ingestion to rename metrics.url to metrics.url2, etc.
Update mozilla-schema-generator to make the same name change
Update bigquery-etl's generate_views module to ignore legacy url etc. fields and to rename metrics.url2 to metrics.url, etc.

GitHub Bugzilla PR Linker

Comment 22

•

3 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/gcp-ingestion/pull/1881 — Details

GitHub Bugzilla PR Linker

Comment 23

•

3 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/mozilla-schema-generator/pull/214 — Details

Jeff Klukas [:klukas] (UTC-4)

Reporter

Comment 24

•

3 years ago

The following two PRs are ready for review:

Neither of these should affect current behavior, since no metrics of the affected types are being sent. But both of these need to be in place before we add a url or text metric to a glean ping in any app.

Once those are in place, we can work on changing generated views to drop the legacy fields and to rename the fixed ones (removing the "2").

Jan-Erik Rediger [:janerik]

Updated

•

3 years ago

Comment 25

•

3 years ago

These two PRs are now in place, so we should now be capable of defining new url and text type metrics in both new and existing Glean pings, and they will show up under metrics.url2 and metrics.text2 in the live and stable tables.

Jeff Klukas [:klukas] (UTC-4)

Reporter

Updated

•

3 years ago

Comment 26

•

3 years ago

I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1741487 for the follow-up of updating generated views.

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED

Jan-Erik Rediger [:janerik]

Updated

•

3 years ago

Blocks: 1741648

Jeff Klukas [:klukas] (UTC-4)

Reporter

Updated

•

2 years ago

Updated

•

2 years ago

Component: Datasets: General → General

Anna Scholtz [:ascholtz]

Updated

•

1 year ago

Whiteboard: [data-platform-infra-wg] → [dataplatform]

You need to log in before you can comment on or make changes to this bug.