Closed Bug 1621943 Opened 5 years ago Closed 4 years ago

Switch Telemetry BigQuery in STMO to point at shared-prod project

Categories

(Data Platform and Tools :: General, task, P2)

task

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: klukas, Unassigned)

Details

Once we have moved remaining derived tables out of moz-fx-data-derived-datasets to moz-fx-data-shared-prod, we want to switch the Telemetry BigQuery data source in STMO to point at shared-prod rather than derived-datasets.

For queries that follow best practices and hit views rather than tables, this switch will be transparent since we already publish all views to both derived-datasets and shared-prod. We will sending more complete instructions to users about the implications before we make the switch.

I want to make sure Data Tools is aware of our intention to make this switch. Again, it should be mostly transparent to users and we'll be sending out communication in advance of the switch to inform users about edge cases that may require intervention.

One impact that Data Tools may need to be directly involved in, though, is that this change will likely increase the burden on STMO for schema caching. STMO will see a nearly identical set of derived tables (like telemetry_derived.) and user-facing views (like telemetry.), but it will now also see the live and historical ping tables (like telemetry_live.* and telemetry_stable.*) that have always lived in shared-prod. When we make the switch, there will be 2x to 3x the number of tables/views that will be visible to STMO since each type of ping will be present as a user-facing view, as a live table, and as a historical table.

Rob - Can you comment on whether this is something we need to be concerned about? We can provide more info and discuss timelines if you expect intervention is going to be needed to ensure stable STMO performance.

Flags: needinfo?(rmiller)

Oof... this is likely to have an impact, yes... STMO's schema fetching queries are already generating a significant amount of load, I imagine this may have significant performance impact. Marina knows more about this than anyone, pulling her into this so we can figure out how to proceed.

Flags: needinfo?(rmiller) → needinfo?(msamuel)

A potential option here is to ignore all live and historical ping tables, so that we don't fetch schema data for any dataset ending in _live or _stable. That would maintain the status quo in terms of what users can see in STMO today, and there's an argument that it's safer to keep those hidden, since we encourage users to hit the user-facing views rather than the ping tables themselves.

I don't know how feasible it would be on the STMO side to filter out _live and _stable datasets from schema fetching.

I believe we already have the ability to filter tables from showing up in the schema browser by matching on a regex against table names, so this is def a possibility.

(In reply to Rob Miller [:RaFromBRC :rmiller] from comment #4)

I believe we already have the ability to filter tables from showing up in the schema browser by matching on a regex against table names, so this is def a possibility.

Hopefully that can at least serve as an interim solution that decouples the project switch from the concern around showing _live and _stable datasets.

A few thoughts on this:

  1. I suggest to try connecting the new moz-fx-data-shared-prod data source in stage to see if there is any visible explosion.
  2. The filter is more of a substring match rather than a regex, so we can only filter out one of the two given the way it currently works - we would need to do a code change to add the ability for more than one.
  3. Jannis recently added caching which speeds up loading the metadata tables so that is certainly a benefit in this case.
Flags: needinfo?(msamuel)

STMO now points to mozdata, so this issue is no longer relevant.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.