GCP ingestion deployment for multiple incompatible changes
Categories
(Data Platform and Tools Graveyard :: Operations, task)
Tracking
(Not tracked)
People
(Reporter: whd, Assigned: whd)
Details
The deployment plan is as follows.
Stage:
- Stop jenkins triggers and automatic deploys for schemas builds and dataflow deploys
- Merge the following PRs and wait for docker images to be available:
jsonschema-transpiler: Add normalize-case option for snake_casing column names https://github.com/mozilla/jsonschema-transpiler/pull/79
mozilla-schema-generator: Drop under-specified columns https://github.com/mozilla/mozilla-schema-generator/pull/41
Add main ping: https://github.com/mozilla/mozilla-schema-generator/pull/44
(not yet filed) PR to use new jsonschema-transpiler snake case option in mozilla-schema-generator - Trigger the schema generator job from airflow, and verify the generated-schemas branch looks correct
- Drain stage dataflow jobs
- Delete stage tables that have incompatible changes per previous steps
This could happen as part of the jenkins job anyway since we're moving to live and raw created tables. The cloudops-infra PR below is currently structured to leave them in but they will schema-incompatible and will hence require manual deletion. - Merge the following PRs:
cloudops-infra: Add clustered live and raw tables: https://github.com/mozilla-services/cloudops-infra/pull/1183
Cut over beam jobs to live tables: https://github.com/mozilla-services/cloudops-infra/pull/1184
gcp-ingestion: Coerce camelCase field names to snake_case in BQ sink: https://github.com/mozilla/gcp-ingestion/pull/689 - Run the beam template build and bigquery stage schemas jenkins jobs manually
This will create the new live/raw dataset table structure with the snake-cased schemas - Run the beam stage deploy jenkins job manually
At this point, we should be populating the new live tables with snake-cased schemas - Verify everything is working as excepted in stage
- Re-enable automatic stage jenkins deploys and triggers
Prod, which can be done separately, would then follow:
- Drain dataflow jobs
- Delete empty tables with schema incompatibilities
- Run bigquery prod jenkins job manually
- Copy existing data from prod tables into new live locations (these should have no schema incompatibility)
- Run beam prod deploy jenkins job manually
- Verify everything is working
A separate but related change can then also be deployed, enabling the remaining telemetry ping types now that
https://github.com/mozilla/gcp-ingestion/pull/678 has been merged: https://github.com/mozilla/mozilla-schema-generator/pull/45
Given the complexity of this deploy and the increased likelihood of issues arising from from this complexity I'm going to wait until :klukas is back to deploy this. The primary blocking issue is to ensure that the snake casing libraries we're using in the transpiler (rust) and the dataflow job (java) are precisely compatible for all potential schemas we will ever generate. :amiyaguchi is working on verifying this.
Not to throw a wrench in the plans, but we're considering this PR so our handling of maps (and perhaps lists) in derived datasets matches that of the live tables: https://github.com/mozilla/parquet2bigquery/pull/28
Given we're going to be reloading all the derived datasets into clustered tables, if we do decide to unnest maps/lists, we should piggyback on that reload
Okay wait, I re-read the bug and it seems the derived datasets will be handled separately. Is that correct?
Comment 3•6 years ago
|
||
(In reply to Sunah Suh (she/her) [:sunahsuh] from comment #2)
Okay wait, I re-read the bug and it seems the derived datasets will be handled separately. Is that correct?
That's correct. The changes described in this bug concern only the destination tables for "raw pings" populated by the pipeline. Derived datasets will be handled separately.
Assignee | ||
Comment 4•6 years ago
|
||
Additional changes / in-flight dependencies:
Update snake case logic to be consistent across https://github.com/mozilla/gcp-ingestion/pull/689 and https://github.com/mozilla/jsonschema-transpiler/pull/79
Per discussion, also introduce snake-casing for docTypes in table destination output for the special case of untrusted modules: https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/master/schemas/telemetry/untrustedModules
This needs to be implemented in gcp-ingestion and mozilla-schema-generator/jsonschema-transpiler.
Low risk change we should also roll into this round: replace : with - in file names: https://github.com/mozilla/gcp-ingestion/pull/699/files
Comment 5•6 years ago
|
||
:amiyaguchi has developed a regex-based algorithm for snake casing that's easily implemented and testable across different languages. The gcp-ingestion PR now uses that methodology and snake-cases table names, so that side should be ready to go.
The jsonschema-transpiler PR now uses the regex-based algorithm for snake-casing and passes tests, but doesn't yet handle modifying untrustedModules to untrusted_modules.
https://github.com/mozilla/gcp-ingestion/pull/699 is approved and ready to be merged as part of this deployment.
Assignee | ||
Comment 6•6 years ago
|
||
Most of the pieces are now ready and reviewed. Since I'm out this afternoon and would like some overlap with :klukas during this deploy I'm aiming to deploy this tomorrow starting around 11AM Pacific.
The remaining things that should be done today are to merge the jsonschema-transpiler change and bump its version, and to file (but not merge) the MSG PR to use this new version. This latter PR is also a good place to put the snake-casing logic for untrustedModules.
Comment 7•6 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #6)
Most of the pieces are now ready and reviewed. Since I'm out this afternoon and would like some overlap with :klukas during this deploy I'm aiming to deploy this tomorrow starting around 11AM Pacific.
That sounds good. I am blocking off my calendar for 11AM to 2PM tomorrow to walk through this.
The remaining things that should be done today are to merge the jsonschema-transpiler change and bump its version, and to file (but not merge) the MSG PR to use this new version. This latter PR is also a good place to put the snake-casing logic for untrustedModules.
Prepping these changes in https://github.com/mozilla/mozilla-schema-generator/pull/46
Comment 8•6 years ago
|
||
All known needed changes are now staged. Step 2 can now read:
- Merge the following PRs and wait for docker images to be available:
Drop under-specified columns https://github.com/mozilla/mozilla-schema-generator/pull/41
Add main ping: https://github.com/mozilla/mozilla-schema-generator/pull/44
Use new jsonschema-transpiler snake case option in mozilla-schema-generator: https://github.com/mozilla/mozilla-schema-generator/pull/46
Comment 9•6 years ago
|
||
In step 6, we have two gcp-ingestion PRs we need to merge:
- Coerce camelCase to snake_case: https://github.com/mozilla/gcp-ingestion/pull/689
- Patch Beam to write into clustered tables: https://github.com/mozilla/gcp-ingestion/pull/672
Assignee | ||
Comment 10•6 years ago
|
||
This round of changes has been deployed, including the production changes. All steps were completed successfully and mostly without issue (so far). We have a couple of followups:
Verify main ping looks as expected WRT additional properties (:klukas)
Investigate protobuf serialization non-fatal errors (:klukas)
Check why bigquery table update failure didn't trigger an alert to telemetry-alerts (:whd)
Deploy https://bugzilla.mozilla.org/show_bug.cgi?id=1563742 (:whd)
Deploy https://github.com/mozilla/mozilla-schema-generator/pull/45 tomorrow (:whd)
Monitor telemetry bq sink for latency and drain time (:whd)
Comment 11•6 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #10)
Verify main ping looks as expected WRT additional properties (:klukas)
Tracking this investigation in https://github.com/mozilla/mozilla-schema-generator/issues/47
Investigate protobuf serialization non-fatal errors (:klukas)
Opened a case with Google: https://enterprise.google.com/supportcenter/managecases#Case/001000000040sBR/U-20037201
Updated•2 years ago
|
Description
•