Closed Bug 1614813 Opened 5 years ago Closed 5 years ago

Deploy ISP lookup ingestion changes

Categories

(Data Platform and Tools Graveyard :: Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whd, Assigned: whd)

References

Details

Attachments

(1 file)

The procedure as I understand it:

  1. Disable automated Jenkins beam-deploy-stage
  2. Merge https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/501/files, https://github.com/mozilla/gcp-ingestion/pull/1117/files, and https://github.com/mozilla-services/cloudops-infra/pull/1826. This will trigger beam-build.
  3. Run the probe-scraper DAG to push changes generated-schemas, which will trigger bigquery-stage
  4. Run Jenkins beam-geoip job to generate new geoip config
  5. Once beam-build, beam-geoip, and bigquery-stage have run sucessfully, run beam-deploy-stage manually
  6. Verify stage decoding includes ISP data
  7. Run bigquery-prod
  8. Run beam-deploy-prod

I'm going to take this opportunity to update geoip to the latest version which has been frozen due to https://github.com/mozilla/gcp-ingestion/issues/1059. If we encounter issues again the fix is simply to re-run beam-geoip with the old geoip version and disable the job again.

That looks good to me. We only need to wait for data review to go through before deploying. I'll give a heads up when that is done.

Data review went through. This is good to go now.

I'm planning to roll this out around 13:00 pacific tomorrow.

I'm rescheduling this for some time next week as we look into some issues that may or may not affect this deploy found while testing bug #1612367. Specifically, there may be issues deploying schema updates to tables with a high volume of BQ streaming inserts, which would include the payload_bytes tables (which haven't had schemas updates since we started streaming data into them).

Multiple issues turned a week into a month. We've finally arrived at a relatively stable pipeline configuration and I expect to roll these changes out early next week.

I planned to roll this out today but due a GCP incident yesterday causing some instability in our probe scraper pipeline I'm going to wait until tomorrow, as clearing and re-running the latest successful MSG run caused a schemas regression.

Schemas updates generally are at present blocked by bug #1604919, so this deploy will block on a resolution to that.

Assignee: nobody → whd

Some additional code was required to propagate this change in the way we expect and that has been added to https://github.com/mozilla/gcp-ingestion/pull/1117/files.

Because of recent pipeline cost/instability issues and the upcoming testing in stage of ingestion-sink to address some of those issues (bug #1625330) taking priority, I'm tentatively scheduling this deploy for April 23rd, or (hopefully) April 27-28th if we deploy batch loads changes on the 23rd.

Reviewing this for deploy, I observe two things:

  1. It appears that we don't propagate ISP information to the errors stream

Given that the errors stream no longer contains IP information (but does contain geoip information), this means we can't recover ISP information from decoded errors anymore. I think we probably want to add the isp block to the errors schema as well to maintain parity with other geoip-derived information. NI :scholtzan to determine whether that's the case.

  1. The procedure should be revised to deploy the ingestion-sink code (with no schemas change) between 2 and 3

It's my understanding that the new sink code will work with the old decoded format (no geoip info) as well as with geoip info, but maybe not the other way around. In the case that it will only work if decoder and sink versions are synchronized, the deploy procedure will need to involve full drains/stoppage of the decoder and payload bytes sink. If there is no ordering dependency at all that would be great to know as well. NI :relud for confirmation.

Flags: needinfo?(dthorn)
Flags: needinfo?(ascholtz)
  1. It appears that we don't propagate ISP information to the errors stream

I agree, the ISP information should be added to the errors schema. We'll probably want to store the ISP name and organization, if that's what you mean by ISP block.

Flags: needinfo?(ascholtz)

this sink should not have an ordering dependency here. ISP data will just be lost until decoder/sink/tables are all updated.

Flags: needinfo?(dthorn)

(In reply to Anna Scholtz from comment #10)

  1. It appears that we don't propagate ISP information to the errors stream

I agree, the ISP information should be added to the errors schema. We'll probably want to store the ISP name and organization, if that's what you mean by ISP block.

:ascholtz, can you make this change? I'm not sure if it requires a code change beyond the simple schemas update.

Flags: needinfo?(ascholtz)

:relud, can you confirm all k8s sink OUTPUT_TYPEs will propagate ISP information? I'm not sure if the change described in comment #8 applied to both decoded and raw.

Flags: needinfo?(dthorn)

I updated the schema in https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/551. There shouldn't be any code changes required for this.

Flags: needinfo?(ascholtz)

I'm scheduling this for May 21st 13:00 Pacific.

There shouldn't be any code changes required for this.

I'm going to take this as the confirmation I was seeking from :relud that no additional ingestion-sink changes are needed for OUTPUT_TYPE raw, so cancelling his NI.

Flags: needinfo?(dthorn)

(In reply to Wesley Dawson [:whd] from comment #14)

:relud, can you confirm all k8s sink OUTPUT_TYPEs will propagate ISP information? I'm not sure if the change described in comment #8 applied to both decoded and raw.

The code changes for ingestion-sink in https://github.com/mozilla/gcp-ingestion/pull/1117 are needed for ISP information to reach tables for OUTPUT_FORMAT=decoded. No code changes are needed for raw or payload.

This was deployed successfully today, so 2020-05-22 UTC will be the first day with complete ISP information. We had a few issues in https://github.com/mozilla-services/cloudops-infra/pull/2173 and https://github.com/mozilla/gcp-ingestion/pull/1247 due to how long ago these PRs were prepared, but the deploy otherwise went smoothly. The procedure included deploying sink schemas updates before beam code/schemas updates to ensure that pbd and live tables match for all ISP data.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: