Closed Bug 1502057 Opened 6 years ago Closed 3 years ago

Publish schema changes to PubSub

Categories

(Data Platform and Tools :: General, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: klukas, Unassigned)

Details

(Whiteboard: [DataPlatform])

In the new GCP data architecture, we'd like for our ingestion process to have access to the latest ping schemas as defined in the mozilla-pipeline-schemas repository. We've conceived of other potential consumers of a stream of schema changes such as triggering creation of new monitors when new doctypes are created, generating or updating BigQuery tables, etc.

If we were still using Kafka in our GCP architecture, I'd suggest publishing schemas to a log-compacted Kafka topic on each push to m-p-s. Arbitrary consumers would then be able to read from the beginning of the topic and always get the full current state of schemas.

The concept of a persistent log-compacted topic doesn't exist for PubSub, so applications will need to pull an initial copy of all schemas from some other source on startup, with PubSub providing only changes. The source for the current state of all schemas could be GitHub itself or we could publish to a location in Google Cloud Storage (GCS) which would likely provide better uptime.

A proposed flow would be that we set up a CircleCI workflow for pushes to the dev branch of m-p-s that does the following:

- Publish the entire contents of the branch to a location in Google Cloud Storage (GCS)
- Publish a message to a PubSub topic indicating that consumers should refresh their cache of schemas from GCS

Timely delivery of the update message in this case relies on uptime of GitHub webhooks, CircleCI, GCS, and Google Pubsub.

## Alternative ideas

We could pursue a more general-purpose solution with a GitHub organization-level webhook to publish GitHub events. In this case, we'd likely need to set up a Google Cloud Function as the webhook destination, and the function would handle republishing events to PubSub. Google docs have a cloud function example that accepts GitHub webhooks [0]

[0] https://cloud.google.com/community/tutorials/cloud-functions-github-auto-deployer
Priority: -- → P2
GCP also offers PubSub notifications for changes in a GCS bucket [1]. So we could write a CircleCI job that just uploads contents to GCS and the bucket configuration would handle sending a message to PubSub.

[1] https://cloud.google.com/storage/docs/pubsub-notifications
I like the idea of having GCS write a notification on change. 

This would also mean we could potentially have multiple repositories of schemas (something we've talked about in the past, but haven't yet needed), each of which could signal a refresh by writing to a similarly-configured bucket, or directly publishing to the topic.
Do you think it would make sense to have this live in some GCS bucket that's an analog of net-mozaws-prod-us-west-2-pipeline-metadata? Or have a bucket that's just for pipeline schemas?
Priority: P2 → P3
Whiteboard: [DataPlatform]

We have built a robust pipeline for writing schema artifacts, plus we now have a "dry run" endpoint that allows fetching the git hash of the currently deployed schema artifact. This has fulfilled all needs for tracking schema changes so far.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.