(NOTE: this does not need to be solved for our 2018H1 goals) https://public.etherpad-mozilla.org/p/jonasfj-postgres-migration-questions Copying from some emails --- (from me) --- It's worth thinking about r17y in this context, too -- in a multi-instance situation, we can't assume that step A happens long before step B. For example, for a hard migration we might set up a second table, modify the code to write to both, and wait until everything in the first table had expired, or manually backfill it. We would need to explain how to do all of that to any other instances of Taskcluster. Maybe we say soft migrations are downtime-free, and hard migrations always involve downtime (and a major version bump along with detailed migration instructions, once we have r17y). So we explicitly exclude the middle ground of adding "temporary" code to handle a hard migration, then later removing that code. --- --- (from Jonas) --- I like this 99% of what we do is soft migrations, add column, etc... Thinking about this from a versioning perspective is interesting. Perhaps we should say that breaking changes to the database requires major version bumps. Example: v1.0 CREATE TABLE tasks v1.1 ALTER TABLE tasks ADD COLUMN rank INT v2.0 ALTER TABLE tasks RENAME COLUMN rank TO task_rank So the requirement is that: version 1.0 of taskcluster-queue can run using database schema from >= 1.0 But if you move the database to schema from version 2.0, then you MUST also update the taskcluster-queue application code first. Isn't that kind of how libraries works too? it keeps working with newer libraries until major version mismatch happens. I guess if we wanted downtime-free migrations from 1.x to 2.x we could do it by saying that 1.y for y > x (typically 1.9) should support working with a database containing either version 1.x schema or version 2.x schema. I have no idea how one would write an app like that though :) Or we could just advice people to use the "follower" pattern if they want zero downtime (but read-only state) migrations from 1.x to 2.x. --- Or is it possible that we say that database and app must not be more than one minor version out of sync. And if they are more than one minor version out of sync, then you either: A) migrate both app and database through all the minor versions B) accept downtime and migrate directly to latest version (while the application is offline) Hmm, I guess that's similar to what I suggest by saying the last minor version in the 1.x series should be compatible the version 2.x database schema. Maybe we have: X.Y.Z Where we bump conditions are: Z: bugfixes only, no significant changes (don't add database columns or new API methods) Y: Functionality is added, but backwards compatibility is maintained (add database columns, new API methods; don't remove columns or API methods) X: Backwards compatibility breaks, anything goes... Then we probably have to tag in git, and push docker containers to docker hub. When we do r17y. --- --- (from me) --- I like the auth:postgres idea. I'd like it more if it was another service, but that's an argument for another day :) Regarding versions, I like the first bit you suggest. I think the rest sounds too fancy (and really hard to guarantee). So TC-1.0 -- DB-1.0 == OK TC-1.1 -- DB-1.1 == BAD TC-1.1 -- DB-1.2 == BAD TC-1.0 -- DB-1.1 == OK TC-1.1 -- DB-1.1 == OK TC-1.0 -- DB-1.2 == untested (but probably OK) TC-1.1 -- DB-1.2 == OK TC-2.x -- DB-1.x == BAD TC-1.x -- DB-2.x == BAD in other words, for minor version changes, upgrade the DB first, preferably one minor version at a time. We would need to build a system for testing that -- probably something with some DB dumps of each version stored in the repo. I think ensuring that TC-1.9 was compatible with DB-2.0 would be too difficult and that we should just require downtime for major version upgrades. Mozilla has a method to support such things, and they will be rare. --- --- (from Jonas) --- I assume: TC-1.1 -- DB-1.1 == BAD was supposed to say: TC-1.1 -- DB-1.1 == OK We would need to build a system for testing that -- probably something with some DB dumps of each version stored in the repo. We could test migration on data left behind, by integration tests... IMO; we should just up testing this too much, it's unlikely to pay of... We should test that migrations can be rolled back, that should be fine.. We can also say that migration changes must land before we land we land code changes.. hence, if integration tests pass, then the migration must be backwards compatible. (note: land on master, who cares if they deploy first) I think ensuring that TC-1.9 was compatible with DB-2.0 would be too difficult and that we should just require downtime for major version upgrades. fair enough. Mozilla has a method to support such things, and they will be rare. Still I prefer to do it without telling people :) ---
So any of these methods mean that users need to take a downtime to run DB migrations. That stinks. The existing azure-entities migration system has served us quite well so far, though. We could do something similar with postgres, but using different table names for different versions. So: sprockets_v1: sprocketId * sprocketProperty1 sprockets_v2: sprocketId * sprocketProperty2 sprocketProperty3 with some code in place to read first from sprockets_v2 then _v1, migrating (using a migration function) on write. In fact, I think we could implement this with triggers and views, entirely in the DB. That means the DB can even perform joins over these structures! It will limit our ability to use the full expressiveness of an RDBMS a little, but we've been fine with non-joinable azure tables for a long time now, so that's probably OK. And it will require a careful testing regime, but we can build some tooling for that purpose, too. A few notes from having done so in Buildbot: * capture small database snapshots before each migration, and then apply those and test that the most recent version works as expected -- thereby testing migration from every previous version * unit test the migration functions to exercise corner cases
John, I'm going to hand this off to you since you're working on the postgres migration.
Assignee: dustin → jhford
Component: Redeployability → Services
Product: Taskcluster → Taskcluster
You need to log in before you can comment on or make changes to this bug.