Closed Bug 1635525 Opened 10 months ago Closed 9 months ago

[meta] switch bouncer production DB to use Nazgul

Categories

(Cloud Services :: Operations: Product Delivery, task)

task

Tracking

(firefox-esr68 fixed, firefox77 fixed, firefox78 fixed, firefox79 fixed)

RESOLVED FIXED
Firefox 79
Tracking Status
firefox-esr68 --- fixed
firefox77 --- fixed
firefox78 --- fixed
firefox79 --- fixed

People

(Reporter: mtabara, Assigned: oremj)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

As discussed here, we're planning to migrate the bouncer admin from tuxedo to Nazul in H1. This bug is to track this on CloudOps side.

RelEng automation currently pushes to both Tuxedo and Nazul successfully. We still need to uplift patches to ESR68 and consequently uplift that to comm-esr68, but once that's done, we're ready to migrate.

Filing this bug for now, will ping once all RelEng bit are done (I hope by EOW) so that we can coordinate over a time/date for production migration.

Please feel free to open other bugs if needed on CloudOps side.

Assignee: nobody → oremj
Blocks: 1629944, 1635159
Attached image migration-plan

With 77.0 released earlier las week, we are now entering mid-beta cycle so it's a good time to migrate to bouncer Nazul. In preparation for that, just wanted to confirm few things:

  1. The Nazgul DB is a mere temporary clone of the production DB
  2. The upcoming “migration” means that you’re killing Tuxedo and pointing Nazgul to talk to the Production-DB
  3. Eventually we can retire Nazgul’s Copy-DB, once things are confirmed to be working fine.
  4. Downstream consumers like go-bouncer and bedrock should notice no downtime/difference in querying the DB

Check things are fine after the migration:
a) Nazgul automation will go smoothly, but this time affecting the Production DB, instead of the Copy-DB
b) go-bouncer tests are still passing and green
c) no downtime for bedrock
d) Nazgul automation should work as expected (bouncer-locations on nightly shall be first, then the Wednesday's beta for bouncer-aliases + bouncer-submission jobs)
e) our own cron-tests-jobs that run in tree should run checks against download.mozilla.org

Plan in case things go wrong:

  • biggest risk is nazgul corrupting the production database and having to rollback
  • we will have a snapshot of the DB so we can revert from Nazgul to tuxedo and rerun the tasks and also rollback the DB

This migration is happening on Tuesday, 9th of June, 10:00 PT.
RelEng (mtabara) / Cloudops (oremj) involved.

Migration is done, Tuxedo is shutdown, Nazgul is now talking to Production DB.

  • we've manually tested write via Nazgul to prod-db and it worked smooth
  • we've manually tested read via Nazgul by rerunning bouncer-locations-nazgul job from most recent nightly; it went green.

Next steps:

  • is to wait tonight's beta 78.0b5 and ensure that works smoothly, bouncer-wise, end-to-end.
  • confirm with @oremj that things were fine
  • start patching up the removal of the old Tuxedo jobs in-tree + configurations from bouncerscript.

Issues we've hit in beta last night:

  1. vanilla bouncer jobs still blocking various barrier tasks, such as notify-promotion or push-to-releaes --> we need to remove them asap
  2. for some reason, bouncer vanilla jobs are set as dep for notify-promotion, while nazgul jobs are not --> we need to replace that dep there when we cleanup the old jobs, we had to force 2 x release-notify-promote to unblock release promotion
  3. both release-bouncer-check-devedition (WjjJOTJhS8K335gM9NGY2g) and release-bouncer-check-firefox (bDOKnxecRLydslIsoMyb-A) failed unexpectedly; the failures are related to partial update products, which are still 404's after pushing to mirrors, where other products are fine
    the task definition for release-bouncer-sub-nazgul-firefox (LD3jsVOmQiqL_wBTakOriA) doesn't have the three partial products that release-bouncer-sub-firefox (EgEz_nIzTl-zAEF6gisKyg) has. The diff between them is https://gist.github.com/nthomas-mozilla/98694ad3773e444719384228b5ee1029#file-task-def-diff-L172-L213

Sum-up:
0. In order to unblock the betas, we need to backfill the missing partials products and reran the bouncer-check-{firefox,devedition} jobs to unblock the rest of the release automation; I'll file a separate bug and wait for CloudOps to come online

  1. Start fixing the deps - each task that depends on vanilla bouncer jobs, should depend on nazgul counterpart too. e.g. release-notify-promotion, but could be others as well.
  2. Clean the vanilla jobs so that they are no longer red in-tree + uplfit to beta and let the trains ride
  3. Fix in-tree so that we're including the missing partials too. We should have no difference between nazgul and vanilla jobs. I remember running taskgraph diff against this but for some reason I missed something.

@oremj:

Seems like there were some missing products in the Nazgul jobs as opposed to the vanilla jobs. We need to backfill those manually to unblock today's beta:

"Firefox-78.0b5-Partial-78.0b2": {
    "options": {
        "add_locales": true,
        "ssl_only": false
    },
    "paths_per_bouncer_platform": {
        "linux": "/firefox/releases/78.0b5/update/linux-i686/:lang/firefox-78.0b2-78.0b5.partial.mar",
        "linux64": "/firefox/releases/78.0b5/update/linux-x86_64/:lang/firefox-78.0b2-78.0b5.partial.mar",
        "osx": "/firefox/releases/78.0b5/update/mac/:lang/firefox-78.0b2-78.0b5.partial.mar",
        "win": "/firefox/releases/78.0b5/update/win32/:lang/firefox-78.0b2-78.0b5.partial.mar",
        "win64": "/firefox/releases/78.0b5/update/win64/:lang/firefox-78.0b2-78.0b5.partial.mar",
        "win64-aarch64": "/firefox/releases/78.0b5/update/win64-aarch64/:lang/firefox-78.0b2-78.0b5.partial.mar"
    }
},
"Firefox-78.0b5-Partial-78.0b3": {
    "options": {
        "add_locales": true,
        "ssl_only": false
    },
    "paths_per_bouncer_platform": {
        "linux": "/firefox/releases/78.0b5/update/linux-i686/:lang/firefox-78.0b3-78.0b5.partial.mar",
        "linux64": "/firefox/releases/78.0b5/update/linux-x86_64/:lang/firefox-78.0b3-78.0b5.partial.mar",
        "osx": "/firefox/releases/78.0b5/update/mac/:lang/firefox-78.0b3-78.0b5.partial.mar",
        "win": "/firefox/releases/78.0b5/update/win32/:lang/firefox-78.0b3-78.0b5.partial.mar",
        "win64": "/firefox/releases/78.0b5/update/win64/:lang/firefox-78.0b3-78.0b5.partial.mar",
        "win64-aarch64": "/firefox/releases/78.0b5/update/win64-aarch64/:lang/firefox-78.0b3-78.0b5.partial.mar"
    }
},
"Firefox-78.0b5-Partial-78.0b4": {
    "options": {
        "add_locales": true,
        "ssl_only": false
    },
    "paths_per_bouncer_platform": {
        "linux": "/firefox/releases/78.0b5/update/linux-i686/:lang/firefox-78.0b4-78.0b5.partial.mar",
        "linux64": "/firefox/releases/78.0b5/update/linux-x86_64/:lang/firefox-78.0b4-78.0b5.partial.mar",
        "osx": "/firefox/releases/78.0b5/update/mac/:lang/firefox-78.0b4-78.0b5.partial.mar",
        "win": "/firefox/releases/78.0b5/update/win32/:lang/firefox-78.0b4-78.0b5.partial.mar",
        "win64": "/firefox/releases/78.0b5/update/win64/:lang/firefox-78.0b4-78.0b5.partial.mar",
        "win64-aarch64": "/firefox/releases/78.0b5/update/win64-aarch64/:lang/firefox-78.0b4-78.0b5.partial.mar"
    }
},
"Devedition-78.0b5-Partial-78.0b4": {
  "options": {
    "add_locales": true,
    "ssl_only": false
  },
  "paths_per_bouncer_platform": {
    "osx": "/devedition/releases/78.0b5/update/mac/:lang/firefox-78.0b4-78.0b5.partial.mar",
    "win64": "/devedition/releases/78.0b5/update/win64/:lang/firefox-78.0b4-78.0b5.partial.mar",
    "linux64": "/devedition/releases/78.0b5/update/linux-x86_64/:lang/firefox-78.0b4-78.0b5.partial.mar",
    "linux": "/devedition/releases/78.0b5/update/linux-i686/:lang/firefox-78.0b4-78.0b5.partial.mar",
    "win64-aarch64": "/devedition/releases/78.0b5/update/win64-aarch64/:lang/firefox-78.0b4-78.0b5.partial.mar",
    "win": "/devedition/releases/78.0b5/update/win32/:lang/firefox-78.0b4-78.0b5.partial.mar"
  }
},
"Devedition-78.0b5-Partial-78.0b2": {
  "options": {
    "add_locales": true,
    "ssl_only": false
  },
  "paths_per_bouncer_platform": {
    "osx": "/devedition/releases/78.0b5/update/mac/:lang/firefox-78.0b2-78.0b5.partial.mar",
    "win64": "/devedition/releases/78.0b5/update/win64/:lang/firefox-78.0b2-78.0b5.partial.mar",
    "linux64": "/devedition/releases/78.0b5/update/linux-x86_64/:lang/firefox-78.0b2-78.0b5.partial.mar",
    "linux": "/devedition/releases/78.0b5/update/linux-i686/:lang/firefox-78.0b2-78.0b5.partial.mar",
    "win64-aarch64": "/devedition/releases/78.0b5/update/win64-aarch64/:lang/firefox-78.0b2-78.0b5.partial.mar",
    "win": "/devedition/releases/78.0b5/update/win32/:lang/firefox-78.0b2-78.0b5.partial.mar"
  }
},
"Devedition-78.0b5-Partial-78.0b3": {
  "options": {
    "add_locales": true,
    "ssl_only": false
  },
  "paths_per_bouncer_platform": {
    "osx": "/devedition/releases/78.0b5/update/mac/:lang/firefox-78.0b3-78.0b5.partial.mar",
    "win64": "/devedition/releases/78.0b5/update/win64/:lang/firefox-78.0b3-78.0b5.partial.mar",
    "linux64": "/devedition/releases/78.0b5/update/linux-x86_64/:lang/firefox-78.0b3-78.0b5.partial.mar",
    "linux": "/devedition/releases/78.0b5/update/linux-i686/:lang/firefox-78.0b3-78.0b5.partial.mar",
    "win64-aarch64": "/devedition/releases/78.0b5/update/win64-aarch64/:lang/firefox-78.0b3-78.0b5.partial.mar",
    "win": "/devedition/releases/78.0b5/update/win32/:lang/firefox-78.0b3-78.0b5.partial.mar"
  }


Let me know if you want me to translate this into the curl commands, similarly to https://bugzilla.mozilla.org/show_bug.cgi?id=1631526#c5 or whether you already have that in shell history/scripts.

Flags: needinfo?(oremj)

Note to self: found one of the issues as to why we ended up with this. Most of my testing in preparation for this was done via taskgraph target-task and params.yml for each of the promotion phases. I completely missed the fact that partials are coming from Ship-it which builds the AC task and sets those up in the payload, which then at runtime, bakes those in environment variable here so that when we run the taskgraph to build the rest of the graph we find them in the release_config here. Because locally that's empty, I never actually tested that part. Also, note that nazgul jobs are missing from https://hg.mozilla.org/mozilla-central/file/tip/taskcluster/taskgraph/util/scriptworker.py#l316. Had I tested this right, I would've noticed that vanilla bouncer jobs are generating the partials entries while the nazgul jobs don't. I need to update that there.

TIL for me:

  • when testing against AC tasks, I should use the taskgraph action-callback as opposed to taskgraph target-graph
    e.g.
./mach taskgraph test-action-callback —verbose —task-group-id <decision-task-id-to-promote> —input action_input.json (taken from AC task extra payload) release-promotion > dirty_action_callback.json

  • I should've grepped bouncer entries in all tree, not just duplicating the existing jobs. Vanilla bouncer jobs appear as dep tasks, if conditions, etc.

Just to confirm, plan going forward is:
a) fix nazgul to properly mirror vanilla bouncer
b) let one beta go successfully without any glitch (that is expected failures for vanilla and green for nazgul and all related bouncer checks green - hopefully this Friday’s beta since I broke last night’s one)
c) remove all vanilla jobs from tree + configs + SOPS, etc
d) rename nazgul back to vanilla

Comment on attachment 9155585 [details]
Bug 1635525 - fix broken nazgul links and deps. r=rail

Beta/Release Uplift Approval Request

  • User impact if declined: Release automation broken updates and delay in QA. Without this patch, bouncer is no longer submitting the partials information correctly so we need to manually change those and add delays to the release.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky):
  • String changes made/needed:

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Release automation broken updates and delay in QA. Without this patch, bouncer is no longer submitting the partials information correctly so we need to manually change those and add delays to the release.
  • User impact if declined:
  • Fix Landed on Version:
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky):
  • String or UUID changes made by this patch:
Attachment #9155585 - Flags: approval-mozilla-release?
Attachment #9155585 - Flags: approval-mozilla-esr68?
Attachment #9155585 - Flags: approval-mozilla-beta?

@rjl: can you please graft this in Thunderbird as well? Without it, some deps are missing in the graphs and partials are no longer working in bouncer (which means we need to backfill them manually via CloudOps)

Flags: needinfo?(rob)
Pushed by mtabara@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7777540437e9
fix broken nazgul links and deps. r=rail
Depends on: 1644857

Created bug 1644857 to track Thunderbird end of thing.

Flags: needinfo?(rob)
Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 79

Let's reopen this until Friday's beta confirms things are smooth. I'll file a separate bug to track the removal of vanilla bouncer and only keep Nazgul.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Removing the NI, kudos to @oremj for his support yesterday!

Flags: needinfo?(oremj)
Depends on: 1644973
Depends on: 1644974
Attachment #9155585 - Flags: approval-mozilla-release?
Attachment #9155585 - Flags: approval-mozilla-release+
Attachment #9155585 - Flags: approval-mozilla-esr68?
Attachment #9155585 - Flags: approval-mozilla-esr68+
Attachment #9155585 - Flags: approval-mozilla-beta?
Attachment #9155585 - Flags: approval-mozilla-beta+
Depends on: 1645001

Mihai, do you have removing the old BncLoc jobs for nightly on your radar ? eg https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=BncLoc

Flags: needinfo?(mtabara)

Actually, I bet that's bug 1644973.

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #20)

Mihai, do you have removing the old BncLoc jobs for nightly on your radar ? eg https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=BncLoc

Thanks for raising this! Yes, that's bug 1644973 indeed. I'll land that today, at the same time with bouncerscript/SOPS changes to make sure I minimize breakage.

Flags: needinfo?(mtabara)
Status: REOPENED → RESOLVED
Closed: 9 months ago9 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.