Closed Bug 730008 Opened 12 years ago Closed 12 years ago

[bedrock] Integrate chief/commander to allow devs to push bedrock live

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlong, Assigned: nmaul)

References

Details

This is somewhat related to bug 728386.

A critical component of adopting bedrock will be the ability for devs to push content live. It's already difficult enough that engagement can't push stuff live, and if we move this ability down to the IT level, there's 2 layers to go through just to push some content live.

AMO has a big red button to push stuff live (https://github.com/jbalogh/chief). I'd like to integrate this functionality in bedrock. We can figure out the right workflow, and it's fine if it's restricted in certain ways, but as long as we can push the button a few times a week, that will work.

This probably mitigates the need for a 4th site which auto-updated from master, as discussed in bug 728386. If we can push to stage too, then we don't need that.

When we know some heavy changes are coming through, we'd still involved IT and they can use stage to test the rollout.

How does this sound? I'm willing to write the chief script and get it all hooked up but obviously we'll need some help from IT.
Summary: Integrate chief/commander to allow devs to push bedrock live → [bedrock] Integrate chief/commander to allow devs to push bedrock live
In the process of this, please also reconcile this:
https://github.com/mozilla/bedrock/blob/master/bin/update_site.py

with what script we're currently using to do a production push for bedrock. This will a) keep us from losing the script and b) from having to guess what the pull/compile/push procedure for bedrock is.
chief was designed with AMO in mind, and recently using it elsewhere has shown that it is not easily portable.  This will take some time as it is not a high priority, but we'll get it done.
Depends on: 737925
Blocks: 736338
Finding a solution to our push problem is a high priority, though it doesn't have to be in the form of chief. How would you suggest we structure our pushes with IT? We update frequently and also need emergency pushes every now and then when something is mysteriously broken on the live site.

Something like 2 pushes a week (Tuesday and Thursday) would be great, but not sure if that would be a strain on IT. For mozilla.org, there's not even a database, so it's usually a simple push.
No longer blocks: 736338
I know there are some other bugs tracking our work on making commander/chief our standard solution, is this bug still necessary, particularly with the move to bedrock on django?
(In reply to James Long (:jlongster) from comment #3)
> Finding a solution to our push problem is a high priority, though it doesn't
> have to be in the form of chief. How would you suggest we structure our
> pushes with IT? We update frequently and also need emergency pushes every
> now and then when something is mysteriously broken on the live site.

I suggest filing blocker bugs to get emergency fixes pushed.  Planned updates can be coordinated with Ops (who does not sleep) so that you don't feel there are too many levels to go through.

The QA process should catch most/all of those and this shouldn't be something that needs to happen often.
I agree with mrz.

Privilege & responsibility are coupled -- allowing pushes to be triggered by people who aren't prepared to fix mistakes/errors is asking for outages.  Thus, the team with the pager is the team who should push releases.

With Ops being global, there's no reason why having them do the push should delay anything, and if something *does* go wrong, they're right there to fix it.
(In reply to Jeff Vier [:jvier] from comment #6)
> Privilege & responsibility are coupled -- allowing pushes to be triggered by
> people who aren't prepared to fix mistakes/errors is asking for outages. 
> Thus, the team with the pager is the team who should push releases.

For www.mozilla.org we can't risk any downtime, I have to agree with this.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
For the record, I don't expect the usual simple content changes on mozilla.org to cause significant problems. Much like AMO and SUMO, simple pushes, simple fixes, are key. Not everything will be 100% perfect all the time. Just last week, the SUMO team gave a brownbag on the issue that was very enlightening.

I respectfully disagree with every simple push needing direct ops action (and if it does, the dev/release process would be broken and should be fixed), but if you can't stomach extending the same privilege to mozilla.org that AMO and SUMO have, that's fine, but expect a *lot* of content pushes.
I agree with Fred, here are a few points:

* The QA process does not catch everything, and its quite common for something to be broken (a newsletter form not signing up, etc). mozilla.org changes too quickly for QA to catch everything and its common for regressions to happen. We are trying our best to fix this.

* Content pushes need to happen at least 2-3 times a week (just template updates)

* Pushing *would* be restricted to people who are prepared to fix errors. Probably only 1 (maybe 2) core developers would be able to push.

* If an error occurs, it's highly likely that Ops would not be able to fix it, since they don't know the codebase or the history of what was pushed. If you really want people around that can fix the problem, we need to schedule a time that a core developer can be there for the push too.

Like Fred said, I respect the decision you guys come up with, and if you decide to not allow this then we'll need to push several times a week.
I want to revisit this. Let's set up a meeting for sometime this month to discuss requirements and commitments. I believe this is something we should work towards, and we're not going to overcome the hurdles if we don't at least start down the track.

The main hurdle to just making it work at all is getting chief set up. However, I believe this will be comparatively minor next to the job you (webdev) will have in order to make this really feasible for most/all use cases. IIRC the SUMO team did a brownbag a while back about how they implemented this... we should re-watch that and get an update from them on anything they've had to change since then.

In particular, I believe they invested a considerable amount of time/effort into automated testing in stage and prod, and in metrics/statistics gathering in prod... so they know before they push that it won't break, they know after the push that it didn't break, and they know all the time that it's not broken. :)
Assignee: server-ops → nmaul
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Thanks Jake, I appreciate the effort. Happy to help! CCing cmore, whose team is in charge of care and feeding of mozilla.org.
Blocks: bedrock-cd
This is up and running (finally), although until bug 775689 is resolved there are no pushbots for IRC yet. But the main app is done.

Note a couple things:

1) This doesn't push the PHP side of the app. That's still done via cron, automatically.

2) Stage is actually currently auto-deployed. We probably want to stop doing that.

3) You need a password (and VPN access) to be able to use these. Let me know who should get the password and I'll send it to them.


http://bedrockadm.private.phx1.mozilla.com/chief/bedrock.prod
https://www.mozilla.org/media/revision.txt
http://bedrockadm.private.phx1.mozilla.com/chief/bedrock.prod/logs/


http://bedrockadm.private.phx1.mozilla.com/chief/bedrock.stage
https://www.allizom.org/media/revision.txt
http://bedrockadm.private.phx1.mozilla.com/chief/bedrock.stage/logs/
Thanks. Can you send me the password please? (To anthony@moz)
Sent. Closing this out... let me know if anyone else should have it as well. I know you're the primary dev these days, but not sure if you have a backup or anything like that. :)
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Reopening for passwords.

Can you send passwords to mkelly and pmac, hereby cced?

Let me know how we should handle other password requests in the future.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Passwords sent.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
So, looks like this does not work. When I look at the logs, the script only does git fetch but no git merge. So we have all the code, just not in the working directory.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This is fixed.

The issue is that the Chief currently isn't set up to do a "merge"... it does fetch and then checkout. This works fine if you use an explicit commit tag, or origin/master... it doesn't work if you use just "master", because it's already on the master branch. origin/master causes it to fast-forward, but just 'master' is a no-op.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.