Closed Bug 883403 Opened 11 years ago Closed 11 years ago

Formalize and improve release process

Categories

(Firefox Health Report Graveyard :: Web: Health Report, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: rnewman, Unassigned)

Details

A push occurred this morning that broke desktop FHR. I'm not sure whether it was a push that went bad, or a push of bad code. Either way, it revealed some flaws in the current process:

* Code that affects production is deployed from master.
* Code can be merged to master without verification.
* There's no handbook for verifying a push without involving QA resources.

So my suggestions:


* Define a release branch. Merge only verified releases to it. Tag those verified releases. Push only tags to prod.
* Verify code on stage (maybe from a 'stage' branch), then merge it to release.
* Describe the bare minimum smoke tests for FHR. For example, "config.json should never 404". Write a trivial smoketest script and run it after every deployment.
* Write up a small handbook for the rest, so that a push can be manually verified.


The idea behind this is that:

* $some_branch should always be safe to push.
* You always know what code was last pushed.
* If everyone is on vacation and something goes wrong, *anyone with a pulse* can push out the last known good release and verify that it works.
* QA is not involved in pushes unless you want to be super duper thorough.

If we're not at that point, we're not ready to handle production traffic for release.
And I should point out: there's no blamestorming here; just taking advantage of an opportunity to improve.
Some notes on the current process:
We deploy from master, as does, for example, AMO. Master is intended to be stable.
Code should not be merged to master without a review. This rule has been notably broken in the last couple of days.
Code should be tested on the dev server before pushing to prod, which was notably absent in this case. 
Code should also be verified on prod after a push, again notably absent in this case.
(There is a "dev" branch for unstable code, which we haven't used much.)
We have some tests. They should be improved.
We observe no-change-Fridays, which was not done in this case.

I do not plan to move to a new process in a kneejerk reaction to a mistakes that came from not following our existing process. I suggest we have a postmortem next week. To be honest, yes, this bug feels like blamestorming to me. Let's talk in person and iterate on our process.
(In reply to Laura Thomson :laura from comment #2)

> Code should not be merged to master without a review. This rule has been
> notably broken in the last couple of days.

Much as I love code review, review is not enough. The bits that get deployed should also be tested live in a staging environment before they reach a point that they can be deployed by accident.

I don't know if today's broken code was reviewed or not: you seem to suggest that it wasn't, but I don't think it matters -- I've seen plenty of code get r+ and break. (And I've been both the reviewer and the author!)

If you count "review" as "code review and deployed to stage and tested", then we're saying the same thing, and that's all good. Nothing merges to master without r= and qa+, and then we can call 'master' 'release'. It won't catch everything, but it'll catch more than review alone.

Tagging releases is a different mechanism to achieve the same deploy-only-verified-code goal; only QA can tag a release. (And of course, tagging has the nice side effect of providing reproducible deployment targets for rollouts and rollbacks.)

I'm sure you know all this; it was common knowledge when I was doing deploys at Tellme in 2006. I'm saying that now is the time to get serious about it for FHR content.


> Code should be tested on the dev server before pushing to prod, which was
> notably absent in this case. 

My primary assertion is that this should happen *before* merging to the release branch (or before tagging for deployment), which means that no code can ever be deployed before at least having being tested 'live'. In today's case, broken code that wasn't tested (or wasn't tested in a release environment) deployed alongside code that didn't even touch the same directory tree, simply because it got merged earlier in the week and nobody had deployed it yet.

It's a communication problem: if the left hand assumes that nobody will push master without QA (and thus lands untested code on the release branch), and the right hand assumes that nobody will merge to the production deployment source without QA, and "master is stable", you get untested code on the CDN. And of course if we only do full QA *after* pushing to prod, or only QA the change we think we've pushed, we have a guaranteed brief outage if there's a problem.

My goal in filing this bug is to put layers in place to prevent untested code from reaching production, given that human beings make mistakes. I think "make sure you test master before you deploy it" is prone to failure in a way that "master only contains QAed merges" is not.


> Code should also be verified on prod after a push, again notably absent in
> this case.

I think that's a great idea, but (perhaps obviously?) that's not the first time it should receive requests on a *.mozilla.* domain. The exact bits we push to prod should first have been pushed to stage, and once we're handling release traffic the exact bits we push to prod should first have been pushed, QAed, load tested, been subjected to a rollback exercise, and tagged. Otherwise you're rolling back under load.

So to reiterate a delta against Comment 2:

* Testing on a staging server should be a prerequisite for merging to a release branch, as well as review. If QA does that testing, then QA should be in charge of approving that merge. (And use a hook to verify it!)
* Code pushed to prod should be tagged, and that same revision should be code that was tested in stage.
* Perhaps it's time to add precommit hooks for r=, if people are merging without review? We do that on all main Mozilla trees, and it largely works.


Happy to talk through this on Monday, if you think there's something about the current process that I'm not understanding; I'm very much open to the possibility that I'm simply not seeing some guards you have in place, and that today's issues were down to some rare collisions of policy ignorance. 

(Or, for that matter, you might be operating under different tradeoffs. Much of my expectations for release process were shaped by the telephony space, which I know is a lot more process-oriented than webby things -- everything we did for deployment was scripted for 24h NOC staff to perform during, e.g., DC failovers, so there wasn't much space for human judgment!)
(In reply to Richard Newman [:rnewman] from comment #3)
> (In reply to Laura Thomson :laura from comment #2)
> 
> > Code should not be merged to master without a review. This rule has been
> > notably broken in the last couple of days.
> 
> Much as I love code review, review is not enough. The bits that get deployed
> should also be tested live in a staging environment before they reach a
> point that they can be deployed by accident.
> 
> I don't know if today's broken code was reviewed or not: you seem to suggest
> that it wasn't, but I don't think it matters -- I've seen plenty of code get
> r+ and break. (And I've been both the reviewer and the author!)

I did note a couple of commits go by without review this week.

> 
> If you count "review" as "code review and deployed to stage and tested",
> then we're saying the same thing, and that's all good. Nothing merges to
> master without r= and qa+, and then we can call 'master' 'release'. It won't
> catch everything, but it'll catch more than review alone.
> 
> Tagging releases is a different mechanism to achieve the same
> deploy-only-verified-code goal; only QA can tag a release. (And of course,
> tagging has the nice side effect of providing reproducible deployment
> targets for rollouts and rollbacks.)
> 
> I'm sure you know all this; it was common knowledge when I was doing deploys
> at Tellme in 2006. I'm saying that now is the time to get serious about it
> for FHR content.
> 
> 

Let me give you some background. The process until recently for, say crash-stats, looked like this:
- master is unstable and automatically staged on dev
- stage/release is stable and automatically staged on stage
- production runs stage/release and pushes are done by hand.

In order to cut a release in this environment we test on dev, certify on stage, and verify on master.  Tagging and signing is done by devs, usually *after* a push. Our QAs generally haven't tagged because they don't have commit access.

We are in the process of changing this, for many reasons. One is that cherry-picking changes from master->release is absolutely fraught with errors, and has caused us many bad pushes in the last 12 months.

Another is that we are moving to a more CD/fail forward/MTTR>MTBF environment. The goal is push-per-change, with different levels of QA depending on the risk. This is to increase agility, and is a common approach to webdev in 2013. It is used at Mozilla on multiple large projects as well as many other companies.

Note that in every case *somebody*, QA or dev, must test on dev before a push and prod after.

For long running branches and big changes, we plan to use a different staging environment.

> > Code should be tested on the dev server before pushing to prod, which was
> > notably absent in this case. 
> 
> My primary assertion is that this should happen *before* merging to the
> release branch (or before tagging for deployment), which means that no code
> can ever be deployed before at least having being tested 'live'. 

Testing on a different branch is insufficient, in my experience.


>In today's
> case, broken code that wasn't tested (or wasn't tested in a release
> environment) deployed alongside code that didn't even touch the same
> directory tree, simply because it got merged earlier in the week and nobody
> had deployed it yet.
> 
> It's a communication problem: if the left hand assumes that nobody will push
> master without QA (and thus lands untested code on the release branch), and
> the right hand assumes that nobody will merge to the production deployment
> source without QA, and "master is stable", you get untested code on the CDN.
> And of course if we only do full QA *after* pushing to prod, or only QA the
> change we think we've pushed, we have a guaranteed brief outage if there's a
> problem.
> 

One communication problem here is also partly that nobody took you through the process - I take responsibility because I was not expecting somebody outside the webdev/ops/qa team to ask for a release. 

We always run full QA on dev before a push, too.

> My goal in filing this bug is to put layers in place to prevent untested
> code from reaching production, given that human beings make mistakes. I
> think "make sure you test master before you deploy it" is prone to failure
> in a way that "master only contains QAed merges" is not.

I do not think filing a bug is the right way to tackle this, as I have said. 

We all make mistakes, but at the end of the day, everyone should at the bare minimum check the release in dev and verify the release in prod, especially if they choose to push without QA. 

Having said that, I am happy to iterate on our process. Some things I would like to add are:
1. more test automation, including setting up jenkins+leeroy
2. only pushing builds that have passed test automation
These two steps should give us the safety margin required. 


> 
> 
> > Code should also be verified on prod after a push, again notably absent in
> > this case.
> 
> I think that's a great idea, but (perhaps obviously?) that's not the first
> time it should receive requests on a *.mozilla.* domain. The exact bits we
> push to prod should first have been pushed to stage, and once we're handling
> release traffic the exact bits we push to prod should first have been
> pushed, QAed, load tested, been subjected to a rollback exercise, and
> tagged. Otherwise you're rolling back under load.
>

I agree with all of this, except that "rolling back under load" is not a problem given the CDN.
 
> So to reiterate a delta against Comment 2:
> 
> * Testing on a staging server should be a prerequisite for merging to a
> release branch, as well as review. If QA does that testing, then QA should
> be in charge of approving that merge. (And use a hook to verify it!)
> * Code pushed to prod should be tagged, and that same revision should be
> code that was tested in stage.

I have responded to this above.

> * Perhaps it's time to add precommit hooks for r=, if people are merging
> without review? We do that on all main Mozilla trees, and it largely works.
I'm good with this.

> 
> 
> Happy to talk through this on Monday, if you think there's something about
> the current process that I'm not understanding; I'm very much open to the
> possibility that I'm simply not seeing some guards you have in place, and
> that today's issues were down to some rare collisions of policy ignorance. 
> 
> (Or, for that matter, you might be operating under different tradeoffs. Much
> of my expectations for release process were shaped by the telephony space,
> which I know is a lot more process-oriented than webby things -- everything
> we did for deployment was scripted for 24h NOC staff to perform during,
> e.g., DC failovers, so there wasn't much space for human judgment!)

I will schedule a time to discuss this. I do believe that the issues were as a result of not being familiar with the process.  I also believe a bug is entirely the wrong place to continue this discussion. I do not intend to discuss it further here. Let's take it to a better medium for discussion.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → INCOMPLETE
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.