Closed Bug 1375444 Opened 8 years ago Closed 8 years ago

Stage statics deploy failed

Categories

(Cloud Services :: Operations: AMO, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: muffinresearch, Assigned: wezhou)

Details

The tagged addons-frontend deployment failed in being fully deployed onto stage on Tuesday, presumably due to the static deployment failing to complete or getting stuck. Looking on Jenkins it wasn't clear if any action could take place, the pipeline was shown as paused. The fix our end was to deploy a new tag pointing at the same revision as that was the only apparent way to instigate a stage deployment. I have a few questions off the back of this: * If statics don't deploy shouldn't the deployment fail? * How can we know when this happens? * Is there a way to restart a stage deployment (or part of a stage deploy) from jenkins?
Assignee: nobody → wezhou
Can we make the whole deployment fail if the static assets fail to deploy? Ideally, the entire deployment should be rolled back (if that's possible). Not having static assets introduces subtle breakage that can sometimes be hard to track down so this spends a lot of engineering effort in troubleshooting.
Jason and I looked at this yesterday, and we think that issue was caused by the instance that was responsible for uploading the static files to s3 failed to build the assets. Unfortunately, by the time we looked at it, that instance had been terminated by aws due to health check failure, thus we don't have logs to prove the theory. Our current plan is to add a check to check the number of files been uploaded, which should be always greater than one (/src/code/dist/.gitkeep is always uploaded). > Looking on Jenkins it wasn't clear if any action could take place, the pipeline was shown as paused. The pause is expected. Once the stage deploy finishes, it pauses so on Thursday, we can manually confirm and continue to push to -prod using the same pipeline. > If statics don't deploy shouldn't the deployment fail? Yes. > How can we know when this happens? Either (a) "aws s3 sync" command fails or (b) "aws s3 sync" succeeded but the number of files uploaded is less or equal to one. We already check for (a) today. We'll add a second check for (b). > Is there a way to restart a stage deployment (or part of a stage deploy) from jenkins? Yes, Jenkins allows us to do that. There is a "resume" button in the upper left corner of each stage of the pipeline. You'll see it when you hover your mouse over it. That said, you probably didn't see it because by default -dev has read only access. For yesterday's issue, restarting part of the stage deploy probably won't fix it because one of the instances failed to build the assets in the first place. I'd have restarted the whole pipeline so it spawns two brand new instances. > Can we make the whole deployment fail if the static assets fail to deploy? Ideally, the entire deployment should be rolled back (if that's possible). Yes, that's what the pipeline is supposed to do. The real reason for yesterday is because it failed to detect the uploading failure. By adding the second check mentioned above, we hope it better detect failures next time. There'd be no need to roll back if the pipeline detected the upload failure and failed properly yesterday, because at that point the new stack would not have been promoted (meaning the new instances haven't replaced the old instances in ELB). P.S.: The logs we see from the first push yesterday shows it only uploaded one file. "stdout_lines": [ "upload: static.temp/dist/.gitkeep to s3://net-mozaws-stage-amo-amo-static-amostage1/.gitkeep", "Completed 1 file(s) with ~0 file(s) remaining (calculating...)", " " ],
Submit a PR: https://github.com/mozilla-services/cloudops-deployment/pull/861 Once merged, the push pipeline will dryrun "aws s3 sync" to check that there are more than just .gitkeep to upload. Having only .gitkeep to upload means the asset file building failed in the previous step, in which case the pipeline fails early.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
:wezhou thanks for the fixes!
You need to log in before you can comment on or make changes to this bug.