Closed
Bug 1429563
Opened 8 years ago
Closed 8 years ago
[ops infra socorro] figure out crontabber deploy steps
Categories
(Socorro :: Infra, task, P1)
Socorro
Infra
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: willkg, Assigned: willkg)
References
Details
Attachments
(1 file)
The -stage-new infrastructure is deploying a new crontabber node every time it does a deploy. However, we've had a few instances where we end up with multiple crontabber nodes in odd states.
This bug covers going through the deploy steps for the crontabber node and making sure it's correct in respects to crontabber's unique runtime requirements.
Assignee | ||
Comment 1•8 years ago
|
||
Making this a P1 to do soon.
The subtitle of this bug is "Will needs to feel comfortable with crontabber in the new infra", so I think this is something I'm going to have to go through, think about, and verify. Given that, I'm going to grab it for now.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P1
Comment 2•8 years ago
|
||
As a temporary measure against failed stacks, I've enabled cloudformation rollback - if a stack fails, it will be deleted.
When a replacement stack succeeds, it shouldn't have a problem scaling down previous stacks anymore - we were having issues updating failed stacks, but now failed stacks should be automatically removed.
We'll see how this goes.
Comment 3•8 years ago
|
||
We've found success with my method of only allowing one crontabber thus far. We haven't had multiple crontabber related issues.
The outstanding issue is what happens when a running crontabber gets shut down at random, and how we can avoid that causing further problems.
Assignee | ||
Comment 4•8 years ago
|
||
If I had my druthers, crontabber nodes would act like processor nodes in that they could be killed/interrupted/whatever and whenever the next one starts up, it can figure things out and everything is fine.
I need to go through the crontabber bookkeeping code as well as all the remaining crontabber jobs to figure out all that handles interruptions/death and see if I can fix the things that don't react well.
I've got that on my list of things to do. Maybe next week.
Assignee | ||
Comment 5•8 years ago
|
||
I went through everything. I did the following:
1. wrote up bugs to nix some unneeded cron apps
2. for daily/weekly cron apps, I changed the job spec to run them in the wee hours of the morning where they won't affect deploys
3. for the remaining apps that run hourly, I'm pretty sure they can all be killed/interrupted and will start up the next time just fine
BugzillaCronApp -- runs in Python-land and does all its db changes in transactions
FTPScraperCronApp -- runs in Python-land and does its work with a stored procedure
FeaturedVersionsAutomaticCronApp -- runs in Python-land; doesn't use a transaction (but probably should); only affects the UI
ReportsCleanCronApp -- runs as a stored procedure
I want to find out if a stored procedure will complete even if the db connection is lost.
I have a PR for the cleanup work. I'll finish this up tomorrow.
Assignee | ||
Comment 6•8 years ago
|
||
Comment 7•8 years ago
|
||
This is great; thanks willkg!
Assignee | ||
Comment 8•8 years ago
|
||
I have notes for the audit here:
https://docs.google.com/document/d/1GGyimKQLZdW0E8TGUlmA4uhInmLpYnV0fUJdNOu1FQ4/edit#
I've also tried to improve docstrings and other things in the code related to those notes. Notes in code is easier to discover later on.
Comment 9•8 years ago
|
||
Commits pushed to master at https://github.com/mozilla-services/socorro
https://github.com/mozilla-services/socorro/commit/0c1facdbfedd741a0788bf70d453b57080c0ab79
bug 1429563 - Change daily/weekly jobs to run early in the morning
This changes jobs that run daily or weekly to all run early in the morning
when they're less likely to be running during a deploy.
https://github.com/mozilla-services/socorro/commit/282aa8d192345652240cbad1ad6e596ebb8690e6
Merge pull request #4352 from willkg/1429563-crontabber
bug 1429563, 1440474, 1440471, 1440496 - clean up crontabber jobs
Assignee | ||
Comment 10•8 years ago
|
||
I messed up the bug number in the commit summary and so the comment ended up in bug #1440745#12.
Reprinting it here because it's relevant.
"""
Commit pushed to master at https://github.com/mozilla-services/socorro
https://github.com/mozilla-services/socorro/commit/cf2b44d9ca2ee2559fbccf8bab03bc241f602e60
bug 1440745 - change age lockout to 2 hours
The default lockout is 12 hours. That means if we kill a crontabber node
that's in the middle of doing something, then that job is locked for 12
hours before it can be done.
With the new infrastructure, it's much more likely we're going to kill a
crontabber node in the middle of doing something. We really don't want
anything to not run for 12 hours.
This changes that to 2 hours. That should be plenty long enough for any of
the jobs to run but a much more reasonable amount of time to wait for a
lock to time out.
"""
Assignee | ||
Comment 11•8 years ago
|
||
I think we're all set here. I'm pretty confident that deploys should go fine without any special treatment. It's definitely worth trying that out now and seeing how things go.
Having said that, I'm not sure how "not going well" would manifest itself or what we could do to have it alert us when things went wrong. I think for now, we can rely on existing monitoring and alerting and if the time comes that we discover that's not sufficient, we'll have a better idea of why and what to do about it.
In the meantime, I'm marking this as FIXED.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment 12•7 years ago
|
||
[snip]
> ReportsCleanCronApp -- runs as a stored procedure
>
> I want to find out if a stored procedure will complete even if the db
> connection is lost.
>
ReportsCleanCronApp is backfill based. So if crontabber on nodeX starts this stored procedure with argument DATE1, and you then kill that crontabber nodeX instance. Then, that job will be "locked" for 2 hours [0] as if it's ongoing that long.
Then you start a new crontabber on nodeY and after about 2 hours it's going to run it again. But because the last time it didn't finish, when it starts it this time from nodeY it's going to send in DATE1 again.
The question is, how horrible is that?
I see some `INSERT INTO ...` that uses uuid [1] in that code. Does that mean it might throw an error about the UUID inserted already exists in reports_user_info for example?
I'm lazy and just trying to help by dropping some vague comments and dim brain dumps but my suggestion would be to avoid that [stored procedure repetition call] problem by allowing old crontabber nodes to take their time to finish instead of killing them just because new nodes are up and working.
Now, how you do that is up to Miles I guess. Also, my suggestions are crap because I don't understand EC2 automation like Miles does. You can do one of these when it's time to terminate an old admin node:
1) while (`ps ax | grep crontabber`) { sleep 10 } else { break; shutdown -now }
2) echo "" > /etc/crontab/socorro && sleep 1000 && shutdown -now
Sorry if this is causing you more thinking on a closed issue. Just trying to help.
[0] https://github.com/mozilla-services/socorro/commit/cf2b44d9ca2ee2559fbccf8bab03bc241f602e60
[1] https://github.com/mozilla-services/socorro/blob/991171dcf54fbe40b247ede4a72e4b77b7c64a29/socorro/external/postgresql/raw_sql/procs/001_update_reports_clean.sql#L377-L399
Assignee | ||
Comment 13•7 years ago
|
||
The context for this is as such:
1. assuming we stick to "normal deploy windows", the only jobs that could be running during a deploy are:
1.1. UploadCrashReportJSONSchemaCronApp -- going away soon
1.2. ReportsCleanCronApp -- going away soon
1.3. FTPScraperCronApp -- going away soon
1.4. FeaturedVersionsAutomaticCronApp -- runs really fast and is pretty straight-forward
1.5. BugzillaCronApp -- runs pretty fast and can be interrupted
2. we deploy on average once a week
I did my audit with the intention of lowering the risk as much as possible such that we could kill off crontabber nodes during deploys because this is by far the simplest deploy mechanism.
Pretty sure we've been doing deploys in -new-stage for a month and haven't hit issues.
I'm pretty confident that if we have issues with crontabber jobs during or after a deploy, we'll see things show up in sentry and can unhork them at that point by hand. I don't think it's the case that any of these jobs are destructive such that if they go awry and no one notices, we're hosed.
Given that, while I agree with your concerns, I don't think it makes sense to spend the time and effort to implement, test, and verify poison pill type deploy or deal with it in some other way--options which have their own set of complexities.
If I'm horribly wrong and we hit problems, we should know really quick and we can revisit this.
You need to log in
before you can comment on or make changes to this bug.
Description
•