1429563 - [ops infra socorro] figure out crontabber deploy steps

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Description

•

8 years ago

The -stage-new infrastructure is deploying a new crontabber node every time it does a deploy. However, we've had a few instances where we end up with multiple crontabber nodes in odd states. This bug covers going through the deploy steps for the crontabber node and making sure it's correct in respects to crontabber's unique runtime requirements.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 1

•

8 years ago

Making this a P1 to do soon. The subtitle of this bug is "Will needs to feel comfortable with crontabber in the new infra", so I think this is something I'm going to have to go through, think about, and verify. Given that, I'm going to grab it for now.

Assignee: nobody → willkg

Status: NEW → ASSIGNED

Priority: -- → P1

Miles Crabill [:miles]

Comment 2

•

8 years ago

As a temporary measure against failed stacks, I've enabled cloudformation rollback - if a stack fails, it will be deleted. When a replacement stack succeeds, it shouldn't have a problem scaling down previous stacks anymore - we were having issues updating failed stacks, but now failed stacks should be automatically removed. We'll see how this goes.

Miles Crabill [:miles]

Comment 3

•

8 years ago

We've found success with my method of only allowing one crontabber thus far. We haven't had multiple crontabber related issues. The outstanding issue is what happens when a running crontabber gets shut down at random, and how we can avoid that causing further problems.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 4

•

8 years ago

If I had my druthers, crontabber nodes would act like processor nodes in that they could be killed/interrupted/whatever and whenever the next one starts up, it can figure things out and everything is fine. I need to go through the crontabber bookkeeping code as well as all the remaining crontabber jobs to figure out all that handles interruptions/death and see if I can fix the things that don't react well. I've got that on my list of things to do. Maybe next week.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

8 years ago

Depends on: 1440471

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

8 years ago

Depends on: 1440474

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

8 years ago

Depends on: 1440496

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 5

•

8 years ago

I went through everything. I did the following: 1. wrote up bugs to nix some unneeded cron apps 2. for daily/weekly cron apps, I changed the job spec to run them in the wee hours of the morning where they won't affect deploys 3. for the remaining apps that run hourly, I'm pretty sure they can all be killed/interrupted and will start up the next time just fine BugzillaCronApp -- runs in Python-land and does all its db changes in transactions FTPScraperCronApp -- runs in Python-land and does its work with a stored procedure FeaturedVersionsAutomaticCronApp -- runs in Python-land; doesn't use a transaction (but probably should); only affects the UI ReportsCleanCronApp -- runs as a stored procedure I want to find out if a stored procedure will complete even if the db connection is lost. I have a PR for the cleanup work. I'll finish this up tomorrow.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 6

•

8 years ago

Attached file pr 4352: bug 1429563 - clean up crontabber jobs — Details

Brian Pitts

Comment 7

•

8 years ago

This is great; thanks willkg!

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 8

•

8 years ago

I have notes for the audit here: https://docs.google.com/document/d/1GGyimKQLZdW0E8TGUlmA4uhInmLpYnV0fUJdNOu1FQ4/edit# I've also tried to improve docstrings and other things in the code related to those notes. Notes in code is easier to discover later on.

[github robot]

Comment 9

•

8 years ago

Commits pushed to master at https://github.com/mozilla-services/socorro https://github.com/mozilla-services/socorro/commit/0c1facdbfedd741a0788bf70d453b57080c0ab79 bug 1429563 - Change daily/weekly jobs to run early in the morning This changes jobs that run daily or weekly to all run early in the morning when they're less likely to be running during a deploy. https://github.com/mozilla-services/socorro/commit/282aa8d192345652240cbad1ad6e596ebb8690e6 Merge pull request #4352 from willkg/1429563-crontabber bug 1429563, 1440474, 1440471, 1440496 - clean up crontabber jobs

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 10

•

8 years ago

I messed up the bug number in the commit summary and so the comment ended up in bug #1440745#12. Reprinting it here because it's relevant. """ Commit pushed to master at https://github.com/mozilla-services/socorro https://github.com/mozilla-services/socorro/commit/cf2b44d9ca2ee2559fbccf8bab03bc241f602e60 bug 1440745 - change age lockout to 2 hours The default lockout is 12 hours. That means if we kill a crontabber node that's in the middle of doing something, then that job is locked for 12 hours before it can be done. With the new infrastructure, it's much more likely we're going to kill a crontabber node in the middle of doing something. We really don't want anything to not run for 12 hours. This changes that to 2 hours. That should be plenty long enough for any of the jobs to run but a much more reasonable amount of time to wait for a lock to time out. """

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 11

•

8 years ago

I think we're all set here. I'm pretty confident that deploys should go fine without any special treatment. It's definitely worth trying that out now and seeing how things go. Having said that, I'm not sure how "not going well" would manifest itself or what we could do to have it alert us when things went wrong. I think for now, we can rely on existing monitoring and alerting and if the time comes that we discover that's not sufficient, we'll have a better idea of why and what to do about it. In the meantime, I'm marking this as FIXED.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Peter Bengtsson [:peterbe]

Comment 12

•

7 years ago

[snip] > ReportsCleanCronApp -- runs as a stored procedure > > I want to find out if a stored procedure will complete even if the db > connection is lost. > ReportsCleanCronApp is backfill based. So if crontabber on nodeX starts this stored procedure with argument DATE1, and you then kill that crontabber nodeX instance. Then, that job will be "locked" for 2 hours [0] as if it's ongoing that long. Then you start a new crontabber on nodeY and after about 2 hours it's going to run it again. But because the last time it didn't finish, when it starts it this time from nodeY it's going to send in DATE1 again. The question is, how horrible is that? I see some `INSERT INTO ...` that uses uuid [1] in that code. Does that mean it might throw an error about the UUID inserted already exists in reports_user_info for example? I'm lazy and just trying to help by dropping some vague comments and dim brain dumps but my suggestion would be to avoid that [stored procedure repetition call] problem by allowing old crontabber nodes to take their time to finish instead of killing them just because new nodes are up and working. Now, how you do that is up to Miles I guess. Also, my suggestions are crap because I don't understand EC2 automation like Miles does. You can do one of these when it's time to terminate an old admin node: 1) while (`ps ax | grep crontabber`) { sleep 10 } else { break; shutdown -now } 2) echo "" > /etc/crontab/socorro && sleep 1000 && shutdown -now Sorry if this is causing you more thinking on a closed issue. Just trying to help. [0] https://github.com/mozilla-services/socorro/commit/cf2b44d9ca2ee2559fbccf8bab03bc241f602e60 [1] https://github.com/mozilla-services/socorro/blob/991171dcf54fbe40b247ede4a72e4b77b7c64a29/socorro/external/postgresql/raw_sql/procs/001_update_reports_clean.sql#L377-L399

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 13

•

7 years ago

The context for this is as such: 1. assuming we stick to "normal deploy windows", the only jobs that could be running during a deploy are: 1.1. UploadCrashReportJSONSchemaCronApp -- going away soon 1.2. ReportsCleanCronApp -- going away soon 1.3. FTPScraperCronApp -- going away soon 1.4. FeaturedVersionsAutomaticCronApp -- runs really fast and is pretty straight-forward 1.5. BugzillaCronApp -- runs pretty fast and can be interrupted 2. we deploy on average once a week I did my audit with the intention of lowering the risk as much as possible such that we could kill off crontabber nodes during deploys because this is by far the simplest deploy mechanism. Pretty sure we've been doing deploys in -new-stage for a month and haven't hit issues. I'm pretty confident that if we have issues with crontabber jobs during or after a deploy, we'll see things show up in sentry and can unhork them at that point by hand. I don't think it's the case that any of these jobs are destructive such that if they go awry and no one notices, we're hosed. Given that, while I agree with your concerns, I don't think it makes sense to spend the time and effort to implement, test, and verify poison pill type deploy or deal with it in some other way--options which have their own set of complexities. If I'm horribly wrong and we hit problems, we should know really quick and we can revisit this.

Bugzilla

[ops infra socorro] figure out crontabber deploy steps

Categories

(Socorro :: Infra, task, P1)

Tracking

(Not tracked)

People

(Reporter: willkg, Assigned: willkg)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Updated

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Attachment

General

Description

File Name

Content Type