766627 - Deploy rabbitmq / celery for MDN dev/stage/prod

Reporter

Description

•

13 years ago

You're going to hate me, but I think we're going to need an offline queue for Kuma/MDN before launch - and that probably means rabbitmq + celeryd In doing more work with kumascript and templates, I think I'm at the point where I'm not comfortable letting the node.js service handle requests on demand. There are some long page build times and it seems dangerous to allow those to stack up. So, I'm starting work on bug 766252, which should do 2 things with the queue: * Defer wiki page build to the queue and show users a "rebuild in progress" page in the meantime. * Only allow one build of a given page at a time. Currently, many requests could be attempted for one of those 30-seconds-to-build pages, and they'd all stack up in parallel. That sounds like bad news.

Les Orchard [:lorchard]

Reporter

Updated

•

13 years ago

Blocks: 766252

Phong Tran [:phong]

Updated

•

13 years ago

Whiteboard: [webops meeting 2012-06-21]

Les Orchard [:lorchard]

Reporter

Comment 1

•

13 years ago

Also, FWIW, this should avoid any messing around with long timeouts in production as requested for developer-dev in bug 763829

Les Orchard [:lorchard]

Reporter

Comment 2

•

13 years ago

I'm not quite done with the feature that depends on this yet, but we will need it before launch. Any update on if/when this can happen?

Jake Maul [:jakem]

Comment 3

•

13 years ago

A node is online but with only a basic server config (no puppet modules). Our rabbitmq / celery puppet configs are all over the map. TODO: 1) Sort out how to actually get stuff installed and running here. :solarce may be helpful (CC'ing him) as we've previously discussed how best to manage a "proper" rabbitmq/celery cluster, but without coming to any concrete plans. 2) Open ACLs between this node and the admin node... presumably if nothing else it will need to get the content so it can run the proper celery workers. 3) Open ACLs between this node and the web nodes, for actual rabbitmq/celery traffic. 4) #2 makes it clear to me that we'll need 3 nodes here, not 1... one for dev, one for stage, and one for new-prod. We aren't set up to deploy code from different environments onto the same node. Need to open a bug to get 2 more VMs spun up, identical to this one. As part of #1 we might decide if it's worth having yet another VM to be run rabbitmq for all 3 env's, since as far as I know rabbitmq is standalone, and won't need a copy of the site code to run. In any case I'm certain this *can* happen, but I don't really know what would be a realistic timeline. Certainly not in time for tomorrow's soft launch, but possibly before the July 14/15 MT contract end date.

Les Orchard [:lorchard]

Reporter

Comment 4

•

13 years ago

FWIW, this is a late request for a last-minute feature add. With that in mind, I built it so that we can launch without a queue. The rendering code will, at worst, still allow only one rendering of any given page at any given time. That should hopefully stop a pile-up on pages that take a long time to render in the request/response cycle. Adding a queue will enhance that by letting us offload some page rendering out of the request/response cycle altogether. So, we definitely want it ASA(reasonably)P, but I don't think it should block launch by any means

Corey Shields [:cshields]

Comment 5

•

13 years ago

Jake, do you need gear built for this? I can have SREs install a couple of nodes..

Assignee: server-ops → nmaul

Whiteboard: [webops meeting 2012-06-21]

Jake Maul [:jakem]

Updated

•

13 years ago

Assignee: nmaul → server-ops-webops

Group: infra

Whiteboard: [pending triage]

Jake Maul [:jakem]

Updated

•

13 years ago

Priority: -- → P3

Whiteboard: [pending triage] → [triaged 20120824]

Jake Maul [:jakem]

Comment 6

•

13 years ago

Renormalizing priority levels... P4 is "normal" now.

Priority: P3 → P4

Luke Crouch [:groovecoder]

Comment 7

•

13 years ago

Altering and bumping since we have another use-case for this now.

Blocks: 750240

Severity: normal → major

Priority: P4 → P3

Summary: Deploy rabbitmq / celery for developer-new.mozilla.org and developer-dev.allizom.org → Deploy rabbitmq / celery for MDN prod & stage

Adrian J Fernandez [:Aj]

Updated

•

13 years ago

Assignee: server-ops-webops → nmaul

Jake Maul [:jakem]

Comment 8

•

13 years ago

rabbitmq will be a pair of standalone nodes (mirrored queues)... VMs or seamicro xeon's or HP blades (probably the seamicros, just based on available resources) celery will probably be 3 VMs (open to discussion), one per environment

Severity: major → normal

Summary: Deploy rabbitmq / celery for MDN prod & stage → Deploy rabbitmq / celery for MDN dev/stage/prod

Jake Maul [:jakem]

Comment 9

•

13 years ago

I have the rabbitmq nodes more or less alive in SCL3. They're properly clustered. I've played with it a little bit, and can send/receive properly... at least to one node. Still need to set up a load-balanced VIP to distribute the work across the two nodes. Setting up the mirrored queues is something that happens on the app level, most likely based on stuff we'd put in settings_local.py. Relevant: https://github.com/celery/celery/issues/483 I'm not opposed to running the celery workers right on the web nodes, at least for dev and stage. That'll be quickest to get up and running. For prod we can debate having separate celery nodes. There's likely a bit of work to restructure how we do celery, as most of our current stuff kinda assumes rabbitmq running *on* the celery node. Shouldn't be too hard to change, I think.

Jake Maul [:jakem]

Comment 10

•

13 years ago

Celery is alive and (I think) working on developer-dev.allizom.org. Our RabbitMQ system is clustered and inherently fairly redundant. However, the contents of the individual queues are not mirrored between RabbitMQ nodes. This is something that needs to be handled at the app layer, in settings_local.py. https://github.com/celery/celery/issues/483 There are implications to doing this (performance, how this affects non-idempotent jobs, etc), and we may not want to actually do it. But we can, if we want to. Let me know how this works for you, and we can easily do this on stage and prod as well.

Luke Crouch [:groovecoder]

Comment 11

•

13 years ago

I'll get back into arecibo to test this on dev and let you know.

Jake Maul [:jakem]

Comment 12

•

13 years ago

I've checked in a commented-out section to the puppet module for developer-stage for enabling celery on the stage node, and in the node manifest for the prod celery node. I have also added the relevant config to the settings_local.py file for stage and prod, also commented out. I have also confirmed that the Chief update script and settings are ready to go for stage and prod. The actual function call is commented out, but should work. Once we're ready to go on this, it should be a simple matter of uncommenting the config in puppet and the matching lines in settings_local.py, deploying the puppet changes, editing deploy.py to have it updated the celery node(s), and deploying the site.

Priority: P3 → --

Les Orchard [:lorchard]

Reporter

Updated

•

13 years ago

Blocks: 839605

Les Orchard [:lorchard]

Reporter

Comment 13

•

13 years ago

What's the next step to getting Celery available for MDN?

Luke Crouch [:groovecoder]

Updated

•

13 years ago

Blocks: 839214

James Socol [:jsocol, :james]

Comment 14

•

13 years ago

(In reply to Jake Maul [:jakem] from comment #12) > Once we're ready to go on this, it should be a simple matter of uncommenting > the config in puppet and the matching lines in settings_local.py, deploying > the puppet changes, editing deploy.py to have it updated the celery node(s), > and deploying the site. Can we go ahead and make the settings and puppet changes, or do the deploy.py changes need to happen at the same time? If so, that shouldn't be too complicated.

Luke Crouch [:groovecoder]

Comment 15

•

13 years ago

We will need this in the next week or so.

Brandon Burton [:solarce]

Assignee

Updated

•

13 years ago

Assignee: nmaul → bburton

Priority: -- → P1

Brandon Burton [:solarce]

Assignee

Comment 16

•

13 years ago

celery is live on stage, I pushed the puppet changes and the BROKER_ config changes Please test it and let me know how it looks

Brandon Burton [:solarce]

Assignee

Comment 17

•

13 years ago

Is there a way I can trigger a job that'll test Celery on staging? thanks

Flags: needinfo?(lorchard)

Les Orchard [:lorchard]

Reporter

Comment 18

•

13 years ago

I don't know / I don't think so? I think we've been holding off deploying or enabling any Celery code until we had a way to run it. Kind of chicken-and-egg

Flags: needinfo?(lorchard)

Brandon Burton [:solarce]

Assignee

Comment 19

•

13 years ago

(In reply to Les Orchard [:lorchard] from comment #18) > I don't know / I don't think so? I think we've been holding off deploying or > enabling any Celery code until we had a way to run it. Kind of > chicken-and-egg Understood, dev and stage are ready to test Celery on, prod is still pending flows in bug 846969 but I imagine will be done this week Let me know if there is anything I can do to help test

James Socol [:jsocol, :james]

Comment 20

•

13 years ago

(In reply to Brandon Burton [:solarce] from comment #19) > Let me know if there is anything I can do to help test We could run the ping() command to make sure the web heads and the admin box are communicating with the workers. ./manage.py shell >>> from celery.task.control import ping >>> ping() [{'worker': 'pong'}]

Brandon Burton [:solarce]

Assignee

Comment 21

•

13 years ago

Dev looks ready, it took a couple pings though [bburton@developer1.dev.webapp.scl3 kuma]$ ./manage.py shell In [1]: from celery.task.control import ping In [2]: ping() Out[2]: [] In [3]: ping() Out[3]: [{u'developer1.dev.webapp.scl3.mozilla.com': u'pong'}] In [4]: ping() Out[4]: [{u'developer1.dev.webapp.scl3.mozilla.com': u'pong'}] I am poking at stage. Thanks James for the tip, that helps a lot!

James Socol [:jsocol, :james]

Comment 22

•

13 years ago

>>> from celery.task.control import ping >>> ping() ;)

John Karahalis [:openjck]

Updated

•

13 years ago

Blocks: 859875

Luke Crouch [:groovecoder]

Comment 23

•

13 years ago

Can we check on stage and prod for this? We finally have a feature that needs to use it so of course now we need it ASAP. :/

Brandon Burton [:solarce]

Assignee

Comment 24

•

13 years ago

(In reply to Luke Crouch [:groovecoder] from comment #23) > Can we check on stage and prod for this? We finally have a feature that > needs to use it so of course now we need it ASAP. :/ From what I can tell stage looks ready [root@developeradm.private.scl3 kuma]# ./manage.py shell Python 2.6.6 (r266:84292, Aug 28 2012, 10:55:56) Type "copyright", "credits" or "license" for more information. IPython 0.10 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object'. ?object also works, ?? prints more. In [1]: from celery.task.control import ping In [2]: ping() Out[2]: [{u'developer1.stage.webapp.scl3.mozilla.com': u'pong'}] In [3]: ping() Out[3]: [{u'developer1.stage.webapp.scl3.mozilla.com': u'pong'}] In [4]: ping() Out[4]: [{u'developer1.stage.webapp.scl3.mozilla.com': u'pong'}] In [5]: ping() Out[5]: [{u'developer1.stage.webapp.scl3.mozilla.com': u'pong'}] How can we test this out? I'll review the prod configs and file any bugs, etc tomorrow

Brandon Burton [:solarce]

Assignee

Comment 25

•

13 years ago

(In reply to Brandon Burton [:solarce] from comment #24) > I'll review the prod configs and file any bugs, etc tomorrow I've confirmed that the flows are in place for both the prod webservers and the prod celery servers I've put the prod BROKER_ settings in settings_local.py commented out Once we've confirmed -stage is happy we can push to prod

Status: NEW → ASSIGNED

Luke Crouch [:groovecoder]

Comment 26

•

13 years ago

Checked stage - the celery process still isn't handling the email tasks. The BROKER_* stuff looks good, but settings.py has CELERY_ALWAYS_EAGER = True, so we need to override that to False in settings_local.py.

Brandon Burton [:solarce]

Assignee

Comment 27

•

13 years ago

(In reply to Luke Crouch [:groovecoder] from comment #26) > Checked stage - the celery process still isn't handling the email tasks. The > BROKER_* stuff looks good, but settings.py has CELERY_ALWAYS_EAGER = True, > so we need to override that to False in settings_local.py. I've added CELERY_ALWAYS_EAGER = False to stage and pushed it Ping me on IRC when you're ready to test things

Luke Crouch [:groovecoder]

Comment 28

•

13 years ago

Tested on stage and it works! Non-blocking request on the edit, and I saw: [2013-05-08 11:48:01,780: INFO/MainProcess] Task tidings.events._fire_task[12009560-ba67-40a0-a615-d69dfbf22ce7] succeeded in 0.520574092865s in /var/log/celeryd-kuma.log BUT - did we update the dev and stage servers with the EMAIL_HOST values like bug 869588? I think dev and stage are still using their localhost smtp? That could be bad if an edit crawler on the stage site spawns hundreds or thousands of emails from it.

Brandon Burton [:solarce]

Assignee

Comment 29

•

13 years ago

(In reply to Luke Crouch [:groovecoder] from comment #28) > BUT - did we update the dev and stage servers with the EMAIL_HOST values > like bug 869588? I think dev and stage are still using their localhost smtp? > That could be bad if an edit crawler on the stage site spawns hundreds or > thousands of emails from it. Yes, that is live with https://bugzilla.mozilla.org/show_bug.cgi?id=869588#c7 When do you want to push to prod and test?

Brandon Burton [:solarce]

Assignee

Comment 30

•

13 years ago

(In reply to Brandon Burton [:solarce] from comment #29) > (In reply to Luke Crouch [:groovecoder] from comment #28) > > > BUT - did we update the dev and stage servers with the EMAIL_HOST values > > like bug 869588? I think dev and stage are still using their localhost smtp? > > That could be bad if an edit crawler on the stage site spawns hundreds or > > thousands of emails from it. > > Yes, that is live with https://bugzilla.mozilla.org/show_bug.cgi?id=869588#c7 > > When do you want to push to prod and test? I've pushed the broker and celery settings live

Brandon Burton [:solarce]

Assignee

Comment 31

•

13 years ago

Discovered developer-celery1 wasn't all the way setup :( I should have checked more thoroughly I'm doing the needful in puppet and have opened https://bugzilla.mozilla.org/show_bug.cgi?id=871824 for a need flow to get the code deployed Will update the bug when things are ready to test again

Brandon Burton [:solarce]

Assignee

Comment 32

•

13 years ago

I'm ready to test this, let me know if there is a good time today

Brandon Burton [:solarce]

Assignee

Comment 33

•

13 years ago

This has been successfully deployed and tested 12:50:32 @groovecoder | solarce: w00t! 12:50:49 solarce | [2013-05-16 09:48:37,783: DEBUG/PoolWorker-7] Sending edited notification email for document (id=4967) 12:50:49 solarce | [2013-05-16 09:48:38,091: INFO/MainProcess] Task tidings.events._fire_task[844ecf6d-ccbe-4dd4-b33e-55116bc2fe52] succeeded in 0.630956888199s: None 12:51:02 @groovecoder | yup, I got the email! interestingly, I also got the previous email notifications! :)

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

7 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard