Deploy rabbitmq / celery for MDN dev/stage/prod

RESOLVED FIXED

Status

P1
normal
RESOLVED FIXED
7 years ago
5 years ago

People

(Reporter: lorchard, Assigned: bburton)

Tracking

Details

(Whiteboard: [triaged 20120824])

(Reporter)

Description

7 years ago
You're going to hate me, but I think we're going to need an offline queue for Kuma/MDN before launch - and that probably means rabbitmq + celeryd

In doing more work with kumascript and templates, I think I'm at the point where I'm not comfortable letting the node.js service handle requests on demand. There are some long page build times and it seems dangerous to allow those to stack up.

So, I'm starting work on bug 766252, which should do 2 things with the queue:

* Defer wiki page build to the queue and show users a "rebuild in progress" page in the meantime.

* Only allow one build of a given page at a time. Currently, many requests could be attempted for one of those 30-seconds-to-build pages, and they'd all stack up in parallel. That sounds like bad news.
(Reporter)

Updated

7 years ago
Blocks: 766252

Updated

7 years ago
Whiteboard: [webops meeting 2012-06-21]
(Reporter)

Comment 1

7 years ago
Also, FWIW, this should avoid any messing around with long timeouts in production as requested for developer-dev in bug 763829
(Reporter)

Comment 2

7 years ago
I'm not quite done with the feature that depends on this yet, but we will need it before launch. Any update on if/when this can happen?

Updated

7 years ago
Depends on: 768952
A node is online but with only a basic server config (no puppet modules). Our rabbitmq / celery puppet configs are all over the map. TODO:

1) Sort out how to actually get stuff installed and running here. :solarce may be helpful (CC'ing him) as we've previously discussed how best to manage a "proper" rabbitmq/celery cluster, but without coming to any concrete plans.

2) Open ACLs between this node and the admin node... presumably if nothing else it will need to get the content so it can run the proper celery workers.

3) Open ACLs between this node and the web nodes, for actual rabbitmq/celery traffic.

4) #2 makes it clear to me that we'll need 3 nodes here, not 1... one for dev, one for stage, and one for new-prod. We aren't set up to deploy code from different environments onto the same node. Need to open a bug to get 2 more VMs spun up, identical to this one. As part of #1 we might decide if it's worth having yet another VM to be run rabbitmq for all 3 env's, since as far as I know rabbitmq is standalone, and won't need a copy of the site code to run.


In any case I'm certain this *can* happen, but I don't really know what would be a realistic timeline. Certainly not in time for tomorrow's soft launch, but possibly before the July 14/15 MT contract end date.
(Reporter)

Comment 4

7 years ago
FWIW, this is a late request for a last-minute feature add. With that in mind, I built it so that we can launch without a queue.

The rendering code will, at worst, still allow only one rendering of any given page at any given time. That should hopefully stop a pile-up on pages that take a long time to render in the request/response cycle.

Adding a queue will enhance that by letting us offload some page rendering out of the request/response cycle altogether. So, we definitely want it ASA(reasonably)P, but I don't think it should block launch by any means
Jake, do you need gear built for this?  I can have SREs install a couple of nodes..
Assignee: server-ops → nmaul
Whiteboard: [webops meeting 2012-06-21]

Updated

6 years ago
Assignee: nmaul → server-ops-webops
Group: infra
Whiteboard: [pending triage]

Updated

6 years ago
Priority: -- → P3
Whiteboard: [pending triage] → [triaged 20120824]
Renormalizing priority levels... P4 is "normal" now.
Priority: P3 → P4
Altering and bumping since we have another use-case for this now.
Blocks: 750240
Severity: normal → major
Priority: P4 → P3
Summary: Deploy rabbitmq / celery for developer-new.mozilla.org and developer-dev.allizom.org → Deploy rabbitmq / celery for MDN prod & stage
Assignee: server-ops-webops → nmaul
rabbitmq will be a pair of standalone nodes (mirrored queues)... VMs or seamicro xeon's or HP blades (probably the seamicros, just based on available resources)

celery will probably be 3 VMs (open to discussion), one per environment
Severity: major → normal
Summary: Deploy rabbitmq / celery for MDN prod & stage → Deploy rabbitmq / celery for MDN dev/stage/prod
I have the rabbitmq nodes more or less alive in SCL3. They're properly clustered. I've played with it a little bit, and can send/receive properly... at least to one node. Still need to set up a load-balanced VIP to distribute the work across the two nodes.

Setting up the mirrored queues is something that happens on the app level, most likely based on stuff we'd put in settings_local.py. Relevant:

https://github.com/celery/celery/issues/483


I'm not opposed to running the celery workers right on the web nodes, at least for dev and stage. That'll be quickest to get up and running. For prod we can debate having separate celery nodes.

There's likely a bit of work to restructure how we do celery, as most of our current stuff kinda assumes rabbitmq running *on* the celery node. Shouldn't be too hard to change, I think.
Celery is alive and (I think) working on developer-dev.allizom.org.


Our RabbitMQ system is clustered and inherently fairly redundant. However, the contents of the individual queues are not mirrored between RabbitMQ nodes. This is something that needs to be handled at the app layer, in settings_local.py.

https://github.com/celery/celery/issues/483

There are implications to doing this (performance, how this affects non-idempotent jobs, etc), and we may not want to actually do it. But we can, if we want to.

Let me know how this works for you, and we can easily do this on stage and prod as well.
I'll get back into arecibo to test this on dev and let you know.
I've checked in a commented-out section to the puppet module for developer-stage for enabling celery on the stage node, and in the node manifest for the prod celery node.

I have also added the relevant config to the settings_local.py file for stage and prod, also commented out.

I have also confirmed that the Chief update script and settings are ready to go for stage and prod. The actual function call is commented out, but should work.

Once we're ready to go on this, it should be a simple matter of uncommenting the config in puppet and the matching lines in settings_local.py, deploying the puppet changes, editing deploy.py to have it updated the celery node(s), and deploying the site.
Priority: P3 → --

Updated

6 years ago
Depends on: 821928
(Reporter)

Updated

6 years ago
Blocks: 839605
(Reporter)

Comment 13

6 years ago
What's the next step to getting Celery available for MDN?
(In reply to Jake Maul [:jakem] from comment #12)
> Once we're ready to go on this, it should be a simple matter of uncommenting
> the config in puppet and the matching lines in settings_local.py, deploying
> the puppet changes, editing deploy.py to have it updated the celery node(s),
> and deploying the site.

Can we go ahead and make the settings and puppet changes, or do the deploy.py changes need to happen at the same time? If so, that shouldn't be too complicated.
We will need this in the next week or so.
(Assignee)

Updated

6 years ago
Assignee: nmaul → bburton
Priority: -- → P1
(Assignee)

Comment 16

6 years ago
celery is live on stage, I pushed the puppet changes and the BROKER_ config changes

Please test it and let me know how it looks
(Assignee)

Updated

6 years ago
Depends on: 846969
(Assignee)

Comment 17

6 years ago
Is there a way I can trigger a job that'll test Celery on staging? 

thanks
Flags: needinfo?(lorchard)
(Reporter)

Comment 18

6 years ago
I don't know / I don't think so? I think we've been holding off deploying or enabling any Celery code until we had a way to run it. Kind of chicken-and-egg
Flags: needinfo?(lorchard)
(Assignee)

Comment 19

6 years ago
(In reply to Les Orchard [:lorchard] from comment #18)
> I don't know / I don't think so? I think we've been holding off deploying or
> enabling any Celery code until we had a way to run it. Kind of
> chicken-and-egg

Understood, dev and stage are ready to test Celery on, prod is still pending flows in bug 846969 but I imagine will be done this week

Let me know if there is anything I can do to help test
(In reply to Brandon Burton [:solarce] from comment #19)
> Let me know if there is anything I can do to help test

We could run the ping() command to make sure the web heads and the admin box are communicating with the workers.

    ./manage.py shell
    >>> from celery.task.control import ping
    >>> ping()
    [{'worker': 'pong'}]
(Assignee)

Updated

6 years ago
Depends on: 848514
(Assignee)

Comment 21

6 years ago
Dev looks ready, it took a couple pings though

[bburton@developer1.dev.webapp.scl3 kuma]$ ./manage.py shell
In [1]: from celery.task.control import ping
In [2]: ping()
Out[2]: []
In [3]: ping()
Out[3]: [{u'developer1.dev.webapp.scl3.mozilla.com': u'pong'}]
In [4]: ping()
Out[4]: [{u'developer1.dev.webapp.scl3.mozilla.com': u'pong'}]

I am poking at stage. 

Thanks James for the tip, that helps a lot!
>>> from celery.task.control import ping
>>> ping()

;)
Blocks: 859875
Can we check on stage and prod for this? We finally have a feature that needs to use it so of course now we need it ASAP. :/
(Assignee)

Comment 24

6 years ago
(In reply to Luke Crouch [:groovecoder] from comment #23)
> Can we check on stage and prod for this? We finally have a feature that
> needs to use it so of course now we need it ASAP. :/

From what I can tell stage looks ready

[root@developeradm.private.scl3 kuma]# ./manage.py shell
Python 2.6.6 (r266:84292, Aug 28 2012, 10:55:56)
Type "copyright", "credits" or "license" for more information.

IPython 0.10 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object'. ?object also works, ?? prints more.

In [1]: from celery.task.control import ping

In [2]: ping()
Out[2]: [{u'developer1.stage.webapp.scl3.mozilla.com': u'pong'}]

In [3]: ping()
Out[3]: [{u'developer1.stage.webapp.scl3.mozilla.com': u'pong'}]

In [4]: ping()
Out[4]: [{u'developer1.stage.webapp.scl3.mozilla.com': u'pong'}]

In [5]: ping()
Out[5]: [{u'developer1.stage.webapp.scl3.mozilla.com': u'pong'}]

How can we test this out?

I'll review the prod configs and file any bugs, etc tomorrow
(Assignee)

Comment 25

6 years ago
(In reply to Brandon Burton [:solarce] from comment #24)

> I'll review the prod configs and file any bugs, etc tomorrow

I've confirmed that the flows are in place for both the prod webservers and the prod celery servers

I've put the prod BROKER_ settings in settings_local.py commented out

Once we've confirmed -stage is happy we can push to prod
Status: NEW → ASSIGNED
Checked stage - the celery process still isn't handling the email tasks. The BROKER_* stuff looks good, but settings.py has CELERY_ALWAYS_EAGER = True, so we need to override that to False in settings_local.py.
(Assignee)

Comment 27

6 years ago
(In reply to Luke Crouch [:groovecoder] from comment #26)
> Checked stage - the celery process still isn't handling the email tasks. The
> BROKER_* stuff looks good, but settings.py has CELERY_ALWAYS_EAGER = True,
> so we need to override that to False in settings_local.py.

I've added CELERY_ALWAYS_EAGER = False to stage and pushed it

Ping me on IRC when you're ready to test things
Tested on stage and it works! Non-blocking request on the edit, and I saw:

[2013-05-08 11:48:01,780: INFO/MainProcess] Task tidings.events._fire_task[12009560-ba67-40a0-a615-d69dfbf22ce7] succeeded in 0.520574092865s

in /var/log/celeryd-kuma.log

BUT - did we update the dev and stage servers with the EMAIL_HOST values like bug 869588? I think dev and stage are still using their localhost smtp? That could be bad if an edit crawler on the stage site spawns hundreds or thousands of emails from it.
(Assignee)

Comment 29

6 years ago
(In reply to Luke Crouch [:groovecoder] from comment #28)

> BUT - did we update the dev and stage servers with the EMAIL_HOST values
> like bug 869588? I think dev and stage are still using their localhost smtp?
> That could be bad if an edit crawler on the stage site spawns hundreds or
> thousands of emails from it.

Yes, that is live with https://bugzilla.mozilla.org/show_bug.cgi?id=869588#c7

When do you want to push to prod and test?
(Assignee)

Comment 30

6 years ago
(In reply to Brandon Burton [:solarce] from comment #29)
> (In reply to Luke Crouch [:groovecoder] from comment #28)
> 
> > BUT - did we update the dev and stage servers with the EMAIL_HOST values
> > like bug 869588? I think dev and stage are still using their localhost smtp?
> > That could be bad if an edit crawler on the stage site spawns hundreds or
> > thousands of emails from it.
> 
> Yes, that is live with https://bugzilla.mozilla.org/show_bug.cgi?id=869588#c7
> 
> When do you want to push to prod and test?

I've pushed the broker and celery settings live
(Assignee)

Updated

6 years ago
Depends on: 871824
(Assignee)

Comment 31

6 years ago
Discovered developer-celery1 wasn't all the way setup :(

I should have checked more thoroughly

I'm doing the needful in puppet and have opened https://bugzilla.mozilla.org/show_bug.cgi?id=871824 for a need flow to get the code deployed

Will update the bug when things are ready to test again
(Assignee)

Comment 32

6 years ago
I'm ready to test this, let me know if there is a good time today
(Assignee)

Comment 33

6 years ago
This has been successfully deployed and tested

12:50:32     @groovecoder | solarce: w00t!
12:50:49          solarce | [2013-05-16 09:48:37,783: DEBUG/PoolWorker-7] Sending edited notification email for document (id=4967)
12:50:49          solarce | [2013-05-16 09:48:38,091: INFO/MainProcess] Task tidings.events._fire_task[844ecf6d-ccbe-4dd4-b33e-55116bc2fe52] succeeded in 0.630956888199s: None
12:51:02     @groovecoder | yup, I got the email! interestingly, I also got the previous email notifications! :)
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.