sumocelery1.webapp.phx1 is on out-of-warranty hardware. Assuming it's still needed, it looks like a candidate for virtualization, and we'd like to work out a time to take it down and convert.
Gathering some stats for suggestions as to virtual hardware to allocate. Assuming that you still need/want it around. 1 CPU 6 GRAM disk - reduce to default vm of 40G. These are just suggestions - let me know if you have concerns with these, and when a good time to take this down would be, so that we can get this off of old hardware. Thanks.
Poking for status, if the suggested specs look acceptable, and if we can have a window to take this down and P2V, so we can get off the hardware as it comes off warranty.
From our email convo, this should take 1-2 hours. I'm game for doing it anytime you all prefer. Having celery tasks backlog for that long is no big deal. Our rabbitmq queues will still be up and running, correct?
Just give us a headsup in #sumodev or in this bug, ideally the day before.
Per #sumodev conversation - did this P2V - upon reboot things were working, but it was using EVERY bit of RAM and swap - surprised OOM_killer-man didn't swoop in. took it back down and gave it more ram, things seem much happier.
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/513] → [kanban:https://kanbanize.com/ctrl_board/4/513] [vm-p2v:1]
email@example.com:~$ ps auxwww|grep "/data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd"|wc -l ; date ; free -m 130 Sun Sep 21 09:02:20 PDT 2014 total used free shared buffers cached Mem: 11912 11646 265 0 37 176 -/+ buffers/cache: 11432 480 Swap: 4095 2097 1998 That seems like a lot of threads blocked.
This should get better without any VM changes... we're dropping the number of celery workers. :cturra has the specifics.
(In reply to Jake Maul [:jakem] from comment #7) > This should get better without any VM changes... we're dropping the number > of celery workers. :cturra has the specifics. i reduced the number of available celery workers from 128 -> 96. watching the rabbit queues on this node before making the change, there never seemed to be any unacknowledged jobs in the queue, which meant we were already over provisioned with workers. will keep an eye on this to see if 96 results in the same.
:cknowles finished this; it's been behaving. With reduced cores it occasionally has lit off load alarms because of transient load, but :ashish said he'll "bump the threshold or increase the delay for notification". Since this is just the p2v bug, and it's verified working correctly, closing this out.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.