virtualize sumocelery1.webapp.phx1



4 years ago
4 months ago


(Reporter: gcox, Unassigned)




(Whiteboard: [kanban:] [vm-p2v:1])



4 years ago
sumocelery1.webapp.phx1 is on out-of-warranty hardware.  Assuming it's still needed, it looks like a candidate for virtualization, and we'd like to work out a time to take it down and convert.


4 years ago
Whiteboard: [kanban:]
Gathering some stats for suggestions as to virtual hardware to allocate.  Assuming that you still need/want it around.

disk - reduce to default vm of 40G.

These are just suggestions - let me know if you have concerns with these, and when a good time to take this down would be, so that we can get this off of old hardware.  Thanks.
Poking for status, if the suggested specs look acceptable, and if we can have a window to take this down and P2V, so we can get off the hardware as it comes off warranty.
From our email convo, this should take 1-2 hours. I'm game for doing it anytime you all prefer. Having celery tasks backlog for that long is no big deal.

Our rabbitmq queues will still be up and running, correct?
Just give us a headsup in #sumodev or in this bug, ideally the day before.
Per #sumodev conversation - did this P2V - upon reboot things were working, but it was using EVERY bit of RAM and swap - surprised OOM_killer-man didn't swoop in.  took it back down and gave it more ram, things seem much happier.
Whiteboard: [kanban:] → [kanban:] [vm-p2v:1]

Comment 6

4 years ago
gcox@sumocelery1.webapp.phx1:~$ ps auxwww|grep "/data/www/ /data/www/ celeryd"|wc -l ; date ; free -m
Sun Sep 21 09:02:20 PDT 2014
             total       used       free     shared    buffers     cached
Mem:         11912      11646        265          0         37        176
-/+ buffers/cache:      11432        480
Swap:         4095       2097       1998

That seems like a lot of threads blocked.

Comment 7

4 years ago
This should get better without any VM changes... we're dropping the number of celery workers. :cturra has the specifics.
(In reply to Jake Maul [:jakem] from comment #7)
> This should get better without any VM changes... we're dropping the number
> of celery workers. :cturra has the specifics.

i reduced the number of available celery workers from 128 -> 96. watching the rabbit queues on this node before making the change, there never seemed to be any unacknowledged jobs in the queue, which meant we were already over provisioned with workers. will keep an eye on this to see if 96 results in the same.

Comment 9

4 years ago
:cknowles finished this; it's been behaving.  With reduced cores it occasionally has lit off load alarms because of transient load, but :ashish said he'll "bump the threshold or increase the delay for notification".

Since this is just the p2v bug, and it's verified working correctly, closing this out.
Last Resolved: 4 years ago
Resolution: --- → FIXED


4 months ago
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.