Last Comment Bug 548371 - production graph server swamped
: production graph server swamped
Status: RESOLVED FIXED
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: All All
: -- normal (vote)
: ---
Assigned To: Phong Tran [:phong]
: matthew zeier [:mrz]
:
Mentors:
Depends on: 550066
Blocks: 548320
  Show dependency treegraph
 
Reported: 2010-02-24 11:47 PST by alice nodelman [:alice] [:anode]
Modified: 2015-03-12 08:17 PDT (History)
11 users (show)
mzeier: needs‑downtime-
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description alice nodelman [:alice] [:anode] 2010-02-24 11:47:52 PST
After some investigation by fox2mike the production graph server was determined to be 'sluggish' and using up half its swap.  This is a central pieces of our build and performance testing infrastructure.

IT should figure out how to beef up this vm.
Comment 1 Shyam Mani [:fox2mike] 2010-02-24 11:53:15 PST
Two pronged approach :

1) Currently the VM has 2 processors and about 2 GB of RAM, I'd like to boost that to 4 + 4 and see how it performs.

2) If the above fails too, we might need to think of moving to hardware, but seeing how it's performed so far I don't think this is needed quite yet.

Alice, this would need downtime, the VM has to be shutdown before these values can be bumped up. Can you please let us know when we can do this?
Comment 2 alice nodelman [:alice] [:anode] 2010-02-24 16:08:33 PST
As long as notification goes to dev.planning and dev.tree-management and waterfalls are closed you can do this whenever fits into your schedule.
Comment 3 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-02-25 09:49:06 PST
when is the next regular downtime, and could this be included?

(i cant set the "needs-downtime" flag, but this does need advance downtime notice as any builds attempting to post results while graphserver is offline will burn red.)
Comment 4 Shyam Mani [:fox2mike] 2010-02-25 10:07:47 PST
Tonight, but if that's too short notice, we could aim for the coming Tue or Thu.
Comment 5 alice nodelman [:alice] [:anode] 2010-03-02 10:57:43 PST
Noticed that this didn't end up on tonight's downtime notification - what are we blocking on here?

Anytime IT can get this on their downtime schedule releng is willing to do the necessary tree closures to make it happen.
Comment 6 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-03-02 12:21:41 PST
Just talked with mrz; this is too late to happen tonight. IT will schedule this for Thursday downtime instead. RelEng still needs to send out notice to developers about tree closure.
Comment 7 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-03-04 17:50:44 PST
(In reply to comment #6)
> Just talked with mrz; this is too late to happen tonight. IT will schedule this
> for Thursday downtime instead. RelEng still needs to send out notice to
> developers about tree closure.

Rescheduled to Friday 8-11am PST.
Comment 8 Shyam Mani [:fox2mike] 2010-03-04 18:50:00 PST
I see a downtime notice for this and email that says do it tonight instead of friday, so which is it? Tonight or Friday?
Comment 9 Aki Sasaki [:aki] 2010-03-04 18:51:50 PST
Yes, joduinn replied to the email (on moz.dev.tree-management at least) saying it's been postponed to tomorrow (Friday).
Comment 10 Shyam Mani [:fox2mike] 2010-03-04 18:53:45 PST
Thanks Aki, the email said don't do it tonight ;) My bad, I misread.
Comment 11 Shyam Mani [:fox2mike] 2010-03-05 08:42:22 PST
All done, dm-graphs02 has twice the processor and ram as it did before this.
Comment 12 Shyam Mani [:fox2mike] 2010-03-05 08:54:05 PST
Picked up a kernel upgrade and a bunch of other fixes as well.
Comment 13 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-03-08 12:55:47 PST
the VM has been swamped this morning, so aravind just rebooted it. Missing VMware tools (maybe after kernel upgrade?).

Reopening to track.
Comment 14 Aravind Gottipati [:aravind] 2010-03-08 12:57:20 PST
I installed vmware tools on it, so if this helps and it hold up, this bug can be closed.
Comment 15 matthew zeier [:mrz] 2010-03-09 11:02:16 PST
Calling fixed.
Comment 16 Chris AtLee [:catlee] 2010-03-10 06:48:45 PST
How was the load this morning?  We had a bunch of failed posts early this morning (2:30-3am or so)
Comment 17 Chris AtLee [:catlee] 2010-03-10 07:55:44 PST
We're still having problems here.  More failures around 7:30.
Comment 18 Shyam Mani [:fox2mike] 2010-03-10 09:41:32 PST
I can't even login to the box :(

Phong any ideas here? We picked up a kernel upgrade, bumped up the RAM and CPU and Aravind updated VMWare tools, but it seems like the box just locks up after a while and is completely unresponsive.
Comment 19 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-03-10 12:02:31 PST
(In reply to comment #18)
> I can't even login to the box :(
> 
> Phong any ideas here? We picked up a kernel upgrade, bumped up the RAM and CPU
> and Aravind updated VMWare tools, but it seems like the box just locks up after
> a while and is completely unresponsive.

By any chance, for the CPU upgrade, did you switch to multi-core CPUs? If so, can we try going back to single-core? I ask because we've hit problems with multi-core CPUs on win32 build VMs in the past and even if graphserver is not-win32 VM, it would be good to eliminate that variable from the problem.
Comment 20 Robert Kaiser 2010-03-11 05:12:51 PST
...and next morning, it seems to be gone again :(
Comment 21 Phong Tran [:phong] 2010-03-11 09:28:41 PST
I thought this was scheduled for 9 AM.
Comment 22 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-03-11 11:30:54 PST
Just talked with mrz: lets leave this closed, and track the latest fallout in bug#551532

Note You need to log in before you can comment on or make changes to this bug.