Last Comment Bug 548371 - production graph server swamped
: production graph server swamped
Status: RESOLVED FIXED
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: All All
-- normal (vote)
: ---
Assigned To: Phong Tran [:phong]
: matthew zeier [:mrz]
:
Mentors:
Depends on: 550066
Blocks: 548320
  Show dependency treegraph
 
Reported: 2010-02-24 11:47 PST by alice nodelman [:alice] [:anode]
Modified: 2015-03-12 08:17 PDT (History)
11 users (show)
mzeier: needs‑downtime-
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description User image alice nodelman [:alice] [:anode] 2010-02-24 11:47:52 PST
After some investigation by fox2mike the production graph server was determined to be 'sluggish' and using up half its swap.  This is a central pieces of our build and performance testing infrastructure.

IT should figure out how to beef up this vm.
Comment 1 User image Shyam Mani [:fox2mike] (AFK until March 10) 2010-02-24 11:53:15 PST
Two pronged approach :

1) Currently the VM has 2 processors and about 2 GB of RAM, I'd like to boost that to 4 + 4 and see how it performs.

2) If the above fails too, we might need to think of moving to hardware, but seeing how it's performed so far I don't think this is needed quite yet.

Alice, this would need downtime, the VM has to be shutdown before these values can be bumped up. Can you please let us know when we can do this?
Comment 2 User image alice nodelman [:alice] [:anode] 2010-02-24 16:08:33 PST
As long as notification goes to dev.planning and dev.tree-management and waterfalls are closed you can do this whenever fits into your schedule.
Comment 3 User image John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-02-25 09:49:06 PST
when is the next regular downtime, and could this be included?

(i cant set the "needs-downtime" flag, but this does need advance downtime notice as any builds attempting to post results while graphserver is offline will burn red.)
Comment 4 User image Shyam Mani [:fox2mike] (AFK until March 10) 2010-02-25 10:07:47 PST
Tonight, but if that's too short notice, we could aim for the coming Tue or Thu.
Comment 5 User image alice nodelman [:alice] [:anode] 2010-03-02 10:57:43 PST
Noticed that this didn't end up on tonight's downtime notification - what are we blocking on here?

Anytime IT can get this on their downtime schedule releng is willing to do the necessary tree closures to make it happen.
Comment 6 User image John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-03-02 12:21:41 PST
Just talked with mrz; this is too late to happen tonight. IT will schedule this for Thursday downtime instead. RelEng still needs to send out notice to developers about tree closure.
Comment 7 User image John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-03-04 17:50:44 PST
(In reply to comment #6)
> Just talked with mrz; this is too late to happen tonight. IT will schedule this
> for Thursday downtime instead. RelEng still needs to send out notice to
> developers about tree closure.

Rescheduled to Friday 8-11am PST.
Comment 8 User image Shyam Mani [:fox2mike] (AFK until March 10) 2010-03-04 18:50:00 PST
I see a downtime notice for this and email that says do it tonight instead of friday, so which is it? Tonight or Friday?
Comment 9 User image Aki Sasaki [:aki] 2010-03-04 18:51:50 PST
Yes, joduinn replied to the email (on moz.dev.tree-management at least) saying it's been postponed to tomorrow (Friday).
Comment 10 User image Shyam Mani [:fox2mike] (AFK until March 10) 2010-03-04 18:53:45 PST
Thanks Aki, the email said don't do it tonight ;) My bad, I misread.
Comment 11 User image Shyam Mani [:fox2mike] (AFK until March 10) 2010-03-05 08:42:22 PST
All done, dm-graphs02 has twice the processor and ram as it did before this.
Comment 12 User image Shyam Mani [:fox2mike] (AFK until March 10) 2010-03-05 08:54:05 PST
Picked up a kernel upgrade and a bunch of other fixes as well.
Comment 13 User image John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-03-08 12:55:47 PST
the VM has been swamped this morning, so aravind just rebooted it. Missing VMware tools (maybe after kernel upgrade?).

Reopening to track.
Comment 14 User image Aravind Gottipati [:aravind] 2010-03-08 12:57:20 PST
I installed vmware tools on it, so if this helps and it hold up, this bug can be closed.
Comment 15 User image matthew zeier [:mrz] 2010-03-09 11:02:16 PST
Calling fixed.
Comment 16 User image Chris AtLee [:catlee] 2010-03-10 06:48:45 PST
How was the load this morning?  We had a bunch of failed posts early this morning (2:30-3am or so)
Comment 17 User image Chris AtLee [:catlee] 2010-03-10 07:55:44 PST
We're still having problems here.  More failures around 7:30.
Comment 18 User image Shyam Mani [:fox2mike] (AFK until March 10) 2010-03-10 09:41:32 PST
I can't even login to the box :(

Phong any ideas here? We picked up a kernel upgrade, bumped up the RAM and CPU and Aravind updated VMWare tools, but it seems like the box just locks up after a while and is completely unresponsive.
Comment 19 User image John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-03-10 12:02:31 PST
(In reply to comment #18)
> I can't even login to the box :(
> 
> Phong any ideas here? We picked up a kernel upgrade, bumped up the RAM and CPU
> and Aravind updated VMWare tools, but it seems like the box just locks up after
> a while and is completely unresponsive.

By any chance, for the CPU upgrade, did you switch to multi-core CPUs? If so, can we try going back to single-core? I ask because we've hit problems with multi-core CPUs on win32 build VMs in the past and even if graphserver is not-win32 VM, it would be good to eliminate that variable from the problem.
Comment 20 User image Robert Kaiser 2010-03-11 05:12:51 PST
...and next morning, it seems to be gone again :(
Comment 21 User image Phong Tran [:phong] 2010-03-11 09:28:41 PST
I thought this was scheduled for 9 AM.
Comment 22 User image John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-03-11 11:30:54 PST
Just talked with mrz: lets leave this closed, and track the latest fallout in bug#551532

Note You need to log in before you can comment on or make changes to this bug.