production graph server swamped

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
RESOLVED FIXED
8 years ago
3 years ago

People

(Reporter: alice, Assigned: phong)

Tracking

Dependency tree / graph
Bug Flags:
needs-downtime -

Details

(Reporter)

Description

8 years ago
After some investigation by fox2mike the production graph server was determined to be 'sluggish' and using up half its swap.  This is a central pieces of our build and performance testing infrastructure.

IT should figure out how to beef up this vm.
(Reporter)

Updated

8 years ago
Blocks: 548320
Two pronged approach :

1) Currently the VM has 2 processors and about 2 GB of RAM, I'd like to boost that to 4 + 4 and see how it performs.

2) If the above fails too, we might need to think of moving to hardware, but seeing how it's performed so far I don't think this is needed quite yet.

Alice, this would need downtime, the VM has to be shutdown before these values can be bumped up. Can you please let us know when we can do this?

Updated

8 years ago
Assignee: server-ops → shyam
(Reporter)

Comment 2

8 years ago
As long as notification goes to dev.planning and dev.tree-management and waterfalls are closed you can do this whenever fits into your schedule.
when is the next regular downtime, and could this be included?

(i cant set the "needs-downtime" flag, but this does need advance downtime notice as any builds attempting to post results while graphserver is offline will burn red.)
Tonight, but if that's too short notice, we could aim for the coming Tue or Thu.
Flags: needs-downtime+
(Reporter)

Comment 5

8 years ago
Noticed that this didn't end up on tonight's downtime notification - what are we blocking on here?

Anytime IT can get this on their downtime schedule releng is willing to do the necessary tree closures to make it happen.
Just talked with mrz; this is too late to happen tonight. IT will schedule this for Thursday downtime instead. RelEng still needs to send out notice to developers about tree closure.
Whiteboard: 03/04/2010 @ 7pm
(In reply to comment #6)
> Just talked with mrz; this is too late to happen tonight. IT will schedule this
> for Thursday downtime instead. RelEng still needs to send out notice to
> developers about tree closure.

Rescheduled to Friday 8-11am PST.
Depends on: 550066
Whiteboard: 03/04/2010 @ 7pm → 03/05/2010 @ 8am
I see a downtime notice for this and email that says do it tonight instead of friday, so which is it? Tonight or Friday?

Comment 9

8 years ago
Yes, joduinn replied to the email (on moz.dev.tree-management at least) saying it's been postponed to tomorrow (Friday).
Thanks Aki, the email said don't do it tonight ;) My bad, I misread.
All done, dm-graphs02 has twice the processor and ram as it did before this.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Picked up a kernel upgrade and a bunch of other fixes as well.
the VM has been swamped this morning, so aravind just rebooted it. Missing VMware tools (maybe after kernel upgrade?).

Reopening to track.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I installed vmware tools on it, so if this helps and it hold up, this bug can be closed.
Calling fixed.
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Flags: needs-downtime+ → needs-downtime-
Resolution: --- → FIXED
Whiteboard: 03/05/2010 @ 8am
How was the load this morning?  We had a bunch of failed posts early this morning (2:30-3am or so)
We're still having problems here.  More failures around 7:30.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I can't even login to the box :(

Phong any ideas here? We picked up a kernel upgrade, bumped up the RAM and CPU and Aravind updated VMWare tools, but it seems like the box just locks up after a while and is completely unresponsive.
Assignee: shyam → phong
(In reply to comment #18)
> I can't even login to the box :(
> 
> Phong any ideas here? We picked up a kernel upgrade, bumped up the RAM and CPU
> and Aravind updated VMWare tools, but it seems like the box just locks up after
> a while and is completely unresponsive.

By any chance, for the CPU upgrade, did you switch to multi-core CPUs? If so, can we try going back to single-core? I ask because we've hit problems with multi-core CPUs on win32 build VMs in the past and even if graphserver is not-win32 VM, it would be good to eliminate that variable from the problem.

Comment 20

8 years ago
...and next morning, it seems to be gone again :(
(Assignee)

Comment 21

8 years ago
I thought this was scheduled for 9 AM.
Just talked with mrz: lets leave this closed, and track the latest fallout in bug#551532
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.