Closed Bug 493623 Opened 11 years ago Closed 10 years ago

p-m having issues posting to graphs.m.o

Categories

(Release Engineering :: General, defect, P2, major)

x86
All

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aki, Assigned: catlee)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files)

Production-master is timing out posting to graphs.m.o.
Talos looks like it's reporting there ok. Not sure why this is.

 - graphs.m.o is slow to load for me, though I'm not sure if that's more than normal or not.
 - production-master had issues this morning and had to be kicked.
 - odd DHCPREQUESTs in production-master:/var/log/messages
 - uh,

[13:42]	<nagios>	[51] dm-graphs01:avg load is WARNING: WARNING - load average: 18.12, 23.93, 22.51

That might possibly be the culprit.
Severity: normal → critical
We haven't hit this in the last few hours, lowering severity.
Severity: critical → major
This is still happening sporadically. eg we had a build [1] that failed to post from production-master to graphs.m.o in this window:
 Start	Wed May 20 02:35:46 2009
 End	Wed May 20 02:36:20 2009
At the same time as End all our hg pollers fired, which might be related. Other graph server posts take just a few seconds and this one is timing out in 30. Please check the load on the graph server and storage latency in that time window.

[1] http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.5/1242809760.1242812223.3642.gz
Assignee: server-ops → thardcastle
I think we're just stacking up processes a bit much. Adding another VCPU to take advantage of some CPU time on another physical core would probably clear that up. Needs some brief VM downtime to enable.
Status: NEW → ASSIGNED
Flags: needs-downtime+
Anyone care if we do this upgrade tomorrow (06/16) night?
Downtime of which? Graphs, or p-m?
Either way I'm pretty sure downtime tomorrow night is a no.
(In reply to comment #2)
> Please check the load on the graph server and storage latency in that time
> window.

Were there any alerts about load / storage latency at this time?... for either graph server, or production-master?
(In reply to comment #4)
> Anyone care if we do this upgrade tomorrow (06/16) night?

A downtime at this point in the release would be tough. 

Recall that power-cycling production-master, even for just a few seconds, will require a multi-hour downtime while we cycle new builds, unittests through to verify all is working. Depending on release schedule, we were hoping to arrange a downtime with Dev later this week / early next week, when things calm down, but a downtime before that will be tough. If you feel we *need* to do this, we should go through our options at the IT/RelEng portion of tmrw's 1pm meeting.

0) Do we know if this is a problem with graph server or with production-master? Or both?

1) What exactly are you suggesting doing? Adding another VCPU to the ESX server running production-master? How confident are we that this extra VCPU would fix this problem?

2) Is there anything *else* that could be migrated off the ESX server instead, to reduce load?

3) Has anything changed recently on the ESX server causing this to start happening now?
0) dm-graphs01 is where the VCPU gets added, the VM has to be rebooted to make the change.

1) From watching the load on dm-graphs01 over a couple days, it's bound by multiple processes competing for CPU time. Most of the time, the ESX server has more CPU cores available, so a VCPU would split the load.

2) This ESX server isn't over loaded, just this VM.

3) The dm-graphs01 load appears to have been rising.
Talked about this yesterday... RelEng will tell us when we can take downtime to do this upgrade.  Reassign when you're ready.
Assignee: thardcastle → nobody
Component: Server Operations → Release Engineering
Flags: needs-downtime+
QA Contact: mrz → release
Upped VCPU from 1 to 2.

Upped RAM from 728K to 1G because 728K sounded low to me for a web application server.
Assignee: nobody → server-ops
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Resolution: --- → FIXED
er, I did mean 728M not 728K there ;)
Duplicate of this bug: 499680
oremj, can you get munin up on production-master (or help RE do so) and on graphs?  

Without understanding the cause it's tough to understand how to really fix :|
Assignee: server-ops → oremj
(In reply to comment #14)
> oremj, can you get munin up on production-master (or help RE do so) and on
> graphs?  

This is bug 500058.
Munin is running on production master now.
Assignee: oremj → server-ops
Since Munin, has there been any repeat occurrence?
Actually...we had a bunch of failures (6 or so) between 2:13 and 2:15pm today.
And Munin shows a load spike around that time (hard to see exactly when because of the scale):
http://nm-dash01.nms.mozilla.org/munin/build/production-master.build.mozilla.org-cpu.html
Can you correlate that to some process?
Whiteboard: [orange]
Blocks: 438871
(In reply to comment #20)
> Can you correlate that to some process?

I wasn't logged on to the machine at the time, but it was almost certainly the Buildbot process causing it.
What's the fix for that then?  If buildbot is chewing CPU do we need to look at moving this to real hardware?
(In reply to comment #22)
> What's the fix for that then?  If buildbot is chewing CPU do we need to look at
> moving this to real hardware?

I wish we knew!

Moving p-m to the new 3GHz cluster will probably help.  Adding another virtual CPU may also help.
(In reply to comment #24)
> Bug 501255 for that.

ETA for that fix?  Can I close this one or dup it to "upgrade p-m"?
"Failed graph server post" has been a top cause of red this week.
Please re-assign back to IT if there is something for us to do here.
Assignee: server-ops → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Marking bug 501255 as a dependency and futuring.
Component: Release Engineering → Release Engineering: Future
Depends on: 501255
Duplicate of this bug: 505382
I believe the build machines are still posting results to the old graph server, so putting a dependency on bug 476208.
Depends on: 476208
Also, if bug 476208 doesn't solve this, then we should move the graph server post onto the build slaves instead of in the master.  Will file separate bug for that if required.
Pretty much all the m-c and 1.9.1 builds are currently red due to failed graphserver post.
This is now a major problem for keeping the tree green, moving out of the Future pool.
Assignee: nobody → catlee
Component: Release Engineering: Future → Release Engineering
This is a stopgap until Alice's patch to send the data to the new graph server is ready to go.
Attachment #389836 - Flags: review?
Attachment #389836 - Flags: review? → review?(aki)
Attachment #389836 - Flags: review?(aki) → review+
Summary: p-m having issues posting to graphs.m.o → p-m having issues posting to graphs-old.m.o
This is stopping us basically dead right now, due to red trees, and we're trying to get things finished up for branching 1.9.2.  Please let me know if I can help, or if there is news to share with developers.  I am running out of shiny things with which to distract them!
Severity: major → blocker
(In reply to comment #35)
> This is stopping us basically dead right now, due to red trees, and we're
> trying to get things finished up for branching 1.9.2.  Please let me know if I
> can help, or if there is news to share with developers.  I am running out of
> shiny things with which to distract them!

We're in the middle of a downtime right now trying to get this resolved.  If the current approach doesn't work we can have a different approach ready for the end of the week.

We could also make failed graph server posts non-fatal.  Is that a viable option?
Since the landings Wednesday morning have we seen any additional failures here?
Severity: blocker → major
Not that I can see on mozilla-central or mozilla-1.9.1.
Haven't seen any problems on mozilla-central or mozilla-1.9.1 since the 23rd.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Had another failure today:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249421520.1249422616.23476.gz
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all being down for a while.

Let's keep our eye out for more of these.
(In reply to comment #41)
> We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all
> being down for a while.

CVS and Hg are both not behind the netscaler at all. You should not have experienced any issues with either of them yesterday when the netscaler failed over. Did you?
(In reply to comment #42)
> (In reply to comment #41)
> > We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all
> > being down for a while.
> 
> CVS and Hg are both not behind the netscaler at all. You should not have
> experienced any issues with either of them yesterday when the netscaler failed
> over. Did you?

We had several talos builds fail to checkout code from CVS yesterday.
(In reply to comment #43)
> We had several talos builds fail to checkout code from CVS yesterday.

Checkout from cvs.mozilla.org or cvs-mirror.mozilla.org? cvs-mirror.mozilla.org is actually behind the netscaler, but cvs.mozilla.org is not.
(In reply to comment #44)
> (In reply to comment #43)
> > We had several talos builds fail to checkout code from CVS yesterday.
> 
> Checkout from cvs.mozilla.org or cvs-mirror.mozilla.org? cvs-mirror.mozilla.org
> is actually behind the netscaler, but cvs.mozilla.org is not.

From cvs-mirror.m.o.
Assignee: catlee → anodelman
Depends on: 509604
We're currently not getting much error reporting to tell where the fault lies, the work in bug 509604 should give us more to go on.
Now that error reporting is up and working we're just waiting around for another failure so that I can see what's going wrong.
Summary: p-m having issues posting to graphs-old.m.o → p-m having issues posting to graphs.m.o
A couple of post errors:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250394830.1250398485.16955.gz

Encountered error when trying to post refcnt_leaks
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds.
]
at Sat Aug 15 21:48:32, and 
Encountered error when trying to post trace_malloc_allocs
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds.
]
(three times) at Sat Aug 15 21:51:24.

And also
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250397117.1250398199.13868.gz
Encountered error when trying to post codesighs
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds.
]
at Sat Aug 15 21:47:35 PDT.
Spot check of dm-graphs01 right now is showing no real issues, however I do note that it's running in a VM.  Maybe it's time for real hardware.
(In reply to comment #51)
> Spot check of dm-graphs01 right now is showing no real issues, however I do
> note that it's running in a VM.  Maybe it's time for real hardware.

On phone with alice right now: 


1) Before we start considering hardware, etc, did this VM hit any performance limits or trigger any alarms in VMware? 

2) One theory is that the talos slaves and cruncher, are both intermittently spiking load on graph.m.o. Changing how cruncher gathers historical data might reduce load significantly. Also, Alice has seen on occasion where talos slaves posting to graphs.m.o fail to post data, yet still report green/success, which could be caused by the same spike in load. The visible symptom is that the green talos build on waterfall has no link to graphserver.
Depends on: 513960
I don't believe that we've been seeing any talos graph server posting errors come through, so we just need to implement the same graph sending code that talos uses and hopefully this will clear up.
Depends on: 516773
I guess this is mine now.  Working on dependent bug 516773 to fix this.
Assignee: anodelman → catlee
Any new occurrences?
Going to call this fixed, I haven't seen any new occurrences since bug 516773 landed.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
Had an occurrence a few days ago (I failed to grab the log).

Need long times between re-tries and possibly more re-tries.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #61)
> Need long times between re-tries and possibly more re-tries.

Will that tie up the machine until it can submit ? At some point we just have to say "graphs is too slow and it's not our problem".
This will try for up to 20 minutes to do the graph server post (was 5 minutes before).
Attachment #408007 - Flags: review?(bhearsum)
Attachment #408007 - Flags: review?(bhearsum) → review+
Attachment #408007 - Flags: review+ → checked-in+
Going to call this Fixed.  Re-open if we hit it again.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Whiteboard: [orange]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.