p-m having issues posting to graphs.m.o

RESOLVED FIXED

Status

Release Engineering
General
P2
major
RESOLVED FIXED
9 years ago
4 years ago

People

(Reporter: aki, Assigned: catlee)

Tracking

({intermittent-failure})

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Reporter)

Description

9 years ago
Production-master is timing out posting to graphs.m.o.
Talos looks like it's reporting there ok. Not sure why this is.

 - graphs.m.o is slow to load for me, though I'm not sure if that's more than normal or not.
 - production-master had issues this morning and had to be kicked.
 - odd DHCPREQUESTs in production-master:/var/log/messages
 - uh,

[13:42]	<nagios>	[51] dm-graphs01:avg load is WARNING: WARNING - load average: 18.12, 23.93, 22.51

That might possibly be the culprit.
(Reporter)

Updated

9 years ago
Severity: normal → critical
We haven't hit this in the last few hours, lowering severity.
Severity: critical → major
This is still happening sporadically. eg we had a build [1] that failed to post from production-master to graphs.m.o in this window:
 Start	Wed May 20 02:35:46 2009
 End	Wed May 20 02:36:20 2009
At the same time as End all our hg pollers fired, which might be related. Other graph server posts take just a few seconds and this one is timing out in 30. Please check the load on the graph server and storage latency in that time window.

[1] http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.5/1242809760.1242812223.3642.gz

Updated

9 years ago
Assignee: server-ops → thardcastle

Comment 3

9 years ago
I think we're just stacking up processes a bit much. Adding another VCPU to take advantage of some CPU time on another physical core would probably clear that up. Needs some brief VM downtime to enable.
Status: NEW → ASSIGNED
Flags: needs-downtime+

Comment 4

9 years ago
Anyone care if we do this upgrade tomorrow (06/16) night?
(Reporter)

Comment 5

9 years ago
Downtime of which? Graphs, or p-m?
Either way I'm pretty sure downtime tomorrow night is a no.
(In reply to comment #2)
> Please check the load on the graph server and storage latency in that time
> window.

Were there any alerts about load / storage latency at this time?... for either graph server, or production-master?
(In reply to comment #4)
> Anyone care if we do this upgrade tomorrow (06/16) night?

A downtime at this point in the release would be tough. 

Recall that power-cycling production-master, even for just a few seconds, will require a multi-hour downtime while we cycle new builds, unittests through to verify all is working. Depending on release schedule, we were hoping to arrange a downtime with Dev later this week / early next week, when things calm down, but a downtime before that will be tough. If you feel we *need* to do this, we should go through our options at the IT/RelEng portion of tmrw's 1pm meeting.

0) Do we know if this is a problem with graph server or with production-master? Or both?

1) What exactly are you suggesting doing? Adding another VCPU to the ESX server running production-master? How confident are we that this extra VCPU would fix this problem?

2) Is there anything *else* that could be migrated off the ESX server instead, to reduce load?

3) Has anything changed recently on the ESX server causing this to start happening now?

Comment 8

9 years ago
0) dm-graphs01 is where the VCPU gets added, the VM has to be rebooted to make the change.

1) From watching the load on dm-graphs01 over a couple days, it's bound by multiple processes competing for CPU time. Most of the time, the ESX server has more CPU cores available, so a VCPU would split the load.

2) This ESX server isn't over loaded, just this VM.

3) The dm-graphs01 load appears to have been rising.

Comment 9

9 years ago
Talked about this yesterday... RelEng will tell us when we can take downtime to do this upgrade.  Reassign when you're ready.
Assignee: thardcastle → nobody
Component: Server Operations → Release Engineering
Flags: needs-downtime+
QA Contact: mrz → release
Upped VCPU from 1 to 2.

Upped RAM from 728K to 1G because 728K sounded low to me for a web application server.
Assignee: nobody → server-ops
Status: ASSIGNED → RESOLVED
Last Resolved: 8 years ago
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Resolution: --- → FIXED
er, I did mean 728M not 728K there ;)

Updated

8 years ago
Duplicate of this bug: 499680
We saw more post failures on Monday after these changes had been made. Here are the logs linked in bug 499680:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1245676090.1245678112.23641.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1245676090.1245678153.23717.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1245676287.1245678528.24302.gz
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
oremj, can you get munin up on production-master (or help RE do so) and on graphs?  

Without understanding the cause it's tough to understand how to really fix :|
Assignee: server-ops → oremj
(Assignee)

Comment 15

8 years ago
(In reply to comment #14)
> oremj, can you get munin up on production-master (or help RE do so) and on
> graphs?  

This is bug 500058.
Munin is running on production master now.

Updated

8 years ago
Assignee: oremj → server-ops
Since Munin, has there been any repeat occurrence?
Actually...we had a bunch of failures (6 or so) between 2:13 and 2:15pm today.
And Munin shows a load spike around that time (hard to see exactly when because of the scale):
http://nm-dash01.nms.mozilla.org/munin/build/production-master.build.mozilla.org-cpu.html
Can you correlate that to some process?

Updated

8 years ago
Whiteboard: [orange]

Updated

8 years ago
Blocks: 438871
(In reply to comment #20)
> Can you correlate that to some process?

I wasn't logged on to the machine at the time, but it was almost certainly the Buildbot process causing it.
What's the fix for that then?  If buildbot is chewing CPU do we need to look at moving this to real hardware?
(Assignee)

Comment 23

8 years ago
(In reply to comment #22)
> What's the fix for that then?  If buildbot is chewing CPU do we need to look at
> moving this to real hardware?

I wish we knew!

Moving p-m to the new 3GHz cluster will probably help.  Adding another virtual CPU may also help.
Bug 501255 for that.
(In reply to comment #24)
> Bug 501255 for that.

ETA for that fix?  Can I close this one or dup it to "upgrade p-m"?

Comment 26

8 years ago
"Failed graph server post" has been a top cause of red this week.
Please re-assign back to IT if there is something for us to do here.
Assignee: server-ops → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
(Reporter)

Comment 28

8 years ago
Marking bug 501255 as a dependency and futuring.
Component: Release Engineering → Release Engineering: Future
Depends on: 501255
(Assignee)

Updated

8 years ago
Duplicate of this bug: 505382
(Assignee)

Comment 30

8 years ago
I believe the build machines are still posting results to the old graph server, so putting a dependency on bug 476208.
Depends on: 476208
(Assignee)

Comment 31

8 years ago
Also, if bug 476208 doesn't solve this, then we should move the graph server post onto the build slaves instead of in the master.  Will file separate bug for that if required.

Comment 32

8 years ago
Pretty much all the m-c and 1.9.1 builds are currently red due to failed graphserver post.
(Assignee)

Comment 33

8 years ago
This is now a major problem for keeping the tree green, moving out of the Future pool.
Assignee: nobody → catlee
Component: Release Engineering: Future → Release Engineering
Created attachment 389836 [details] [diff] [review]
Allow longer to submit to the old graph server

This is a stopgap until Alice's patch to send the data to the new graph server is ready to go.
Attachment #389836 - Flags: review?
Attachment #389836 - Flags: review? → review?(aki)
(Reporter)

Updated

8 years ago
Attachment #389836 - Flags: review?(aki) → review+
Summary: p-m having issues posting to graphs.m.o → p-m having issues posting to graphs-old.m.o
This is stopping us basically dead right now, due to red trees, and we're trying to get things finished up for branching 1.9.2.  Please let me know if I can help, or if there is news to share with developers.  I am running out of shiny things with which to distract them!
Severity: major → blocker
(Assignee)

Comment 36

8 years ago
(In reply to comment #35)
> This is stopping us basically dead right now, due to red trees, and we're
> trying to get things finished up for branching 1.9.2.  Please let me know if I
> can help, or if there is news to share with developers.  I am running out of
> shiny things with which to distract them!

We're in the middle of a downtime right now trying to get this resolved.  If the current approach doesn't work we can have a different approach ready for the end of the week.

We could also make failed graph server posts non-fatal.  Is that a viable option?
Since the landings Wednesday morning have we seen any additional failures here?
Severity: blocker → major
Not that I can see on mozilla-central or mozilla-1.9.1.
(Assignee)

Comment 39

8 years ago
Haven't seen any problems on mozilla-central or mozilla-1.9.1 since the 23rd.
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
Had another failure today:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249421520.1249422616.23476.gz
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 41

8 years ago
We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all being down for a while.

Let's keep our eye out for more of these.
(In reply to comment #41)
> We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all
> being down for a while.

CVS and Hg are both not behind the netscaler at all. You should not have experienced any issues with either of them yesterday when the netscaler failed over. Did you?
(Assignee)

Comment 43

8 years ago
(In reply to comment #42)
> (In reply to comment #41)
> > We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all
> > being down for a while.
> 
> CVS and Hg are both not behind the netscaler at all. You should not have
> experienced any issues with either of them yesterday when the netscaler failed
> over. Did you?

We had several talos builds fail to checkout code from CVS yesterday.
(In reply to comment #43)
> We had several talos builds fail to checkout code from CVS yesterday.

Checkout from cvs.mozilla.org or cvs-mirror.mozilla.org? cvs-mirror.mozilla.org is actually behind the netscaler, but cvs.mozilla.org is not.
(Assignee)

Comment 45

8 years ago
(In reply to comment #44)
> (In reply to comment #43)
> > We had several talos builds fail to checkout code from CVS yesterday.
> 
> Checkout from cvs.mozilla.org or cvs-mirror.mozilla.org? cvs-mirror.mozilla.org
> is actually behind the netscaler, but cvs.mozilla.org is not.

From cvs-mirror.m.o.

Comment 46

8 years ago
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249939966.1249940631.15087.gz&fulltext=1

Comment 47

8 years ago
another one: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249940454.1249941366.23198.gz
(Assignee)

Updated

8 years ago
Assignee: catlee → anodelman
Depends on: 509604
We're currently not getting much error reporting to tell where the fault lies, the work in bug 509604 should give us more to go on.
Priority: -- → P2
Now that error reporting is up and working we're just waiting around for another failure so that I can see what's going wrong.
Summary: p-m having issues posting to graphs-old.m.o → p-m having issues posting to graphs.m.o
A couple of post errors:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250394830.1250398485.16955.gz

Encountered error when trying to post refcnt_leaks
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds.
]
at Sat Aug 15 21:48:32, and 
Encountered error when trying to post trace_malloc_allocs
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds.
]
(three times) at Sat Aug 15 21:51:24.

And also
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250397117.1250398199.13868.gz
Encountered error when trying to post codesighs
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds.
]
at Sat Aug 15 21:47:35 PDT.
Spot check of dm-graphs01 right now is showing no real issues, however I do note that it's running in a VM.  Maybe it's time for real hardware.
(In reply to comment #51)
> Spot check of dm-graphs01 right now is showing no real issues, however I do
> note that it's running in a VM.  Maybe it's time for real hardware.

On phone with alice right now: 


1) Before we start considering hardware, etc, did this VM hit any performance limits or trigger any alarms in VMware? 

2) One theory is that the talos slaves and cruncher, are both intermittently spiking load on graph.m.o. Changing how cruncher gathers historical data might reduce load significantly. Also, Alice has seen on occasion where talos slaves posting to graphs.m.o fail to post data, yet still report green/success, which could be caused by the same spike in load. The visible symptom is that the green talos build on waterfall has no link to graphserver.

Comment 53

8 years ago
A couple more:
http://production-master.build.mozilla.org:8010/builders/OS%20X%2010.5.2%20mozilla-1.9.2%20leak%20test%20build/builds/27
http://production-master.build.mozilla.org:8010/builders/WINNT%205.2%20mozilla-central%20leak%20test%20build/builds/3133

Same errors as nthomas mentioned in comment 50
Depends on: 513960
hit this on a m-c build machine:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1251886388.1251894000.9007.gz
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1252709267.1252710049.18373.gz&fulltext=1
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1252888172.1252889239.7318.gz
OS X 10.5.2 mozilla-central build on 2009/09/13 17:29:32
I don't believe that we've been seeing any talos graph server posting errors come through, so we just need to implement the same graph sending code that talos uses and hopefully this will clear up.
Depends on: 516773
(Assignee)

Comment 58

8 years ago
I guess this is mine now.  Working on dependent bug 516773 to fix this.
Assignee: anodelman → catlee
Any new occurrences?
(Assignee)

Comment 60

8 years ago
Going to call this fixed, I haven't seen any new occurrences since bug 516773 landed.
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
Had an occurrence a few days ago (I failed to grab the log).

Need long times between re-tries and possibly more re-tries.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #61)
> Need long times between re-tries and possibly more re-tries.

Will that tie up the machine until it can submit ? At some point we just have to say "graphs is too slow and it's not our problem".
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1255416057.1255424088.9667.gz
WINNT 5.2 mozilla-central leak test build on 2009/10/12 23:40:57
(Assignee)

Comment 64

8 years ago
Created attachment 408007 [details] [diff] [review]
Bump up retries to 8

This will try for up to 20 minutes to do the graph server post (was 5 minutes before).
Attachment #408007 - Flags: review?(bhearsum)
Attachment #408007 - Flags: review?(bhearsum) → review+
Comment on attachment 408007 [details] [diff] [review]
Bump up retries to 8

http://hg.mozilla.org/build/buildbotcustom/rev/54cef4dc3faf

pm & pm02 reconfig'd.
Attachment #408007 - Flags: review+
Attachment #408007 - Flags: review+ → checked-in+
(Assignee)

Comment 66

8 years ago
Going to call this Fixed.  Re-open if we hit it again.
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
Keywords: intermittent-failure
Whiteboard: [orange]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.