Closed
Bug 493623
Opened 15 years ago
Closed 15 years ago
p-m having issues posting to graphs.m.o
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mozilla, Assigned: catlee)
References
Details
(Keywords: intermittent-failure)
Attachments
(2 files)
989 bytes,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
1.21 KB,
patch
|
bhearsum
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
Production-master is timing out posting to graphs.m.o. Talos looks like it's reporting there ok. Not sure why this is. - graphs.m.o is slow to load for me, though I'm not sure if that's more than normal or not. - production-master had issues this morning and had to be kicked. - odd DHCPREQUESTs in production-master:/var/log/messages - uh, [13:42] <nagios> [51] dm-graphs01:avg load is WARNING: WARNING - load average: 18.12, 23.93, 22.51 That might possibly be the culprit.
Reporter | ||
Updated•15 years ago
|
Severity: normal → critical
Comment 1•15 years ago
|
||
We haven't hit this in the last few hours, lowering severity.
Severity: critical → major
Comment 2•15 years ago
|
||
This is still happening sporadically. eg we had a build [1] that failed to post from production-master to graphs.m.o in this window: Start Wed May 20 02:35:46 2009 End Wed May 20 02:36:20 2009 At the same time as End all our hg pollers fired, which might be related. Other graph server posts take just a few seconds and this one is timing out in 30. Please check the load on the graph server and storage latency in that time window. [1] http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.5/1242809760.1242812223.3642.gz
Updated•15 years ago
|
Assignee: server-ops → thardcastle
I think we're just stacking up processes a bit much. Adding another VCPU to take advantage of some CPU time on another physical core would probably clear that up. Needs some brief VM downtime to enable.
Status: NEW → ASSIGNED
Flags: needs-downtime+
Comment 4•15 years ago
|
||
Anyone care if we do this upgrade tomorrow (06/16) night?
Reporter | ||
Comment 5•15 years ago
|
||
Downtime of which? Graphs, or p-m? Either way I'm pretty sure downtime tomorrow night is a no.
Comment 6•15 years ago
|
||
(In reply to comment #2) > Please check the load on the graph server and storage latency in that time > window. Were there any alerts about load / storage latency at this time?... for either graph server, or production-master?
Comment 7•15 years ago
|
||
(In reply to comment #4) > Anyone care if we do this upgrade tomorrow (06/16) night? A downtime at this point in the release would be tough. Recall that power-cycling production-master, even for just a few seconds, will require a multi-hour downtime while we cycle new builds, unittests through to verify all is working. Depending on release schedule, we were hoping to arrange a downtime with Dev later this week / early next week, when things calm down, but a downtime before that will be tough. If you feel we *need* to do this, we should go through our options at the IT/RelEng portion of tmrw's 1pm meeting. 0) Do we know if this is a problem with graph server or with production-master? Or both? 1) What exactly are you suggesting doing? Adding another VCPU to the ESX server running production-master? How confident are we that this extra VCPU would fix this problem? 2) Is there anything *else* that could be migrated off the ESX server instead, to reduce load? 3) Has anything changed recently on the ESX server causing this to start happening now?
0) dm-graphs01 is where the VCPU gets added, the VM has to be rebooted to make the change. 1) From watching the load on dm-graphs01 over a couple days, it's bound by multiple processes competing for CPU time. Most of the time, the ESX server has more CPU cores available, so a VCPU would split the load. 2) This ESX server isn't over loaded, just this VM. 3) The dm-graphs01 load appears to have been rising.
Comment 9•15 years ago
|
||
Talked about this yesterday... RelEng will tell us when we can take downtime to do this upgrade. Reassign when you're ready.
Assignee: thardcastle → nobody
Component: Server Operations → Release Engineering
Flags: needs-downtime+
QA Contact: mrz → release
Comment 10•15 years ago
|
||
Upped VCPU from 1 to 2. Upped RAM from 728K to 1G because 728K sounded low to me for a web application server.
Assignee: nobody → server-ops
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Resolution: --- → FIXED
Comment 11•15 years ago
|
||
er, I did mean 728M not 728K there ;)
Comment 13•15 years ago
|
||
We saw more post failures on Monday after these changes had been made. Here are the logs linked in bug 499680: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1245676090.1245678112.23641.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1245676090.1245678153.23717.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1245676287.1245678528.24302.gz
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 14•15 years ago
|
||
oremj, can you get munin up on production-master (or help RE do so) and on graphs? Without understanding the cause it's tough to understand how to really fix :|
Assignee: server-ops → oremj
Assignee | ||
Comment 15•15 years ago
|
||
(In reply to comment #14) > oremj, can you get munin up on production-master (or help RE do so) and on > graphs? This is bug 500058.
Comment 16•15 years ago
|
||
Munin is running on production master now.
Updated•15 years ago
|
Assignee: oremj → server-ops
Comment 17•15 years ago
|
||
Since Munin, has there been any repeat occurrence?
Comment 18•15 years ago
|
||
Actually...we had a bunch of failures (6 or so) between 2:13 and 2:15pm today.
Comment 19•15 years ago
|
||
And Munin shows a load spike around that time (hard to see exactly when because of the scale): http://nm-dash01.nms.mozilla.org/munin/build/production-master.build.mozilla.org-cpu.html
Comment 20•15 years ago
|
||
Can you correlate that to some process?
Updated•15 years ago
|
Whiteboard: [orange]
Comment 21•15 years ago
|
||
(In reply to comment #20) > Can you correlate that to some process? I wasn't logged on to the machine at the time, but it was almost certainly the Buildbot process causing it.
Comment 22•15 years ago
|
||
What's the fix for that then? If buildbot is chewing CPU do we need to look at moving this to real hardware?
Assignee | ||
Comment 23•15 years ago
|
||
(In reply to comment #22) > What's the fix for that then? If buildbot is chewing CPU do we need to look at > moving this to real hardware? I wish we knew! Moving p-m to the new 3GHz cluster will probably help. Adding another virtual CPU may also help.
Comment 24•15 years ago
|
||
Bug 501255 for that.
Comment 25•15 years ago
|
||
(In reply to comment #24) > Bug 501255 for that. ETA for that fix? Can I close this one or dup it to "upgrade p-m"?
Comment 26•15 years ago
|
||
"Failed graph server post" has been a top cause of red this week.
Comment 27•15 years ago
|
||
Please re-assign back to IT if there is something for us to do here.
Assignee: server-ops → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Reporter | ||
Comment 28•15 years ago
|
||
Marking bug 501255 as a dependency and futuring.
Component: Release Engineering → Release Engineering: Future
Depends on: 501255
Assignee | ||
Comment 30•15 years ago
|
||
I believe the build machines are still posting results to the old graph server, so putting a dependency on bug 476208.
Depends on: 476208
Assignee | ||
Comment 31•15 years ago
|
||
Also, if bug 476208 doesn't solve this, then we should move the graph server post onto the build slaves instead of in the master. Will file separate bug for that if required.
Comment 32•15 years ago
|
||
Pretty much all the m-c and 1.9.1 builds are currently red due to failed graphserver post.
Assignee | ||
Comment 33•15 years ago
|
||
This is now a major problem for keeping the tree green, moving out of the Future pool.
Assignee: nobody → catlee
Component: Release Engineering: Future → Release Engineering
Comment 34•15 years ago
|
||
This is a stopgap until Alice's patch to send the data to the new graph server is ready to go.
Attachment #389836 -
Flags: review?
Updated•15 years ago
|
Attachment #389836 -
Flags: review? → review?(aki)
Reporter | ||
Updated•15 years ago
|
Attachment #389836 -
Flags: review?(aki) → review+
Updated•15 years ago
|
Summary: p-m having issues posting to graphs.m.o → p-m having issues posting to graphs-old.m.o
This is stopping us basically dead right now, due to red trees, and we're trying to get things finished up for branching 1.9.2. Please let me know if I can help, or if there is news to share with developers. I am running out of shiny things with which to distract them!
Severity: major → blocker
Assignee | ||
Comment 36•15 years ago
|
||
(In reply to comment #35) > This is stopping us basically dead right now, due to red trees, and we're > trying to get things finished up for branching 1.9.2. Please let me know if I > can help, or if there is news to share with developers. I am running out of > shiny things with which to distract them! We're in the middle of a downtime right now trying to get this resolved. If the current approach doesn't work we can have a different approach ready for the end of the week. We could also make failed graph server posts non-fatal. Is that a viable option?
Comment 37•15 years ago
|
||
Since the landings Wednesday morning have we seen any additional failures here?
Severity: blocker → major
Comment 38•15 years ago
|
||
Not that I can see on mozilla-central or mozilla-1.9.1.
Assignee | ||
Comment 39•15 years ago
|
||
Haven't seen any problems on mozilla-central or mozilla-1.9.1 since the 23rd.
Status: REOPENED → RESOLVED
Closed: 15 years ago → 15 years ago
Resolution: --- → FIXED
Comment 40•15 years ago
|
||
Had another failure today: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249421520.1249422616.23476.gz
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 41•15 years ago
|
||
We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all being down for a while. Let's keep our eye out for more of these.
Comment 42•15 years ago
|
||
(In reply to comment #41) > We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all > being down for a while. CVS and Hg are both not behind the netscaler at all. You should not have experienced any issues with either of them yesterday when the netscaler failed over. Did you?
Assignee | ||
Comment 43•15 years ago
|
||
(In reply to comment #42) > (In reply to comment #41) > > We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all > > being down for a while. > > CVS and Hg are both not behind the netscaler at all. You should not have > experienced any issues with either of them yesterday when the netscaler failed > over. Did you? We had several talos builds fail to checkout code from CVS yesterday.
Comment 44•15 years ago
|
||
(In reply to comment #43) > We had several talos builds fail to checkout code from CVS yesterday. Checkout from cvs.mozilla.org or cvs-mirror.mozilla.org? cvs-mirror.mozilla.org is actually behind the netscaler, but cvs.mozilla.org is not.
Assignee | ||
Comment 45•15 years ago
|
||
(In reply to comment #44) > (In reply to comment #43) > > We had several talos builds fail to checkout code from CVS yesterday. > > Checkout from cvs.mozilla.org or cvs-mirror.mozilla.org? cvs-mirror.mozilla.org > is actually behind the netscaler, but cvs.mozilla.org is not. From cvs-mirror.m.o.
Comment 46•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249939966.1249940631.15087.gz&fulltext=1
Comment 47•15 years ago
|
||
another one: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249940454.1249941366.23198.gz
Assignee | ||
Updated•15 years ago
|
Assignee: catlee → anodelman
Comment 48•15 years ago
|
||
We're currently not getting much error reporting to tell where the fault lies, the work in bug 509604 should give us more to go on.
Updated•15 years ago
|
Priority: -- → P2
Comment 49•15 years ago
|
||
Now that error reporting is up and working we're just waiting around for another failure so that I can see what's going wrong.
Updated•15 years ago
|
Summary: p-m having issues posting to graphs-old.m.o → p-m having issues posting to graphs.m.o
Comment 50•15 years ago
|
||
A couple of post errors: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250394830.1250398485.16955.gz Encountered error when trying to post refcnt_leaks [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds. ] at Sat Aug 15 21:48:32, and Encountered error when trying to post trace_malloc_allocs [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds. ] (three times) at Sat Aug 15 21:51:24. And also http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250397117.1250398199.13868.gz Encountered error when trying to post codesighs [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds. ] at Sat Aug 15 21:47:35 PDT.
Comment 51•15 years ago
|
||
Spot check of dm-graphs01 right now is showing no real issues, however I do note that it's running in a VM. Maybe it's time for real hardware.
Comment 52•15 years ago
|
||
(In reply to comment #51) > Spot check of dm-graphs01 right now is showing no real issues, however I do > note that it's running in a VM. Maybe it's time for real hardware. On phone with alice right now: 1) Before we start considering hardware, etc, did this VM hit any performance limits or trigger any alarms in VMware? 2) One theory is that the talos slaves and cruncher, are both intermittently spiking load on graph.m.o. Changing how cruncher gathers historical data might reduce load significantly. Also, Alice has seen on occasion where talos slaves posting to graphs.m.o fail to post data, yet still report green/success, which could be caused by the same spike in load. The visible symptom is that the green talos build on waterfall has no link to graphserver.
Comment 53•15 years ago
|
||
A couple more: http://production-master.build.mozilla.org:8010/builders/OS%20X%2010.5.2%20mozilla-1.9.2%20leak%20test%20build/builds/27 http://production-master.build.mozilla.org:8010/builders/WINNT%205.2%20mozilla-central%20leak%20test%20build/builds/3133 Same errors as nthomas mentioned in comment 50
Comment 54•15 years ago
|
||
hit this on a m-c build machine: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1251886388.1251894000.9007.gz
Comment 55•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1252709267.1252710049.18373.gz&fulltext=1
Comment 56•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1252888172.1252889239.7318.gz OS X 10.5.2 mozilla-central build on 2009/09/13 17:29:32
Comment 57•15 years ago
|
||
I don't believe that we've been seeing any talos graph server posting errors come through, so we just need to implement the same graph sending code that talos uses and hopefully this will clear up.
Assignee | ||
Comment 58•15 years ago
|
||
I guess this is mine now. Working on dependent bug 516773 to fix this.
Assignee: anodelman → catlee
Comment 59•15 years ago
|
||
Any new occurrences?
Assignee | ||
Comment 60•15 years ago
|
||
Going to call this fixed, I haven't seen any new occurrences since bug 516773 landed.
Status: REOPENED → RESOLVED
Closed: 15 years ago → 15 years ago
Resolution: --- → FIXED
Comment 61•15 years ago
|
||
Had an occurrence a few days ago (I failed to grab the log). Need long times between re-tries and possibly more re-tries.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 62•15 years ago
|
||
(In reply to comment #61) > Need long times between re-tries and possibly more re-tries. Will that tie up the machine until it can submit ? At some point we just have to say "graphs is too slow and it's not our problem".
Comment 63•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1255416057.1255424088.9667.gz WINNT 5.2 mozilla-central leak test build on 2009/10/12 23:40:57
Assignee | ||
Comment 64•15 years ago
|
||
This will try for up to 20 minutes to do the graph server post (was 5 minutes before).
Attachment #408007 -
Flags: review?(bhearsum)
Updated•15 years ago
|
Attachment #408007 -
Flags: review?(bhearsum) → review+
Comment 65•15 years ago
|
||
Comment on attachment 408007 [details] [diff] [review] Bump up retries to 8 http://hg.mozilla.org/build/buildbotcustom/rev/54cef4dc3faf pm & pm02 reconfig'd.
Attachment #408007 -
Flags: review+
Updated•15 years ago
|
Attachment #408007 -
Flags: review+ → checked-in+
Assignee | ||
Comment 66•15 years ago
|
||
Going to call this Fixed. Re-open if we hit it again.
Status: REOPENED → RESOLVED
Closed: 15 years ago → 15 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [orange]
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•