493623 - p-m having issues posting to graphs.m.o

Reporter

Description

•

16 years ago

Production-master is timing out posting to graphs.m.o. Talos looks like it's reporting there ok. Not sure why this is. - graphs.m.o is slow to load for me, though I'm not sure if that's more than normal or not. - production-master had issues this morning and had to be kicked. - odd DHCPREQUESTs in production-master:/var/log/messages - uh, [13:42] <nagios> [51] dm-graphs01:avg load is WARNING: WARNING - load average: 18.12, 23.93, 22.51 That might possibly be the culprit.

Aki Sasaki (not active)

Reporter

Updated

•

16 years ago

Severity: normal → critical

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

16 years ago

We haven't hit this in the last few hours, lowering severity.

Severity: critical → major

Nick Thomas [:nthomas] (UTC+12)

Comment 2

•

16 years ago

This is still happening sporadically. eg we had a build [1] that failed to post from production-master to graphs.m.o in this window: Start Wed May 20 02:35:46 2009 End Wed May 20 02:36:20 2009 At the same time as End all our hg pollers fired, which might be related. Other graph server posts take just a few seconds and this one is timing out in 30. Please check the load on the graph server and storage latency in that time window. [1] http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.5/1242809760.1242812223.3642.gz

matthew zeier [:mrz]

Updated

•

16 years ago

Assignee: server-ops → thardcastle

chizu

Comment 3

•

16 years ago

I think we're just stacking up processes a bit much. Adding another VCPU to take advantage of some CPU time on another physical core would probably clear that up. Needs some brief VM downtime to enable.

Status: NEW → ASSIGNED

Flags: needs-downtime+

matthew zeier [:mrz]

Comment 4

•

16 years ago

Anyone care if we do this upgrade tomorrow (06/16) night?

Aki Sasaki (not active)

Reporter

Comment 5

•

16 years ago

Downtime of which? Graphs, or p-m? Either way I'm pretty sure downtime tomorrow night is a no.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 6

•

16 years ago

(In reply to comment #2) > Please check the load on the graph server and storage latency in that time > window. Were there any alerts about load / storage latency at this time?... for either graph server, or production-master?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 7

•

16 years ago

(In reply to comment #4) > Anyone care if we do this upgrade tomorrow (06/16) night? A downtime at this point in the release would be tough. Recall that power-cycling production-master, even for just a few seconds, will require a multi-hour downtime while we cycle new builds, unittests through to verify all is working. Depending on release schedule, we were hoping to arrange a downtime with Dev later this week / early next week, when things calm down, but a downtime before that will be tough. If you feel we *need* to do this, we should go through our options at the IT/RelEng portion of tmrw's 1pm meeting. 0) Do we know if this is a problem with graph server or with production-master? Or both? 1) What exactly are you suggesting doing? Adding another VCPU to the ESX server running production-master? How confident are we that this extra VCPU would fix this problem? 2) Is there anything *else* that could be migrated off the ESX server instead, to reduce load? 3) Has anything changed recently on the ESX server causing this to start happening now?

chizu

Comment 8

•

16 years ago

0) dm-graphs01 is where the VCPU gets added, the VM has to be rebooted to make the change. 1) From watching the load on dm-graphs01 over a couple days, it's bound by multiple processes competing for CPU time. Most of the time, the ESX server has more CPU cores available, so a VCPU would split the load. 2) This ESX server isn't over loaded, just this VM. 3) The dm-graphs01 load appears to have been rising.

matthew zeier [:mrz]

Comment 9

•

16 years ago

Talked about this yesterday... RelEng will tell us when we can take downtime to do this upgrade. Reassign when you're ready.

Assignee: thardcastle → nobody

Component: Server Operations → Release Engineering

Flags: needs-downtime+

QA Contact: mrz → release

Dave Miller [:justdave]

Comment 10

•

16 years ago

Upped VCPU from 1 to 2. Upped RAM from 728K to 1G because 728K sounded low to me for a web application server.

Assignee: nobody → server-ops

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Component: Release Engineering → Server Operations

QA Contact: release → mrz

Resolution: --- → FIXED

Dave Miller [:justdave]

Comment 11

•

16 years ago

er, I did mean 728M not 728K there ;)

Chris Cooper [:coop] (he/him)

Comment 13

•

16 years ago

We saw more post failures on Monday after these changes had been made. Here are the logs linked in bug 499680: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1245676090.1245678112.23641.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1245676090.1245678153.23717.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1245676287.1245678528.24302.gz

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

matthew zeier [:mrz]

Comment 14

•

16 years ago

oremj, can you get munin up on production-master (or help RE do so) and on graphs? Without understanding the cause it's tough to understand how to really fix :|

Assignee: server-ops → oremj

Chris AtLee [:catlee]

Assignee

Comment 15

•

16 years ago

(In reply to comment #14) > oremj, can you get munin up on production-master (or help RE do so) and on > graphs? This is bug 500058.

Jeremy Orem [:oremj]

Comment 16

•

16 years ago

Munin is running on production master now.

Jeremy Orem [:oremj]

Updated

•

16 years ago

Assignee: oremj → server-ops

matthew zeier [:mrz]

Comment 17

•

16 years ago

Since Munin, has there been any repeat occurrence?

bhearsum@mozilla.com (:bhearsum)

Comment 18

•

16 years ago

Actually...we had a bunch of failures (6 or so) between 2:13 and 2:15pm today.

bhearsum@mozilla.com (:bhearsum)

Comment 19

•

16 years ago

And Munin shows a load spike around that time (hard to see exactly when because of the scale): http://nm-dash01.nms.mozilla.org/munin/build/production-master.build.mozilla.org-cpu.html

matthew zeier [:mrz]

Comment 20

•

16 years ago

Can you correlate that to some process?

Jesse Ruderman

Updated

•

16 years ago

Whiteboard: [orange]

Jesse Ruderman

Updated

•

16 years ago

Blocks: 438871

bhearsum@mozilla.com (:bhearsum)

Comment 21

•

16 years ago

(In reply to comment #20) > Can you correlate that to some process? I wasn't logged on to the machine at the time, but it was almost certainly the Buildbot process causing it.

matthew zeier [:mrz]

Comment 22

•

16 years ago

What's the fix for that then? If buildbot is chewing CPU do we need to look at moving this to real hardware?

Chris AtLee [:catlee]

Assignee

Comment 23

•

16 years ago

(In reply to comment #22) > What's the fix for that then? If buildbot is chewing CPU do we need to look at > moving this to real hardware? I wish we knew! Moving p-m to the new 3GHz cluster will probably help. Adding another virtual CPU may also help.

Nick Thomas [:nthomas] (UTC+12)

Comment 24

•

16 years ago

Bug 501255 for that.

matthew zeier [:mrz]

Comment 25

•

16 years ago

(In reply to comment #24) > Bug 501255 for that. ETA for that fix? Can I close this one or dup it to "upgrade p-m"?

Jesse Ruderman

Comment 26

•

16 years ago

"Failed graph server post" has been a top cause of red this week.

Aravind Gottipati [:aravind]

Comment 27

•

16 years ago

Please re-assign back to IT if there is something for us to do here.

Assignee: server-ops → nobody

Component: Server Operations → Release Engineering

QA Contact: mrz → release

Aki Sasaki (not active)

Reporter

Comment 28

•

16 years ago

Marking bug 501255 as a dependency and futuring.

Component: Release Engineering → Release Engineering: Future

Depends on: 501255

Chris AtLee [:catlee]

Assignee

Comment 30

•

16 years ago

I believe the build machines are still posting results to the old graph server, so putting a dependency on bug 476208.

Depends on: 476208

Chris AtLee [:catlee]

Assignee

Comment 31

•

16 years ago

Also, if bug 476208 doesn't solve this, then we should move the graph server post onto the build slaves instead of in the master. Will file separate bug for that if required.

Benjamin Smedberg

Comment 32

•

16 years ago

Pretty much all the m-c and 1.9.1 builds are currently red due to failed graphserver post.

Chris AtLee [:catlee]

Assignee

Comment 33

•

16 years ago

This is now a major problem for keeping the tree green, moving out of the Future pool.

Assignee: nobody → catlee

Component: Release Engineering: Future → Release Engineering

Nick Thomas [:nthomas] (UTC+12)

Comment 34

•

16 years ago

Attached patch Allow longer to submit to the old graph server — Details — Splinter Review

This is a stopgap until Alice's patch to send the data to the new graph server is ready to go.

Attachment #389836 - Flags: review?

Nick Thomas [:nthomas] (UTC+12)

Updated

•

16 years ago

Attachment #389836 - Flags: review? → review?(aki)

Aki Sasaki (not active)

Reporter

Updated

•

16 years ago

Attachment #389836 - Flags: review?(aki) → review+

Nick Thomas [:nthomas] (UTC+12)

Updated

•

16 years ago

Summary: p-m having issues posting to graphs.m.o → p-m having issues posting to graphs-old.m.o

Mike Shaver (:shaver emeritus)

Comment 35

•

16 years ago

This is stopping us basically dead right now, due to red trees, and we're trying to get things finished up for branching 1.9.2. Please let me know if I can help, or if there is news to share with developers. I am running out of shiny things with which to distract them!

Severity: major → blocker

Chris AtLee [:catlee]

Assignee

Comment 36

•

16 years ago

(In reply to comment #35) > This is stopping us basically dead right now, due to red trees, and we're > trying to get things finished up for branching 1.9.2. Please let me know if I > can help, or if there is news to share with developers. I am running out of > shiny things with which to distract them! We're in the middle of a downtime right now trying to get this resolved. If the current approach doesn't work we can have a different approach ready for the end of the week. We could also make failed graph server posts non-fatal. Is that a viable option?

alice nodelman [:alice] [:anode]

Comment 37

•

16 years ago

Since the landings Wednesday morning have we seen any additional failures here?

Severity: blocker → major

Nick Thomas [:nthomas] (UTC+12)

Comment 38

•

16 years ago

Not that I can see on mozilla-central or mozilla-1.9.1.

Chris AtLee [:catlee]

Assignee

Comment 39

•

16 years ago

Haven't seen any problems on mozilla-central or mozilla-1.9.1 since the 23rd.

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

Chris Cooper [:coop] (he/him)

Comment 40

•

16 years ago

Had another failure today: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249421520.1249422616.23476.gz

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Chris AtLee [:catlee]

Assignee

Comment 41

•

16 years ago

We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all being down for a while. Let's keep our eye out for more of these.

Reed Loden [:reed]

Comment 42

•

16 years ago

(In reply to comment #41) > We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all > being down for a while. CVS and Hg are both not behind the netscaler at all. You should not have experienced any issues with either of them yesterday when the netscaler failed over. Did you?

Chris AtLee [:catlee]

Assignee

Comment 43

•

16 years ago

(In reply to comment #42) > (In reply to comment #41) > > We had some infrastructure problems yesterday, cvs, hg, bugzilla, wiki all > > being down for a while. > > CVS and Hg are both not behind the netscaler at all. You should not have > experienced any issues with either of them yesterday when the netscaler failed > over. Did you? We had several talos builds fail to checkout code from CVS yesterday.

Reed Loden [:reed]

Comment 44

•

16 years ago

(In reply to comment #43) > We had several talos builds fail to checkout code from CVS yesterday. Checkout from cvs.mozilla.org or cvs-mirror.mozilla.org? cvs-mirror.mozilla.org is actually behind the netscaler, but cvs.mozilla.org is not.

Chris AtLee [:catlee]

Assignee

Comment 45

•

16 years ago

(In reply to comment #44) > (In reply to comment #43) > > We had several talos builds fail to checkout code from CVS yesterday. > > Checkout from cvs.mozilla.org or cvs-mirror.mozilla.org? cvs-mirror.mozilla.org > is actually behind the netscaler, but cvs.mozilla.org is not. From cvs-mirror.m.o.

David Dahl :ddahl

Comment 46

•

16 years ago

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249939966.1249940631.15087.gz&fulltext=1

David Dahl :ddahl

Comment 47

•

16 years ago

another one: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1249940454.1249941366.23198.gz

Chris AtLee [:catlee]

Assignee

Updated

•

16 years ago

Assignee: catlee → anodelman

alice nodelman [:alice] [:anode]

Updated

•

16 years ago

Depends on: 509604

alice nodelman [:alice] [:anode]

Comment 48

•

16 years ago

We're currently not getting much error reporting to tell where the fault lies, the work in bug 509604 should give us more to go on.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

16 years ago

Priority: -- → P2

alice nodelman [:alice] [:anode]

Comment 49

•

16 years ago

Now that error reporting is up and working we're just waiting around for another failure so that I can see what's going wrong.

alice nodelman [:alice] [:anode]

Updated

•

16 years ago

Summary: p-m having issues posting to graphs-old.m.o → p-m having issues posting to graphs.m.o

Nick Thomas [:nthomas] (UTC+12)

Comment 50

•

16 years ago

A couple of post errors: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250394830.1250398485.16955.gz Encountered error when trying to post refcnt_leaks [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds. ] at Sat Aug 15 21:48:32, and Encountered error when trying to post trace_malloc_allocs [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds. ] (three times) at Sat Aug 15 21:51:24. And also http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1250397117.1250398199.13868.gz Encountered error when trying to post codesighs [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.defer.TimeoutError'>: Getting http://graphs.mozilla.org/server/collect.cgi took longer than 120 seconds. ] at Sat Aug 15 21:47:35 PDT.

Dave Miller [:justdave]

Comment 51

•

16 years ago

Spot check of dm-graphs01 right now is showing no real issues, however I do note that it's running in a VM. Maybe it's time for real hardware.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 52

•

16 years ago

(In reply to comment #51) > Spot check of dm-graphs01 right now is showing no real issues, however I do > note that it's running in a VM. Maybe it's time for real hardware. On phone with alice right now: 1) Before we start considering hardware, etc, did this VM hit any performance limits or trigger any alarms in VMware? 2) One theory is that the talos slaves and cruncher, are both intermittently spiking load on graph.m.o. Changing how cruncher gathers historical data might reduce load significantly. Also, Alice has seen on occasion where talos slaves posting to graphs.m.o fail to post data, yet still report green/success, which could be caused by the same spike in load. The visible symptom is that the green talos build on waterfall has no link to graphserver.

Armen [:armenzg]

Comment 53

•

16 years ago

A couple more: http://production-master.build.mozilla.org:8010/builders/OS%20X%2010.5.2%20mozilla-1.9.2%20leak%20test%20build/builds/27 http://production-master.build.mozilla.org:8010/builders/WINNT%205.2%20mozilla-central%20leak%20test%20build/builds/3133 Same errors as nthomas mentioned in comment 50

alice nodelman [:alice] [:anode]

Updated

•

16 years ago

Depends on: 513960

Lukas Blakk [:lsblakk] use ?needinfo

Comment 54

•

16 years ago

hit this on a m-c build machine: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1251886388.1251894000.9007.gz

Jeff Walden [:Waldo]

Comment 55

•

16 years ago

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1252709267.1252710049.18373.gz&fulltext=1

Daniel Holbert [:dholbert]

Comment 56

•

16 years ago

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1252888172.1252889239.7318.gz OS X 10.5.2 mozilla-central build on 2009/09/13 17:29:32

alice nodelman [:alice] [:anode]

Comment 57

•

16 years ago

I don't believe that we've been seeing any talos graph server posting errors come through, so we just need to implement the same graph sending code that talos uses and hopefully this will clear up.

alice nodelman [:alice] [:anode]

Updated

•

16 years ago

Depends on: 516773

Chris AtLee [:catlee]

Assignee

Comment 58

•

16 years ago

I guess this is mine now. Working on dependent bug 516773 to fix this.

Assignee: anodelman → catlee

alice nodelman [:alice] [:anode]

Comment 59

•

16 years ago

Any new occurrences?

Chris AtLee [:catlee]

Assignee

Comment 60

•

16 years ago

Going to call this fixed, I haven't seen any new occurrences since bug 516773 landed.

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

alice nodelman [:alice] [:anode]

Comment 61

•

16 years ago

Had an occurrence a few days ago (I failed to grab the log). Need long times between re-tries and possibly more re-tries.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Nick Thomas [:nthomas] (UTC+12)

Comment 62

•

16 years ago

(In reply to comment #61) > Need long times between re-tries and possibly more re-tries. Will that tie up the machine until it can submit ? At some point we just have to say "graphs is too slow and it's not our problem".

Daniel Holbert [:dholbert]

Comment 63

•

16 years ago

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1255416057.1255424088.9667.gz WINNT 5.2 mozilla-central leak test build on 2009/10/12 23:40:57

Chris AtLee [:catlee]

Assignee

Comment 64

•

16 years ago

Attached patch Bump up retries to 8 — Details — Splinter Review

This will try for up to 20 minutes to do the graph server post (was 5 minutes before).

Attachment #408007 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Updated

•

16 years ago

Attachment #408007 - Flags: review?(bhearsum) → review+

Nick Thomas [:nthomas] (UTC+12)

Comment 65

•

16 years ago

Comment on attachment 408007 [details] [diff] [review] Bump up retries to 8 http://hg.mozilla.org/build/buildbotcustom/rev/54cef4dc3faf pm & pm02 reconfig'd.

Attachment #408007 - Flags: review+

Nick Thomas [:nthomas] (UTC+12)

Updated

•

16 years ago

Attachment #408007 - Flags: review+ → checked-in+

Chris AtLee [:catlee]

Assignee

Comment 66

•

16 years ago

Going to call this Fixed. Re-open if we hit it again.

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

13 years ago

Keywords: intermittent-failure

Nobody; OK to take it and work on it

Updated

•

13 years ago

Whiteboard: [orange]

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Allow longer to submit to the old graph server 16 years ago Nick Thomas [:nthomas] (UTC+12) 989 bytes, patch	mozilla : review+	Details \| Diff \| Splinter Review
Bump up retries to 8 16 years ago Chris AtLee [:catlee] 1.21 KB, patch	bhearsum : review+ nthomas : checked-in+	Details \| Diff \| Splinter Review