710427 - Download rates from stage.m.o are much slower than normal

Reporter

Description

•

13 years ago

This is currently blocking us pushing 3.6.25 to the beta channel. If we make a request like http://stage-old.mozilla.org/pub/mozilla.org/firefox/nightly/3.6.25-candidates/build1/update/linux-i686/mk/firefox-3.6.25.complete.mar from a build slave like mv-moz2-linux-ix-slave18.build.m.o then it takes at least 60 seconds to complete, when historically it takes just a few seconds. I need to confirm but I suspect this is not limited to our slaves which are in mtv1, and also affects machines in sjc1 too. RelEng is causing a bunch of requests right now because of overlapping release processes, but if I look at https://ganglia.mozilla.org/sjc1/?r=week&c=Webtools&h=surf.mozilla.org&mc=2 the load is historically not unusual, yet the response is.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

13 years ago

(In reply to Nick Thomas [:nthomas] from comment #0) > This is currently blocking us pushing 3.6.25 to the beta channel. Currently this is blocking our testing prior to the push, the jobs are taking 3+ hours instead of 30 mins, but if it's a netapp issue then ftp.m.o will be impacted too and we won't be able to push to users either.

Dumitru Gherman [:dumitru]

Updated

•

13 years ago

Assignee: server-ops → dgherman

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

13 years ago

Turns out the affected machines are in MTV, and the mtv1-sjc1 link failed over to the backup with 10x less capacity. ravi is investigating that, and I'm looking for ways to shift our traffic to be sjc1-scl1.

Al Billings [:abillings - ex-MoCo]

Comment 3

•

13 years ago

Should we just go to beta tomorrow? It is almost 4pm in mv right now.

Alex Keybl [:akeybl]

Comment 4

•

13 years ago

(In reply to Al Billings [:abillings] from comment #3) > Should we just go to beta tomorrow? It is almost 4pm in mv right now. Yep - let's do this tomorrow. Still on schedule, and it sounds like we'll be in better shape to push then.

Dumitru Gherman [:dumitru]

Updated

•

13 years ago

Assignee: dgherman → ravi

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 5

•

13 years ago

Additional info from irc discussions before this bug was filed, adding here for completion: 1) at approx noon PT, hwine asked in #ops if there was a reason Apple -> foopy18.mtv1 link would be blocked/slow. A 1.8GB transfer was given an ETC of 3 hours. At ~approx 2:20pm PST, this was raised in #ops again. 2) There are two releases in progress, which is busy but not unusually high load from RelEng. 3) release automation seeing slow downloads from stage-old.m.o (aka stage, surf). ** One example of a file which had slow downloads is http://stage-old.mozilla.org/pub/mozilla.org/firefox/nightly/3.6.25-candidates/build1/update/win32/es-AR/firefox-3.6.25.complete.mar ** Approx 70KB/s averaged over a 10MB file. ** CPU load on stage-old seems OK ** questions about 10.253.0.11 (mpt-netapp-b) ** we believe the RelEng machines being impacted are in mtv1 and sjc1 (turned out to be mtv1 only) 4) some (unrelated) complaints of slow sjc1 transfer speeds in #it 5) cpu wio of stage is fairly high, although not as bad as it has been in the last week - https://ganglia.mozilla.org/sjc1/?r=day&c=Webtools&h=surf.mozilla.org&mc=2 6) status of mpt-netapp-b: a) cpu isn't too terribly high b) lerxst sees an NFS read latency of 35748 msec (wasn't reproducible on machines mounting partitions from the netapp) 7) RelEng started using machines in scl1 to work around this problem, but the mtv1 link was swung back to full speed before that got far. 8) The release schedule for getting 3.6.25 on the beta channel was put back a day to adjust, but had been running a day early before this episode. open questions: ** is there sufficient monitoring for ftp.m.o? ganglia? ** some discussions in irc about whether the link between mtv1 and sjc1 hit capacity and is causing this delay. is there sufficient monitoring on this link?

matthew zeier [:mrz]

Comment 6

•

13 years ago

> open questions: > ** is there sufficient monitoring for ftp.m.o? ganglia? Yes. > ** some discussions in irc about whether the link between mtv1 and sjc1 hit > capacity and is causing this delay. is there sufficient monitoring on this > link? Yes. The issue here is that we have production critical resources in a location that doesn't have the level of redundant capacity these systems require. What you're probably asking for is a notification mechanism that says "Mountain View suffered a failure of the 1GbE point-to-point and automatically failed over to the lower capacity 100Mbps link, expect delays", not unlike BART does when they experience service disruption.

Ravi Pina [:ravi]

Assignee

Comment 7

•

13 years ago

The 1G link is showing >250M of sustained throughput. The backup link has a cap of 100M. It appears something on the 6-DEC increased our egress traffic from MTV1 with the 7th being the previous peak. There was a 20M spike at or about 1100 today where it remained at capacity until traffic was shifted back to the 1G.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

13 years ago

Blocks: 712516

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

Download rates from stage.m.o are much slower than normal

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: nthomas, Assigned: ravi)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Updated

Updated