Closed Bug 710427 Opened 13 years ago Closed 13 years ago

Download rates from stage.m.o are much slower than normal

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: ravi)

References

Details

This is currently blocking us pushing 3.6.25 to the beta channel. If we make a request like http://stage-old.mozilla.org/pub/mozilla.org/firefox/nightly/3.6.25-candidates/build1/update/linux-i686/mk/firefox-3.6.25.complete.mar from a build slave like mv-moz2-linux-ix-slave18.build.m.o then it takes at least 60 seconds to complete, when historically it takes just a few seconds. I need to confirm but I suspect this is not limited to our slaves which are in mtv1, and also affects machines in sjc1 too. RelEng is causing a bunch of requests right now because of overlapping release processes, but if I look at https://ganglia.mozilla.org/sjc1/?r=week&c=Webtools&h=surf.mozilla.org&mc=2 the load is historically not unusual, yet the response is.
(In reply to Nick Thomas [:nthomas] from comment #0) > This is currently blocking us pushing 3.6.25 to the beta channel. Currently this is blocking our testing prior to the push, the jobs are taking 3+ hours instead of 30 mins, but if it's a netapp issue then ftp.m.o will be impacted too and we won't be able to push to users either.
Assignee: server-ops → dgherman
Turns out the affected machines are in MTV, and the mtv1-sjc1 link failed over to the backup with 10x less capacity. ravi is investigating that, and I'm looking for ways to shift our traffic to be sjc1-scl1.
Should we just go to beta tomorrow? It is almost 4pm in mv right now.
(In reply to Al Billings [:abillings] from comment #3) > Should we just go to beta tomorrow? It is almost 4pm in mv right now. Yep - let's do this tomorrow. Still on schedule, and it sounds like we'll be in better shape to push then.
Assignee: dgherman → ravi
Additional info from irc discussions before this bug was filed, adding here for completion: 1) at approx noon PT, hwine asked in #ops if there was a reason Apple -> foopy18.mtv1 link would be blocked/slow. A 1.8GB transfer was given an ETC of 3 hours. At ~approx 2:20pm PST, this was raised in #ops again. 2) There are two releases in progress, which is busy but not unusually high load from RelEng. 3) release automation seeing slow downloads from stage-old.m.o (aka stage, surf). ** One example of a file which had slow downloads is http://stage-old.mozilla.org/pub/mozilla.org/firefox/nightly/3.6.25-candidates/build1/update/win32/es-AR/firefox-3.6.25.complete.mar ** Approx 70KB/s averaged over a 10MB file. ** CPU load on stage-old seems OK ** questions about 10.253.0.11 (mpt-netapp-b) ** we believe the RelEng machines being impacted are in mtv1 and sjc1 (turned out to be mtv1 only) 4) some (unrelated) complaints of slow sjc1 transfer speeds in #it 5) cpu wio of stage is fairly high, although not as bad as it has been in the last week - https://ganglia.mozilla.org/sjc1/?r=day&c=Webtools&h=surf.mozilla.org&mc=2 6) status of mpt-netapp-b: a) cpu isn't too terribly high b) lerxst sees an NFS read latency of 35748 msec (wasn't reproducible on machines mounting partitions from the netapp) 7) RelEng started using machines in scl1 to work around this problem, but the mtv1 link was swung back to full speed before that got far. 8) The release schedule for getting 3.6.25 on the beta channel was put back a day to adjust, but had been running a day early before this episode. open questions: ** is there sufficient monitoring for ftp.m.o? ganglia? ** some discussions in irc about whether the link between mtv1 and sjc1 hit capacity and is causing this delay. is there sufficient monitoring on this link?
> open questions: > ** is there sufficient monitoring for ftp.m.o? ganglia? Yes. > ** some discussions in irc about whether the link between mtv1 and sjc1 hit > capacity and is causing this delay. is there sufficient monitoring on this > link? Yes. The issue here is that we have production critical resources in a location that doesn't have the level of redundant capacity these systems require. What you're probably asking for is a notification mechanism that says "Mountain View suffered a failure of the 1GbE point-to-point and automatically failed over to the lower capacity 100Mbps link, expect delays", not unlike BART does when they experience service disruption.
The 1G link is showing >250M of sustained throughput. The backup link has a cap of 100M. It appears something on the 6-DEC increased our egress traffic from MTV1 with the 7th being the previous peak. There was a 20M spike at or about 1100 today where it remained at capacity until traffic was shifted back to the 1G.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.