Closed Bug 555794 Opened 10 years ago Closed 10 years ago

timeouts between castro and mpt

Categories

(mozilla.org Graveyard :: Server Operations, task, critical)

x86
All
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: dmoore)

References

Details

Attachments

(5 files)

This will close the tree if we get more failures like it, so I'm up'ing the severity
Severity: normal → critical
Assignee: server-ops → dmoore
So, I'm trying to find the relevant sections from the logs referenced above. I see some sections like this:

===

make[1]: Entering directory `/e/builds/moz2_slave/mozilla-central-win32-debug/build/obj-firefox/browser/installer'
d:/mozilla-build/python25/python2.5.exe /e/builds/moz2_slave/mozilla-central-win32-debug/build/build/upload.py --base-path ../../dist \
		"../../dist/firefox-3.7a4pre.en-US.win32.zip" \
		"../../dist/install/sea/firefox-3.7a4pre.en-US.win32.installer.exe" \
		 \
		 \
		"../../dist/firefox-3.7a4pre.en-US.win32.tests.tar.bz2" \
		"../../dist/firefox-3.7a4pre.en-US.win32.crashreporter-symbols.zip" \
		 \
		"../../dist//firefox-3.7a4pre.en-US.win32.txt" \
		  ../../dist/firefox-3.7a4pre.en-US.langpack.xpi

command timed out: 1200 seconds without output
program finished with exit code 1
elapsedTime=1200.453000

===


Can you tell me a little bit about what upload.py is trying to do? I'd like to understand which servers it is talking to and what method it is using to upload. Currently, we're not seeing the level of network congestion which had caused us problems in the recent past.
the code that does the upload is located at http://mxr.mozilla.org/mozilla-central/source/build/upload.py .  It looks like it is using scp
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269886313.1269894376.26142.gz manages to catch at least a little more helpfulness:

make[1]: Entering directory `/e/builds/moz2_slave/mozilla-central-win32/build/obj-firefox/browser/installer'
d:/mozilla-build/python25/python2.5.exe /e/builds/moz2_slave/mozilla-central-win32/build/build/upload.py --base-path ../../dist \
		"../../dist/firefox-3.7a4pre.en-US.win32.zip" \
		"../../dist/install/sea/firefox-3.7a4pre.en-US.win32.installer.exe" \
		 \
		 \
		"../../dist/firefox-3.7a4pre.en-US.win32.tests.tar.bz2" \
		"../../dist/firefox-3.7a4pre.en-US.win32.crashreporter-symbols.zip" \
		 \
		"../../dist//firefox-3.7a4pre.en-US.win32.txt" \
		  ../../dist/firefox-3.7a4pre.en-US.langpack.xpi
Connection closed by 10.2.74.116
Uploading e:\builds\moz2_slave\mozilla-central-win32\build\obj-firefox\dist\firefox-3.7a4pre.en-US.win32.zip
Uploading e:\builds\moz2_slave\mozilla-central-win32\build\obj-firefox\dist\install\sea\firefox-3.7a4pre.en-US.win32.installer.exe
Encountered error while uploading
Command ['ssh', '-o', 'IdentityFile=~/.ssh/ffxbld_dsa', 'ffxbld@stage.mozilla.org', 'mkdir -p /tmp/tmp.yNIPxh3596/'] returned non-zero exit code: 255
The 'Connection closed' message is interesting, that sounds like a different problem than the timeout.  I wonder if this is related to bug 492033, where connections to MPT get their TCP session reset.
Also slightly different, from http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269885519.1269894417.26255.gz :

scp -o User=ffxbld -o IdentityFile=~/.ssh/ffxbld_dsa ../malloc.log ../sdleak.tree stage.mozilla.org:/home/ftp/pub/firefox/tinderbox-builds/mozilla-central-win32/
 in dir e:\builds\moz2_slave\mozilla-central-win32-debug\build (timeout 1200 secs)
 watching logfiles {}
 argv: ['scp', '-o', 'User=ffxbld', '-o', 'IdentityFile=~/.ssh/ffxbld_dsa', '../malloc.log', '../sdleak.tree', 'stage.mozilla.org:/home/ftp/pub/firefox/tinderbox-builds/mozilla-central-win32/']
...
Read from remote host stage.mozilla.org: Connection reset by peer
lost connection
program finished with exit code 1
elapsedTime=839.344000
Severity: critical → blocker
* we're hitting timeouts on uploads from castro to stage, but for win32 only
* it's normal for win32 to not have any output when uploading, and if its running slow then we'll hit timeouts
* linux and mac do spew output, and haven't been timing out
* we get 6-7MB/s upload to stage from MPT build slaves
* it's fluctuated around, but we've seen a low as 250KB/s upload from Castro based ix hardware for linux
Severity: blocker → critical
Based on the number of checkins to various code repositories, today is a much busier day than most of last week would have been. Are we congesting the Castro <-> mpt link again ?

mozilla-central is closed to checkins at the moment.
Going to open the tree and hope enough people went home already.
Do we have any reporting on the mpt <--> castro link that can show us what is going on ? Upload/download throughput, latency, top users by accumulated bandwidth etc
Just to keep it from being a complete Windowsfest, http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269898859.1269902706.16068.gz is a Mac talos box timing out downloading symbols.
Actually it's the slave losing connection the master in MPT:
316416K ........ ........ ........ ........ ........ ........ 79%   79.73 KB/s
319488K ........ .
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.

Putting back the blocker status that I accidentally unset.
Severity: critical → blocker
Re: Comment #10

We do have bandwidth and latency monitoring in place:

  http://mvadm01/smokeping/?target=Castro.opteman
  http://mvadm01/cacti/graph.php?action=view&local_graph_id=250&rra_id=all

In short, latency is continually under 3 ms and upload bandwidth, in particular, is under 10 Mbps (well below the 40Mbps cap where we'd start to see congestion)


I'm also concerned about comment #5 and comment #6, as I agree they are indicative of something other than a network timeout.
Severity: blocker → critical
Again, comment #12 is not a timeout. I'm investigating other possibilities.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270059099.1270065204.15639.gz
Uploading /builds/slave/mozilla-central-macosx-debug/build/obj-firefox/dist/firefox-3.7a4pre.en-US.mac.dmg

firefox-3.7a4pre.en-US.mac.dmg                  0%    0     0.0KB/s   --:-- ETA
firefox-3.7a4pre.en-US.mac.dmg                100%   12MB  12.3MB/s   00:00    
Uploading /builds/slave/mozilla-central-macosx-debug/build/obj-firefox/dist/firefox-3.7a4pre.en-US.mac.tests.tar.bz2

firefox-3.7a4pre.en-US.mac.tests.tar.bz2        0%    0     0.0KB/s   --:-- ETA
firefox-3.7a4pre.en-US.mac.tests.tar.bz2      100%   11MB  11.4MB/s   00:01    
Uploading /builds/slave/mozilla-central-macosx-debug/build/obj-firefox/dist/firefox-3.7a4pre.en-US.mac.crashreporter-symbols.zip

firefox-3.7a4pre.en-US.mac.crashreporter-symb   0%    0     0.0KB/s   --:-- ETA
firefox-3.7a4pre.en-US.mac.crashreporter-symb  18%   20MB  19.7MB/s   00:04 ETA
Read from remote host stage.mozilla.org: Connection reset by peer

lost connection
Encountered error while uploading
Command '['scp', '-o', 'IdentityFile=~/.ssh/ffxbld_dsa', '/builds/slave/mozilla-central-macosx-debug/build/obj-firefox/dist/firefox-3.7a4pre.en-US.mac.crashreporter-symbols.zip', 'ffxbld@stage.mozilla.org:/tmp/tmp.sEoOs12328/']' returned non-zero exit status 1
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270092059.1270098581.31683.gz
ssh: connect to host stage.mozilla.org port 22: Bad file number
lost connection
Uploading e:\builds\moz2_slave\mozilla-central-win32\build\obj-firefox\dist\firefox-3.7a4pre.en-US.win32.zip
Uploading e:\builds\moz2_slave\mozilla-central-win32\build\obj-firefox\dist\install\sea\firefox-3.7a4pre.en-US.win32.installer.exe
Uploading e:\builds\moz2_slave\mozilla-central-win32\build\obj-firefox\dist\firefox-3.7a4pre.en-US.win32.tests.tar.bz2
Encountered error while uploading
Command '['scp', '-o', 'IdentityFile=~/.ssh/ffxbld_dsa', '/e/builds/moz2_slave/mozilla-central-win32/build/obj-firefox/dist/firefox-3.7a4pre.en-US.win32.tests.tar.bz2', 'ffxbld@stage.mozilla.org:/tmp/tmp.VDBua13845/']' returned non-zero exit status 1

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270092059.1270095507.25543.gz
command timed out: 1200 seconds without output
program finished with exit code 1

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270089559.1270095931.26619.gz
command timed out: 1200 seconds without output
program finished with exit code 1

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270089559.1270094975.24492.gz
command timed out: 1200 seconds without output
program finished with exit code 1

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270086473.1270093256.20541.gz
command timed out: 1200 seconds without output
program finished with exit code 1
This is a stopgap to double the timeout on upload,
 http://hg.mozilla.org/build/buildbotcustom/rev/e35cdf3012ae
pm01/02 reconfig'd to pick it up.

We could also make the win32 scp uploads output progress along the way (like linux and mac do). Even that doesn't deal with the underlying issue of ssh connections failing, slaves losing the socket connections with their master, and Castro slaves taking a long time to upload to stage; which is what we observe whenever there are several builds running. It's possible there are NFS issues between stage and ftp.m.o (or iscsi from ftp to the equallogic array) but the ssh and buildbot issues point elsewhere.

Did anything come of comment #14 ?
Attachment #436425 - Flags: review?(bhearsum)
The stopgap didn't end up helping, because we had progressed from "slow uploads being killed off by buildbot" to "machines losing contact with their buildbot master (pm01)" before they got through compiling. In the end I just stopped buildbot on mw32-ix-slave02 thru 16, since they were the only machines showing a problem. This is a regression for developers as they now have the slower VMs doing the compile work for them, but at least the builds are more reliable.

Also, we've been allowing 20 minutes to upload a completed build (and the symbols & tests) to stage.m.o, which other jobs then use (some via the castro proxy). For windows the total upload is 83MB for an optimized build, and 105MB for debug build (120 and 128MB for linux). For timely response for developers even 20 minutes is quite a long time, so the change to 40 minutes should only be a temporary one.
We really need a plan for diagnosing this problem. What information can RelEng provide to help IT, and vice versa ? All I have right now is broad notion that lots of checkins from developers results in work in the Castro server room (because we bias the compile jobs onto the fast ix slaves there), which results in win32 ix slaves failing (in the multiple ways described above). Would you like times, to the nearest few seconds, when machines lose connection to the buildbot master ? Or periods of time where the upload speed becomes a problem for the windows slaves ? [1] I haven't noticed any correlation with individual slaves but we could look more seriously if that would be helpful. Perhaps we have a fault on the setup of the ix machines, and can spot a packet storm from them causing problems.

The ping times at the link in comment #13 do look good, but I can't see the cacti graph for the bandwidth (using ldap I get 'account is disabled'). Could there be a bottleneck in the network between the 2nd floor server room and the fiber gateway ? Say if we were saturating the outward connection from switches there. No offense intended, just throwing ideas out there.
Of note may be that we started a large roll out to the Linux ix machines yesterday. As each of them rebooted, they sucked down a 1.2GB file from bm-sun-xf01, which I believe is located in mpt. If link saturation triggers these issues, this could have aggravated over night.

Also interesting is that we haven't had any Linux machines in Castro lose their connection to the master mid-build.
Attachment #436425 - Flags: review?(bhearsum) → review+
if this is still happening after the deployment, could we move one of these machines to mpt and see if it is still timing out there?
We're getting horribly slow speeds cloning repositories between mpt <-> castro, which is interfering with the 3.6.3 release. Up'ing severity to blocker.
Severity: critical → blocker
dmoore: What does munin say for network/bandwidth approx between 11pm - 1am last night? There were few (none?) developers around doing checkins, and nightly builds dont start until 3-4am. This meant we didnt have much build load going on, and didnt expect to hit congestion problems.
also, this morning doing FF3.6.3 release builds, it took 58minutes for a linux ix machines just to do "hg clone" for release build. For comparison, a VM in MPT took <4 minutes.
We've got a whole swatch of talos disconnects (see the worst on moz-central right now) that are most likely related to this bug.
dmoore de-prioritized NFS traffic from mpt->650castro at approx 11.45PST. This
means that the one-time hit for scratchbox upgrades will take longer, but means
that builds, and releases, will get more bandwidth.
(In reply to comment #26)
> See:
> 
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270146822.1270147232.19728.gz
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270144697.1270147842.21337.gz
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270144259.1270147837.21313.gz
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270144463.1270146419.17237.gz
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270144601.1270146437.17354.gz
> 
> etc.

from dmoore, he thinks the file transfer errors in those logs like "Connection to the other side was lost in a non-clean fashion." are unrelated to bandwidth. He suspects that something else, maybe a http proxy?, is forcing closed the connection.

Also of note: the file transfer starts quickly (MBs), as it uses the local cache, and then slows (KBs) when it reaches non-cached file.
(In reply to comment #28)
> (In reply to comment #26)
> > See:
> > 
> > http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270146822.1270147232.19728.gz
> > http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270144697.1270147842.21337.gz
> > http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270144259.1270147837.21313.gz
> > http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270144463.1270146419.17237.gz
> > http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270144601.1270146437.17354.gz
> > 
> > etc.
> 
> from dmoore, he thinks the file transfer errors in those logs like "Connection
> to the other side was lost in a non-clean fashion." are unrelated to bandwidth.

Just to be clear, it's not the wget that lost its connection, it's the connection between the Buildbot slave and master.
Some of the machines that disconnected:

talos-r3-w7-034
talos-r3-w7-023
talos-r3-leopard-022
talos-r3-leopard-039
talos-r3-leopard-016
talos-r3-leopard-027
talos-r3-snow-018

etc.
I'm not sure if this is the same problem or not.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.5/1270586714.1270588917.10733.gz
Linux mozilla-1.9.1 build on 2010/04/06 13:45:14

s: moz2-linux-slave40
gmake[1]: *** [upload] Error 2
gmake[1]: Leaving directory `/builds/slave/mozilla-1.9.1-linux/build/obj-firefox/browser/installer'
make: *** [upload] Error 2


Uploading /builds/slave/mozilla-1.9.1-linux/build/obj-firefox/dist/firefox-3.5.10pre.en-US.linux-i686.tar.bz2
Connection closed by 10.2.74.116

lost connection
Encountered error while uploading
I'm currently experiencing very high slowdowns on traffic between stage.m.o (MPT) and production-mobile-master.build.m.o (also mw32-ix-slave1*.build.m.o)

Currently meaning for the past 3 hours or so, and now.

I suppose by "slowdown" I mean somewhere around 40kb/s?

Attached is one of the affected logs, with verbose wget output, from around 15:07-15:15pm PDT; this took 8 minutes where it usually takes 1-2 minutes. (Not a huge deal, but I did time out slightly earlier on two release repacks around 13:22 and 13:27 PDT.)
This is currently keeping the tree closed (timeouts uploading builds means we have reduced test coverage).

Do we have any idea why this seems to have gotten worse today? Are there potential solutions in sight?
(In reply to comment #33)
> This is currently keeping the tree closed (timeouts uploading builds means we
> have reduced test coverage).
> 
> Do we have any idea why this seems to have gotten worse today? Are there
> potential solutions in sight?

Are you doing more "work" today?  The bandwidth graph is flat - you're hitting the limit and have been all day.

Between the office and San Jose we only have a 50Mbps line.  I just signed the paperwork to increase that to 100Mbps but that won't happen quickly (it took two weeks to get the papers).
[02:45]	<dmoore>	Hey all, we had a failure in the castro web proxy which meant all HTTP downloads were uncached
[02:46]	this pushed our utilization to 100% and resulted in the performance problems you've been seeing
	the proxy is restarted and performance should pick up again soon
Suck.  We'll make sure we have proper monitoring around that.  Going to mark this one as resolved.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
This isn't really resolved, since timeouts are still happening (though maybe less frequently?). Is there another bug tracking the bandwidth upgrade?
No, no other bug tracking bandwidth upgrade.  Is there value in keeping this bug open?
I think outstanding infra issues should generally have open bugs (even if only as a reference) so that interested parties (i.e. everyone wondering what's up with the random red on the tree) can track their progress.

But that's secondary to actually getting the problem fixed, so if all efforts are being made to fix it, I won't complain too much!
This is still causing tree closures, and failed nightly builds, so reopening. 


(In reply to comment #34)
> (In reply to comment #33)
> Between the office and San Jose we only have a 50Mbps line.  I just signed the
> paperwork to increase that to 100Mbps but that won't happen quickly (it took
> two weeks to get the papers).
Do you have an ETA?

Also, in addition to bandwidth, is there anything else - like improving caching done in bug#551187 - that might help?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(pinged dmoore about those, and he said caching was busted, and then unbusted it)
Whiteboard: [tracking bandwidth]
Just so everyone's on the same page, there isn't any short term fix here.  Pasting more updates doesn't change that.

We are waiting for a couple things to happen:
1. Mountain View bandwidth increase - ~30 days
2. Move Castro Minis to new colo - no ETA
In the meantime, how do we tell the difference between times when the cache is working but we're maxed out anyway, and we should just close the tree and sit back and wait, and times when the cache is broken, and we should tell you about it?
If the cache is busted we'll get paged (we will now anyways) and will fix it.  RelEng has access to the cacti install at Castro and can see if they are running flat or not.
(In reply to comment #45)
> Just so everyone's on the same page, there isn't any short term fix here. 
> Pasting more updates doesn't change that.
> 
> We are waiting for a couple things to happen:
> 1. Mountain View bandwidth increase - ~30 days
> 2. Move Castro Minis to new colo - no ETA

Sorry, armenzg asked me to post failures on irc. I guess you don't need to see these.
(In reply to comment #47)
> RelEng has access to the cacti install at Castro and can see if they are
> running flat or not.

Actually neither Ben or I can't log on to the second link of comment 13. I get 'Error: Access Denied, user account disabled.' Might affect others too.
That's expected - you have to auth first, tell us and we enable the account.  Did that for both of yours.
Thanks. Which one should I be looking at ? Mountain View > Host ?
Depends on: 561155
For posterity (for those who have access), here is a link to the graph which shows build usage of the the Castro <-> mpt link - http://10.250.0.5/cacti/graph.php?action=view&local_graph_id=250&rra_id=all


Nick (nthomas) and John (joduinn) and I looked around the graphs for awhile, here's some analysis of the yearly graph:
We spiked to 30mbit for the first time at the end of January. This corresponds with the addition of some of the rev3 Talos machines. Then it dipped back down to 5-10mbit sustained at the start of february.
The next spike to 30mbit happens mid February. This *seems* to correspond with a lot of extra checkins in that part of the month.
The next spike is 45mbit in the last week of February. This appears to correspond with some Windows ix machines landing in production.
The next spike is at 50mbit, around the first two weeks of March. Around this time we added many more Talos machines, and Linux ix machines.
When the cache was introduced on March 15th we immediately saw a dramatic drop in link usage, going from 45-50mbit sustained down to 15-20mbit sustained, on average. This is an average over 24 hours, so it's hard to tell what the peaks and valleys are. I suspect the peaks are above our current 30mbit cap.

There were also changes to the cap for the build machines over the past couple of months. At the very beginning of March, they had no cap. Around March 5th, they were capped to 20mbit. Around March 19th, the cap was bumped to 30mbit.

So, the long story short here is that we appear to have genuinely, steadily increased the load going between Castro and MPT. We introduced a caching proxy, which helped, but our non-cacheable load spikes above our 30mbit limit regularly.

Some ideas we could try:
* turn on ssh compression for uploads -- might be of negligible value due to packages already being compressed
* prioritize master/slave connections, other things that cause burning
* somehow limit how much we can upload at one time, to avoid completely saturating the link with uploads


One other note:
Each checkin uploads 1.5GB across our 5 platforms (linux 32,64, mac 32, 64, windows). That is then downloaded back to Castro by the test machines. So, each checkin causes roughly 3GB of traffic across the link, in a fairly short period of time.
Attached image Le graph
Wot bhearsum was referring to.
> Some ideas we could try:
> * turn on ssh compression for uploads -- might be of negligible value due to
> packages already being compressed
> * prioritize master/slave connections, other things that cause burning
> * somehow limit how much we can upload at one time, to avoid completely
> saturating the link with uploads

One big win here could be making the OSX symbols files smaller.  They currently weigh in at 400 MB or so, both because they contain two copies of identical fat binaries.  We really only need breakpad symbols, so maybe we could avoid uploading these symbols altogether?
What's the strategy going to be for the next month? Are we going to keep the tree closed and have sheriffs meter checkins? If so, what policy should the sheriff follow to keep things from blowing up?
<nthomas>: philor, rs - we think we have to leave m-c metered at 1 every 2 hours
What about other trees --- electrolysis, tracemonkey, try-server? Do they contribute to the load on the link?
(In reply to comment #58)
> What about other trees --- electrolysis, tracemonkey, try-server? Do they
> contribute to the load on the link?

All branches except for Try contribute to the problem.


We were chatting about these a bunch more tonight and have come up with a couple of ideas that would reduce load on the link by 50%. We'll be trying those out tomorrow. If they go well, we *may* have one of them in place before the end of the day.
(In reply to comments #56-58)

The try server doesn't use the MPT <--> Castro link at all, and Firefox 3.0 has a small impact (smaller number of talos runs, every hour or so, only downloading a few MB of build). 

Everything else preferentially tries to build Win32/Linux32 in Castro (on the faster hardware), may build Mac10.5 there, does all Talos there, and for m-c the majority of the unit tests (all except windows). We are looking at ways to minimize traffic on the link - eg prepopulating the cache in Castro, catlee's comment about mac symbols, bug 561235, looking into moving the fast builders - so that we can run as fast as possible checkin-wise. But in the meantime we need to consider all branches when metering, and if we do that will be able to sustain a higher checkin rate than 1 per 2hr. We'll have to figure out what that rate should be experimentally. Suggestions on how to achieve global metering are welcome.
We seem to be in a desperate situation here. Given we only just added Linux-64 and Mac-64, it seems to me turning them off would be a reasonable emergency measure.

And BTW, simply doubling the link capacity might help in the short term but presumably we're working on a better long term solution as well?
> And BTW, simply doubling the link capacity might help in the short term but
> presumably we're working on a better long term solution as well?

Who spilled the beans?!

Long term fix is total eviction from Castro.  No ETA yet.
dmoore, did the firewall upgrade happen at about 01:15 PDT ? See a stats break and some dropped builds then.
bug 561419 has caused us to download just under 2/3 less data (just over 500MB compared to 1.5G) per checkin. We also save a couple of hundred MB in uploading Linux and Windows symbols in many cases.
Depends on: 561419
Mozilla's backup provider for the Castro office is bumping up the RF link to 100Mbps.  Signed the order, ETA for 24 hours.
Whiteboard: [tracking bandwidth] → [tracking bandwidth] 04/26
turnup delay with WiLine, new date 04/28
Whiteboard: [tracking bandwidth] 04/26 → [tracking bandwidth] 04/28
Here's the traffic graph for the past 24 hours. Most of the time, traffic was below the 30mbit cap, averaged every 5 minutes. There were a few spikes yesterday, to 40mbit just after noon and 4pm, and a few other > 30mbit spikes later in the day. However, after looking through the build and talos results for the past 24 hours I don't see any failures related to this. The worst that I've found is an upload that happened just before 4pm yesterday went pretty slowly, but managed to finish.

So, we're in much better shape here, but it's certainly still possible to push it over the edge.
We had 3 more timeouts today when we hit our 30mbit outbound bandwidth cap. joduinn tells me the new link is imminent though, i.e. still on target for today (04/28).
According to IRC the WiLine connection was to go live on the PDT evening of the 28th. 

Between 2030 and 2145 PDT today there was a lot of traffic 'Outbound' on the new link (see attached graph), which caused several slaves in Castro to disconnect from their buildbot master in MPT. When I looked at the graphs for the individual ports on sxmac01/02/03 & buildsx01 there are spikes as high as 2Mbps but they they typically lasted for only one (5 minute) datapoint. Don't believe that combining all the ports would explain the flat 40Mbps for 85 minutes. What's going on here ?
This link, http://10.250.0.5/cacti/graph.php?action=view&local_graph_id=250&rra_id=all is all traffic from the firewall's perspective.
The traffic spike noted last night was not build traffic, but it did utilize the same VPN hardware as the build system. We'll be switching to new VPN hardware over the weekend which will remove the possibility of non-build traffic having this impact in the future.
Comment #73 and comment #74 are fallout from the web proxy reboot for Bug 562526. The proxy came back up with an incomplete configuration which actually caused HTTP connections to be refused. This has been fixed.
Another fistful at 21:43
And at 05:39
And at 13:38
And at 05:40
And at 13:41.

I'm almost, nearly, beginning to see a little pattern here.
(In reply to comment #80)
> I'm almost, nearly, beginning to see a little pattern here.

Pretty much sounds like "every 8 hours" to me....
IT: is anything (script/cron/etc) running on an 8-hour cycle that might be causing these timeouts?
I would also suggest contacting the service provider to see their equipment is running any sort of routines.  Being an old re-tried switched network employee for the phone company, its very possible some routine is running on the provided link impacting performance.  I cannot suggest what routines those might be as I'm not familiar with wire-less lan connections, but they should be able to tell you if they are doing some routine maintenance on the link.
FWIW, I believe most of the recent spat of test failures (on talos-r3 machines) is related to load on the talos master, not related to the mpt <-> castro link.

It's impossible to tell from the error log though.
I'm not convinced, since today's 13:40 episode featured four talos slaves, but also one build slave, but since I don't have any sort of logs to look at other than the tinderbox logs that tell me that every day at 05:40 and 13:40, and 21:40 when we have something running then, some but not all of the slaves that are doing something at that time disconnect, and have since last Friday, I don't really know.
Could we please have the logging improvements mentioned on IRC
* the two missing switches added
* improved labels on the graphs so we can understand the traffic flows (the network topology is a black box to us, so (eg) it's not clear what inbound/outbound translate to for traffic up to/down from the colo)
(In reply to comment #87)
> Could we please have the logging improvements mentioned on IRC
> * the two missing switches added
> * improved labels on the graphs so we can understand the traffic flows (the
> network topology is a black box to us, so (eg) it's not clear what
> inbound/outbound translate to for traffic up to/down from the colo)

Recap IRC for me since I wasn't there - which switches are missing?
Looking back at logs it appears I am perennially confused about where the machines are. (Note to self: the rev3 minis are on the 2nd floor, and the ix machines on the 3rd floor). And the 2nd floor switches were added to Cacti about 2nd April. 

I think the ones missing are sxmac04 and 05.
<dmoore> bhearsum: sxmac04 is linked into sxmac03 port 50, and sxmac05 is linked into sxmac04 port 49
And at 21:44 (because I jumped in front of the train to see if it would hit me), one Windows build, two Mac and one Linux rev3 test, and one Tracemonkey Windows non-Rev3 test.
Even easier, "buildsx" doesn't have any Macs attached.  sxmac* only has Macs.
(In reply to comment #90)
> And at 21:44 (because I jumped in front of the train to see if it would hit
> me), one Windows build, two Mac and one Linux rev3 test, and one Tracemonkey
> Windows non-Rev3 test.

So at 21:44 the link was fine.  The load on talos-master was also high, but that wouldn't affect the Windows build, nor the Tracemonkey non-rev3 test.

The 8 hour interval is also pretty clear now.  I'm looking into our systems to see if there's anything that happens on an 8 hour interval.

mrz, is there anything that could be affecting this link on an 8 hour interval?
(In reply to comment #91)
> Even easier, "buildsx" doesn't have any Macs attached.  sxmac* only has Macs.

Sorry, I don't follow. If there are talos minis attached to sxmac04/05 then shouldn't they be added to the RelEng part of that Cacti tree ?
Yes, those are missing.  My comment was in reference to your Note to self.  All the Minis are on the second floor off switches named "sxmac".
Whiteboard: [tracking bandwidth] 04/28
No longer blocks: 552680
(In reply to comment #92)
> (In reply to comment #90)
> The 8 hour interval is also pretty clear now.  I'm looking into our systems to
> see if there's anything that happens on an 8 hour interval.
> 
> mrz, is there anything that could be affecting this link on an 8 hour interval?
mrz/dmoore: we expect this to happen again at ~13.48 PST today, can you be watching to see if anything looks wrong around that time? 


(In reply to comment #95)
> Yes, those are missing.  My comment was in reference to your Note to self.  All
> the Minis are on the second floor off switches named "sxmac".

Have these been added to cacti yet, so we can help with debugging in case its something from RelEng?
* the enigmatic bug 563968 has been completed, 'replacing the Netscreen SSG in San Jose with a Juniper SRX240 to better handle the VPN traffic from the build cluster in Castro;'
* we saw 5 talos slaves disconnect in the couple of hours after that, in maybe 30 builds
* dmoore has set up traffic logging on talos-master, RelEng to advise failure times and hosts
Speak of the devil ....

Between talos-master:9012 and 
* talos-r3-fed-012 @ Thu May 6 22:08:34 2010
* talos-r3-fed-030 @ Thu May 6 22:08:26 2010
* talos-r3-fed-033 @ Thu May 6 22:08:04 2010
* talos-r3-fed-026 @ Thu May 6 22:08:01 2010
* talos-r3-fed-022 @ Thu May 6 22:07:59 2010
* talos-r3-fed-011 @ Thu May 6 22:07:57 2010
Current situation:
Since the network change we've not been able to complete a windows build, and 30-40% of talos/unit-on-talos runs are also being interrupted. The main development trees (mozilla-central, mozilla-1.9.2 etc) are all closed. IT & RelEng are working to figure out the problem.

Derek, did you find anything useful in your investigations ? Should we continue with the new VPN hardware ?
I found some more failures over night:

22:07:59    mw32-ix-slave13
22:20:24    mw32-ix-slave13
22:20:25    mv-moz2-linux-ix-slave10
23:33:34    mw32-ix-slave04
01:02:34    mw32-ix-slave03
03:21:26    mw32-ix-slave16
03:36:05    mw32-ix-slave05
04:59:46    talos-r3-leopard-035
05:13:47    talos-r3-snow-016
05:16:20    talos-r3-snow-017
05:16:22    talos-r3-snow-009
06:03:42    talos-r3-fed64-016
06:03:44    talos-r3-fed64-017
06:17:17    talos-r3-fed64-005

*ALL* the failures so far have been between slaves in Castro and masters in MPT.
We'll do an emergency rollback if this can't be debugged rapidly over the next hour or two.
It feels like we've made a lot of progress on this today, in part due to the debugging capabilities of the new VPN configuration.

We've identified and removed a timeout problem with talos-master (and sm-try-master). Both hosts reside outside of the recognized "build" networks and, as such, were not included in existing TCP idle timeout exemptions.

We've also identified undesirable behavior in the new VPN hardware when it comes to aging out idle connections. This has been worked around.
We've also received a fix for the following Cisco bug, which has impacted us in the past:


BugID: CSCsj40681

ENH: PIX/ASA: Maintain TCP connections when VPN tunnels flap

This is an enhancement request to prevent disruption of TCP connections over IPsec tunnel when IPsec tunnel is dropped and reestablished by any reason (even for very short time).

Symptom:

With PIX/ASA 7.x and 8.x, when IPSec VPN tunnel flaps briefly, all TCP connections previously established over this tunnel are dropped too. This is disruptive for many TCP applications.

This happens by design in PIX 7.x and 8.x. This issue does not occur on other platforms: PIX 6.x, IOS, VPN3000.

Example of Syslog message generated on PIX/ASA:

%ASA-6-302014: Teardown TCP connection 57983 for outside:192.168.0.100/80 to inside:10.0.0.100/1135 duration 0:00:36 bytes 53947 Tunnel has been torn down

Condition:

- To trigger that issue, IPsec tunnels should be bouncing periodically and TCP sessions should be active this time. It is usually observed when IPsec lifetime or idle time is less than TCP session idle timeout, and there are long periods of no activity over TCP sessions.
Any status here?  Will the tree just end up staying closed for the weekend?
17:14:58 mw32-ix-slave04
17:33:22 talos-r3-xp-036
17:33:25 talos-r3-w7-003
17:33:27 talos-r3-w7-006
17:33:27 talos-r3-xp-013
17:33:30 talos-r3-w7-016
17:41:24 talos-r3-leopard-050
17:41:28 talos-r3-leopard-042
17:54:14 mw32-ix-slave04
17:54:20 talos-r3-xp-034
18:04:41 talos-r3-w7-015
IT is reverting back to the previous VPN setup with the more regular 8-hour disconnects.  We'll revisit the new VPN set up another time.

Might want to see how the set of nightlies go before re-opening the tree.
We tried a checkin (http://hg.mozilla.org/mozilla-central/rev/a214b695dc7b) to see if the network issues are resolved, but we're seeing a bunch of timeouts still, for example:

Started 05:16
r3-fed64-004
r3-fed64-007
r3-fed64-015
r3-fed64-019
...

Started 05:29
talos-r3-leopard-022
talos-r3-leopard-028
talos-r3-leopard-029
talos-r3-leopard-041
talos-r3-leopard-045
talos-r3-leopard-047
...
And while I thought that might have just hit the 8 hour thing, the next trial clearly wasn't: about 30 rev3 slaves and 1 non-talos slave, half of them failing right off by timing out cloning build tools, the rest disconnecting in one of the other early steps.
The talos machines where still going over the new (broken) VPN until about 10am PDT this morning. Now they're going over the old (8 hour bustage) VPN. I would suggest trying another push to m-c.
Derek, an unexpected crop of timeouts:
* talos-r3-xp-021, talos-r3-xp-039, talos-r3-w7-007, talos-r3-w7-012 @ 03:08
* talos-r3-leopard-042, talos-r3-leopard-049 @ 02:39
* talos-r3-xp-005, talos-r3-w7-004, talos-r3-xp-034, talos-r3-w7-007 @ 01:50
Doesn't match the 8 hour pattern (X, X+15m, X+8hr, etc).

Prior to those the last lost connection was Sat May 8 10:17 PDT.
Developers, we don't know the character of the current failure situation. If the nightly builds & tests turn out green then please set the tree to METERED - a rush of checkins would not be helpful while we're in a fragile state. We're expecting that the connection resets will come in a gaggle every 8 hours or so, but when during the day is not known. If the failure rate is worse than that then we should go back to CLOSED.
(In reply to comment #112)
> Developers, we don't know the character of the current failure situation. If
> the nightly builds & tests turn out green then please set the tree to METERED -
> a rush of checkins would not be helpful while we're in a fragile state. We're
> expecting that the connection resets will come in a gaggle every 8 hours or so,
> but when during the day is not known. If the failure rate is worse than that
> then we should go back to CLOSED.

It looks like the Windows nightlies failed, so I'm leaving it closed based on the above comment.
After a couple of trial pushes, looks like the current state is that everything's fine on Linux and Mac (other than bug 564694, where a test that hasn't significantly changed since last November, which relies on getting a DNS lookup failure for a made-up domain, has suddenly started failing), but Windows build slaves disconnect after 3 minutes (four times the same 3 minute disconnect, though the Windows nightlies disconnected after 1 and 9 minutes instead).

Tree's still closed, Generalissimo Francisco Franco is still dead.
Repeating my comment from bug 564694 here in case it is part of the cause of the later issues here:

The issue might be that as of May 6th, the root DNS servers have DNSSEC
implemented and are therefore returning query responses that might exceed the
default DNS UDP query buffersize.  In that case the request is retried using
TCP.

It could be that the firewalls are only allowing DNS queries to the Internet
using UDP.  They must permit both UDP port 53 and TCP port 53 in order to
function properly.
Depends on: 564726
Bill, nice try but the sort of problems we're seeing aren't related to resolving machines on the wider internet. It's all about connectivity between Mozilla locations.
Status update:
* win32 minis and build slaves in Castro are still having a great deal of difficulty in maintaining connections to MPT, most visibly a few minutes after starting a job
* bug 564726 is a work around to get talos/unit-on-talos back in action, by adding a buildbot master in Castro. The mini slaves are still vulnerable to hg/cvs/ftp/graphs.m.o not being temporarily inaccessible 
* can't pull the same trick for the ix builders, best we can do is temporarily disable them and fall back to the VMs at MPT. This results in significantly slower windows & linux builds, and should only be considered an very short term solution to reopen the tree. I've done this in order to do testing for bug 564726
  http://hg.mozilla.org/build/buildbot-configs/rev/523476a592a4

But the real key here is figuring out what's happening with the network so we at least get back to a pre-Thursday state (where we 'only' lose some connections every 8 hours). From what I've observed today
* occasionally connectivity between MPT and Castro will appear to hang - have had ssh sessions which become unresponsive for 10-30 seconds, and pinging a Castro machine from build-vpn reporting packet loss
* over the 60 mins to 3:10 mtr found X% packet loss, ie the link is hung 1 minute in every X
* the nagios checks on pings are also reporting drop outs, this slave has been idle all day (ie no reboots)
https://nagios.mozilla.org/nagios/cgi-bin/history.cgi?host=mv-moz2-linux-ix-slave19.build&type=0&statetype=0&service=PING&archive=1
  (may not be a permalink, it's meant to be Sun May 9 00:00:00 PDT 2010 +24h)
* talos machines have been failing to resolve cvs-mirror, eg (PDT)
   talos-r3-w7-005, talos-r3-xp-006 @ Mon May 10 02:07
   talos-r3-leopard-003             @ Sun May  9 23:58
   talos-r3-fed-006                 @ Sun May  9 22:13
   talos-r3-w7-003/4/5              @ Sun May  9 20:18
   talos-r3-w7-003/4                @ Sun May  9 20:01
(In reply to comment #117)
> * over the 60 mins to 3:10 mtr found 11% packet loss, ie the link is hung 1
> minute in every 9
(In reply to comment #118)
> (In reply to comment #117)
> > * over the 60 mins to 3:10 mtr found 11% packet loss, ie the link is hung 1
> > minute in every 9

I re-ran this again, and found a very similar %age:
[bhearsum@bm-vpn01 ~]$ /usr/sbin/mtr -c 3600 -r mv-moz2-linux-ix-slave22
HOST: bm-vpn01.build.mozilla.org  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. ???                          100.0  3600    0.0   0.0   0.0   0.0   0.0
  2. mv-moz2-linux-ix-slave22.bui 11.9%  3600    9.4   8.0   2.6 3661.  65.2
dmoore is changing over the WAN connection between build systems in 650castro and MPT. We should know if this helps in the next 30mins or so.
A Linux ix machine in castro lost connection to production-master02 at Mon May 10 14:56:53 2010

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
Had a Linux ix machine time out when cloning http://hg.mozilla.org/build/tools at Mon May 10 15:21:37 2010
(In reply to comment #122)
> Had a Linux ix machine time out when cloning http://hg.mozilla.org/build/tools
> at Mon May 10 15:21:37 2010

15:21 was the start of the clone, and at 15:41:37 it timed out.

A second one happened at Mon May 10 15:41:39 2010, which was almost precisely when the first one timed out.
bhearsum,

Are the ix/builder failures something that *ever* happens for slaves located in MPT? I'm trying to get a feel for how unusual these events are.
(In reply to comment #124)
> bhearsum,
> 
> Are the ix/builder failures something that *ever* happens for slaves located in
> MPT? I'm trying to get a feel for how unusual these events are.

A failure like this, where the cloning starts but times out after 20 minutes, has only happened when we were bandwidth constrained as far as I can remember.

I guess this failure could've been a transient issue on hg.m.o, though.
Another timeout. Started cloning build/tools at Mon May 10 19:04:59 2010, timed out at Mon May 10 19:25:07 2010.

We also had a corrupt file get into the proxy, which may not be related, but I guess it could result from a connection reset. Nick said there was nothing obvious in the squid logs. This incident happened around Mon May 10 18:41:05 2010.
Has the corrupt file issue been resolved? If not, what was the invalid url?
The corrupt file was removed in bug 564964.
I looked over the build and test results from the past 12h and haven't seen even a *single* timeout / lost connection / etc.
Got another issue which may or may not be related here:
A slave failed to upload a build at Tue May 11 07:01:16 2010 with the following error:
ssh: connect to host stage.mozilla.org port 22: Bad file number

I *think* that this means that the connection closed at a weird time, but that's mostly a guess.
Corrupted file purged:

# squidclient -m PURGE http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/electrolysis-macosx64/1273582056/firefox-3.7a5pre.en-US.mac64.dmg

HTTP/1.0 200 OK
Server: squid/2.6.STABLE21
Date: Tue, 11 May 2010 15:05:34 GMT
Content-Length: 0
I'm doing log analysis on the failed uploads now.
(In reply to comment #122)
> Had a Linux ix machine time out when cloning http://hg.mozilla.org/build/tools
> at Mon May 10 15:21:37 2010

This one was on mv-moz2-linux-ix-slave17

(In reply to comment #126)
> Another timeout. Started cloning build/tools at Mon May 10 19:04:59 2010, timed
> out at Mon May 10 19:25:07 2010.

This one was on mv-moz2-linux-ix-slave12
Another failed upload:
mv-moz2-linux-ix-slave23 started uploading at Tue May 11 09:44:53 2010 and timed out at Tue May 11 10:00:45 2010 with the following error:
fennec-2.0a1pre.en-US.linux-gnueabi-arm.tar.b  90% 9360KB   0.0KB/s - stalled -Read from remote host stage.mozilla.org: Connection timed out


...after being stalled for 10 minutes or so.
Another cloning timeout. This time from moz2-darwin9-slave30, which is actually in the colo....
Started at Tue May 11 11:10:28 2010
Ended at Tue May 11 11:30:31 2010
Timeout in cloning http://hg.mozilla.org/projects/electrolysis:
mv-moz2-linux-ix-slave24 
started at Tue May 11 11:20:31 2010
timed out at Tue May 11 11:51:46 2010
I suspect the HG cloning timeouts are related to bug 492033, where a known bug in the Cisco firewall regressed into the firmware we are currently running.

I've implemented a workaround here at Castro which strips the offending TCP options before forwarding them to MPT. Please continue to update this bug if you experience timeouts for connections which start after this comment.
Thanks for the update Derek, I haven't seen any more timeouts since my last post.

Any idea why we only started hitting this on these machines now? Something to do with the network changes, I guess?
It's not a new thing at all for mobile-land... repository clones + cvs checkouts break more than half the test runs on n900s (tracked in bug 555892).
Another 2 lost connections:
moz2-darwin10-slave40, started at Tue May 11 17:19:32 2010 ended at Tue May 11 17:28:15 2010
mw32-ix-slave22, started at Tue May 11 17:25:44 2010 ended at Tue May 11 17:26:09 2010

Another corrupt file:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/fx-linux-tbox-mozilla1.9.0/1273621440/firefox-3.0.19pre.en-US.linux-i686.tar.bz2

That file was uploaded from a slave in mpt, which seems to indicate that the corruption happens on the mpt -> castro trip.
In reply to comment #130: Don't know if related but KaiRo's latest blog http://home.kairo.at/blog/2010-05/weekly_status_report_w18_2010 mentioned "bad file number" in relation with bug 563151 and a command-line which was too long (in excess of 33,000 characters).
DownloadFile(): FAILED: 0, output: --13:54:42--  http://stage-old.mozilla.org/pub/mozilla.org/firefox/nightly/3.6.3-candidates/build1/update/win32/ta/firefox-3.6.3.complete.mar

slave: mv-moz2-linux-ix-slave24.build.mozilla.org
Upload failure from mw32-ix-slave16 to stage.m.o at 21:23:41:
Read from remote host stage.mozilla.org: Connection reset by peer
lost connection
There wasn't any failures overnight. Based on that and not having any over the prior night seems to indicate that these dropped connections only happen during times of higher load.

Derek, could any of the networking hardware/software be getting overloaded when too many requests come in? Any other theories?
(In reply to comment #143)
> Don't know if related but KaiRo's latest blog [...] mentioned "bad file
> number" [...]

This error message can mean any number of things and is generally not very helpful in telling anything else than that there is an error. I don't think there can be any relation here.
This isn't a blocker anymore since the tree is-reopened
Severity: blocker → critical
mv-moz2-linux-ix-slave03 timed out when cloning. started at Fri May 14 10:29:44 2010 ended at Fri May 14 11:29:45 2010
mw32-ix-slave08 timed out when cloning. started at Fri May 14 10:31:39 2010 ended at Fri May 14 11:31:40 2010
dmoore: We havent seen this in a while, so can we close this now, or do you want to keep open for tracking?
I believe these are occurrences of it.

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1276020106.1276021338.19049.gz
s: talos-r3-fed-019
Start	Tue Jun 8 11:04:00 2010
End	Tue Jun 8 11:20:47 2010

And another one:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1276027166.1276027252.15585.gz
s:moz2-linux64-slave04
Start	Tue Jun 8 13:00:49 2010
End     Tue Jun 8 13:00:51 2010
(In reply to comment #152)
> s: talos-r3-fed-019

Castro talos slave connected to Castro talos master, so not related to the Castro/MPT link. Probably load on the master.

> And another one:
> s:moz2-linux64-slave04

I bet this was bug 566026.
I do think this bug has outlived its usefulness. We've covered a lot of ground with it, and I'd prefer to track any new (and more specific issues) with new bugs.

Thank you, and good night.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
(In reply to comment #45) matthew zeier [:mrz] 2010-04-21 12:42:56 PDT
> Just so everyone's on the same page, there isn't any short term fix here. 
> Pasting more updates doesn't change that.
> 
> We are waiting for a couple things to happen:
> 1. Mountain View bandwidth increase - ~30 days
> 2. Move Castro Minis to new colo - no ETA

Did these happen?
Yes and no, in that order.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.