Closed Bug 811063 Opened 8 years ago Closed 8 years ago

Intermittent periods when ftp.mozilla.org doesn't resolve

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: selenamarie, Assigned: bhourigan)

References

Details

(Keywords: intermittent-failure)

We're trying to root-cause some timeouts connecting to the FTP server at specific times. 

The problem seems to occur sometime between 5-9 minutes after the hour. The symptoms in the past have been "hang forever" on ftp-scraping scripts. Rumor was that tbpl (tinderbox push log) was having troubles last week that were similar, as that tool also uses an ftp scraper to update build/release ids.

If I run our scraper script at any other start time than 5 after the hour, I can't reproduce the hang. The scraper is running from sp-admin01.

Suggestion in #it was that perhaps there would be log entries that would show a resource exhaustion issue somewhere... but I don't have a great idea about how to root cause this. :/

We added a 2-minute timeout to the scraper and started it at 5 minutes after the hour, and the second time it ran we got this error log (the scraper does not always fail): 

2012-11-12 12:06:04,616 DEBUG - Connecting to http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2012/11//2012-11-12-04-20-14-mozilla-aurora-android//fennec-18.0a2.mul
ti.android-arm.txt
2012-11-12 12:08:05,650 ERROR - Caught Error: <class 'urllib2.URLError'>
2012-11-12 12:08:05,650 ERROR - <urlopen error timed out>
2012-11-12 12:08:05,651 ERROR - trace back follows:
2012-11-12 12:08:05,662 ERROR - Traceback (most recent call last):
2012-11-12 12:08:05,662 ERROR - File "/home/sdeckelmann/socorro/socorro/cron/ftpscraper.py", line 208, in scrapeNightlies
    for info in getNightly(nightly, nightly_url):
2012-11-12 12:08:05,662 ERROR - File "/home/sdeckelmann/socorro/socorro/cron/ftpscraper.py", line 132, in getNightly
    kvpairs = parseInfoFile(info_url, nightly=True)
2012-11-12 12:08:05,662 ERROR - File "/home/sdeckelmann/socorro/socorro/cron/ftpscraper.py", line 66, in parseInfoFile
    infotxt = urllib.urlopen(url)
2012-11-12 12:08:05,663 ERROR - File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
2012-11-12 12:08:05,663 ERROR - File "/usr/lib64/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
2012-11-12 12:08:05,663 ERROR - File "/usr/lib64/python2.6/urllib2.py", line 409, in _open
    '_open', req)
2012-11-12 12:08:05,664 ERROR - File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
2012-11-12 12:08:05,664 ERROR - File "/usr/lib64/python2.6/urllib2.py", line 1190, in http_open
    return self.do_open(httplib.HTTPConnection, req)
2012-11-12 12:08:05,664 ERROR - File "/usr/lib64/python2.6/urllib2.py", line 1165, in do_open
    raise URLError(err)
2012-11-12 12:08:05,664 ERROR - URLError: <urlopen error timed out>
Here's additional logged timeouts: 

[sdeckelmann@sp-admin01.phx1 socorro]$ grep -B 2 'ERROR - <urlopen error timed out>' selena_temp_scraper_job.log
2012-11-12 12:06:04,616 DEBUG - Connecting to http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2012/11//2012-11-12-04-20-14-mozilla-aurora-android//fennec-18.0a2.multi.android-arm.txt
2012-11-12 12:08:05,650 ERROR - Caught Error: <class 'urllib2.URLError'>
2012-11-12 12:08:05,650 ERROR - <urlopen error timed out>
--
2012-11-12 14:07:33,380 DEBUG - Connecting to http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly//2.11b3-candidates//build1/
2012-11-12 14:09:34,505 ERROR - Caught Error: <class 'urllib2.URLError'>
2012-11-12 14:09:34,505 ERROR - <urlopen error timed out>
--
2012-11-13 06:07:12,034 DEBUG - Connecting to http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly//2.8b1-candidates/
2012-11-13 06:09:13,153 ERROR - Caught Error: <class 'urllib2.URLError'>
2012-11-13 06:09:13,153 ERROR - <urlopen error timed out>
--
2012-11-13 07:08:15,612 DEBUG - Connecting to http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly//2.12b5-candidates//build1//win32_info.txt
2012-11-13 07:10:16,681 ERROR - Caught Error: <class 'urllib2.URLError'>
2012-11-13 07:10:16,681 ERROR - <urlopen error timed out>
--
2012-11-13 10:06:27,994 DEBUG - Connecting to http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2012/11//2012-11-13-03-06-58-mozilla-central-android-noion//fennec-19.0a1.en-US.android-arm.txt
2012-11-13 10:08:29,112 ERROR - Caught Error: <class 'urllib2.URLError'>
2012-11-13 10:08:29,112 ERROR - <urlopen error timed out>
--
2012-11-14 03:06:13,030 DEBUG - Connecting to http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly//15.0b1-candidates/
2012-11-14 03:08:14,147 ERROR - Caught Error: <class 'urllib2.URLError'>
2012-11-14 03:08:14,148 ERROR - <urlopen error timed out>
You might want to print the URLError.reason, which I suspect will be something about not being able to resolve ftp.mozilla.org, and you are seeing the same thing that tbpl is showing.

We use "tbpl" too broadly, both for things that it hits and for things it displays. In this case, it's buildbot jobs whose logs it is displaying which are hitting that - we get frequent little spatterings of two or five or ten test jobs which try to download a build or a test.zip, and die with "wget: unable to resolve host address `ftp.mozilla.org'".
This is happening sporadically.
We're seeing the same issue on buildbot/tbpl (ERROR 503: Server Too Busy.).
https://tbpl.mozilla.org/php/getParsedLog.php?id=17070762&tree=Firefox

wget --progress=dot:mega --no-verbose -N http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1352996332/firefox-19.0a1.en-US.mac64.tests.zip
 in dir /Users/cltbld/talos-slave/test/build (timeout 1200 secs)
 watching logfiles {}
 argv: ['wget', '--progress=dot:mega', '--no-verbose', '-N', u'http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1352996332/firefox-19.0a1.en-US.mac64.tests.zip']
 environment:
  Apple_PubSub_Socket_Render=/tmp/launch-dLH7QC/Render
  CVS_RSH=ssh
  DISPLAY=/tmp/launch-1wPqUh/org.x:0
  HOME=/Users/cltbld
  LOGNAME=cltbld
  PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
  PWD=/Users/cltbld/talos-slave/test/build
  PYTHONPATH=/Library/Python/2.5/site-packages
  SHELL=/bin/bash
  SSH_AUTH_SOCK=/tmp/launch-fEWVQ4/Listeners
  TMPDIR=/var/folders/qd/srwd5f710sj0fcl9z464lkj00000gn/T/
  USER=cltbld
  VERSIONER_PYTHON_PREFER_32_BIT=no
  VERSIONER_PYTHON_VERSION=2.7
  __CF_USER_TEXT_ENCODING=0x1F5:0:0
 using PTY: False
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1352996332/firefox-19.0a1.en-US.mac64.tests.zip:
08:59:42 ERROR 503: Server Too Busy.
program finished with exit code 1
Examples from today:
{
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164734&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164745&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164723&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164730&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164714&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164741&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164761&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164706&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164767&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164720&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164759&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17164750&tree=Mozilla-Inbound
}

From 16th November:
{
https://tbpl.mozilla.org/php/getParsedLog.php?id=17103614&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17103625&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17103575&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17103606&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17103632&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17103586&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17103597&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17103619&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17103615&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=17103613&tree=Mozilla-Inbound
}

From 15th November:
{
https://tbpl.mozilla.org/php/getParsedLog.php?id=17071011&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=17064103&tree=Mozilla-Inbound
}

From 13th November:
{
https://tbpl.mozilla.org/php/getParsedLog.php?id=16985789&tree=Mozilla-Inbound
}

(and I'm sure there are more)
Well, even if Selena's issue wasn't the same as releng's "can't resolve ftp.m.o," we're stealing this bug.

Interesting clue, if anyone knew what it was a clue to: today (or I finally noticed today) the two periods moved from being ~10 and ~40 minutes after the hour to being ~20 and ~50 after.
OS: Linux → All
Hardware: x86_64 → All
Summary: Timeout issue with FTP server → Intermittent periods when ftp.mozilla.org doesn't resolve
Severity: normal → major
Assignee: server-ops-infra → pradcliffe
:edmorley is there a way to get actual error messages for those tbpl results? Exit code 1 doesn't really tell me anything useful, just that it failed :/
(In reply to Peter Radcliffe [:pir] from comment #6)
> :edmorley is there a way to get actual error messages for those tbpl
> results? Exit code 1 doesn't really tell me anything useful, just that it
> failed :/

Some of the logs do contain more useful output (I think platform specific), eg from this morning:

https://tbpl.mozilla.org/php/getParsedLog.php?id=17198931&tree=Firefox
{
wget: unable to resolve host address `ftp.mozilla.org'
}
(And if we needed to add more output, we can backout bug 794467; though I would have thought in the error cases, using --no-verbose wouldn't hide anything useful).
Some of them having an error and other not would suggest there is more than one issue (or that it manifests in different ways) unless there are other differences between those runs that I'm unaware of.
The connection details in verbose output could be useful...
(In reply to Peter Radcliffe [:pir] from comment #9)
> Some of them having an error and other not would suggest there is more than
> one issue (or that it manifests in different ways) unless there are other
> differences between those runs that I'm unaware of.

I believe it's platform specific - the OS X runs seem to be the ones that only display "program finished with exit code 1".
Depends on: 813501
There are some that have the exit code 1 and also have the unable to resolve error and others that just have the exit code 1. That implies some other difference.
(In reply to Peter Radcliffe [:pir] from comment #10)
> The connection details in verbose output could be useful...

I've backed out bug 794467, but we'll need the reconfig (bug 813501) to see the changes in production.
Thanks.
Running some tests on sp-admin01.

:edmorley - do you have a host that is exhibiting the problem that I can log into to run tests?
(In reply to Peter Radcliffe [:pir] from comment #15)
> Running some tests on sp-admin01.
> 
> :edmorley - do you have a host that is exhibiting the problem that I can log
> into to run tests?

Unfortunately I don't have access to the build slaves, you'd need to speak to releng for that (#build).
Pir, I just pulled aside talos-r4-snow-025 for you

talos-r4-snow-025:~ cltbld$ facter ipaddress
10.12.51.53
talos-r4-snow-025:~ cltbld$ facter fqdn
talos-r4-snow-025.build.scl1.mozilla.com

user:cltbld PW as per releng current. Root pw should also be in the password file. Slave also has VPN access should you need it.

If you make any software or system config changes, please let releng know, so we can reimage when we bring it back.

Its behind the BuildVPN so you can either us our jumphost vpn1.releng.scl3.mozilla.com or connect through OpenVPN, if you don't have a route to it.
:digi - any insights on the DNS side of this issue.
Got a couple of failures on that build machine.

###################################################
; <<>> DiG 9.6.0-APPLE-P2 <<>> ftp.mozilla.org
;; global options: +cmd
;; connection timed out; no servers could be reached
###################################################
; <<>> DiG 9.6.0-APPLE-P2 <<>> ftp.mozilla.org
;; global options: +cmd
;; connection timed out; no servers could be reached
Do we have any ideas when this problem started? When did people start noticing it?
(In reply to Peter Radcliffe [:pir] from comment #21)
> Do we have any ideas when this problem started? When did people start
> noticing it?

I'm sorry I don't have a specific range - most of the times connectivity issues pop up they are so transient that they've resolved themselves before we have a chance to file a bug - which means unfortunately in this case, it wasn't until the frequency was seen to be abnormally high, that we started tracking more closely.

At a guess I would say > 6-7 days and < 2 weeks.
Even rough ranges are useful, thanks.
I've been running this script in places. Haven't caught any more errors since the two earlier. It's possible that testing will disguise the problem of course.

#!/bin/bash

touch dig-fail.txt

while true; do

  if ! dig=$(dig ftp.mozilla.org 2>&1) ; then
    date="$(date)"
    echo "###################################################" >> dig-fail.txt
    echo "$date" >> dig-fail.txt
    echo "$dig" >> dig-fail.txt
    echo "Failed $date"
  fi
done
Maybe this coincides with puppet runs on the DNS server.. times look similar.
:pir

I've looked through the debug log on admin1b.private.scl1 and I'm able to correlate some of the failures identified in tbpl to error messages on the recursive name server. I downloaded bind-9.8.2 source and checked line 6440, but it doesn't quite match up to anything meaningful, I suspect because of the RedHat patch set. I'm tracking down an SRPM now. Based on the debug log this problem has been occurring for some time, at least since 11/5. 

It might be worth mentioning that ftp.mozilla.org may not be the best hostname to connect to if connectivity to our local product delivery cluster is the goal because that name is a CNAME pointing to Dynect. Something like ftp-zlb.vips.scl3.mozilla.com would always resolve to our local product delivery cluster without recursing outside of Mozilla.

19-Nov-2012 23:19:48.728 query-errors: debug 1: client 10.12.51.176#54703: view mpt-dmz-in: query failed (SERVFAIL) for ftp.mozilla.org/IN/AAAA at query.c:6440
19-Nov-2012 23:19:48.728 query-errors: debug 1: client 10.12.51.51#59341: view mpt-dmz-in: query failed (SERVFAIL) for ftp.mozilla.org/IN/AAAA at query.c:6440
19-Nov-2012 23:19:48.728 query-errors: debug 1: client 10.12.49.171#59152: view mpt-dmz-in: query failed (SERVFAIL) for ftp.mozilla.org/IN/AAAA at query.c:6440
19-Nov-2012 23:19:48.728 query-errors: debug 1: client 10.12.51.51#56619: view mpt-dmz-in: query failed (SERVFAIL) for ftp.dynect.mozilla.net/IN/AAAA at query.c:6440
19-Nov-2012 23:19:48.728 query-errors: debug 1: client 10.12.51.176#55901: view mpt-dmz-in: query failed (SERVFAIL) for ftp.dynect.mozilla.net/IN/AAAA at query.c:6440
Interesting, I didn't spot those. Good catch.

I poked at the name that ftp.mozilla.org points through, through dynect, and it had a redirect back to ftp.mozilla.org once you get into /pub (root doesn't redirect, /pub does).

We could ask webops to give us a name that's local that doesn't redirect. Would make sense to me.
\o/ Thanks for tracking this down. :)
I was able to get a SRPM of the bind version that is used on admin1b.private.scl1, and unfortunately query.c:6440 is an error message for a 'default:' case in a switch(result) statement. By default we have a debug log level of 1 which won't identify the point at which the query failed (eg; root NS, Dynect NS, Dynect AAAA, Dynect A, etc). I bumped the debug log level to 2 and I'll keep an eye on the log for additional information.

Taken from the BIND documentation:

At the debug levels of 2 or higher, detailed context information of recursive resolutions that resulted in SERVFAIL is logged. The log message will look like as follows:

fetch completed at resolver.c:2970 for www.example.com/A in 30.000183: timed out/success [domain:example.com,referral:2,restart:7,qrysent:8,timeout:5,lame:0,neterr:0,badresp:1,adberr:0,findfail:0,valfail:0]


If anyone would like to poke at the BIND source it's in rhel6dev64.sandbox.phx1.mozilla.com:/root/rpmbuild/BUILD/bind-9.8.2rc1
If anything, let's add a new CNAME (and one that our internal NS's are authoritative for), rather than pointing directly at the zlb VIP.  That will let us move this service around more flexibly in the future if necessary.
It happened again since I bumped the debug level earlier this morning, here's a relevant excerpt from the debug log:

20-Nov-2012 10:20:25.005 query-errors: debug 1: client 10.12.50.232#41563: view mpt-dmz-in: query failed (SERVFAIL) for ftp.mozilla.org/IN/A at query.c:6440
20-Nov-2012 10:20:25.005 query-errors: debug 2: fetch completed at resolver.c:3178 for ftp.dynect.mozilla.net/A in 30.000246: timed out/success [domain:dynect.mozilla.net,referral:0,restart:5,qrysent:17,timeout:16,lame:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

I'm seeing similar errors for other recursive queries, so I don't think this is a Dynect specific problem. After some further investigating I think this odd behavior could be attributed to the kernel version. It's running 2.6.32-71.el6.x86_64 which has been known to cause bizarre problems in the past.

I recommend we update and reboot this host. We'll be able to determine whether the problem has been fixed or not with about 12 hours of logs.
I just wanted to clarify - by "this host" I mean admin1[ab].infra.scl1
Does that kernel version appear in phx1 too, for Selena's error?

Let's do this kernel upgrade *right now*, or after the holidays, and not tomorrow.  And let's be extra-careful that one node is working before upgrading the other.
Most of a day late and dollar short, but the verbose wget failure is just

--2012-11-20 15:50:15--  http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-beta-win32/1353436282/firefox-18.0.en-US.win32.tests.zip
Resolving ftp.mozilla.org... failed: Unknown host.
wget: unable to resolve host address `ftp.mozilla.org'
program finished with exit code 1
I've completed the kernel upgrades on admin1[ab].infra.scl1, we're now running 2.6.32-279.14.1.el6.x86_64. I have not seen any errors, but I wouldn't expect to. I'll check back in on the debug log in a few hours after it's had some time to answer queries.

It should address Selena's original concern which I understand to be intermittent connection failures to ftp.m.o, traced back to intermittent DNS outages. However, it will not address Armen Zambrano G. [:armenzg]'s issue in #c3. It looks like those error messages indicated a successful connection but an undesirable response ("ERROR 503: Server Too Busy."), unless that error originated from some sort of intermediate proxy.
We just hit another bit of "unable to resolve," between 21:21:05 and 21:21:16.
If we want to change the domain used the majority of RelEng  test jobs are covered by the config for http://hg.mozilla.org/build/tools/file/tip/stage/post_upload.py, which is in svn at sysadmins/puppet/trunk/modules/productdelivery/files/etc/post_upload.ini. I'm not sure if that covers the newer compile jobs, which may not be using post_upload.py.
:dustin that's what I would suggest, yes, something like ftp-releng.mozilla.org or ftp-scl3.mozilla.org depending if it's better to do it by organisational unit or geographical location.
My checks got a bunch of failures overnight, on both hg.mozilla.com (fox2mike's suggestion to monitor more than just ftp.mozilla.org) and ftp.mozilla.org.

I'd assume all the 16:xx PST failures are :digi's work.

Timestamps:
Failed ftp.mozilla.org Tue Nov 20 10:20:10 PST 2012
Failed ftp.mozilla.org Tue Nov 20 10:20:40 PST 2012
Failed hg.mozilla.org Tue Nov 20 15:50:18 PST 2012
Failed hg.mozilla.org Tue Nov 20 15:51:21 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:07:25 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:19:36 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:19:51 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:20:06 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:20:21 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:20:36 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:20:51 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:21:06 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:21:21 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:21:36 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:21:51 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:22:06 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:22:21 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:22:36 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:22:51 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:23:06 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:23:21 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:23:36 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:23:51 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:24:06 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:24:21 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:24:36 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:24:51 PST 2012
Failed hg.mozilla.org Tue Nov 20 16:25:06 PST 2012
Failed ftp.mozilla.org Tue Nov 20 16:25:21 PST 2012
Failed hg.mozilla.org Tue Nov 20 21:21:07 PST 2012
Failed hg.mozilla.org Wed Nov 21 02:51:21 PST 2012
Failed ftp.mozilla.org Wed Nov 21 02:52:25 PST 2012
Failed ftp.mozilla.org Wed Nov 21 02:53:13 PST 2012
Testing on sp-admin01.phx1 has seen no failures at all. This would imply to me that :selena's issue is different.

:armenzg - are you still seeing 503s? Those are again, a different issue and need to be tracked through a separate chunk of infrastructure.
The last SERVFAILs I'm seeing in the debug logs are now:

21-Nov-2012 02:53:11.155 query-errors: debug 1: client 10.12.49.169#51544: view mpt-dmz-in: query failed (SERVFAIL) for ftp.mozilla.org/IN/AAAA at query.c:6440
21-Nov-2012 02:53:12.670 query-errors: debug 1: client 10.12.49.162#43157: view mpt-dmz-in: query failed (SERVFAIL) for videos-cdn.mozilla.net/IN/AAAA at query.c:6440
21-Nov-2012 02:53:12.670 query-errors: debug 1: client 10.12.49.162#46901: view mpt-dmz-in: query failed (SERVFAIL) for videos-cdn.mozilla.net/IN/AAAA at query.c:6440

Full log excerpt for the most recent failure:

21-Nov-2012 02:53:12.670 query-errors: debug 1: client 10.12.49.162#43157: view mpt-dmz-in: query failed (SERVFAIL) for videos-cdn.mozilla.net/IN/AAAA at query.c:6440
21-Nov-2012 02:53:12.670 query-errors: debug 1: client 10.12.49.162#46901: view mpt-dmz-in: query failed (SERVFAIL) for videos-cdn.mozilla.net/IN/AAAA at query.c:6440
21-Nov-2012 02:53:12.670 client: debug 3: client 10.12.49.162#46901: view mpt-dmz-in: error
21-Nov-2012 02:53:12.670 client: debug 3: client 10.12.49.162#43157: view mpt-dmz-in: error
21-Nov-2012 02:53:12.670 client: debug 3: client 10.12.49.162#46901: view mpt-dmz-in: send
21-Nov-2012 02:53:12.670 client: debug 3: client 10.12.49.162#43157: view mpt-dmz-in: send
21-Nov-2012 02:53:12.670 client: debug 3: client 10.12.49.162#46901: view mpt-dmz-in: sendto
21-Nov-2012 02:53:12.670 client: debug 3: client 10.12.49.162#43157: view mpt-dmz-in: sendto
21-Nov-2012 02:53:12.670 client: debug 3: client 10.12.49.162#46901: view mpt-dmz-in: senddone
21-Nov-2012 02:53:12.670 client: debug 3: client 10.12.49.162#46901: view mpt-dmz-in: next
21-Nov-2012 02:53:12.670 client: debug 3: client 10.12.49.162#46901: view mpt-dmz-in: endrequest
21-Nov-2012 02:53:12.670 client: debug 3: client 10.12.49.162#43157: view mpt-dmz-in: senddone


Matching up with my ftp.mozilla.org failures I see no SERVFAILs. I find a timeout but no corresponding failure.

21-Nov-2012 02:52:25.334 resolver: debug 3: fctx 0x7f4060123fa0(ftp.dynect.mozilla.net/A'): timeout
21-Nov-2012 02:52:25.334 resolver: debug 3: fctx 0x7f4060123fa0(ftp.dynect.mozilla.net/A'): cancelquery
21-Nov-2012 02:52:25.334 resolver: debug 3: fctx 0x7f4060123fa0(ftp.dynect.mozilla.net/A'): try
21-Nov-2012 02:52:25.334 resolver: debug 3: fctx 0x7f4060123fa0(ftp.dynect.mozilla.net/A'): query
21-Nov-2012 02:52:25.334 resolver: debug 3: resquery 0x7f4060129fa0 (fctx 0x7f4060123fa0(ftp.dynect.mozilla.net/A)): send
21-Nov-2012 02:52:25.335 resolver: debug 3: resquery 0x7f4060129fa0 (fctx 0x7f4060123fa0(ftp.dynect.mozilla.net/A)): sent
21-Nov-2012 02:52:25.335 resolver: debug 3: resquery 0x7f4060129fa0 (fctx 0x7f4060123fa0(ftp.dynect.mozilla.net/A)): udpconnected
21-Nov-2012 02:52:25.335 resolver: debug 3: resquery 0x7f4060129fa0 (fctx 0x7f4060123fa0(ftp.dynect.mozilla.net/A)): senddone

I can't find anything matching the hg.mozilla.org failure from 02:51:21 PST 2012. A failure there is more serious since it's authoritative for that data (TTL doesn't decrease) but at the current logging level I can't be sure that request made it to the nameserver.

My last ftp.mozilla.org error matches up with:

 21-Nov-2012 02:53:14.019 resolver: debug 3: fctx 0x7f405c27fbc0(addons.dynect.mozilla.net/AAAA'): timeout
21-Nov-2012 02:53:14.019 resolver: debug 3: fctx 0x7f405c27fbc0(addons.dynect.mozilla.net/AAAA'): cancelquery
21-Nov-2012 02:53:14.019 resolver: debug 3: fctx 0x7f405c27fbc0(addons.dynect.mozilla.net/AAAA'): try
21-Nov-2012 02:53:14.019 resolver: debug 3: fctx 0x7f405c27fbc0(addons.dynect.mozilla.net/AAAA'): query
21-Nov-2012 02:53:14.019 resolver: debug 3: resquery 0x7f405c286630 (fctx 0x7f405c27fbc0(addons.dynect.mozilla.net/AAAA)): send
21-Nov-2012 02:53:14.019 resolver: debug 3: resquery 0x7f405c286630 (fctx 0x7f405c27fbc0(addons.dynect.mozilla.net/AAAA)): sent
21-Nov-2012 02:53:14.020 resolver: debug 3: resquery 0x7f405c286630 (fctx 0x7f405c27fbc0(addons.dynect.mozilla.net/AAAA)): udpconnected
21-Nov-2012 02:53:14.020 resolver: debug 3: resquery 0x7f405c286630 (fctx 0x7f405c27fbc0(addons.dynect.mozilla.net/AAAA)): senddone
21-Nov-2012 02:53:14.020 resolver: debug 3: fctx 0x7f405c27f370(addons.dynect.mozilla.net/A'): timeout
21-Nov-2012 02:53:14.020 resolver: debug 3: fctx 0x7f405c27f370(addons.dynect.mozilla.net/A'): cancelquery
21-Nov-2012 02:53:14.020 resolver: debug 3: fctx 0x7f405c27f370(addons.dynect.mozilla.net/A'): try
21-Nov-2012 02:53:14.020 resolver: debug 3: fctx 0x7f405c27f370(addons.dynect.mozilla.net/A'): query
21-Nov-2012 02:53:14.020 resolver: debug 3: resquery 0x7f405c2868f0 (fctx 0x7f405c27f370(addons.dynect.mozilla.net/A)): send
21-Nov-2012 02:53:14.021 resolver: debug 3: resquery 0x7f405c2868f0 (fctx 0x7f405c27f370(addons.dynect.mozilla.net/A)): sent
21-Nov-2012 02:53:14.021 resolver: debug 3: resquery 0x7f405c2868f0 (fctx 0x7f405c27f370(addons.dynect.mozilla.net/A)): udpconnected
21-Nov-2012 02:53:14.021 resolver: debug 3: resquery 0x7f405c2868f0 (fctx 0x7f405c27f370(addons.dynect.mozilla.net/A)): senddone
21-Nov-2012 02:53:14.054 resolver: debug 3: fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A'): timeout
21-Nov-2012 02:53:14.054 resolver: debug 3: fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A'): cancelquery
21-Nov-2012 02:53:14.054 resolver: debug 3: fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A'): try
21-Nov-2012 02:53:14.054 resolver: debug 3: fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A'): query
21-Nov-2012 02:53:14.054 resolver: debug 3: resquery 0x7f406012a260 (fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A)): send
21-Nov-2012 02:53:14.054 resolver: debug 3: resquery 0x7f406012a260 (fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A)): sent
21-Nov-2012 02:53:14.054 resolver: debug 3: resquery 0x7f406012a260 (fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A)): udpconnected
21-Nov-2012 02:53:14.055 resolver: debug 3: resquery 0x7f406012a260 (fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A)): senddone

and another:

21-Nov-2012 02:51:36.002 resolver: debug 3: fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A'): timeout
21-Nov-2012 02:51:36.002 resolver: debug 3: fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A'): done
21-Nov-2012 02:51:36.002 resolver: debug 3: fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A'): stopeverything
21-Nov-2012 02:51:36.002 resolver: debug 3: fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A'): cancelqueries
21-Nov-2012 02:51:36.002 resolver: debug 3: fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A'): cancelquery
21-Nov-2012 02:51:36.003 resolver: debug 3: fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A'): sendevents
21-Nov-2012 02:51:36.003 query-errors: debug 1: client 10.12.51.53#49176: view mpt-dmz-in: query failed (SERVFAIL) for ftp.mozilla.org/IN/A at query.c:6440
21-Nov-2012 02:51:36.003 client: debug 3: client 10.12.51.53#49176: view mpt-dmz-in: error
21-Nov-2012 02:51:36.003 query-errors: debug 1: client 10.12.49.162#40832: view mpt-dmz-in: query failed (SERVFAIL) for ftp.mozilla.org/IN/A at query.c:6440
21-Nov-2012 02:51:36.003 client: debug 3: client 10.12.49.162#40832: view mpt-dmz-in: error
21-Nov-2012 02:51:36.003 query-errors: debug 1: client 10.12.49.144#45572: view mpt-dmz-in: query failed (SERVFAIL) for ftp.mozilla.org/IN/A at query.c:6440
21-Nov-2012 02:51:36.003 client: debug 3: client 10.12.49.144#45572: view mpt-dmz-in: error
21-Nov-2012 02:51:36.003 client: debug 3: client 10.12.49.144#45572: view mpt-dmz-in: send
21-Nov-2012 02:51:36.003 client: debug 3: client 10.12.49.162#40832: view mpt-dmz-in: send
21-Nov-2012 02:51:36.003 client: debug 3: client 10.12.51.53#49176: view mpt-dmz-in: send
21-Nov-2012 02:51:36.003 query-errors: debug 1: client 10.12.51.53#54344: view mpt-dmz-in: query failed (SERVFAIL) for ftp.mozilla.org/IN/A at query.c:6440
21-Nov-2012 02:51:36.003 client: debug 3: client 10.12.51.53#54344: view mpt-dmz-in: error
21-Nov-2012 02:51:36.003 client: debug 3: client 10.12.49.162#40832: view mpt-dmz-in: sendto
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.51.53#49176: view mpt-dmz-in: sendto
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.49.144#45572: view mpt-dmz-in: sendto
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.51.53#54344: view mpt-dmz-in: send
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.49.144#45572: view mpt-dmz-in: senddone
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.49.144#45572: view mpt-dmz-in: next
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.49.144#45572: view mpt-dmz-in: endrequest
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.51.53#54344: view mpt-dmz-in: sendto
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.51.53#49176: view mpt-dmz-in: senddone
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.49.162#40832: view mpt-dmz-in: senddone
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.49.162#40832: view mpt-dmz-in: next
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.51.53#54344: view mpt-dmz-in: senddone
21-Nov-2012 02:51:36.004 client: debug 3: client 10.12.51.53#54344: view mpt-dmz-in: next
21-Nov-2012 02:51:36.004 query-errors: debug 2: fetch completed at resolver.c:3178 for ftp.dynect.mozilla.net/A in 30.000451: timed out/success [domain:dynect.mozilla.net,referral:0,restart:5,qrysent:17,timeout:16,lame:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]
21-Nov-2012 02:51:36.005 resolver: debug 3: fetch 0x7f4069bc9000 (fctx 0x7f40601243c8(ftp.dynect.mozilla.net/A)): destroyfetch


It does seem like it would be good to get dynect out of the path for releng. makes me wonder how the end user experience is via them.

There are various other similar errors in the logs, mostly for redhat and redhat mirror domains and akamai.
(In reply to Peter Radcliffe [:pir] from comment #40)
> Testing on sp-admin01.phx1 has seen no failures at all. This would imply to
> me that :selena's issue is different.
> 
> :armenzg - are you still seeing 503s? Those are again, a different issue and
> need to be tracked through a separate chunk of infrastructure.

I had reported the failure on behalf of philor. I don't look into finding those errors myself.
Just hit this again. Not copying/pasting a bunch of redundant logs.
Just hit this again (not going to copy & paste links to 30 logs showing the same thing).
I think we have plenty of examples now, thanks.

The issue is debugging. The nameservers get far too many requests to log absolutely everything and it's hard to work out if a handful of requests in a few million gets to the nameserver or if there's a network issue, the process is not responding or one of a dozen other reasons these requests aren't getting responded to.

We'll need to work through some possibilities. Easiest to test is taking dynect out of the path. I've added ftp-scl3.mozilla.com as a direct ip, not a CNAME, can you switch these wget downloads to paths under
http://ftp-scl3.mozilla.com/pub/mozilla.org/ instead of ftp.mozilla.org, please?
Depends on: 815124
Is there a way from the results of buildbot to get a number per day of failures? It would help me establish if something is helping or not.
(In reply to Peter Radcliffe [:pir] from comment #52)
> Is there a way from the results of buildbot to get a number per day of
> failures? It would help me establish if something is helping or not.

Not that I can think of. I will ask.
For the most part, these should be getting starred on tbpl as this bug number, so you should be able to see the frequency on OrangeFactor.
Assignee: pradcliffe → bhourigan
The test releng box from comment 17 is running a test script in a root screen session.
Today I added admin1a and admin1b directly into /etc/resolv.conf and have had one failure already.
I just moved admin1a to be first instead of the virtual IP.
I'm taking this issue over from :pir as he is going on PTO starting today. After analyzing the detailed debug data collected over the weekend the root cause appears to be related to Dynect, a third party geo aware DNS provider. One of their name servers consistently times out. I've escalated this issue to their support, and I have provided them with our debug logs.

As a temporary workaround I may suggest connecting to the IP of ftp.mozilla.org directly (63.245.215.46), this will eliminate any DNS related failures while this issue is being resolved.
(In reply to Peter Radcliffe [:pir] from comment #51)
> We'll need to work through some possibilities. Easiest to test is taking
> dynect out of the path. I've added ftp-scl3.mozilla.com as a direct ip, not
> a CNAME, can you switch these wget downloads to paths under
> http://ftp-scl3.mozilla.com/pub/mozilla.org/ instead of ftp.mozilla.org,
> please?

ftp-scl3.mozilla.com resolves on the external nameservers (ie ns1.m.o and friends) but not on the internal ones the test machines are hitting (scl1: 10.12.48.19; mtv1: 10.250.48.19; scl3: 10.26.75.40, 10.26.75.41). Could we fix that up ?
(In reply to Nick Thomas [:nthomas] from comment #59)
> ftp-scl3.mozilla.com resolves on the external nameservers (ie ns1.m.o and
> friends) but not on the internal ones the test machines are hitting (scl1:
> 10.12.48.19; mtv1: 10.250.48.19; scl3: 10.26.75.40, 10.26.75.41). Could we
> fix that up ?

I just committed this change. Should be visible in a few minutes.


(In reply to Brian Hourigan [:digi] from comment #58)
> As a temporary workaround I may suggest connecting to the IP of
> ftp.mozilla.org directly (63.245.215.46), this will eliminate any DNS
> related failures while this issue is being resolved.

I doubt this will work conveniently. If you don't send a proper Host: header, our default Apache configs (set by Security Assurance) will give you a 403. Let's try the separate name :pir set up, although from other comments here I'm still not thoroughly convinced this is the right solution. It sounds to me like other stuff is failing too, and this is just one of many. Also if Dynect really is causing problems, that affects a lot more than just FTP access for Rel-Eng... it affects (for example) public access to www.mozilla.org.
At 4PM PST I made changes to admin1[ab].infra.scl1 which may address the intermittent failures. We suspect the issue is related to BIND thinking it can query Dynect via IPv6 falling back to IPv4 after a timeout. It's only a theory at this point, but we're hopeful this will address the intermittent failures. I started a tcpdump inside a screen so if it fails again we'll have detailed information beyond level 3 debug logs.
Just wanted to update everyone - after analyzing the packet capture we have identified that the failure is related to admin1b.infra.scl1 reaching Dynect's external authoritative name servers.

Bug 816224 is a netops bug to investigate whether its a local network problem or if it is indeed a failure with Dynect. If it is a Dynect failure I hope to gather supporting evidence so they will take my bug report seriously. I have yet to receive any response from Dynect thus far.

Thank you all for being patient, this has been a difficult to diagnose problem and we're making progress towards a final resolution.
:nthomas (sorry for the slow response, I've been on PTO).

The nameservers all seem to be resolving ftp-scl3.mozilla.com so either someone fixed it or it just needed more time to push out.

I'd agree that switching to this name isn't a solution for the overall problem but there's also no need for internal machines to be hitting Dynect nameservers. It increases traffic and failure modes.
Bug 816224 which referenced a networking issue has been closed I would not expect to see further failures for ftp.mozilla.org.
Thank you for everyone's work on this.

I haven't seen any instances for a fair few days now, so going to tentatively call this fixed; we can reopen if needed.

Thank you! :-D
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
:edmorley - it looks like using ftp-scl3.m.c got reverted in bug 815124?
Any chance we could get that un-reverted so these aren't hitting dynect (reduces failure points and traffic out to dynect).
(In reply to Peter Radcliffe [:pir] from comment #77)
> :edmorley - it looks like using ftp-scl3.m.c got reverted in bug 815124?
> Any chance we could get that un-reverted so these aren't hitting dynect
> (reduces failure points and traffic out to dynect).

It relanded (bug 815124 comment 7).
Ah, great. Thanks for the confirmation.
I know this bug is closed now and I don't expect a major reworking here, but I'd very much appreciate it if we could not let this fix become too permanent. I believe this sort of solution is ultimately poisonous to our overall infrastructure, in a "death by 1000 tiny cuts" sort of way.

This basically constitutes a split-horizon DNS system, or a split-brain scenario. I know that's not quite the right term, but essentially we have "Rel-Eng uses one thing, the rest of the world uses something else". Historically that type of split has gotten us into trouble, and we are generally undoing it everywhere we can. This is a step in the opposite direction... one that I feel was made under duress of a service interruption rather than careful consideration and planning.


My overall concern is mostly centered around maintainability:

I'm not sure "ftp-scl3.mozilla.com" is a good long-term name. I believe something in mozilla.org might be more appropriate. This is obviously minor.

I can't guarantee that "ftp-scl3.mozilla.com" will always work identically to ftp.mozilla.org. Apache VirtualHosts change over time, and unusual / unknown / unexpected things tend to break. There's a lot more conditional logic around "ftp.mozilla.org" than one might expect. Heck, I can't even guarantee they work identically *right now* without spending some time to validate the configs. :)

I'm skeptical that, in 2 years' time, anyone will remember why this exists or why we don't just use ftp.mozilla.org. That's not good for maintainability.

Let's say we migrate off of Dynect in 2 years and the concern/need for this goes away. Will anyone remember that we have this special case hanging around, or will it persist for years after as unnecessary cruft? If our current DNS/Apache configs are any indication, the latter is *far* more likely. Wandering around http://ftp.mozilla.org/ speaks volumes about how good we are at cleaning out cruft. :)

Stopgap decisions based on load and availability are (to an extent) how we got pv-mirror01, pv-mirror02, releases.mozilla.org, releases-rsync, stage-rsync, and various other moving parts to an overall layout... and how their responsibilities shifted over time. This can be good for performance, but tends to be bad for maintainability... it's hard to keep the whole layout in your head, especially if you weren't involved in the design. This feels like a similar crutch. In other words, if X isn't good enough, I think we should spend some time trying to make X better rather than abandon it for Y at the first sign of trouble. That way lies madness.

I can't guarantee that ftp-scl3.mozilla.com will always be *available*. The reason we go to Dynect is to intelligently balance between multiple IPs. This allows us to transparently add and remove IPs (or even whole locations) on the fly, from a single pane of glass... ftp-scl3.mozilla.com is a simple A record to one IP, period. If that IP is down or deprecated (say, due to a failed Zeus node), or the whole SCL3 FTP Cluster is offline (say, failed over to PHX1- someday), you're dead in the water while the rest of the world stays online. By not using Dynect, you lose the HA capabilities they do provide in favor of a single, hard-coded default. Essentially, you are trading one set of failure modes for another.

Lastly, if Dynect isn't good enough for Rel-Eng, I have to wonder if it's good enough for all our users. I say this partly in jest (as there are real concerns), but there's a nugget of truth here. *Any* network can have connectivity problems to specific other places... if Comcast's DNS servers can't reach Dynect (but can reach Mozilla's), then we're down for Comcast users (and for way more than just ftp.m.o). That's essentially no different than this scenario was, just replace "Comcast" with "Rel-Eng". We can't expect Comcast to rewrite "ftp.mozilla.org" into "ftp-scl3.mozilla.com" for us, and so it seems like a questionable approach for Rel-Eng as well. We should endeavor to make it work, in a way that works for everyone.



If you got this far, thanks for reading. :)
It's my understanding that ftp-scl3 is temporary. It was designed as a temporary fix to reduce the business impact of the DNS failures. We tracked the root cause of these DNS failures to a networking issue in scl1, which was fixed this weekend.

I expect that we can revert post_upload.ini to ftp.m.o and drop ftp-scl3 from DNS, going back to Dynect.
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.