Closed Bug 1269464 Opened 9 years ago Closed 9 years ago

Investigate why proxxy crapped out in the wee hours of 2016-05-01, consider monitoring and autorecovery options

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

Retriggered back a couple weeks in https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=91115264629dfaacf2d60d52a3eff89c18c5af0d&filter-searchStr=2f1a61e71e091d8b93709edbb1b9bdfb4fa90921&selectedJob=3813256 to be sure I was well before any possibility of it being a code change, plus there are failures on aurora and beta as well. Starting on the first of the month would make me suspicious of a timebomb, except that it's just a maxtime exceeded 1800 seconds total failure, and not in a particular test, and although we don't have enough activity in the wee hours of Sunday to precisely time things it appears to have started a fair bit after midnight, unlike a timebomb.
Blocks: 1269641
Linux32 debug mochitest-webgl has failed 100% of the time on mozilla-beta, and all but one run on mozilla-aurora since the whatever-happened. Jit-2 on mozilla-aurora and Jit-1 on trunk are pretty bad, maybe 20%, which is four times our visiblity threshold and I'd be pretty concerned by that if not for the permared looking much worse. Apparently what happened was that everything got significantly slower, but the percentage increase varies among suites and only some suites were running close enough to their maxtime to have doubling their runtime put them over the line.
None of my choices here make this anything other than a blocker.
Severity: critical → blocker
Summary: Starting the morning of 2016-05-01, Linux32 debug mochitest-a11y began exceeding maxtime around 80% of the time → All trees closed: Starting the morning of 2016-05-01, Linux32 tests began running at up to twice as slow, exceeding maxtime and creating permared on various suites
http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-linux-debug/1460626792/mozilla-central_ubuntu32_vm-debug_test-mochitest-a11y-bm05-tests1-linux32-build8.txt.gz There are several attempts to download from taskcluster but these including retries fail but cost 2.5 minutes for each resource and push the runtime above the limit: 08:22:16 INFO - URL Candidate: http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json 08:22:16 INFO - trying http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json 08:22:16 INFO - Downloading http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json to /builds/slave/test/build/firefox-48.0a1.en-US.linux-i686.test_packages.json 08:22:16 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json', 'file_name': '/builds/slave/test/build/firefox-48.0a1.en-US.linux-i686.test_packages.json'}, attempt #1 08:22:46 WARNING - Timed out accessing http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json: timed out 08:22:46 INFO - retry: attempt #1 caught exception: timed out 08:22:46 INFO - retry: Failed, sleeping 30 seconds before retrying 08:23:16 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json', 'file_name': '/builds/slave/test/build/firefox-48.0a1.en-US.linux-i686.test_packages.json'}, attempt #2 08:23:46 WARNING - Timed out accessing http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json: timed out 08:23:46 INFO - retry: attempt #2 caught exception: timed out 08:23:46 INFO - retry: Failed, sleeping 60 seconds before retrying 08:24:46 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json', 'file_name': '/builds/slave/test/build/firefox-48.0a1.en-US.linux-i686.test_packages.json'}, attempt #3 08:25:16 WARNING - Timed out accessing http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json: timed out 08:25:16 INFO - retry: attempt #3 caught exception: timed out 08:25:16 INFO - Can't download from http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json to /builds/slave/test/build/firefox-48.0a1.en-US.linux-i686.test_packages.json! 08:25:16 INFO - Caught exception: timed out
proxxy is timing out trying to fetch from queue.tc.net
I've restarted all the proxxy instances.
I'm assuming the restart fixed things (and won't unfix anytime soon) and reopening trees.
Are we still OK now ? Lowering as trees are still open.
Severity: blocker → normal
Yup, we're both okay for Linux32 and starting to catch up on the insane backlog we built up for OS X 10.10 from the same problem - they weren't hitting maxtime because they have plenty of headroom, but sitting through the proxxy failures, taking 15 minutes for download-and-extract instead of 30 seconds, was as much as doubling their runtime. Now we just need to do the new summary.
Component: Buildduty → General Automation
QA Contact: bugspam.Callek → catlee
Summary: All trees closed: Starting the morning of 2016-05-01, Linux32 tests began running at up to twice as slow, exceeding maxtime and creating permared on various suites → Investigate why proxxy crapped out in the wee hours of 2016-05-01, consider monitoring and autorecovery options
I blame DNS / The Cloud. I've filed bug 1270337 to let us know if/when this happens next time.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.