Closed
Bug 1269464
Opened 9 years ago
Closed 9 years ago
Investigate why proxxy crapped out in the wee hours of 2016-05-01, consider monitoring and autorecovery options
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
Details
Retriggered back a couple weeks in https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=91115264629dfaacf2d60d52a3eff89c18c5af0d&filter-searchStr=2f1a61e71e091d8b93709edbb1b9bdfb4fa90921&selectedJob=3813256 to be sure I was well before any possibility of it being a code change, plus there are failures on aurora and beta as well.
Starting on the first of the month would make me suspicious of a timebomb, except that it's just a maxtime exceeded 1800 seconds total failure, and not in a particular test, and although we don't have enough activity in the wee hours of Sunday to precisely time things it appears to have started a fair bit after midnight, unlike a timebomb.
Reporter | ||
Comment 2•9 years ago
|
||
Linux32 debug mochitest-webgl has failed 100% of the time on mozilla-beta, and all but one run on mozilla-aurora since the whatever-happened. Jit-2 on mozilla-aurora and Jit-1 on trunk are pretty bad, maybe 20%, which is four times our visiblity threshold and I'd be pretty concerned by that if not for the permared looking much worse.
Apparently what happened was that everything got significantly slower, but the percentage increase varies among suites and only some suites were running close enough to their maxtime to have doubling their runtime put them over the line.
Reporter | ||
Comment 3•9 years ago
|
||
None of my choices here make this anything other than a blocker.
Severity: critical → blocker
Reporter | ||
Updated•9 years ago
|
Summary: Starting the morning of 2016-05-01, Linux32 debug mochitest-a11y began exceeding maxtime around 80% of the time → All trees closed: Starting the morning of 2016-05-01, Linux32 tests began running at up to twice as slow, exceeding maxtime and creating permared on various suites
![]() |
||
Comment 4•9 years ago
|
||
http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-linux-debug/1460626792/mozilla-central_ubuntu32_vm-debug_test-mochitest-a11y-bm05-tests1-linux32-build8.txt.gz
There are several attempts to download from taskcluster but these including retries fail but cost 2.5 minutes for each resource and push the runtime above the limit:
08:22:16 INFO - URL Candidate: http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json
08:22:16 INFO - trying http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json
08:22:16 INFO - Downloading http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json to /builds/slave/test/build/firefox-48.0a1.en-US.linux-i686.test_packages.json
08:22:16 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json', 'file_name': '/builds/slave/test/build/firefox-48.0a1.en-US.linux-i686.test_packages.json'}, attempt #1
08:22:46 WARNING - Timed out accessing http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json: timed out
08:22:46 INFO - retry: attempt #1 caught exception: timed out
08:22:46 INFO - retry: Failed, sleeping 30 seconds before retrying
08:23:16 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json', 'file_name': '/builds/slave/test/build/firefox-48.0a1.en-US.linux-i686.test_packages.json'}, attempt #2
08:23:46 WARNING - Timed out accessing http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json: timed out
08:23:46 INFO - retry: attempt #2 caught exception: timed out
08:23:46 INFO - retry: Failed, sleeping 60 seconds before retrying
08:24:46 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json', 'file_name': '/builds/slave/test/build/firefox-48.0a1.en-US.linux-i686.test_packages.json'}, attempt #3
08:25:16 WARNING - Timed out accessing http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json: timed out
08:25:16 INFO - retry: attempt #3 caught exception: timed out
08:25:16 INFO - Can't download from http://queue.taskcluster.net.proxxy1.srv.releng.usw2.mozilla.com/v1/task/BzXNiYLgQ4C9IpAWxsTqRA/artifacts/public/build/firefox-48.0a1.en-US.linux-i686.test_packages.json to /builds/slave/test/build/firefox-48.0a1.en-US.linux-i686.test_packages.json!
08:25:16 INFO - Caught exception: timed out
Comment 5•9 years ago
|
||
proxxy is timing out trying to fetch from queue.tc.net
Comment 6•9 years ago
|
||
I've restarted all the proxxy instances.
All of the retriggers that philor linked in #releng came back green.
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=60ce6cc4da0665503d9f9f5f66c4ddd9474228ea&filter-searchStr=3a7186143fb372185a039f0ffe27f23e76d09cc0
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=4b9c6eb81f85bc3474a284a831b6ff33344e33a4&filter-searchStr=f884bd7bec0a2519fae637b09c3213d4959a0fc3
https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=410939eae11a9ef8432b3ca874b5e3262000dfea&filter-searchStr=8ac74cda3b8d9b74586336f1b64babf5e69f5072
https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=410939eae11a9ef8432b3ca874b5e3262000dfea&filter-searchStr=73a30cdac109f6c2f764a57f7dc0e4e77258ed2b
I'm assuming the restart fixed things (and won't unfix anytime soon) and reopening trees.
Comment 9•9 years ago
|
||
Are we still OK now ? Lowering as trees are still open.
Severity: blocker → normal
Reporter | ||
Comment 10•9 years ago
|
||
Yup, we're both okay for Linux32 and starting to catch up on the insane backlog we built up for OS X 10.10 from the same problem - they weren't hitting maxtime because they have plenty of headroom, but sitting through the proxxy failures, taking 15 minutes for download-and-extract instead of 30 seconds, was as much as doubling their runtime.
Now we just need to do the new summary.
Component: Buildduty → General Automation
QA Contact: bugspam.Callek → catlee
Summary: All trees closed: Starting the morning of 2016-05-01, Linux32 tests began running at up to twice as slow, exceeding maxtime and creating permared on various suites → Investigate why proxxy crapped out in the wee hours of 2016-05-01, consider monitoring and autorecovery options
Comment 11•9 years ago
|
||
I blame DNS / The Cloud. I've filed bug 1270337 to let us know if/when this happens next time.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•