Closed Bug 1134790 Opened 10 years ago Closed 10 years ago

10.10 talos runs frequently fail with "FAIL: Graph server unreachable (5 attempts)" due to "RETURN:[Errno 8] nodename nor servname provided, or not known"

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

Unlike so many instances of "FAIL: Graph server unreachable (5 attempts)", this apparently really is an unreachable problem, because... DNS goes away? 10.10 caches a bad DNS lookup instead of trying again? https://treeherder.mozilla.org/logviewer.html#?job_id=5045963&repo=try 11:49:09 CRITICAL - FAIL: Graph server unreachable (5 attempts) 11:49:09 INFO - RETURN:[Errno 8] nodename nor servname provided, or not known 11:49:09 ERROR - Traceback (most recent call last): 11:49:09 INFO - File "/builds/slave/talos-slave/test/build/venv/bin/talos", line 9, in <module> 11:49:09 INFO - load_entry_point('talos==0.0', 'console_scripts', 'talos')() 11:49:09 INFO - File "/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/run_tests.py", line 302, in main 11:49:09 INFO - sys.exit(run_tests(parser)) 11:49:09 INFO - File "/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/run_tests.py", line 276, in run_tests 11:49:09 INFO - talos_results.output(results_urls, **results_options) 11:49:09 INFO - File "/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/results.py", line 87, in output 11:49:09 INFO - raise e 11:49:09 INFO - talos.utils.TalosError: Graph server unreachable (5 attempts) 11:49:09 INFO - [Errno 8] nodename nor servname provided, or not known 11:49:09 ERROR - Return code: 1 11:49:09 CRITICAL - # TBPL WARNING #
I suspect this is part of the jobs that are getting starred into 1124697
The current batch? No, those are graphserver returning "Service Unavailable", as of just recently, while this is what the 10.10 slaves have done for as long as they've existed, not contacting it at all. "Graph server unreachable" is nearly without meaning, it's the highlighted error for absolutely every single failure which results from anything happening other than graphserver returning whatever it is that it returns for success. If you send two values when you should send three, "GSU." If you are a slave that graphserver doesn't know by name, "GSU." If you are a new platform which loses the ability to resolve graphs.mozilla.org periodically, "GSU." If graphs.m.o actually really is unreachable, "GSU." If it's perfectly reachable, but returns an error of any sort including 503, "GSU."
There are reports of significant dns issues in 10.10, let's see if reimaging them as 10.10.2 in bug 1134223 (once we get the issues with the image sorted) addresses the issue.
Blocks: 1134223
Kim, this seems to be pretty frequent now on 10.10
Flags: needinfo?(kmoir)
The recent ones all seem to be code issues like this JavaScript error: resource:///modules/devtools/gDevTools.jsm, line 488: TypeError: this._telemetry is undefined JavaScript error: resource:///modules/CustomizableUI.jsm, line 1552: TypeError: aWindowPalette is undefined the end error message is "Graph server unreachable" but as philor remarks in comment 2, this is not the real cause, just the default error message.
Flags: needinfo?(kmoir)
Depends on: 1144206
Fixed by bug 1144206.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.