Closed Bug 1124697 Opened 10 years ago Closed 9 years ago

Intermittent tp talos.utils.TalosError: Graph server unreachable (5 attempts)

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: RyanVM, Unassigned)

References

Details

(Keywords: intermittent-failure)

I think these started with yesterday's talos update and we've been just sort-of ignoring them as infra. But they're not going away on their own, so I'm filing it for more investigation.
thanks, I saw a couple issues of it, iwill wait to see what platforms/tests this occurs on, it might be obvious what the problem is when we have 20/30 data points.
Summary: Intermittent tp TalosError: Graph server unreachable (5 attempts) → Intermittent tp talos.utils.TalosError: Graph server unreachable (5 attempts)
From one of the logs: 13:27:22 INFO - unable to insert new record into 'test_run_values': (1062, "Duplicate entry '45444192-0' for key 'PRIMARY'")
Also seeing "service unavailable" in the logs
Depends on: 772610, 808547
Q, this is the bug we discussed on IRC yesterday. Is it possible to make the GPO changes to the Windows test slaves that were made on the builders? I have a nagging suspicion that we have a similar root issue of slow transfers at play here.
Flags: needinfo?(q)
I am looking over the logs now. At first glance I see a couple of OSX machines were having trouble too. Let me look a bit deeper.
Flags: needinfo?(q)
"Graph server unreachable" is a bullshit message, given for absolutely any failure between the start of sending and finally not getting back exactly the desired response, so any bug about it will always gather up other failures. Those first 10 failures on OS X were one thing, what it was originally about, then there are probably three more things before the present. That said, so far I don't see any recent Windows logs that look like slow uploads, only talos failing to gather the data it needs, and then submitting anyway, and graphserver giving a slightly cryptic response that none the less does identify a particular failure to gather data ("No test_name called 'tp5o' can be found" apparently means "when you saw that 'No results collected for: tp5o_responsiveness' you should have halted and caught fire, not continued on to give me bad data", and "to determine geomean from 'test_run_values' for 48294338 - local variable 'values' referenced before assignment" means "when you saw that 'No results collected for: tp5n_main_startup_netio' you should have halted and caught fire, not continued on to give me bad data"), and then talos in its infinite insanity rather than passing along that message as the failure, claims "Graph server unreachable".
"No results collected for: tp5o_responsiveness:"
"No results collected for: tp5o_responsiveness:"
"No results collected for: tp5o_responsiveness:" "No results collected for: tp5o_responsiveness:" "No results collected for: tp5o_responsiveness:"
See Also: → 1204303
this intermittent hasn't been seen in 4+ months!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.