Closed Bug 1137314 Opened 9 years ago Closed 9 years ago

Intermittent Linux talos command timed out: 3600 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'other_nol64', '--add-option', '--webServer,localhost', '--branch-name', 'Fx-Team-Non-PGO', '--sys

Categories

(Testing :: Talos, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 977306

People

(Reporter: RyanVM, Unassigned)

References

Details

(Keywords: intermittent-failure)

This has been happening for a couple of days now and doesn't seem to be going away. Affects Linux32 and Linux64.
Summary: Intermittent Linux talos ommand timed out: 3600 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'other_nol64', '--add-option', '--webServer,localhost', '--branch-name', 'Fx-Team-Non-PGO', '--syst → Intermittent Linux talos command timed out: 3600 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'other_nol64', '--add-option', '--webServer,localhost', '--branch-name', 'Fx-Team-Non-PGO', '--sys
the first instance of this is Feb 23rd: a55ca32e464f, then 20 revisions later we see  again.  

We did a talos update on Feb 20th (50+ revisions in the past), I initially thought that was the reason, but now I am not so sure.  

Dustin, do you know if any linux32 puppet changes took place this weekend or Monday?  I know we had a kernel update a while back, who knows if some other update took place.
Flags: needinfo?(dustin)
here is a treeherder view which includes the first two:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=linux talos&fromchange=0f97b9f85516&tochange=931e6cba2ac5

the behavior follows a pattern where we configure talos and the first test to run launches the browser on getInfo.html (a warmup run to initialize the profile) and it never outputs anything or closes the browser.
The puppet repository's there for anyone to see.. http://hg.mozilla.org/build/puppet
Flags: needinfo?(dustin)
thanks dustin- i can use that to see whenever changes are made! I don't see anything there that appears to affect linux.
narrowed down the range a bit:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=Ubuntu HW 12.04 mozilla-inbound talos chromez&fromchange=591c754ca95f&tochange=b0760f9002a9

so far it is looking like:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0f97b9f85516

doing a lot of retriggers to help prove this.
ok, it is related to bug 1134021.

:wchen, I see you as the author of this patch.  It is causing issues on linux32 where talos fails to start about one out of 20 runs.  The link above shows the clear pattern and how this patch introduces a problem.

What we do in talos is create a fresh profile, then launch the browser to initalize the profile and collect some metrics with this page:
http://hg.mozilla.org/build/talos/file/tip/talos/getInfo.html

The intention is that we print out some metrics, then close the browser almost immediately.  In this case we don't see the browser print anything out or close, instead we timeout.  The seems to happen on any given test suite as this is independent of any talos test.

I will leave this up to the sheriffs to determine how acceptable it is to have this in the tree.  Personally I would vote for fixing this or backing it out if we cannot fix it.
Flags: needinfo?(wchen)
:ryanvm, can you weigh in on a sheriffs perspective here?
Flags: needinfo?(ryanvm)
(In reply to Joel Maher (:jmaher) from comment #20)
> narrowed down the range a bit:
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-
> searchStr=Ubuntu HW 12.04 mozilla-inbound talos
> chromez&fromchange=591c754ca95f&tochange=b0760f9002a9

This link shows one instance of the failure on a push earlier than wchen's push as well. It's on RyanVM's backout push at b8f973e91242.
Oh, and to me seems extremely unlikely that wchen's patch would cause this sort of an error. The change he made is only exercised when APZ is enabled, which it isn't on Linux/talos.
apologies- i guess I saw no pending jobs and all green- I suspect it was a completed job that hadn't refreshed in treeherder.  ok, I will keep hunting- did a bunch of retriggers.  Thanks for taking a brief look kats!
clearing needinfo for now
Flags: needinfo?(wchen)
Flags: needinfo?(ryanvm)
All of these failures have been on talos-linux32-ix-001 and talos-linux32-ix-026. A total coincidence, I'm sure. Both slaves disabled.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.