Closed Bug 686084 Opened 13 years ago Closed 11 years ago

Intermittent Tegra "command timed out: 2400 seconds without output, killing pid ...", "command timed out: 1200 seconds without output, process killed by signal 9"

Categories

(Release Engineering :: General, defect, P3)

ARM
Android
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [android_tier_∞])

https://tbpl.mozilla.org/php/getParsedLog.php?id=6357711
Android Tegra 250 mozilla-inbound opt test mochitest-6 on 2011-09-09 18:21:48 PDT for push 4f7f1840152b

========= Started Cleanup Device failed (results: 2, elapsed: 20 mins, 0 secs) ==========
python /builds/sut_tools/cleanup.py 10.250.49.12
 in dir /builds/tegra-025/test/. (timeout 1200 secs)
 watching logfiles {}
 argv: ['python', '/builds/sut_tools/cleanup.py', '10.250.49.12']
 environment:
  PATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
  PWD=/builds/tegra-025/test
  SUT_IP=10.250.49.12
  SUT_NAME=tegra-025
  VERSIONER_PYTHON_PREFER_32_BIT=no
  VERSIONER_PYTHON_VERSION=2.6
  __CF_USER_TEXT_ENCODING=0x1F6:0:0
 closing stdin
 using PTY: False

command timed out: 1200 seconds without output, killing pid 47920
process killed by signal 9
program finished with exit code -1
elapsedTime=1200.008258
======== Finished Cleanup Device failed (results: 2, elapsed: 20 mins, 0 secs) ========
Priority: -- → P3
Whiteboard: [orange][android_tier_1]
Whiteboard: [orange]
https://tbpl.mozilla.org/php/getParsedLog.php?id=7032755&tree=Mozilla-Aurora

This is just the buildbot timeout, so there's no separate just-this thing that could insert an Automation Error or Remote Device Error:, is there?
https://tbpl.mozilla.org/php/getParsedLog.php?id=7563196&tree=Firefox
Android XUL Tegra 250 mozilla-central opt test reftest-3 on 2011-11-23 19:23:57 PST for push cf764be32bc3
https://tbpl.mozilla.org/php/getParsedLog.php?id=7571973&tree=Firefox
Android XUL Tegra 250 mozilla-central opt test mochitest-8 on 2011-11-24 09:54:54 PST for push 84117219ded0
slave: tegra-105


command timed out: 1200 seconds without output, killing pid 35702
process killed by signal 9
program finished with exit code -1
elapsedTime=1200.007302
======== Finished Cleanup Device failed (results: 2, elapsed: 20 mins, 0 secs) ========


...


command timed out: 1800 seconds without output, killing pid 36050
process killed by signal 9
program finished with exit code -1
elapsedTime=1800.007443
======== Finished Reboot Device failed (results: 2, elapsed: 30 mins, 0 secs) ========
https://tbpl.mozilla.org/php/getParsedLog.php?id=8062970&tree=Firefox
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
https://tbpl.mozilla.org/php/getParsedLog.php?id=8090851&tree=Mozilla-Inbound

"How long would it take me to notice that an Android intermittent failure bug had been mistakenly closed?" Yeah, about that long.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [orange] → [orange][triagefollowup]
Assignee: nobody → coop
Whiteboard: [orange][triagefollowup] → [orange]
Depends on: 690311
Status: REOPENED → NEW
Assignee: coop → nobody
(In reply to Phil Ringnalda (:philor) from comment #668)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=9631229&tree=Try

I don't think so:

======== Finished Cleanup Device (results: 0, elapsed: 8 mins, 1 secs) ========

Filed Bug 730634; looks very similar to Bug 681948.
Yep. In the hypothetical world where Tegra problems can be fixed, I would have made the distinction about 400 comments back; in the hypothetical world where the log_eval_func for each buildstep was going to have a unique set of regexes rather than one set common to everything, I would have too; in the hypothetical world where I thought that someone else would star them, or even look at Android logs, I would have adjusted the summary of this bug. As it is, I'm flat out astonished that someone other than me has apparently looked at not one, but *two* logs in the last 200 :)
(In reply to Phil Ringnalda (:philor) from comment #670)

> As it is, I'm flat out astonished that someone other than me has apparently
> looked at not one, but *two* logs in the last 200 :)

Heh :D

Yeah, this bug is pretty sad.

I do tend to look at the logs for my own pushes if the summary is at all ambiguous. I'm surprised more people don't…
Nobody believes that the Android tests are anything but a tired old joke, and they are correct. They are expected to treat the purple of killed by signal 15 and the red of killed by signal 9 and the orange of reconnecting socket/unable to connect socket as unique and individual events that might be related to them, and they are not. They are all one, a set of probably 20 different symptoms when you include all the ways Talos interprets tegra-went-away, all of which are a single infrastructure failure.

Based on other things we've let go too long, my guess is that once we switch most or all of them to auto-retry, it'll be about 2 years before most people start treating the Android tests as real tests, though we can maybe short-circuit it a little bit if we switch to more reliable hardware and change the name when we do.
Whiteboard: [orange] → [orange][android_tier_∞]
(In reply to Phil Ringnalda (:philor) from comment #1027)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=10691261&tree=Mozilla-
> Inbound

reboot.py timeout

> https://tbpl.mozilla.org/php/getParsedLog.php?id=10691097&tree=Mozilla-
> Inbound

This was an installApp failure not cleanup Device...

I didn't check all the other logs from today, but I suspect the *cleanup device* ones are rarer than the last few comments seems to suggest.
(In reply to Justin Wood (:Callek) from comment #1028)
> reboot.py timeout

I presume you mean "updateSUT.py timeout" since it doesn't seem terribly interesting that after the tegra died, coincidentally during updateSUT.py, it did not come back to life before reboot.py.

But, if you want to have separate bugs for every single buildstep during which a tegra might die and buildbot might wind up reporting that as a 1200 seconds without output, process killed by signal 9," just let me know when you've filed the bugs for the ones which have never been filed separately before and you are ready to take over starring them, and I'll start leaving them for you.

I'm a little worried that you might not be quite that interested in the statistics, though, if you weren't willing to even look at all six logs handed to you in a single bug comment.

https://tbpl.mozilla.org/php/getParsedLog.php?id=10709962&tree=Mozilla-Inbound
(In reply to Phil Ringnalda (:philor) from comment #1030)
> I'm a little worried that you might not be quite that interested in the
> statistics, though, if you weren't willing to even look at all six logs
> handed to you in a single bug comment.

Didn;t say unwilling, just that I didn't (was opening up on my bugmail for today, and had outstanding pings from people when I peeked here)
There, I fixed it.
Summary: Intermittent Tegra "Cleanup Device failed" from "command timed out: 1200 seconds without output, process killed by signal 9" → Intermittent Tegra "command timed out: 1200 seconds without output, process killed by signal 9"
Diagnostic thinking:

in python signal.SIGILL is int value 9 which is used by many things in sut_lib
(Many of which were actually busted tegras, something I need to remember to check for any instances of this)

https://tbpl.mozilla.org/php/getParsedLog.php?id=14249979&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=14423742&tree=Mozilla-Inbound
tegra-101
(in install app - can't really remember the last one I saw that verify didn't catch)
Depends on: 793091
Do we have any idea at what point of verify we're timing out?
https://tbpl.mozilla.org/php/getParsedLog.php?id=16710275&tree=Mozilla-Aurora
Summary: Intermittent Tegra "command timed out: 1200 seconds without output, process killed by signal 9" → Intermittent Tegra "buildbot.slave.commands.TimeoutError: command timed out: 2400 seconds without output, killing pid ...", "command timed out: 1200 seconds without output, process killed by signal 9"
Whiteboard: [orange][android_tier_∞] → [android_tier_∞]
Summary: Intermittent Tegra "buildbot.slave.commands.TimeoutError: command timed out: 2400 seconds without output, killing pid ...", "command timed out: 1200 seconds without output, process killed by signal 9" → Intermittent Tegra "command timed out: 2400 seconds without output, killing pid ...", "command timed out: 1200 seconds without output, process killed by signal 9"
Status: NEW → RESOLVED
Closed: 13 years ago11 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.