Closed Bug 916765 Opened 11 years ago Closed 9 years ago

Intermittent "command timed out: 600 seconds without output, attempting to kill" running expandlibs_exec.py in libgtest

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: cbook, Unassigned)

References

()

Details

(Keywords: intermittent-failure)

Linux x86-64 fx-team debug asan build on 2013-09-16 03:23:27 PDT for push 7b5b8819ac56

slave: bld-centos6-hp-015

https://tbpl.mozilla.org/php/getParsedLog.php?id=27923647&tree=Fx-Team

command timed out: 600 seconds without output, attempting to kill
The 10 minutes from bug 890349 may not have been enough for debug+asan.
Blocks: 890349
No longer blocks: 890349
Depends on: 886079, 890349
Summary: Intermittent command timed out: 600 seconds without output, attempting to kill → Intermittent "command timed out: 600 seconds without output, attempting to kill" running expandlibs_exec.py in libgtest
(In reply to Nick Thomas [:nthomas] from comment #1)
> The 10 minutes from bug 890349 may not have been enough for debug+asan.

On my local Windows machine, the mozilla::pkix gtest tests (mach gtest "pkix*") a little over 2 minutes on their own, and soon will probably take 3 minutes. All of these tests are relatively new (as of Firefox 31), it seems likely that we need a timeout ~3 minutes higher than what we had before Firefox 31/32/33.
Nick, do you have cycles to look into this? :)
Flags: needinfo?(nthomas)
Not convinced a longer timeout will help here. Here is an 'Linux x86-64 mozilla-central debug asan build' I found in this state:
make[1]: Entering directory `/builds/slave/m-cen-l64-asan-d-0000000000000/build/obj-firefox/testing/xpcshell'
..........Can't trigger Breakpad, just killing process

For comparison, a log which passes is:
make[1]: Entering directory `/builds/slave/m-in-l64-asan-0000000000000000/build/obj-firefox/testing/xpcshell'
.....................Can't trigger Breakpad, just killing process
.............
----------------------------------------------------------------------
Ran 34 tests in 10.848s

The ps tree (for the hung case) looks like this:
 1401 ?        Sl     0:05 /tools/buildbot-0.8.4-pre-moz4/bin/python2.7 /tools/buildbot/bin/twistd --no_save --logfile /builds/slave/twistd.log --python /builds/slave/buildbot.tac
12432 ?        S      0:00  \_ /usr/bin/python -tt /usr/sbin/mock_mozilla -r mozilla-centos6-x86_64 --cwd /builds/slave/m-cen-l64-asan-d-0000000000000/build/obj-firefox --unpriv --shell /usr/bin/env HG_SHARE_BASE_DIR="/builds/
12458 ?        S      0:00      \_ make -k check
14013 ?        S      0:00          \_ make -C testing/xpcshell check
14014 ?        Sl     0:00              \_ /builds/slave/m-cen-l64-asan-d-0000000000000/build/obj-firefox/_virtualenv/bin/python /builds/slave/m-cen-l64-asan-d-0000000000000/build/testing/xpcshell/selftest.py
14188 ?        Z      0:00                  \_ [xpcshell] <defunct>
14202 ?        Sl     0:00 /builds/slave/m-cen-l64-asan-d-0000000000000/build/obj-firefox/dist/bin/plugin-container -appdir /builds/slave/m-cen-l64-asan-d-0000000000000/build/obj-firefox/dist/bin 14188 tab

There's no sign of a link process, even allowing for mock preocess not running as cltbld. The Z means
        Z    Defunct ("zombie") process, terminated but not reaped by its parent.

Smells like an xpcshell crash.
Flags: needinfo?(nthomas)
(In reply to TBPL Robot from comment #101)
> submit_timestamp: 2014-11-06T19:15:12
> log:
> https://treeherder.mozilla.org/ui/logviewer.html#?repo=try&job_id=2977346
> repository: try
> who: rvandermeulen[at]mozilla[dot]com
> machine: b-2008-ix-0019
> buildname: WINNT 5.2 try leak test build
> revision: 435c42a53ff7

This looks like a genuine slow link in make check, but much of the rest looks like this bug has attracted many types of 600s timeouts. It would be helpful if someone could separate out the different situations.
Would help if they failed in unique ways.
Seems like all the recent instances are clustered on a pretty small set of Windows build slaves. Jordan, is there something special with how they're configured that might be causing them to run slower?
Flags: needinfo?(jlund)
Ah, my dear friends from bug 1115490 comment 40 are all gathered here, too.
(In reply to Nick Thomas [:nthomas] from comment #122)
> Not convinced a longer timeout will help here.

agree it appears like there is more than one gun with smoke coming out of it.

(In reply to Nick Thomas [:nthomas] from comment #123)
> This looks like a genuine slow link in make check, but much of the rest looks like this bug has attracted > many types of 600s timeouts. It would be helpful if someone could separate out the different situations.

looking at the last 20 or so of these (which are more frequent as of late) it appears like they are all doing win64 builds. I wonder if bumping the timeout will weed out the builders that require a longer 'make -k check'?

nthomas: worth me posting a timeout bump patch even on a trial basis? bonus here is that it will only affect win64 + release branches since everything else uses mozharness so we will be able to see the direct effect on win64.
Flags: needinfo?(jlund)
Depends on: 1122975
I'd rather fix the machines, so I'm glad to see bug 1122975 filed.
(In reply to Nick Thomas [:nthomas] from comment #195)
> I'd rather fix the machines, so I'm glad to see bug 1122975 filed.

okies, will wait for investigation results of bug 1122975 first
Inactive; closing (see bug 1180138).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.