Closed Bug 554754 Opened 14 years ago Closed 12 years ago

automation.py apparently hanging after test run, getting killed by buildbot in Linux Mo5 on hardware -ix- slaves

Categories

(Testing :: Mochitest, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: cjones, Unassigned)

References

Details

(Keywords: intermittent-failure)

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269463018.1269465138.31051.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/03/24 13:36:58
s: mv-moz2-linux-ix-slave08

17886 INFO TEST-PASS | /tests/widget/tests/test_plugin_scroll_invalidation.html | no paint on scroll
17888 INFO Passed: 17472
17889 INFO Failed: 0
17890 INFO Todo:   19
17891 INFO SimpleTest FINISHED
NPP_Destroy
NPP_Destroy
NPP_Destroy
NPP_Destroy
NPP_Destroy
NPP_Destroy
NOTE: child process received `Goodbye', closing down
INFO | automation.py | Application ran for: 0:02:28.358450
INFO | automation.py | Reading PID log: /tmp/tmpHmLDAepidlog
==> process 4466 launched child process 4491
==> process 4466 launched child process 4498
==> process 4466 launched child process 4502
==> process 4466 launched child process 4504
==> process 4466 launched child process 4509
INFO | automation.py | Checking for orphan process with PID: 4491
INFO | automation.py | Checking for orphan process with PID: 4498
INFO | automation.py | Checking for orphan process with PID: 4502
INFO | automation.py | Checking for orphan process with PID: 4504
INFO | automation.py | Checking for orphan process with PID: 4509
NEXT ERROR PROCESS-CRASH | automation.py | application crashed (minidump found)
[snip known minidump from crash tests]
NEXT ERROR PROCESS-CRASH | automation.py | application crashed (minidump found)
[snip known minidump from crash tests]
PROCESS-CRASH | automation.py | application crashed (minidump found)
[snip known minidump from crash tests]

command timed out: 1200 seconds without output, killing pid 4443
process killed by signal 9
program finished with exit code -1
elapsedTime=1350.394776
TinderboxPrint: mochitest-plain-5<br/>17472/0/19
buildbot.slave.commands.TimeoutError: command timed out: 1200 seconds without output, killing pid 4443
TinderboxPrint: mochitest-plain-5<br/><em class="testfail">timeout</em>


This log suggests to me that automation.py processed the leak log and (known) minidumps, but didn't exit within its allotted time.  Bug 538616 might be related, but I can't load those logs.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269464162.1269466396.3052.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/03/24 13:56:02  
s: mv-moz2-linux-ix-slave10
^^
Linux mozilla-central opt test mochitests-5/5 on 2010/03/24 15:55:49
s: mv-moz2-linux-ix-slave03

This would be one of those rare cases that are the reason why I always paste everything including the slave name: those three close together failures, looking like a regression from something pushed today, seem to be the only three runs of Linux Mo5 this week that have been on the -ix- real hardware slaves.
Version: unspecified → Trunk
Don't know why only the hardware linux runs are failing, but the lack of symbol resolution is because crashreporter-symbols.zip is essentially empty on linux. Fallout from a breakpad landing ? It's been like this for at least 24 hours; mac and win32 are OK. 

Will try to track down the first build where the symbols vanished.
Depends on: 554854
Presence or absence of symbols shouldn't make any difference to automation.py, it doesn't know anything about the symbols except for the directory where they live.

It's not clear where exactly things are hanging here. If this persists, we could stick some extra output in automation.py around here:
http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#791

Since it's clearly getting to the checkForCrashes call, but there isn't a whole lot more that it does. It's also possible that something in runtests.py itself is taking too long, but it doesn't do much after the call to runApp, you can see here:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py.in#587
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270593291.1270595643.2046.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/04/06 15:34:51
s: mv-moz2-linux-ix-slave04

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270588979.1270590925.18211.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/04/06 14:22:59
s: mv-moz2-linux-ix-slave08
Blocks: 438871
Whiteboard: [orange]
Summary: automation.py apparently hanging after test run, getting killed by buildbot → automation.py apparently hanging after test run, getting killed by buildbot in Linux Mo5 on hardware -ix- slaves
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270671292.1270672725.12522.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/04/07 13:14:52
s: mv-moz2-linux-ix-slave04
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270757326.1270759043.3207.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/04/08 13:08:46
s: mv-moz2-linux-ix-slave03
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1272342489.1272343930.13035.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/04/26 21:28:09  
s: mv-moz2-linux-ix-slave09
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1272591074.1272592667.23582.gz#err10
Linux mozilla-central opt test mochitests-5/5 on 2010/04/29 18:31:14  
s: moz2-linux-slave09
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1272664691.1272666244.4919.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/04/30 14:58:11
s: mv-moz2-linux-ix-slave05
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1272795488.1272797100.24468.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/02 03:18:08
s: moz2-linux-slave09

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1272795488.1272796995.24239.gz
Rev3 Fedora 12 mozilla-central opt test mochitests-5/5 on 2010/05/02 03:18:08
s: talos-r3-fed-016

And assuming I'm right about those two being this, and comment 11 was actually this, we should drop the "on hardware -ix- slaves" part of the summary, and switch to maybe something like "when things run faster than we expect." Or something.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1272881999.1272883586.24127.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/03 03:19:59
s: moz2-linux-slave09
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1273031152.1273032576.10713.gz
Linux tracemonkey opt test mochitests-5/5 on 2010/05/04 20:45:52
s: mv-moz2-linux-ix-slave24
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273077590.1273079059.23806.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/05 09:39:50
s: mv-moz2-linux-ix-slave03
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273109763.1273111412.20326.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/05 18:36:03
s: moz2-linux-slave15
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273151839.1273153417.31672.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/06 06:17:19
s: moz2-linux-slave10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273173375.1273174971.14292.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/06 12:16:15
s: moz2-linux-slave10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273183798.1273185401.25246.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/06 15:09:58
s: moz2-linux-slave15
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273252053.1273253649.10773.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/07 10:07:33
s: moz2-linux-slave10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273320860.1273322373.1387.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/08 05:14:20
s: moz2-linux-slave23
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273330959.1273332746.2893.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/08 08:02:39
s: moz2-linux-slave23
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273368437.1273369989.1406.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/08 18:27:17
s: moz2-linux-slave23
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273400717.1273402229.27409.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/09 03:25:17
s: moz2-linux-slave23
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273434153.1273435767.16059.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/09 12:42:33
s: moz2-linux-slave13
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273465711.1273467334.32012.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/09 21:28:31
s: moz2-linux-slave13
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273520735.1273522352.11044.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/10 12:45:35
s: moz2-linux-slave13
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273523377.1273524934.21493.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/10 13:29:37
s: moz2-linux-slave18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273543779.1273545289.28056.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/10 19:09:39
s: moz2-linux-slave18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273651151.1273652560.22387.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/12 00:59:11
s: mv-moz2-linux-ix-slave11
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273760169.1273761574.18371.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/13 07:16:09
s: mv-moz2-linux-ix-slave06
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273870813.1273872286.8662.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/14 14:00:13
s: mv-moz2-linux-ix-slave03
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274116139.1274119330.29801.gz
Rev3 Fedora 12 mozilla-central debug test mochitests-5/5 on 2010/05/17 10:08:59  
s: talos-r3-fed-037
It's possible that runtests.py is hanging waiting for xpcshell (running httpd.js) to shutdown here:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py.in#337

Since we just call proc.wait(), if the process doesn't exit we'll wait forever. We could modify that to .poll() / sleep() in a somewhat-busy loop (since wait() doesn't take a timeout) and just .kill() the process if it takes too long.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274302755.1274304243.8202.gz
Rev3 Fedora 12 mozilla-central opt test mochitests-5/5 on 2010/05/19 13:59:15
s: talos-r3-fed-037
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274312918.1274314462.21683.gz
Rev3 Fedora 12 mozilla-central opt test mochitests-5/5 on 2010/05/19 16:48:38
s: talos-r3-fed-037
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274324615.1274326143.4165.gz
Rev3 Fedora 12 mozilla-central opt test mochitests-5/5 on 2010/05/19 20:03:35
s: talos-r3-fed-037
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274393151.1274394544.26494.gz
Linux mozilla-central opt test mochitests-5/5 on 2010/05/20 15:05:51
s: mv-moz2-linux-ix-slave09
Mass marking whiteboard:[orange] bugs WFM (to clean up TBPL bug suggestions) that:
* Haven't changed in > 6months
* Whose whiteboard contains none of the strings: {disabled,marked,random,fuzzy,todo,fails,failing,annotated,leave open,time-bomb}
* Passed a (quick) manual inspection of bug summary/whiteboard to ensure they weren't a false positive.

I've also gone through and searched for cases where the whiteboard wasn't labelled correctly after test disabling, by using attachment description & basic comment searches. However if the test for which this bug was about has in fact been disabled/annotated/..., please accept my apologies & reopen/mark the whiteboard appropriately so this doesn't get re-closed in the future (and please ping me via IRC or email so I can try to tweak the saved searches to avoid more edge cases).

Sorry for the spam! Filter on: #FFA500
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.