Closed Bug 544891 Opened 14 years ago Closed 14 years ago

Intermittent timeout in reftests on debug builds (aborts the reftest suite)

Categories

(Core :: Layout, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9.3a2

People

(Reporter: ehsan.akhgari, Unassigned)

References

()

Details

(Keywords: intermittent-failure, Whiteboard: [fixed by bug 545784] [test which aborts the suite])

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265640712.1265644736.24211.gz#err0
Linux mozilla-central debug test reftest on 2010/02/08 06:51:52

REFTEST INFO | Loading a blank page
++DOMWINDOW == 13 (0x9a57b30) [serial = 10387] [outer = 0x99a3698]
REFTEST INFO | Loading file:///builds/moz2_slave/mozilla-central-linux-debug-unittest-reftest/build/reftest/tests/layout/reftests/table-bordercollapse/frame_box_rules_all.html
++DOMWINDOW == 14 (0x9c34658) [serial = 10388] [outer = 0x99a3698]
REFTEST INFO | Loading file:///builds/moz2_slave/mozilla-central-linux-debug-unittest-reftest/build/reftest/tests/layout/reftests/table-bordercollapse/frame_box_rules_all_ref.html
++DOMWINDOW == 15 (0x9db47b8) [serial = 10389] [outer = 0x99a3698]

command timed out: 3600 seconds elapsed, killing pid 7001
process killed by signal 9
program finished with exit code -1
elapsedTime=3600.066542
TinderboxPrint: reftest<br/><em class="testfail">T-FAIL</em>
buildbot.slave.commands.TimeoutError: command timed out: 3600 seconds elapsed, killing pid 7001
TinderboxPrint: reftest<br/><em class="testfail">timeout</em>


This timeout aborts the reftest suite.

This is intermittent, because it happened on changeset http://hg.mozilla.org/mozilla-central/rev/48fde4aa4334, which was specific to Windows.
This happened again on the same revision:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265643895.1265647922.29942.gz
Linux mozilla-central debug test reftest on 2010/02/08 07:44:55
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265646662.1265650612.31868.gz
Linux mozilla-central debug test reftest on 2010/02/08 08:31:02

This time, the reftest suite timed out after loading layout/reftests/table-anonymous-boxes/white-space-7.html.
Summary: Intermittent timeout in reftest frame_box_rules_all (aborts the reftest suite) → Intermittent timeout in reftests on debug Linux builds (aborts the reftest suite)
This seems to be happening on Mac as well.

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265648292.1265652169.17595.gz
OS X 10.5.2 mozilla-central debug test reftest on 2010/02/08 08:58:12
OS: Linux → All
Summary: Intermittent timeout in reftests on debug Linux builds (aborts the reftest suite) → Intermittent timeout in reftests on debug builds (aborts the reftest suite)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265651279.1265655149.20453.gz
Linux mozilla-central debug test reftest on 2010/02/08 09:47:59

REFTEST TEST-PASS | file:///builds/moz2_slave/mozilla-central-linux-debug-unittest-reftest/build/reftest/tests/layout/reftests/table-bordercollapse/frame_border_rules_none.html |
REFTEST INFO | Loading a blank page
--DOMWINDOW == 65 (0xaa1beb0) [serial = 10315] [outer = 0x9152270] [url = file:///builds/moz2_slave/mozilla-central-linux-debug-unittest-reftest/build/reftest/tests/layout/reftests/table-bordercollapse/bc_dyn_cg2.html]
--DOMWINDOW == 64 (0x9a17c10) [serial = 10313] [outer = 0x9152270] [url = file:///builds/moz2_slave/mozilla-central-linux-debug-unittest-reftest/build/reftest/tests/layout/reftests/table-bordercollapse/bc_dyn_cg1.html]
--DOMWINDOW == 63 (0xa4ec608) [serial = 10317] [outer = 0x9152270] [url = file:///builds/moz2_slave/mozilla-central-linux-debug-unittest-reftest/build/reftest/tests/layout/reftests/table-bordercollapse/bc_dyn_cg3.html]

command timed out: 3600 seconds elapsed, killing pid 3366
process killed by signal 9
program finished with exit code -1
elapsedTime=3600.029928
TinderboxPrint: reftest<br/><em class="testfail">T-FAIL</em>
buildbot.slave.commands.TimeoutError: command timed out: 3600 seconds elapsed, killing pid 3366
TinderboxPrint: reftest<br/><em class="testfail">timeout</em>
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265665019.1265669069.20046.gz
Linux mozilla-central debug test reftest on 2010/02/08 13:36:59
s: moz2-linux-slave13
REFTEST INFO | Loading file:///builds/moz2_slave/mozilla-central-linux-debug-unittest-reftest/build/reftest/tests/layout/reftests/table-bordercollapse/bc_dyn_rg3.html
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265666706.1265670744.7697.gz
Linux mozilla-central debug test reftest on 2010/02/08 14:05:06
s: moz2-linux-slave10
REFTEST INFO | Loading file:///builds/moz2_slave/mozilla-central-linux-debug-unittest-reftest/build/reftest/tests/layout/reftests/table-anonymous-boxes/white-space-6.html
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265668070.1265672686.30967.gz
Linux mozilla-central debug test reftest on 2010/02/08 14:27:50
s: moz2-linux-slave41
REFTEST INFO | Loading file:///builds/slave/mozilla-central-linux-debug-unittest-reftest/build/reftest/tests/layout/reftests/table-bordercollapse/frame_lhs_rules_cols_ref.html
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265671430.1265675777.2963.gz
Linux mozilla-central debug test reftest on 2010/02/08 15:23:50
s: moz2-linux-slave02
REFTEST TEST-PASS | file:///builds/moz2_slave/mozilla-central-linux-debug-unittest-reftest/build/reftest/tests/layout/reftests/table-width/colgroup-vs-column-3.html |
REFTEST INFO | Loading a blank page
Happened on 7 out of the 10 runs since and including changeset 48fde4aa4334.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265679561.1265683513.27263.gz
Linux mozilla-central debug test reftest on 2010/02/08 17:39:21
s: moz2-linux-slave07
REFTEST INFO | Loading file:///builds/moz2_slave/mozilla-central-linux-debug-unittest-reftest/build/reftest/tests/layout/reftests/table-width/conflicting-percent-widths-2.html
If it's that frequent, could somebody try to reproduce the bug?
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265685906.1265689733.2288.gz
Linux mozilla-central debug test reftest on 2010/02/08 19:25:06
s: moz2-linux-slave05
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265693036.1265697085.22665.gz
Linux mozilla-central debug test reftest on 2010/02/08 21:23:56
s: moz2-linux-slave02
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265696711.1265700562.30034.gz
Linux mozilla-central debug test reftest on 2010/02/08 22:25:11  
s: moz2-linux-slave13

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265709258.1265713151.14211.gz
Linux mozilla-central debug test reftest on 2010/02/09 01:54:18
s: moz2-linux-slave14

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265715472.1265719293.22839.gz
Linux mozilla-central debug test reftest on 2010/02/09 03:37:52
s: moz2-linux-slave13
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265735419.1265739433.10475.gz
Linux mozilla-central debug test reftest on 2010/02/09 09:10:19
s: moz2-linux-slave15
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265737866.1265741715.5094.gz
Linux mozilla-central debug test reftest on 2010/02/09 09:51:06
s: moz2-linux-slave08
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265757022.1265760919.29456.gz
Linux mozilla-central debug test reftest on 2010/02/09 15:10:22
s: moz2-linux-slave09
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265759399.1265763387.24642.gz
Linux mozilla-central debug test reftest on 2010/02/09 15:49:59
s: moz2-linux-slave01
I think this is another one, but not linux - which seems to be the majority here:

OS X 10.5.2 mozilla-central debug test reftest [testfailed] Started 16:53, finished 18:03

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265763199.1265767363.5591.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265767197.1265771789.23677.gz
Linux mozilla-central debug test reftest on 2010/02/09 17:59:57
s: moz2-linux-slave08

If that Mac one, which timed out after the whole suite was done, in shutdown, really is the same thing, then maybe we've just gotten too slow for the timeout - Linux debug is infamous for being the thing that fails whenever your test can't take being run in slow motion.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265769737.1265774209.19006.gz
Linux mozilla-central debug test reftest on 2010/02/09 18:42:17
s: moz2-linux-slave07
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265768276.1265772611.318.gz
OS X 10.5.2 mozilla-central debug test reftest on 2010/02/09 18:17:56
s: moz2-darwin9-slave19

Another Mac timeout during shutdown after finishing the run.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265772675.1265777121.19209.gz
Linux mozilla-central debug test reftest on 2010/02/09 19:31:15
s: moz2-linux-slave06
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265775463.1265779910.18349.gz
Linux mozilla-central debug test reftest on 2010/02/09 20:17:43
s: moz2-linux-slave24

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265777378.1265781399.2831.gz
Linux mozilla-central debug test reftest on 2010/02/09 20:49:38
s: moz2-linux-slave22
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265827194.1265831293.28939.gz
Linux mozilla-central debug test reftest on 2010/02/10 10:39:54
s: moz2-linux-slave09


http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265828595.1265832921.15114.gz
Linux mozilla-central debug test reftest on 2010/02/10 11:03:15
s: moz2-linux-slave10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265835890.1265839868.29597.gz
Linux mozilla-central debug test reftest on 2010/02/10 13:04:50
s: moz2-linux-slave02
What percentage complete is the run when it hits these timeouts?  Is it usually near the end?
(In reply to comment #28)
> What percentage complete is the run when it hits these timeouts?  Is it usually
> near the end?

I don't have precise numbers here, but judging from the logs, and the list files inside layout/reftest/reftest.list, I'd say it's usually about 60-70% percent complete when the failure occurs.

Does this shed any light on the problem?
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265840838.1265844705.20298.gz
OS X 10.5.2 mozilla-central debug test reftest on 2010/02/10 14:27:18
s: moz2-darwin9-slave19
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265842369.1265846303.6137.gz
Linux mozilla-central debug test reftest on 2010/02/10 14:52:49
s: moz2-linux-slave24
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265884121.1265888209.21134.gz
Linux mozilla-central debug test reftest on 2010/02/11 02:28:41
s: moz2-linux-slave24
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265890680.1265894640.1801.gz
Linux mozilla-central debug test reftest on 2010/02/11 04:18:00
s: moz2-linux-slave01

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265876440.1265880384.27770.gz
Linux mozilla-central debug test reftest on 2010/02/11 00:20:40
s: moz2-linux-slave24

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265905549.1265909866.22837.gz
Linux mozilla-central debug test reftest on 2010/02/11 08:25:49
s: moz2-linux-slave13

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265899812.1265904152.20618.gz
Linux mozilla-central debug test reftest on 2010/02/11 06:50:12
s: moz2-linux-slave02
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265913600.1265917662.15133.gz
Linux mozilla-central debug test reftest on 2010/02/11 10:40:00
s: moz2-linux-slave06
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265914141.1265917969.18573.gz
OS X 10.5.2 mozilla-central debug test reftest on 2010/02/11 10:49:01
s: moz2-darwin9-slave24
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265915599.1265919573.4312.gz
Linux mozilla-central debug test reftest on 2010/02/11 11:13:19
s: moz2-linux-slave13
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265918316.1265922346.3003.gz
Linux mozilla-central debug test reftest on 2010/02/11 11:58:36
s: moz2-linux-slave08
(In reply to comment #29)
> (In reply to comment #28)
> > What percentage complete is the run when it hits these timeouts?  Is it usually
> > near the end?
> 
> I don't have precise numbers here, but judging from the logs, and the list
> files inside layout/reftest/reftest.list, I'd say it's usually about 60-70%
> percent complete when the failure occurs.
> 
> Does this shed any light on the problem?

Which directories was it ending in?  the main layout/reftest/reftest.list is pretty heavily biased, since there's a large number of reftests in layout/reftests/bugs/.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265923102.1265927645.30607.gz
Linux mozilla-central debug test reftest on 2010/02/11 13:18:22
s: moz2-linux-slave24
(In reply to comment #39)
> (In reply to comment #29)
> > (In reply to comment #28)
> > > What percentage complete is the run when it hits these timeouts?  Is it usually
> > > near the end?
> > 
> > I don't have precise numbers here, but judging from the logs, and the list
> > files inside layout/reftest/reftest.list, I'd say it's usually about 60-70%
> > percent complete when the failure occurs.
> > 
> > Does this shed any light on the problem?
> 
> Which directories was it ending in?  the main layout/reftest/reftest.list is
> pretty heavily biased, since there's a large number of reftests in
> layout/reftests/bugs/.

I think most (but not all) of the failures happen in the table-bordercollapse directory.
Did we have an increase in assertions recently?  That could slow things down.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265926833.1265931665.11407.gz
WINNT 5.2 mozilla-central debug test reftest on 2010/02/11 14:20:33  
s: win32-slave13
/table-bordercollapse/

Even Windows finally joins the party.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265930020.1265934342.10018.gz
WINNT 5.2 mozilla-central debug test reftest on 2010/02/11 15:13:40
s: win32-slave43
/table-width/
A passing Rd run takes nearly an hour, so a one-hour maximum is obviously going to cause problems.  We should increase the maximum.

On Linux, it might have gone from 40 minutes average to 50 minutes average with http://hg.mozilla.org/mozilla-central/rev/8dc04dca2a59. It's hard for me to tell.
(In reply to comment #42)
> Did we have an increase in assertions recently?  That could slow things down.

I investigated ten changesets before the first one on which this happened for the first time, and none had added a noticeable bunch of assertions.

What is the fix strategy here?  Should we increase the timeout value on buildbot?
Depends on: 545784
We're doing that in bug 545784 :)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265959778.1265963682.11510.gz
Linux mozilla-central debug test reftest on 2010/02/11 23:29:38
s: moz2-linux-slave01

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265968145.1265972261.11157.gz
Linux mozilla-central debug test reftest on 2010/02/12 01:49:05
s: moz2-linux-slave13

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265970015.1265974415.3256.gz
Linux mozilla-central debug test reftest on 2010/02/12 02:20:15
s: moz2-linux-slave22
Blocks: 539516
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Flags: in-testsuite-
Whiteboard: [orange][test which aborts the suite] → [fixed by bug 545784] [orange][test which aborts the suite]
Target Milestone: --- → mozilla1.9.3a2
http://tinderbox.mozilla.org/showlog.cgi?tree=Firefox&errorparser=unittest&logfile=1269251710.1269255098.18990.gz&buildtime=1269251710&buildname=WINNT%205.2%20mozilla-central%20debug%20test%20reftest&fulltext=1#err0
WINNT 5.2 mozilla-central debug test reftest on 2010/03/22 02:55:10

TEST-UNEXPECTED-FAIL | automation.py | application timed out after 330 seconds with no output

command timed out: 1200 seconds without output
program finished with exit code 1
elapsedTime=2933.906000
TinderboxPrint: reftest<br/>4267/0/277
buildbot.slave.commands.TimeoutError: command timed out: 1200 seconds without output
TinderboxPrint: reftest<br/><em class="testfail">timeout</em>


Hmm, it seems that the timeout in automation.py is 330 seconds here, not 1200 seconds.  I wonder why.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
These are time-without-output timeouts.  Not this bug.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
(In reply to comment #50)
> These are time-without-output timeouts.  Not this bug.

Filed bug 554111
Whiteboard: [fixed by bug 545784] [orange][test which aborts the suite] → [fixed by bug 545784] [test which aborts the suite]
You need to log in before you can comment on or make changes to this bug.