Closed
Bug 482463
Opened 15 years ago
Closed 15 years ago
Consistent timeout failures on unit tests early in the mornings
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bzbarsky, Assigned: bhearsum)
References
Details
(Keywords: intermittent-failure)
Attachments
(1 file, 1 obsolete file)
1.63 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
I usually look at tinderbox at about 5-6am Pacific for the first time in the morning, and every day the last several cycles of unit tests are orange with various timeout-related oranges. A sampling from today (ignoring the orange caused by actual code changes): buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 27909 buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 30167 buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output (That's all three of the unit test cycles that tested rev 1d032a7fd91a on m-c). buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 15844 buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 50401 (two of three unit test cycles testing rev 3b40510faeff on m-c) *** 42317 ERROR TEST-UNEXPECTED-FAIL | /tests/intl/uconv/tests/test_unicode_noncharacters_gb18030.html | Error thrown during test: $("display") is null - got 0, expected 1 *** 42320 ERROR TEST-UNEXPECTED-FAIL | /tests/intl/uconv/tests/test_unicode_noncharacters_utf8.html | Error thrown during test: $("display") is null - got 0, expected 1 (this is due to it taking the test machine more than 100ms to parse a </pre>, an <a> with a bit of text in it, and a <p> with that ID). Is there something going on that makes load go through the roof on the unit test machines sometime between 23:00 and 04:00 pacific? It'd be really nice to fix it, whatever it is...
Comment 1•15 years ago
|
||
cron runs all the daily scripts at 04:00 pacific local time. Here are the scripts that are in /etc/cron.daily on moz2-linux-slave17: 0anacron 0logwatch cups logrotate makewhatis.cron mlocate.cron prelink rpm tmpwatch
Reporter | ||
Comment 2•15 years ago
|
||
Yeah, that's actually a little later than the orange starts kicking in...
Reporter | ||
Comment 3•15 years ago
|
||
Also, that would not affect the mac unit test boxes, right?
Comment 4•15 years ago
|
||
Nightlies started at 2am until 20090310, now at 3:02am PST/PDT. That's 10 builds starting at once, perhaps a bit of a jump in load on the ESX servers but nothing that we don't do later in the day. And it doesn't explain Mac. Are the tests relying on anything on the network ?
Reporter | ||
Comment 5•15 years ago
|
||
They ought not to be, but sometimes we make mistakes, of course... That said, do they rely on the network for the process that has the 5-minute timeout in any way?
Reporter | ||
Comment 6•15 years ago
|
||
Data from this morning: The Mac unit test testing rev 501d8a053dc9 timed out.
Blocks: 438871
Assignee | ||
Comment 7•15 years ago
|
||
(In reply to comment #5) > They ought not to be, but sometimes we make mistakes, of course... > > That said, do they rely on the network for the process that has the 5-minute > timeout in any way? The timeout is done all on the slave, actually, so prolonged network weirdness wouldn't affect it.
Assignee | ||
Comment 8•15 years ago
|
||
Our VM hosts seem fine: HOST | CPU % | MEM % | bm-vmware01 | 57% | 77% | bm-vmware02 | 53% | 71% | bm-vmware03 | 64% | 73% | bm-vmware04 | 32% | 64% | bm-vmware05 | 52% | 81% | bm-vmware06 | 77% | 83% | bm-vmware07 | 58% | 68% | bm-vmware08 | 54% | 79% | bm-vmware09 | 72% | 82% | bm-vmware10 | 71% | 78% | bm-vmware11 | 62% | 79% | bm-vmware12 | 74% | 86% | bm-vmware13 | 46% | 82% | Right now *all* of our Linux and Win32 VMs are building, too, so I suspect it doesn't spike much higher than this. (Also, this might be related/the same to bug 483595)
Assignee | ||
Updated•15 years ago
|
Assignee: nobody → bhearsum
Priority: -- → P2
Assignee | ||
Comment 9•15 years ago
|
||
(In reply to comment #1) > cron runs all the daily scripts at 04:00 pacific local time. > > Here are the scripts that are in /etc/cron.daily on moz2-linux-slave17: > 0anacron 0logwatch cups logrotate makewhatis.cron mlocate.cron prelink > rpm tmpwatch The mlocate.cron probably eats our face a little. It runs 'updatedb', which, considering all of the file changes each day must hurt.
Reporter | ||
Comment 10•15 years ago
|
||
We could start by turning that off (can't think of a reason for "locate" to work on those machines anyway).
Assignee | ||
Comment 11•15 years ago
|
||
Yeah. I'm going to do that. Also looking at disabling the console X session, which runs gnome. (Note: unittests run in Xvfb, so this doesn't affect them in any way.)
Assignee | ||
Comment 12•15 years ago
|
||
So, here's my plan of attack: 1) Disable updatedb cronjob 2) Turn off non-unittest X/gnome session 3) Talk to IT about getting more RAM in these VMs. I'll do 1) and 2) today. I've pinged phong about 3) and he'll got back to me on how much room we have.
Status: NEW → ASSIGNED
Assignee | ||
Comment 14•15 years ago
|
||
Alright, the useless X session is stopped on moz2-linux-slave01 -> 19 and try-linux-slave01 -> 05. Looks like the leak test builds were using that X session, patch incoming to fix them.
Assignee | ||
Comment 15•15 years ago
|
||
Attachment #367824 -
Flags: review?(catlee)
Assignee | ||
Comment 16•15 years ago
|
||
Attachment #367825 -
Flags: review?(catlee)
Updated•15 years ago
|
Attachment #367825 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 17•15 years ago
|
||
Comment on attachment 367825 [details] [diff] [review] this time, with a patch changeset: 1019:782a43ac71dc
Attachment #367825 -
Flags: checked‑in+ checked‑in+
Assignee | ||
Updated•15 years ago
|
Attachment #367824 -
Attachment is obsolete: true
Attachment #367824 -
Flags: review?(catlee)
Assignee | ||
Comment 18•15 years ago
|
||
(In reply to comment #12) > So, here's my plan of attack: > 1) Disable updatedb cronjob > 2) Turn off non-unittest X/gnome session Both of these are done now.
Assignee | ||
Comment 19•15 years ago
|
||
I've seen at least 2 libxul.so linkings go by without an issue. Interestingly, they both took less than 2 minutes to link. It really seems like some of them just hang, rather than take a long time swapping. It could be that during certain times the load on the datastores is higher and swapping becomes tremendously more expensive. Nevertheless, still looking at getting more RAM for these.
Assignee | ||
Comment 20•15 years ago
|
||
I just went around and disabled the prelink cronjob as well.
Assignee | ||
Comment 21•15 years ago
|
||
Doesn't look like we had any failures like this this morning. The only change is some shuffling of VMs on the datastores aravind did in bug 483896. I did land a change that bumps timeouts, but that was only for the Compile and Mercurial step, so I can't see that affecting test pass/fail. I'll this bug open but if tomorrow morning is as problem free I'm going to declare this FIXED.
Assignee | ||
Comment 22•15 years ago
|
||
Still haven't seen anymore failures due to timeout.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•14 years ago
|
Whiteboard: [orange]
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [orange]
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•