Closed Bug 482463 Opened 11 years ago Closed 11 years ago

Consistent timeout failures on unit tests early in the mornings

Categories

(Release Engineering :: General, defect, P2)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bzbarsky, Assigned: bhearsum)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 1 obsolete file)

I usually look at tinderbox at about 5-6am Pacific for the first time in the morning, and every day the last several cycles of unit tests are orange with various timeout-related oranges.  A sampling from today (ignoring the orange caused by actual code changes):

buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 27909
buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 30167
buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output

(That's all three of the unit test cycles that tested rev 1d032a7fd91a on m-c).

buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 15844
buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 50401

(two of three unit test cycles testing rev 3b40510faeff on m-c)

*** 42317 ERROR TEST-UNEXPECTED-FAIL | /tests/intl/uconv/tests/test_unicode_noncharacters_gb18030.html | Error thrown during test: $("display") is null - got 0, expected 1
*** 42320 ERROR TEST-UNEXPECTED-FAIL | /tests/intl/uconv/tests/test_unicode_noncharacters_utf8.html | Error thrown during test: $("display") is null - got 0, expected 1

(this is due to it taking the test machine more than 100ms to parse a </pre>, an <a> with a bit of text in it, and a <p> with that ID).

Is there something going on that makes load go through the roof on the unit test machines sometime between 23:00 and 04:00 pacific?  It'd be really nice to fix it, whatever it is...
cron runs all the daily scripts at 04:00 pacific local time.

Here are the scripts that are in /etc/cron.daily on moz2-linux-slave17:
0anacron  0logwatch  cups  logrotate  makewhatis.cron  mlocate.cron  prelink  rpm  tmpwatch
Yeah, that's actually a little later than the orange starts kicking in...
Also, that would not affect the mac unit test boxes, right?
Nightlies started at 2am until 20090310, now at 3:02am PST/PDT. That's 10 builds starting at once, perhaps a bit of a jump in load on the ESX servers but nothing that we don't do later in the day. And it doesn't explain Mac. 

Are the tests relying on anything on the network ?
They ought not to be, but sometimes we make mistakes, of course...

That said, do they rely on the network for the process that has the 5-minute timeout in any way?
Data from this morning:

The Mac unit test testing rev 501d8a053dc9 timed out.
Blocks: 438871
(In reply to comment #5)
> They ought not to be, but sometimes we make mistakes, of course...
> 
> That said, do they rely on the network for the process that has the 5-minute
> timeout in any way?

The timeout is done all on the slave, actually, so prolonged network weirdness wouldn't affect it.
Our VM hosts seem fine:
HOST        | CPU % | MEM % |
bm-vmware01 |  57%  |  77%  |
bm-vmware02 |  53%  |  71%  |
bm-vmware03 |  64%  |  73%  |
bm-vmware04 |  32%  |  64%  |
bm-vmware05 |  52%  |  81%  |
bm-vmware06 |  77%  |  83%  |
bm-vmware07 |  58%  |  68%  |
bm-vmware08 |  54%  |  79%  |
bm-vmware09 |  72%  |  82%  |
bm-vmware10 |  71%  |  78%  |
bm-vmware11 |  62%  |  79%  |
bm-vmware12 |  74%  |  86%  |
bm-vmware13 |  46%  |  82%  |

Right now *all* of our Linux and Win32 VMs are building, too, so I suspect it doesn't spike much higher than this.

(Also, this might be related/the same to bug 483595)
Assignee: nobody → bhearsum
Priority: -- → P2
(In reply to comment #1)
> cron runs all the daily scripts at 04:00 pacific local time.
> 
> Here are the scripts that are in /etc/cron.daily on moz2-linux-slave17:
> 0anacron  0logwatch  cups  logrotate  makewhatis.cron  mlocate.cron  prelink 
> rpm  tmpwatch

The mlocate.cron probably eats our face a little. It runs 'updatedb', which, considering all of the file changes each day must hurt.
We could start by turning that off (can't think of a reason for "locate" to work on those machines anyway).
Yeah. I'm going to do that. Also looking at disabling the console X session, which runs gnome. (Note: unittests run in Xvfb, so this doesn't affect them in any way.)
So, here's my plan of attack:
1) Disable updatedb cronjob
2) Turn off non-unittest X/gnome session
3) Talk to IT about getting more RAM in these VMs.

I'll do 1) and 2) today.
I've pinged phong about 3) and he'll got back to me on how much room we have.
Status: NEW → ASSIGNED
Duplicate of this bug: 483595
Alright, the useless X session is stopped on moz2-linux-slave01 -> 19 and try-linux-slave01 -> 05.

Looks like the leak test builds were using that X session, patch incoming to fix them.
Attachment #367825 - Flags: review?(catlee)
Attachment #367825 - Flags: review?(catlee) → review+
Comment on attachment 367825 [details] [diff] [review]
this time, with a patch

changeset:   1019:782a43ac71dc
Attachment #367825 - Flags: checked‑in+ checked‑in+
Attachment #367824 - Attachment is obsolete: true
Attachment #367824 - Flags: review?(catlee)
(In reply to comment #12)
> So, here's my plan of attack:
> 1) Disable updatedb cronjob
> 2) Turn off non-unittest X/gnome session

Both of these are done now.
I've seen at least 2 libxul.so linkings go by without an issue. Interestingly, they both took less than 2 minutes to link. It really seems like some of them just hang, rather than take a long time swapping. It could be that during certain times the load on the datastores is higher and swapping becomes tremendously more expensive.

Nevertheless, still looking at getting more RAM for these.
Depends on: 483896
I just went around and disabled the prelink cronjob as well.
Doesn't look like we had any failures like this this morning. The only change is some shuffling of VMs on the datastores aravind did in bug 483896. I did land a change that bumps timeouts, but that was only for the Compile and Mercurial step, so I can't see that affecting test pass/fail.

I'll this bug open but if tomorrow morning is as problem free I'm going to declare this FIXED.
Still haven't seen anymore failures due to timeout.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [orange]
Whiteboard: [orange]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.