Closed Bug 482463 Opened 15 years ago Closed 15 years ago

Consistent timeout failures on unit tests early in the mornings

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bzbarsky, Assigned: bhearsum)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 1 obsolete file)

use DISPLAY=:2 for leak tests, so we don't need to run two X sessions 15 years ago bhearsum@mozilla.com (:bhearsum) 19 bytes, patch		Details \| Diff \| Splinter Review
this time, with a patch 15 years ago bhearsum@mozilla.com (:bhearsum) 1.63 KB, patch	catlee : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review

Boris Zbarsky [:bzbarsky]

Reporter

Description

•

15 years ago

I usually look at tinderbox at about 5-6am Pacific for the first time in the morning, and every day the last several cycles of unit tests are orange with various timeout-related oranges.  A sampling from today (ignoring the orange caused by actual code changes):

buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 27909
buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 30167
buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output

(That's all three of the unit test cycles that tested rev 1d032a7fd91a on m-c).

buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 15844
buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 50401

(two of three unit test cycles testing rev 3b40510faeff on m-c)

*** 42317 ERROR TEST-UNEXPECTED-FAIL | /tests/intl/uconv/tests/test_unicode_noncharacters_gb18030.html | Error thrown during test: $("display") is null - got 0, expected 1
*** 42320 ERROR TEST-UNEXPECTED-FAIL | /tests/intl/uconv/tests/test_unicode_noncharacters_utf8.html | Error thrown during test: $("display") is null - got 0, expected 1

(this is due to it taking the test machine more than 100ms to parse a </pre>, an <a> with a bit of text in it, and a <p> with that ID).

Is there something going on that makes load go through the roof on the unit test machines sometime between 23:00 and 04:00 pacific?  It'd be really nice to fix it, whatever it is...

Chris AtLee [:catlee]

Comment 1

•

15 years ago

cron runs all the daily scripts at 04:00 pacific local time.

Here are the scripts that are in /etc/cron.daily on moz2-linux-slave17:
0anacron  0logwatch  cups  logrotate  makewhatis.cron  mlocate.cron  prelink  rpm  tmpwatch

Boris Zbarsky [:bzbarsky]

Reporter

Comment 2

•

15 years ago

Yeah, that's actually a little later than the orange starts kicking in...

Boris Zbarsky [:bzbarsky]

Reporter

Comment 3

•

15 years ago

Also, that would not affect the mac unit test boxes, right?

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

15 years ago

Nightlies started at 2am until 20090310, now at 3:02am PST/PDT. That's 10 builds starting at once, perhaps a bit of a jump in load on the ESX servers but nothing that we don't do later in the day. And it doesn't explain Mac. 

Are the tests relying on anything on the network ?

Boris Zbarsky [:bzbarsky]

Reporter

Comment 5

•

15 years ago

They ought not to be, but sometimes we make mistakes, of course...

That said, do they rely on the network for the process that has the 5-minute timeout in any way?

Boris Zbarsky [:bzbarsky]

Reporter

Comment 6

•

15 years ago

Data from this morning:

The Mac unit test testing rev 501d8a053dc9 timed out.

Blocks: 438871

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 7

•

15 years ago

(In reply to comment #5)
> They ought not to be, but sometimes we make mistakes, of course...
> 
> That said, do they rely on the network for the process that has the 5-minute
> timeout in any way?

The timeout is done all on the slave, actually, so prolonged network weirdness wouldn't affect it.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 8

•

15 years ago

Our VM hosts seem fine:
HOST        | CPU % | MEM % |
bm-vmware01 |  57%  |  77%  |
bm-vmware02 |  53%  |  71%  |
bm-vmware03 |  64%  |  73%  |
bm-vmware04 |  32%  |  64%  |
bm-vmware05 |  52%  |  81%  |
bm-vmware06 |  77%  |  83%  |
bm-vmware07 |  58%  |  68%  |
bm-vmware08 |  54%  |  79%  |
bm-vmware09 |  72%  |  82%  |
bm-vmware10 |  71%  |  78%  |
bm-vmware11 |  62%  |  79%  |
bm-vmware12 |  74%  |  86%  |
bm-vmware13 |  46%  |  82%  |

Right now *all* of our Linux and Win32 VMs are building, too, so I suspect it doesn't spike much higher than this.

(Also, this might be related/the same to bug 483595)

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

15 years ago

Assignee: nobody → bhearsum

Priority: -- → P2

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 9

•

15 years ago

(In reply to comment #1)
> cron runs all the daily scripts at 04:00 pacific local time.
> 
> Here are the scripts that are in /etc/cron.daily on moz2-linux-slave17:
> 0anacron  0logwatch  cups  logrotate  makewhatis.cron  mlocate.cron  prelink 
> rpm  tmpwatch

The mlocate.cron probably eats our face a little. It runs 'updatedb', which, considering all of the file changes each day must hurt.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 10

•

15 years ago

We could start by turning that off (can't think of a reason for "locate" to work on those machines anyway).

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 11

•

15 years ago

Yeah. I'm going to do that. Also looking at disabling the console X session, which runs gnome. (Note: unittests run in Xvfb, so this doesn't affect them in any way.)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 12

•

15 years ago

So, here's my plan of attack:
1) Disable updatedb cronjob
2) Turn off non-unittest X/gnome session
3) Talk to IT about getting more RAM in these VMs.

I'll do 1) and 2) today.
I've pinged phong about 3) and he'll got back to me on how much room we have.

Status: NEW → ASSIGNED

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 14

•

15 years ago

Alright, the useless X session is stopped on moz2-linux-slave01 -> 19 and try-linux-slave01 -> 05.

Looks like the leak test builds were using that X session, patch incoming to fix them.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 15

•

15 years ago

Attached patch use DISPLAY=:2 for leak tests, so we don't need to run two X sessions (obsolete) — Details — Splinter Review

Attachment #367824 - Flags: review?(catlee)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 16

•

15 years ago

Attached patch this time, with a patch — Details — Splinter Review

Attachment #367825 - Flags: review?(catlee)

Chris AtLee [:catlee]

Updated

•

15 years ago

Attachment #367825 - Flags: review?(catlee) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 17

•

15 years ago

Comment on attachment 367825 [details] [diff] [review]
this time, with a patch

changeset:   1019:782a43ac71dc

Attachment #367825 - Flags: checkedâ€‘in+ checked‑in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

15 years ago

Attachment #367824 - Attachment is obsolete: true

Attachment #367824 - Flags: review?(catlee)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 18

•

15 years ago

(In reply to comment #12)
> So, here's my plan of attack:
> 1) Disable updatedb cronjob
> 2) Turn off non-unittest X/gnome session

Both of these are done now.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 19

•

15 years ago

I've seen at least 2 libxul.so linkings go by without an issue. Interestingly, they both took less than 2 minutes to link. It really seems like some of them just hang, rather than take a long time swapping. It could be that during certain times the load on the datastores is higher and swapping becomes tremendously more expensive.

Nevertheless, still looking at getting more RAM for these.

Nick Thomas [:nthomas] (UTC+12)

Updated

•

15 years ago

Depends on: 483896

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 20

•

15 years ago

I just went around and disabled the prelink cronjob as well.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 21

•

15 years ago

Doesn't look like we had any failures like this this morning. The only change is some shuffling of VMs on the datastores aravind did in bug 483896. I did land a change that bumps timeouts, but that was only for the Compile and Mercurial step, so I can't see that affecting test pass/fail.

I'll this bug open but if tomorrow morning is as problem free I'm going to declare this FIXED.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 22

•

15 years ago

Still haven't seen anymore failures due to timeout.

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

(no longer active)

Updated

•

14 years ago

Whiteboard: [orange]

Nobody; OK to take it and work on it

Updated

•

12 years ago

Keywords: intermittent-failure

Nobody; OK to take it and work on it

Updated

•

12 years ago

Whiteboard: [orange]

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.