Closed Bug 557440 Opened 14 years ago Closed 14 years ago

Color depth on -ix- slaves is sometimes too low to run reftests, resulting in failing modules/libpr0n/test/reftest/colordepth.html (and 135 others)

Categories

(Release Engineering :: General, defect)

x86
Windows Server 2003
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [badslave?])

The first failure in http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270530540.1270530972.9262.gz is the one that was added to ensure that the tinderbox trying to run reftests has at least 24 bit color, because the others will fail if it doesn't. So apparently at least mw32-ix-slave14 does not - I didn't look back to see whether other -ix- slaves have successfully run reftests.
The colour depth is set to 24bits on these slaves, Ben do you have ideas on where else we should be looking into this?
The display size is set to 1024 X 640 though, instead of 1280 X 1024 like the win32 VMs.
(continued from previous comment... oops)
s: mw32-ix-slave15
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270599905.1270601418.18629.gz
WINNT 5.2 mozilla-central debug test reftest on 2010/04/06 17:25:05
s: mw32-ix-slave04
Summary: Color depth on mw32-ix-slave14 (and other -ix- slaves?) too low to run reftests → Color depth on -ix- slaves is sometimes too low to run reftests, resulting in failing modules/libpr0n/test/reftest/colordepth.html (and 135 others)
(In reply to comment #3)
> The colour depth is set to 24bits on these slaves, Ben do you have ideas on
> where else we should be looking into this?

Based on the comments, we should repro in staging, and then try changing the color depth to see if that fixes it.
Oh, I'm also wondering why this only came up now. There's been no configuration changes to the machines, and they've been running in production for close to a month. Did the test change in some way?
colordepth.html hasn't changed since December 2008. What has changed is that fairly recently the machines went from being prioritized for builds to not being allowed to do builds since they were saturating the mpt-castro connection (and if they previously did reftests, and failed once a week in the middle of the night, we might well have just ignored it), and judging by the nagios-spam, a whole lot of restarting of them. I remember back when IT did restarting of tinderboxes, there were explicit instructions about how you could and couldn't connect to them and what to do while restarting, to avoid this exact problem. Is someone maybe, or maybe just sometimes, not doing the current equivalent of those instructions?
These machines reboot after every job. They come back up automatically, no intervention involved. I'm not saying it's not possible that the screen depth is an issue, I'm simply wondering why it didn't come up, or wasn't noticed until a month later. If the answer to that is just that it was missed, that's fine, but if the problem truly didn't occur until a few days ago then either the tests have changed, or the machine configuration has changed, or we have the strangest orange we've seen yet.
(In reply to comment #13)
> ... I'm simply wondering why it didn't come up, or wasn't noticed
> until a month later. If the answer to that is just that it was missed, that's
> fine, but if the problem truly didn't occur until a few days ago then either
> the tests have changed, or the machine configuration has changed, or we have
> the strangest orange we've seen yet.

1) Good question, bhearsum. These machines have been in production since early March, if bug#545136 is to be believed. bhearsum/philor: Do we have any examples of this orangeness before a few days ago? That would help us figure out whats going on here.

2) Until this is resolved, are all the win32 ix machines removed from production pool to avoid orange, or are some machines still working correctly in production?
Is this happening? Any recent examples that I can jump in and debug?
No. It happened on Monday and Tuesday of that week, then stopped. It wasn't happening and being ignored before that, and it isn't happening and being ignored after that.
(In reply to comment #13)
> the tests have changed, or the machine configuration has changed

Given that neither of these changed and the situation hasn't recurred in 2 weeks, I'm marking this with "badslave?" for tracking and moving on.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
Whiteboard: [orange] → [orange][badslave?]
Whiteboard: [orange][badslave?] → [badslave?]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.