Closed Bug 473005 Opened 16 years ago Closed 16 years ago

Multiple Failures on Thunderbird tinderboxes

Categories

(Mozilla Messaging Graveyard :: Server Operations, defect)

x86
macOS
defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: standard8, Assigned: gozer)

Details

I would raise multiple bugs, but these seem to have all happened at the same time, the list below are what we've got problems with, both on trunk and 1.9.1:

Linux Check
Linux Bloat
Windows Bloat
Mac Bloat (also covered by bug 472864)

On just trunk, mac check is potentially a problem - though it may go green again.

Also Windows check is taking a long time to report back from its trunk build.

Nightly builds are covered by bug 472970 which is a moco permissions problem.

At the moment I'm holding the tree closed as we have no bloat coverage, reduced check coverage
Allright, status update from the front:

Linux comm-central check : OK
Linux comm-1.9.1 check   : OK
Win2k comm-central check : OK
Win2k comm-central check : OK
Mac OS X 10.4 comm-central check : OK
Mac OS X 10.4 comm-1.9.1   check : OK
Win32 comm-central bloat : OK
Mac OS X 10.4 bloat      : OK
Linux comm-central bloat : OK
Linux on Thunderbird3.0 : GREEN
Windows on Thunderbird3.0 : GREEN
Please, close once 

MacOSX 10.4 comm-central bloat build

Turns green, it should do so withing an hour.
Current status:

- Most builds ok
- "Linux comm-central mozilla-central bloat build" busted. A clobber may fix but there's no irc bot around to do this.
- "MacOSX 10.4 comm-central bloat build" busted. However I think this is just a one-off "it took too long to compile" problem.

Given the mac 1.9.1 bloat issue, and the fact that mac & windows check have long recompiles to do on 1.9.1, and the fact that all these boxes are currently stuck building on trunk (due to non TB checkins), I've completely closed comm-central for a couple of hours to try and let these boxes get over to the 1.9.1 branch.
The IRC bot for comm-central mozilla-central builds is supposed to be : thunderbuild-trunk

Hrm, the problem is most likely because there has been lots of builds accumulating in the build queues, but on the good side, even though it might indicate 20 builds pending, it really mean only 1-2 builds, as buildbot will/should merge these into single builds, skipping over the queued versions in the middle.
Clobbered "Linux comm-central mozilla-central bloat build"
(In reply to comment #8)
> The IRC bot for comm-central mozilla-central builds is supposed to be :
> thunderbuild-trunk
> 
That's not on irc, additionally the buildbot config is possible old:
http://hg.mozilla.org/build/buildbot-configs/file/a4fc7865f94d/thunderbird/config.py#l494

> Hrm, the problem is most likely because there has been lots of builds
> accumulating in the build queues, but on the good side, even though it might
> indicate 20 builds pending, it really mean only 1-2 builds, as buildbot
> will/should merge these into single builds, skipping over the queued versions
> in the middle.

Yeah we can cope now I think

Note that we've been seeing lots of drop offs (ping timeouts) of the irc bots
today since you fixed the main issues. Linux bloat & build have also been
loosing their connections in the middle of builds quite frequently as well.
I'm also uncertain about "Win2k3 comm-central check" its been building for approx 5 hours 40 mins now which is a little excessive even for a full rebuild.
Status update:

- "Win2k3 comm-central check" has failed to check in after lots of hours compiling. It hasn't dropped off the buildbot radar.

- "Linux comm-central bloat build" is frequently dropping connections seems to coincide with irc bots having a ping timeout and then reconnecting.

- "Linux comm-central build" also drops connection occasionally, but I think that's on the same VM so not surprising.

- "Linux comm-central mozilla-central bloat build" was still busted after the clobber. Currently Linux & Mac are busted as a result of bug 386676, I'm going to do a trunk build anyway to see if the original bustage was real or not.

There's nothing here that is a real show stopper at the moment, we can cope with missing Windows check but keeping an eye on the SeaMonkey boxes.
(In reply to comment #12)
> - "Linux comm-central mozilla-central bloat build" was still busted after the
> clobber. Currently Linux & Mac are busted as a result of bug 386676, I'm going
> to do a trunk build anyway to see if the original bustage was real or not.

Local build worked fine. Lets wait till the current bustage is resolved to see what is happening on that box.
Win2k3 comm-central check`is probably the result of confusion between the buildbot client and server, Ive seen it happen before. The client needs restarting, most likely.

Linux comm-central build is inside the MoCo network
Linux comm-central bloat build is inside the MoMo network

That`s a bit odd that they are experiencing connection issues, as well as with the IRC server. I know MoCo recently performed core router upgrades, so it might have something to do with it.
Just confirmed that the buildbot client isnt running on the Win32 check box, it should be. The master simply has managed to not notice and get confused as to what its status is.

Unfortunately, I can`t restart it from my current network location, it`ll have to wait for later this evening.
(In reply to comment #14)
> That`s a bit odd that they are experiencing connection issues, as well as with
> the IRC server. I know MoCo recently performed core router upgrades, so it
> might have something to do with it.

Something I have noticed. As America woke up and checkins increase, the responsiveness of build.mozillamessaging.com has gone down, and I think we're getting more timeouts (the timeouts is more of an instinct).

Mac Check & Mac Bloat are also starting to look like they may have dropped out again (almost 3 hour build times, as they had both just completed a build, I think that's suspect).
Further to my previous comment, it appears MoCo has network issues (knocked out a switch and a few machines). This probably explains the extra problems.
That might indicate problems with load/network on the buildbot master, but I am not seeing anything probative from looking at the historical charts.

Looks like the momo-xserve-01, our Apple X-Serve, was rebooted, currenty reporting 6 hours, 51 minutes of uptime, strange. MoCo ?

And as I suspected, buildbot (and the rdp sessions) on the win32 check box were gone/dead. Not sure what's up there, can't remember how to find uptime on win32.
Win32 unittest builder restarted (comm-1.9.1 and mozilla-c entral)
OS X unittest builder restarted (comm-1.9.1 and mozilla-central)
OS X bloat    builder restarted (comm-1.9.1 and mozilla-central)
(In reply to comment #18)
> Looks like the momo-xserve-01, our Apple X-Serve, was rebooted, currenty
> reporting 6 hours, 51 minutes of uptime, strange. MoCo ?

They had a power outage which took out a main switch. Not sure why our xserve was rebooted.

General Update:

- Most builders seem steady and reporting the correct state of the tree.
- Linux * bloat build regularly busted due to connection timeout issues. I'm trying to give the 1.9.1 build a clobber, but the next build it just drops connection which messes it up again. I think this is the same reason for the trunk build being messed up (which I can't clobber).
- irc bots are still dropping off irc.

I think the current state is reasonable and we can live with it until next week if there's no obvious fixes.
In addition to the current status in comment 21, "Win2k3 comm-central check" seems to have died again (8 hour build at the moment).

Not a significant problem at the moment as we have stable Linux/Mac coverage as well as SeaMonkey's boxes.
restarted "win2k * check", looks like the VM had crashed/rebooted, and buildbot doesn't start on boot.

restarted "linux * bloat" buildbot clients, just in case, but there is definitely something going on there.
(In reply to comment #23)
> restarted "win2k * check", looks like the VM had crashed/rebooted, and buildbot
> doesn't start on boot.

Looks fine at the moment :-)
 
> restarted "linux * bloat" buildbot clients, just in case, but there is
> definitely something going on there.

I get the impression that this is more related to the irc bots timeout out and coming back on - when I've done clobber builds its typically around the time of the irc bot dropout that the build will fail.

This implies to me we've got a problem at the master or some connectivity issues somewhere.
(In reply to comment #24)
> I get the impression that this is more related to the irc bots timeout out and
> coming back on - when I've done clobber builds its typically around the time of
> the irc bot dropout that the build will fail.
> 
> This implies to me we've got a problem at the master or some connectivity
> issues somewhere.

I've just been looking at other bugs, bug 470462 Setup VMWare reservations for buildbot master VMs - I'm not sure what one is, or whether buildbot master runs in a VM, but it might help!
I suspect part of the problem was because of the master having been reconfigured a lot of times and not restarted.

I've restarted it cold this evening, and I am looking at it still.

On the funny side of things, the win2k * check box is apparently blue screening
during the builds...

http://imagebin.ca/view/ezMqDY5.html
(In reply to comment #26)
> I suspect part of the problem was because of the master having been
> reconfigured a lot of times and not restarted.
> 
> I've restarted it cold this evening, and I am looking at it still.

irc bots seem stable now.

Mac bloat timed out (both on trunk & 1.9.1) and messed up builds I've queued up clobbers so they should go green again.

> On the funny side of things, the win2k * check box is apparently blue screening
> during the builds...

I think it happened again today - its failed to report in again.
That windows box needs to be replaced with a freshly imaged one. Will do on monday.
Win32 * check box is up, running, building and checking.

For some reason, the test run in mozilla/toolkit/crashreporter/test returns a non-zero status, even though no failing tests are reported:

make[4]: Leaving directory `[objdir]/mozilla/toolkit/crashreporter/test'
make[3]: Leaving directory `[objdir]/mozilla/toolkit/crashreporter'
make[2]: Leaving directory `[objdir]/mozilla/toolkit'
make[1]: Leaving directory `[objdir]/mozilla'
make[4]: *** [check] Error 1
make[3]: *** [check] Error 2
make[2]: *** [check] Error 2
make[1]: *** [check] Error 2
make: *** [check] Error 2

[full buildbot log is here: <http://build.mozillamessaging.com/buildbot/production/builders/MacOSX 10.4 comm-central bloat build/builds/3068/steps/compile/logs/stdio>]
Wrong link in comment #29, should have been <http://build.mozillamessaging.com/buildbot/unittest/builders/Win2k3 comm-1.9.1 check/builds/161/steps/check/logs/stdio>
Further work on general stability will be hapenning in bug 474600
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.