Closed Bug 547131 Opened 14 years ago Closed 14 years ago

talos-r3 master or slaves unwell

Categories

(Release Engineering :: General, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: nthomas, Assigned: anodelman)

References

Details

Attachments

(3 files)

We've had several bunches of talos runs fail out when the slaves lose contact with the master. In particular at 15:22 (5 slaves), 15:37 (17), 17:30-18:00 (50+).

Possibly fallout from bug 546731 ? Master is also gobbling up 1.5G right now, but CPU usage is OK.
There are lots of "talos dirty" runs that are stuck at "downloading to dirtyMaxDBs.zip". I feel like restarting the master.
This is all the builds today that ended in exception, meaning they lost connection between slave and master.
dmoore, do the times correlate with any work you were doing ?
talos-r3 master got a stop/start in a quiet period, with a purge_events for good measure. talos-pool has a similar problem with stalled builds but I haven't touche that.
The last time we had machines fail in downloading dirtyDBs.zip it was because the slaves had a bad version of Twisted IIRC.
Depends on: 547602
There were 4 more premature disconnects at 22:11 PST on the 18th, two each of fedora and leopard boxes. Nothing else between then and now. Seems to me there could be network congestion issue between the slaves in MV and the master in MPT, so I think we should leave this open to see what happens when MV arrives back at work. Alternatively, twice as many slaves connecting to talos-master may be saturating the network connection to communicate logs and download files. Filed bug 547602 to add munin monitoring.

Note that there are other issues causing problems for r3 jobs
 * timing out downloading symbols on mac - should be fixed by bug 546939
 * "talos dirty" jobs for XP and Leopard are timing out retrieving dirtyMaxDBs.zip from the master, bug 547600
alice: is this still happening?
Assignee: nobody → anodelman
Attached file More recent failures
Still some failures occurring. This is all the 'exception' results since those in attachment 427704 [details], including the 4 I mentioned in comment 6. I verified that a couple are lost connections between master and slave.
Is this still occurring now that rev3 master is in production?
Attached file But wait there's more
Don't know which of these line up with our recent problems.
Nothing new here since the 9th - still an issue?
One yesterday (tracemonkey-xp-v8 218 at 2010-03-23 00:24:51 UTC), then lots a week ago, otherwise fine.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: