talos-r3 master or slaves unwell

RESOLVED WORKSFORME

Status

Release Engineering
General
RESOLVED WORKSFORME
9 years ago
5 years ago

People

(Reporter: nthomas, Assigned: alice)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

(Reporter)

Description

9 years ago
We've had several bunches of talos runs fail out when the slaves lose contact with the master. In particular at 15:22 (5 slaves), 15:37 (17), 17:30-18:00 (50+).

Possibly fallout from bug 546731 ? Master is also gobbling up 1.5G right now, but CPU usage is OK.
(Reporter)

Comment 1

9 years ago
There are lots of "talos dirty" runs that are stuck at "downloading to dirtyMaxDBs.zip". I feel like restarting the master.
(Reporter)

Comment 2

9 years ago
Created attachment 427704 [details]
Details of talos failures

This is all the builds today that ended in exception, meaning they lost connection between slave and master.
(Reporter)

Comment 3

9 years ago
dmoore, do the times correlate with any work you were doing ?
(Reporter)

Comment 4

9 years ago
talos-r3 master got a stop/start in a quiet period, with a purge_events for good measure. talos-pool has a similar problem with stalled builds but I haven't touche that.
The last time we had machines fail in downloading dirtyDBs.zip it was because the slaves had a bad version of Twisted IIRC.
(Reporter)

Updated

9 years ago
Depends on: 547602
(Reporter)

Comment 6

9 years ago
There were 4 more premature disconnects at 22:11 PST on the 18th, two each of fedora and leopard boxes. Nothing else between then and now. Seems to me there could be network congestion issue between the slaves in MV and the master in MPT, so I think we should leave this open to see what happens when MV arrives back at work. Alternatively, twice as many slaves connecting to talos-master may be saturating the network connection to communicate logs and download files. Filed bug 547602 to add munin monitoring.

Note that there are other issues causing problems for r3 jobs
 * timing out downloading symbols on mac - should be fixed by bug 546939
 * "talos dirty" jobs for XP and Leopard are timing out retrieving dirtyMaxDBs.zip from the master, bug 547600
alice: is this still happening?
Assignee: nobody → anodelman
(Reporter)

Comment 8

9 years ago
Created attachment 428681 [details]
More recent failures

Still some failures occurring. This is all the 'exception' results since those in attachment 427704 [details], including the 4 I mentioned in comment 6. I verified that a couple are lost connections between master and slave.
(Assignee)

Comment 9

9 years ago
Is this still occurring now that rev3 master is in production?
(Reporter)

Comment 10

9 years ago
Created attachment 431497 [details]
But wait there's more

Don't know which of these line up with our recent problems.
(Assignee)

Comment 11

8 years ago
Nothing new here since the 9th - still an issue?
(Reporter)

Comment 12

8 years ago
One yesterday (tracemonkey-xp-v8 218 at 2010-03-23 00:24:51 UTC), then lots a week ago, otherwise fine.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.