Closed Bug 547131 Opened 14 years ago Closed 14 years ago

talos-r3 master or slaves unwell

Tracking

(Not tracked)

Status:

RESOLVED WORKSFORME

People

(Reporter: nthomas, Assigned: anodelman)

References

Details

Attachments

(3 files)

Details of talos failures 14 years ago Nick Thomas [:nthomas] (UTC+12) 10.39 KB, text/plain		Details
More recent failures 14 years ago Nick Thomas [:nthomas] (UTC+12) 4.28 KB, text/plain		Details
But wait there's more 14 years ago Nick Thomas [:nthomas] (UTC+12) 14.01 KB, text/plain		Details

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

14 years ago

We've had several bunches of talos runs fail out when the slaves lose contact with the master. In particular at 15:22 (5 slaves), 15:37 (17), 17:30-18:00 (50+).

Possibly fallout from bug 546731 ? Master is also gobbling up 1.5G right now, but CPU usage is OK.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

14 years ago

There are lots of "talos dirty" runs that are stuck at "downloading to dirtyMaxDBs.zip". I feel like restarting the master.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

14 years ago

Attached file Details of talos failures — Details

This is all the builds today that ended in exception, meaning they lost connection between slave and master.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

14 years ago

dmoore, do the times correlate with any work you were doing ?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 4

•

14 years ago

talos-r3 master got a stop/start in a quiet period, with a purge_events for good measure. talos-pool has a similar problem with stalled builds but I haven't touche that.

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

14 years ago

The last time we had machines fail in downloading dirtyDBs.zip it was because the slaves had a bad version of Twisted IIRC.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

14 years ago

Depends on: 547602

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 6

•

14 years ago

There were 4 more premature disconnects at 22:11 PST on the 18th, two each of fedora and leopard boxes. Nothing else between then and now. Seems to me there could be network congestion issue between the slaves in MV and the master in MPT, so I think we should leave this open to see what happens when MV arrives back at work. Alternatively, twice as many slaves connecting to talos-master may be saturating the network connection to communicate logs and download files. Filed bug 547602 to add munin monitoring.

Note that there are other issues causing problems for r3 jobs
 * timing out downloading symbols on mac - should be fixed by bug 546939
 * "talos dirty" jobs for XP and Leopard are timing out retrieving dirtyMaxDBs.zip from the master, bug 547600

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 7

•

14 years ago

alice: is this still happening?

Assignee: nobody → anodelman

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 8

•

14 years ago

Attached file More recent failures — Details

Still some failures occurring. This is all the 'exception' results since those in attachment 427704 [details], including the 4 I mentioned in comment 6. I verified that a couple are lost connections between master and slave.

alice nodelman [:alice] [:anode]

Assignee

Comment 9

•

14 years ago

Is this still occurring now that rev3 master is in production?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 10

•

14 years ago

Attached file But wait there's more — Details

Don't know which of these line up with our recent problems.

alice nodelman [:alice] [:anode]

Assignee

Comment 11

•

14 years ago

Nothing new here since the 9th - still an issue?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 12

•

14 years ago

One yesterday (tracemonkey-xp-v8 218 at 2010-03-23 00:24:51 UTC), then lots a week ago, otherwise fine.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → WORKSFORME

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

talos-r3 master or slaves unwell

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: nthomas, Assigned: anodelman)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Attachment

General

Description

File Name

Content Type