Closed Bug 1238035 Opened 8 years ago Closed 8 years ago

possible network issues with buildbot-master115

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Unassigned)

References

Details

Attachments

(1 file)

We've been seeing puppet timeout email from buildbot-master115, and manual runs connecting to releng-puppet1.srv.releng.usw2.mozilla.com sometimes work and sometimes don't. This might indicate that it's having general network problems or resource problems on the host it currently resides on.

It might be worth shutting down buildbot-master115, destroying it, and recreating it so it resides on a different physical host.
hm, during the tree closure window bm115 never came back to life. I wonder if this bug is related. At any rate, it's down now so it may be best to recreate it on monday
Attached you can find the event from AWS for bm115
In order to resolve the AWS event I stopped and started the instance. 
Everything started OK less than builbot who has problems by connecting to mysql server "buildbot-rw-vip.db.scl3.mozilla.com"

The exception from logs : _mysql_exceptions.OperationalError: (2005, "Unknown MySQL server host 'buildbot-rw-vip.db.scl3.mozilla.com' (2)")
(In reply to Vlad Ciobancai [:vladC] from comment #4)
> The exception from logs : _mysql_exceptions.OperationalError: (2005,
> "Unknown MySQL server host 'buildbot-rw-vip.db.scl3.mozilla.com' (2)")

We think the above error has been created when puppet run on boot.


I started manually the buildbot and from what we can see everything is running as expected.
I monitored the buildbot master and I haven't see any network issue.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
thanks!

We should keep an eye on https://bugzilla.mozilla.org/show_bug.cgi?id=1238035#c0 happening again and general performance of this master. I wouldn't be surprised if we need to recreate it.
Master tanked my restart script today and is currently inaccessible. I've disabled it in slavealloc. Next step is to terminate and recreate.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Chris Cooper [:coop] from comment #8)
> Next step is to terminate and recreate.

Master has been terminated. I'm recreating it now.
Master is back up.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: