All users were logged out of Bugzilla on October 13th, 2018

possible network issues with buildbot-master115

RESOLVED FIXED

Status

RESOLVED FIXED
3 years ago
5 months ago

People

(Reporter: arich, Unassigned)

Tracking

Details

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
We've been seeing puppet timeout email from buildbot-master115, and manual runs connecting to releng-puppet1.srv.releng.usw2.mozilla.com sometimes work and sometimes don't. This might indicate that it's having general network problems or resource problems on the host it currently resides on.

It might be worth shutting down buildbot-master115, destroying it, and recreating it so it resides on a different physical host.

Comment 1

3 years ago
hm, during the tree closure window bm115 never came back to life. I wonder if this bug is related. At any rate, it's down now so it may be best to recreate it on monday
Duplicate of this bug: 1238442
Created attachment 8706246 [details]
buildbot-master115-aws-event.png

Attached you can find the event from AWS for bm115
In order to resolve the AWS event I stopped and started the instance. 
Everything started OK less than builbot who has problems by connecting to mysql server "buildbot-rw-vip.db.scl3.mozilla.com"

The exception from logs : _mysql_exceptions.OperationalError: (2005, "Unknown MySQL server host 'buildbot-rw-vip.db.scl3.mozilla.com' (2)")
(In reply to Vlad Ciobancai [:vladC] from comment #4)
> The exception from logs : _mysql_exceptions.OperationalError: (2005,
> "Unknown MySQL server host 'buildbot-rw-vip.db.scl3.mozilla.com' (2)")

We think the above error has been created when puppet run on boot.


I started manually the buildbot and from what we can see everything is running as expected.
I monitored the buildbot master and I haven't see any network issue.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED

Comment 7

3 years ago
thanks!

We should keep an eye on https://bugzilla.mozilla.org/show_bug.cgi?id=1238035#c0 happening again and general performance of this master. I wouldn't be surprised if we need to recreate it.
Master tanked my restart script today and is currently inaccessible. I've disabled it in slavealloc. Next step is to terminate and recreate.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Chris Cooper [:coop] from comment #8)
> Next step is to terminate and recreate.

Master has been terminated. I'm recreating it now.
Master is back up.
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED

Updated

5 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.