issue with buildapi - no masters running as root, no related queue dir nagios alerts

RESOLVED WORKSFORME

Status

Infrastructure & Operations
CIDuty
RESOLVED WORKSFORME
2 years ago
a month ago

People

(Reporter: jlund, Unassigned)

Tracking

Details

(Reporter)

Description

2 years ago
here is my vague bug summary while we diagnose what is going on
(Reporter)

Comment 1

2 years ago
could be noteworthy. could be red herring:

14:01:25 <relengbot> [sns alert] Tue 14:01:06 PST buildbot-master03.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes.
14:01:45 <Callek> well *thats* interesting
14:11:00 <jlund> bm03 last tried starting a reconfig at master/twistd.log.2:2016-01-25 21:01:14-0800 [-] loading configuration from /builds/buildbot/tests1-linux32/master/master.cfg
14:11:17 <jlund> seems it got confused.
14:12:01 <jlund> ah, seta
14:12:07 <jlund> generic exception: Traceback (most recent call last):
14:12:07 <jlund> 2016-01-25 21:01:16-0800 [-]   File "/builds/buildbot/tests1-linux32/master/config_seta.py", line 47, in get_seta_platforms
14:12:13 <jlund> not sure if that's still expected
14:13:18 <jlund> I'm going to remove the lock and manually reconfig now
14:13:28 <jlund> once checkconfig passes
14:15:10 <jlund> reconfig in progress

bm03 should be in better shape now.
(Reporter)

Comment 2

2 years ago
bm03 finished reconfig cleanly:

master/twistd.log:2016-01-26 14:18:51-0800 [-] configuration update started
master/twistd.log:2016-01-26 14:20:24-0800 [-] configuration update complete
(Reporter)

Comment 3

2 years ago
and it's taking jobs again \o/ (after a graceful restart): http://buildbot-master03.bb.releng.use1.mozilla.com:8201/one_line_per_build

I think all jobs that were 'claimed' by bm03 while this master was out of sync with the rest of the masters either never ran or the result was out of buildapi/status's knowledge.

e.g. https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/build/96636434

at this point, the evidence suggests this was a single master issue. To actually get results for those 'ghost' jobs, we need to either do some db hacking or maybe a re-trigger will do.

either way we seem to be scheduling new jobs so I think we are okay moving forward (opening trees if closed)

Comment 4

2 years ago
We added retries to workaround this in bug 1247286.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WORKSFORME

Updated

a month ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.