Closed
Bug 1314130
Opened 8 years ago
Closed 8 years ago
Buildapi seems to be borked
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: nthomas)
References
Details
For at least the last two hours. I might have caused it by increasing load on it. In any case, this could be preventing scheduling requests from developers.
Reporter | ||
Updated•8 years ago
|
Severity: normal → major
Reporter | ||
Comment 1•8 years ago
|
||
If I'm not around when this gets fixed, could you please file a bug in https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Heroku%3A%20Administration and ask them to enable to instances of the first worker? https://dashboard.heroku.com/apps/pulse-actions/resources I've shut down all the workers until this gets fixed.
Assignee | ||
Comment 2•8 years ago
|
||
Can't see anything obvious in newrelic, which is surprising. The alerts on pending count and backlog (pending age) are still running without timeouts, and the files in https://secure.pub.build.mozilla.org/builddata/buildjson/ like builds-4hr.js.gz and builds-running.js are recently modified.
Assignee | ||
Comment 3•8 years ago
|
||
buildapi.pvt.build.mozilla.org;http - /buildapi/self-serve/jobs continues to flap. Other nagios OK, except we didn't get an IRC notification in #buildduty when backlog age transitioned from warning to critical - something may have regressed there. Both of web[12].releng.webapp.scl3 are relatively bored. The buildapi.pvt.build vhost is only doing the dumps of running/pending, and nagios's check on /buildapi/self-serve/jobs, so it's not that someone is bashing on the expensive reports. secure.pub.build is calm too. I did discover via newrelic that disk i/o utilisation on buidlbot3.db.scl3 (the r/w part of the db cluster) is much higher for the last three days, and is hitting 100% at times today. Attempting to contact pythian about that.
Assignee | ||
Comment 4•8 years ago
|
||
Status update - I've opened up bug 1314204 so that's transparent to sheriffs. We're still seeing intermittent failures, even though there is currently not heavy load on the db due to builds/tests, or the db checksums test (r/w and r/o slave consistency). I haven't bounced buildapi yet, although it's tempting at this point. coop ran the master restart script today, it finished at Oct 31 23:41 Pacific after successfully restarting everything.
Assignee | ||
Comment 5•8 years ago
|
||
I wish I'd just opened the browser console and made lots of requests earlier, 'cos I would have found that it's just web1.releng.webapp.scl3.mozilla.com which such a pitiful excuse for a webserver. It's sibling web2 is responding to requests properly.
Assignee | ||
Comment 6•8 years ago
|
||
jlaz restarted apache on web1 (and aselagea told me how I could do that next time), so now it's much better.
Assignee: nobody → nthomas
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 7•8 years ago
|
||
Thank you Nick. I've re-enabled pulse_actions.
Assignee | ||
Comment 8•8 years ago
|
||
Oh whoops, I missed your request about re-enabling Heroku in the excitement of tracking this down for real.
Reporter | ||
Comment 9•8 years ago
|
||
It's happening again. I've brought pulse_actions down from 4 workers to 2 workers.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 10•8 years ago
|
||
I see a bunch of alerts between 11-02-2016 05:49 and 06:55 Pacific, but nothing since. From IRC Alin restarted apache on both servers.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 11•8 years ago
|
||
web2 was in this state again (ie 503 response for buildapi requests), so I've restarted apache on web1 and web2.releng.webapp.scl3.
Reporter | ||
Comment 12•8 years ago
|
||
I noticed on my alerting. Anything I could do differently? Anything the logs tell us? Anyway to stop directing traffic to web head that gets into a bad state? or automatically rebooting apache? I could try to see if I can create queue scheduling requests (withhold scheduling if 503) or scheduling directly via TC/BBB. I would not be surprised if it is because jmaher was looking for a talos regressions and made a lot of requests.
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•