Closed Bug 1314130 Opened 8 years ago Closed 8 years ago

Buildapi seems to be borked

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: nthomas)

References

Details

For at least the last two hours.
I might have caused it by increasing load on it.

In any case, this could be preventing scheduling requests from developers.
Severity: normal → major
If I'm not around when this gets fixed, could you please file a bug in https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Heroku%3A%20Administration and ask them to enable to instances of the first worker?
https://dashboard.heroku.com/apps/pulse-actions/resources

I've shut down all the workers until this gets fixed.
Can't see anything obvious in newrelic, which is surprising. The alerts on pending count and backlog (pending age) are still running without timeouts, and the files in https://secure.pub.build.mozilla.org/builddata/buildjson/ like builds-4hr.js.gz and builds-running.js are recently modified.
buildapi.pvt.build.mozilla.org;http - /buildapi/self-serve/jobs continues to flap. Other nagios OK, except we didn't get an IRC notification in #buildduty when backlog age transitioned from warning to critical - something may have regressed there.

Both of web[12].releng.webapp.scl3 are relatively bored. The buildapi.pvt.build vhost is only doing the dumps of running/pending, and nagios's check on /buildapi/self-serve/jobs, so it's not that someone is bashing on the expensive reports. secure.pub.build is calm too.

I did discover via newrelic that disk i/o utilisation on buidlbot3.db.scl3 (the r/w part of the db cluster) is much higher for the last three days, and is hitting 100% at times today. Attempting to contact pythian about that.
Depends on: 1314204
Status update - I've opened up bug 1314204 so that's transparent to sheriffs. We're still seeing intermittent failures, even though there is currently not heavy load on the db due to builds/tests, or the db checksums test (r/w and r/o slave consistency).

I haven't bounced buildapi yet, although it's tempting at this point. coop ran the master restart script today, it finished at Oct 31 23:41 Pacific after successfully restarting everything.
I wish I'd just opened the browser console and made lots of requests earlier, 'cos I would have found that it's just web1.releng.webapp.scl3.mozilla.com which such a pitiful excuse for a webserver. It's sibling web2 is responding to requests properly.
jlaz restarted apache on web1 (and aselagea told me how I could do that next time), so now it's much better.
Assignee: nobody → nthomas
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Thank you Nick. I've re-enabled pulse_actions.
Oh whoops, I missed your request about re-enabling Heroku in the excitement of tracking this down for real.
It's happening again.

I've brought pulse_actions down from 4 workers to 2 workers.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I see a bunch of alerts between 11-02-2016 05:49 and 06:55 Pacific, but nothing since. From IRC Alin restarted apache on both servers.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
web2 was in this state again (ie 503 response for buildapi requests), so I've restarted apache on web1 and web2.releng.webapp.scl3.
I noticed on my alerting.

Anything I could do differently? Anything the logs tell us?
Anyway to stop directing traffic to web head that gets into a bad state? or automatically rebooting apache?

I could try to see if I can create queue scheduling requests (withhold scheduling if 503) or scheduling directly via TC/BBB.

I would not be surprised if it is because jmaher was looking for a talos regressions and made a lot of requests.
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.