1314130 - Buildapi seems to be borked

Reporter

Description

•

9 years ago

For at least the last two hours. I might have caused it by increasing load on it. In any case, this could be preventing scheduling requests from developers.

Armen [:armenzg]

Reporter

Updated

•

9 years ago

Severity: normal → major

Armen [:armenzg]

Reporter

Comment 1

•

9 years ago

If I'm not around when this gets fixed, could you please file a bug in https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Heroku%3A%20Administration and ask them to enable to instances of the first worker? https://dashboard.heroku.com/apps/pulse-actions/resources I've shut down all the workers until this gets fixed.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 2

•

9 years ago

Can't see anything obvious in newrelic, which is surprising. The alerts on pending count and backlog (pending age) are still running without timeouts, and the files in https://secure.pub.build.mozilla.org/builddata/buildjson/ like builds-4hr.js.gz and builds-running.js are recently modified.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 3

•

9 years ago

buildapi.pvt.build.mozilla.org;http - /buildapi/self-serve/jobs continues to flap. Other nagios OK, except we didn't get an IRC notification in #buildduty when backlog age transitioned from warning to critical - something may have regressed there. Both of web[12].releng.webapp.scl3 are relatively bored. The buildapi.pvt.build vhost is only doing the dumps of running/pending, and nagios's check on /buildapi/self-serve/jobs, so it's not that someone is bashing on the expensive reports. secure.pub.build is calm too. I did discover via newrelic that disk i/o utilisation on buidlbot3.db.scl3 (the r/w part of the db cluster) is much higher for the last three days, and is hitting 100% at times today. Attempting to contact pythian about that.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Updated

•

9 years ago

Depends on: 1314204

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 4

•

9 years ago

Status update - I've opened up bug 1314204 so that's transparent to sheriffs. We're still seeing intermittent failures, even though there is currently not heavy load on the db due to builds/tests, or the db checksums test (r/w and r/o slave consistency). I haven't bounced buildapi yet, although it's tempting at this point. coop ran the master restart script today, it finished at Oct 31 23:41 Pacific after successfully restarting everything.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 5

•

9 years ago

I wish I'd just opened the browser console and made lots of requests earlier, 'cos I would have found that it's just web1.releng.webapp.scl3.mozilla.com which such a pitiful excuse for a webserver. It's sibling web2 is responding to requests properly.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 6

•

9 years ago

jlaz restarted apache on web1 (and aselagea told me how I could do that next time), so now it's much better.

Assignee: nobody → nthomas

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Reporter

Comment 7

•

9 years ago

Thank you Nick. I've re-enabled pulse_actions.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 8

•

9 years ago

Oh whoops, I missed your request about re-enabling Heroku in the excitement of tracking this down for real.

Armen [:armenzg]

Reporter

Comment 9

•

9 years ago

It's happening again. I've brought pulse_actions down from 4 workers to 2 workers.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 10

•

9 years ago

I see a bunch of alerts between 11-02-2016 05:49 and 06:55 Pacific, but nothing since. From IRC Alin restarted apache on both servers.

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 11

•

9 years ago

web2 was in this state again (ie 503 response for buildapi requests), so I've restarted apache on web1 and web2.releng.webapp.scl3.

Armen [:armenzg]

Reporter

Comment 12

•

9 years ago

I noticed on my alerting. Anything I could do differently? Anything the logs tell us? Anyway to stop directing traffic to web head that gets into a bad state? or automatically rebooting apache? I could try to see if I can create queue scheduling requests (withhold scheduling if 503) or scheduling directly via TC/BBB. I would not be surprised if it is because jmaher was looking for a talos regressions and made a lot of requests.

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Buildapi seems to be borked

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: armenzg, Assigned: nthomas)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Updated