Closed Bug 1286605 Opened 9 years ago Closed 8 years ago

Add nagios checks for buildbot bridge services

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: bhearsum, Assigned: aselagea)

References

Details

(Whiteboard: [bbb])

I just got these e-mails: Warning: your queue "queue/buildbot-bridge/log_uploaded" on exchange "could not be determined" is overgrowing (7898 ready messages, 7898 total messages). The queue will be automatically deleted when it exceeds 16000 messages. Make sure your clients are running correctly and are cleaning up unused durable queues. Warning: your queue "queue/buildbot-bridge/started" on exchange "could not be determined" is overgrowing (8410 ready messages, 8410 total messages). The queue will be automatically deleted when it exceeds 16000 messages. Make sure your clients are running correctly and are cleaning up unused durable queues.
Looks related to the DB fail over. Restarted the dead services.
Do we need a nagios check for this? How would buildduty normally find out about these emails?
Whiteboard: [bb-database failover]
it'd be great to have nagios checks not running buildbot-bridge services.
Buildduty should be able to help us get these checks setup.
Component: General Automation → Buildduty
QA Contact: catlee → bugspam.Callek
Summary: buildbot bridge queue is growing, bridge is possibly broken? → Add nagios checks for buildbot bridge services
Whiteboard: [bb-database failover] → [bbb]
Assignee: nobody → aselagea
We already have a check in place for the buildbot-bridge services, e.g: "nagios-releng> Thu 16:00:07 PDT [4087] buildbot-master82.bb.releng.scl3.mozilla.com:procs - buildbot-bridge is CRITICAL: PROCS CRITICAL: 0 processes with regex args /builds/bbb/bin/buildbot-bridge (http://m.mozilla.org/procs+-+buildbot-bridge)" Judging by the time Ben received the e-mail, it was before me or Andrei had come online and then Rail restarted the dead services to solve the issue. @Rail: is there something else you think we'd need here?
Flags: needinfo?(rail)
I think we are ok here!
Flags: needinfo?(rail)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.