Need nagios monitoring on product availability in Bouncer

RESOLVED INCOMPLETE

Status

--
minor
RESOLVED INCOMPLETE
9 years ago
4 years ago

People

(Reporter: justdave, Unassigned)

Tracking

Details

Attachments

(1 attachment)

We had a situation this last week where an upgrade to bouncer broke sentry's monitoring of the EUBallot Firefox builds, causing it to think there were no mirrors at all with it available for download, which in turn causes a 404 error when users click the download link.  This went undetected for several days, which was a Really Bad Thing.

We should set up a nagios monitor to make an SQL query against the bouncer database and start paging any time any product with "checknow" enabled on it has no active mirrors.
(In reply to comment #0)
> We should set up a nagios monitor to make an SQL query against the bouncer
> database and start paging any time any product with "checknow" enabled on it
> has no active mirrors.

Or when we drop below a critical value.

Updated

9 years ago
Assignee: server-ops → ayounsi

Comment 3

9 years ago
arzhel, when do you think you'll be able to wrap this one up?
Frederic, what the query should looks like?
I would say:

SELECT * FROM `mirror_products` AS p
INNER JOIN mirror_locations AS loc ON (loc.product_id = p.id)
LEFT JOIN mirror_location_mirror_map AS lmm ON (loc.id = lmm.location_id AND lmm.active = 1)
LEFT JOIN mirror_mirrors AS m ON (m.id = lmm.mirror_id AND m.active = 1)
WHERE p.checknow = 1 AND lmm.id IS NULL AND m.id IS NULL
GROUP BY p.id;

That throws out all "checknow" products that for any of their locations either do not have an active mirror mapping, or if they are mapped to a mirror, that mirror is marked inactive.

If you set up nagios so it cries when this query turns up non-empty, you'll be golden.
Created attachment 446959 [details]
result of the sql command

Thanks for the query, here is its current result.
I'll add the check into nagios as soon as this is fixed.
Attachment #446959 - Attachment mime type: application/octet-stream → text/txt
Attachment #446959 - Attachment mime type: text/txt → text/plain
Both Camino-2.0.3 and Thunderbird-3.0.5 haven't shipped yet but RelEng have met requests to add them to bouncer (bug 566237, bug 566572). It's normal for entries to be added several days before the actual release, so that the project can release without us blocking them. But I guess that makes this query much more difficult.

I don't know what the story is with that Fennec release is, but it may be remnant from Fennec-on-WinMo going into deep freeze.
Frederic, do you think a query that take care of situations like comment 7 could exist ?
Had a thought - The error case is when a location is present on dm-download02 (or stage.m.o or ftp.m.o) but not other mirrors. But there are two exceptions to this:
* we're offering some files only by dm-download02
* bouncer config errors, eg a bogus location, would go undetected even if a release has been pushed
Does it make sense for (unshipped) products to be marked checknow, if, indeed, we do not want to "check them now"? I am thinking, the query throwing them out is expected behavior if they are marked like that.
*ping* Nick, did you have a reply regarding comment 10?
For upcoming Firefox & Mobile releases we could setup bouncer with Check Now set off, and after pushing the files enable Check Now. But then we come back to the issues in comment #7 - apps like Thunderbird, SeaMonkey, Camino don't have access to flip the bit after they push and shouldn't be gated on MoCo personnel.  We could limit the scope of the nagios checks to MoCo apps if we can't see another way forward, but ideally should try to do better for the community.

Alternatively, what if we query the products with Check Now set, but check against stage.m.o to see if the files are present there. If they're also missing on stage that would be fine, just unreleased. We'd leave out the 'releng screwed up the locations' class of problems but that's probably OK.
This bug looks stalled - what has to happen to wrap it up?

Updated

8 years ago
Assignee: ayounsi → server-ops
Marking incomplete, reopen when there is some IT action needed
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.