Closed Bug 1126619 Opened 9 years ago Closed 8 years ago

Deploy ZooKeeper and Kafka on hg.mozilla.org

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(2 files)

We'll be using Kafka for our log-based replication system being written in bug 1126153.

A prerequisite of Kafka is ZooKeeper.

We'll likely run Kafka servers on the same machines where the ZooKeeper servers are running.

Both ZooKeeper and Kafka run in the JVM. We'll want both processes to be hooked up to supervisor or some such so that they run at startup and restart if they fail.

We'll need the following flows open from each node talking to ZK and K (both hgssh and hgweb will):

 tcp 2181 (ZooKeeper client connections)
 tcp 2888 (ZooKeeper server to server communication)
 tcp 3888 (ZooKeeper server to server communication)
 tcp 9092 (Kafka client connections)

The ports don't have to be exactly these. But these are the defaults. We don't need these ports open to hosts outside of the Mercurial "cluster."

The minimum ZooKeeper cluster size is 3 (you need a system that can maintain quorum if hosts die). But we should aim to deploy on more hosts, if possible. But I don't think we absolutely need to deploy on every machine in the cluster. I think 5 or 6 would be fine. Let's say hgssh[12].dmz.scl3 and hgweb[1234].dmz.scl3.

The docs for ZooKeeper and Kafka aren't terrific. We should definitely run our configurations by someone who has deployed these systems before. Cloud Services has a few people.

We should probably aim for latest stable versions. 3.4.6 for ZK and 0.8.1.1 for K. These are Java packages, so you pretty much just download a tarball and run some scripts inside.

https://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html
https://kafka.apache.org/documentation.html
Depends on: 1203797
fubar: we'll want separate users to own processes and files for these daemons. IIRC Puppet or mig or something will delete unknown users from hgssh and hgweb unless they are listed in Puppet. Could you please create 2 system accounts, "kafka" and "zookeeper" in Puppet? It looks like there is already a zookeeper user with uid 2321 in sysadmins.
Flags: needinfo?(klibby)
puppet has a clean up function. I've added both users to all of the hg nodes.
Flags: needinfo?(klibby)
ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar

We implement a check_zookeeper Nagios check. The check verifies the
health of an individual node and a ZooKeeper ensemble (cluster).

We add a NRPE config file defining how to invoke the Nagios check.

A test of the new Nagios check has been implemented.
Attachment #8685700 - Flags: review?(klibby)
Attachment #8685700 - Flags: review?(klibby) → review+
Comment on attachment 8685700 [details]
MozReview Request: ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar

https://reviewboard.mozilla.org/r/24875/#review22451

I have slight concerns about the check and config being in v-c-t instead of puppet, as nagios and nrpe are the domain of the MOC; mixing two sources seems like it could cause confusion. Also, when mana documentation is written for this check (and that needs to happen before enabling it), it should include clear instructions on who to contact, and how, for issues with the check itself (and ofc how to respond to the check).
https://reviewboard.mozilla.org/r/24875/#review22451

I wasn't a huge fan of putting it in v-c-t either. But a version needs to live in v-c-t if we are to test it. AFAICT none of the existing checks in syadmins/puppets have tests and for something as complicated as this, I kinda insist on having tests.

Anyway, if you want to vendor this check in sysadmins, I understand completely. But then we'll have a synchronization issue. Sucks either way.

I /could/ make the check output more verbose to include follow-up instructions. Is that preferred to writing separate docs [which may get out of sync with reality]?
https://reviewboard.mozilla.org/r/24875/#review22451

Tests win, honestly. If we document well enough, having it in v-c-t really shouldn't be an issue.

Nearly all of the MOC's nagios checks report status with a link to the check's documentation on mana, e.g.: "host.mozilla.com:HTTP is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/HTTP)". All information on how to respond to the check, or escalate, should be on that page OR linked to from it. A lot of those pages have some basic troubleshooting steps but then refer the user to other mana pages for more complicated (or site specific) troubleshooting, so sending people off to the v-c-t docs wouldn't be unusual (other than it not being on mana).
Comment on attachment 8685700 [details]
MozReview Request: ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/24875/diff/1-2/
Comment on attachment 8685700 [details]
MozReview Request: ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/24875/diff/2-3/
ansible/hg-ssh: check for writing events into Kakfa; r?fubar

The replication log for hg has a no-op "heartbeat" message used for
testing whether the replication log is writable. This is used internally
so hg operations can fail fast when the replication log isn't available.
The message can also be used for monitoring. The sending of this message
is conveniently exposed via the `hg sendheartbeat` command.

This commit introduces a Nagios check that simply invokes `hg
sendheartbeat` and reports the results.

By running this check periodically in production, we'll verify the
replication log is online and writable so failures in the replication
log can hopefully be identified and fixed before a push comes in and
fails due to the log being unavailable.
Attachment #8686877 - Flags: review?(klibby)
Attachment #8686877 - Flags: review?(klibby) → review+
Comment on attachment 8686877 [details]
MozReview Request: ansible/hg-ssh: check for writing events into Kakfa; r?fubar

https://reviewboard.mozilla.org/r/25097/#review22653

lgtm
Assignee: nobody → gps
Status: NEW → ASSIGNED
Things are deployed. Closing.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: