Closed
Bug 1126619
Opened 9 years ago
Closed 8 years ago
Deploy ZooKeeper and Kafka on hg.mozilla.org
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect)
Developer Services
Mercurial: hg.mozilla.org
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gps, Assigned: gps)
References
Details
Attachments
(2 files)
We'll be using Kafka for our log-based replication system being written in bug 1126153. A prerequisite of Kafka is ZooKeeper. We'll likely run Kafka servers on the same machines where the ZooKeeper servers are running. Both ZooKeeper and Kafka run in the JVM. We'll want both processes to be hooked up to supervisor or some such so that they run at startup and restart if they fail. We'll need the following flows open from each node talking to ZK and K (both hgssh and hgweb will): tcp 2181 (ZooKeeper client connections) tcp 2888 (ZooKeeper server to server communication) tcp 3888 (ZooKeeper server to server communication) tcp 9092 (Kafka client connections) The ports don't have to be exactly these. But these are the defaults. We don't need these ports open to hosts outside of the Mercurial "cluster." The minimum ZooKeeper cluster size is 3 (you need a system that can maintain quorum if hosts die). But we should aim to deploy on more hosts, if possible. But I don't think we absolutely need to deploy on every machine in the cluster. I think 5 or 6 would be fine. Let's say hgssh[12].dmz.scl3 and hgweb[1234].dmz.scl3. The docs for ZooKeeper and Kafka aren't terrific. We should definitely run our configurations by someone who has deployed these systems before. Cloud Services has a few people. We should probably aim for latest stable versions. 3.4.6 for ZK and 0.8.1.1 for K. These are Java packages, so you pretty much just download a tarball and run some scripts inside. https://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html https://kafka.apache.org/documentation.html
Assignee | ||
Comment 1•9 years ago
|
||
fubar: we'll want separate users to own processes and files for these daemons. IIRC Puppet or mig or something will delete unknown users from hgssh and hgweb unless they are listed in Puppet. Could you please create 2 system accounts, "kafka" and "zookeeper" in Puppet? It looks like there is already a zookeeper user with uid 2321 in sysadmins.
Flags: needinfo?(klibby)
Comment 2•9 years ago
|
||
puppet has a clean up function. I've added both users to all of the hg nodes.
Flags: needinfo?(klibby)
Assignee | ||
Comment 3•9 years ago
|
||
ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar We implement a check_zookeeper Nagios check. The check verifies the health of an individual node and a ZooKeeper ensemble (cluster). We add a NRPE config file defining how to invoke the Nagios check. A test of the new Nagios check has been implemented.
Attachment #8685700 -
Flags: review?(klibby)
Updated•9 years ago
|
Attachment #8685700 -
Flags: review?(klibby) → review+
Comment 4•9 years ago
|
||
Comment on attachment 8685700 [details] MozReview Request: ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar https://reviewboard.mozilla.org/r/24875/#review22451 I have slight concerns about the check and config being in v-c-t instead of puppet, as nagios and nrpe are the domain of the MOC; mixing two sources seems like it could cause confusion. Also, when mana documentation is written for this check (and that needs to happen before enabling it), it should include clear instructions on who to contact, and how, for issues with the check itself (and ofc how to respond to the check).
Assignee | ||
Comment 5•9 years ago
|
||
https://reviewboard.mozilla.org/r/24875/#review22451 I wasn't a huge fan of putting it in v-c-t either. But a version needs to live in v-c-t if we are to test it. AFAICT none of the existing checks in syadmins/puppets have tests and for something as complicated as this, I kinda insist on having tests. Anyway, if you want to vendor this check in sysadmins, I understand completely. But then we'll have a synchronization issue. Sucks either way. I /could/ make the check output more verbose to include follow-up instructions. Is that preferred to writing separate docs [which may get out of sync with reality]?
Comment 6•9 years ago
|
||
https://reviewboard.mozilla.org/r/24875/#review22451 Tests win, honestly. If we document well enough, having it in v-c-t really shouldn't be an issue. Nearly all of the MOC's nagios checks report status with a link to the check's documentation on mana, e.g.: "host.mozilla.com:HTTP is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/HTTP)". All information on how to respond to the check, or escalate, should be on that page OR linked to from it. A lot of those pages have some basic troubleshooting steps but then refer the user to other mana pages for more complicated (or site specific) troubleshooting, so sending people off to the v-c-t docs wouldn't be unusual (other than it not being on mana).
Assignee | ||
Comment 7•9 years ago
|
||
Comment on attachment 8685700 [details] MozReview Request: ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar Review request updated; see interdiff: https://reviewboard.mozilla.org/r/24875/diff/1-2/
Assignee | ||
Comment 8•9 years ago
|
||
Comment on attachment 8685700 [details] MozReview Request: ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar Review request updated; see interdiff: https://reviewboard.mozilla.org/r/24875/diff/2-3/
Assignee | ||
Comment 9•9 years ago
|
||
ansible/hg-ssh: check for writing events into Kakfa; r?fubar The replication log for hg has a no-op "heartbeat" message used for testing whether the replication log is writable. This is used internally so hg operations can fail fast when the replication log isn't available. The message can also be used for monitoring. The sending of this message is conveniently exposed via the `hg sendheartbeat` command. This commit introduces a Nagios check that simply invokes `hg sendheartbeat` and reports the results. By running this check periodically in production, we'll verify the replication log is online and writable so failures in the replication log can hopefully be identified and fixed before a push comes in and fails due to the log being unavailable.
Attachment #8686877 -
Flags: review?(klibby)
Updated•9 years ago
|
Attachment #8686877 -
Flags: review?(klibby) → review+
Comment 10•9 years ago
|
||
Comment on attachment 8686877 [details] MozReview Request: ansible/hg-ssh: check for writing events into Kakfa; r?fubar https://reviewboard.mozilla.org/r/25097/#review22653 lgtm
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → gps
Status: NEW → ASSIGNED
Assignee | ||
Comment 11•9 years ago
|
||
https://hg.mozilla.org/hgcustom/version-control-tools/rev/a8eac2fb5c1a6e5c0cacf8afdb2f06287a1a87be ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r=fubar https://hg.mozilla.org/hgcustom/version-control-tools/rev/c0c6165a47a1b27e1fe8e150d1c74ab46533f1d2 ansible/hg-ssh: check for writing events into Kakfa (bug 1126619); r=fubar
Assignee | ||
Comment 12•8 years ago
|
||
Things are deployed. Closing.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•