Closed Bug 1126619 Opened 9 years ago Closed 8 years ago

Deploy ZooKeeper and Kafka on hg.mozilla.org

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(2 files)

MozReview Request: ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar 9 years ago Gregory Szorc [:gps] 40 bytes, text/x-review-board-request	fubar : review+	Details
MozReview Request: ansible/hg-ssh: check for writing events into Kakfa; r?fubar 9 years ago Gregory Szorc [:gps] 40 bytes, text/x-review-board-request	fubar : review+	Details

Gregory Szorc [:gps]

Assignee

Description

•

9 years ago

We'll be using Kafka for our log-based replication system being written in bug 1126153.

A prerequisite of Kafka is ZooKeeper.

We'll likely run Kafka servers on the same machines where the ZooKeeper servers are running.

Both ZooKeeper and Kafka run in the JVM. We'll want both processes to be hooked up to supervisor or some such so that they run at startup and restart if they fail.

We'll need the following flows open from each node talking to ZK and K (both hgssh and hgweb will):

tcp 2181 (ZooKeeper client connections)
tcp 2888 (ZooKeeper server to server communication)
tcp 3888 (ZooKeeper server to server communication)
tcp 9092 (Kafka client connections)

The ports don't have to be exactly these. But these are the defaults. We don't need these ports open to hosts outside of the Mercurial "cluster."

The minimum ZooKeeper cluster size is 3 (you need a system that can maintain quorum if hosts die). But we should aim to deploy on more hosts, if possible. But I don't think we absolutely need to deploy on every machine in the cluster. I think 5 or 6 would be fine. Let's say hgssh[12].dmz.scl3 and hgweb[1234].dmz.scl3.

The docs for ZooKeeper and Kafka aren't terrific. We should definitely run our configurations by someone who has deployed these systems before. Cloud Services has a few people.

We should probably aim for latest stable versions. 3.4.6 for ZK and 0.8.1.1 for K. These are Java packages, so you pretty much just download a tarball and run some scripts inside.

https://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html
https://kafka.apache.org/documentation.html

Gregory Szorc [:gps]

Assignee

Updated

•

9 years ago

Depends on: 1203797

Gregory Szorc [:gps]

Assignee

Comment 1

•

9 years ago

fubar: we'll want separate users to own processes and files for these daemons. IIRC Puppet or mig or something will delete unknown users from hgssh and hgweb unless they are listed in Puppet. Could you please create 2 system accounts, "kafka" and "zookeeper" in Puppet? It looks like there is already a zookeeper user with uid 2321 in sysadmins.

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 2

•

9 years ago

puppet has a clean up function. I've added both users to all of the hg nodes.

Flags: needinfo?(klibby)

Gregory Szorc [:gps]

Assignee

Comment 3

•

9 years ago

Attached file MozReview Request: ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar — Details

ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar

We implement a check_zookeeper Nagios check. The check verifies the
health of an individual node and a ZooKeeper ensemble (cluster).

We add a NRPE config file defining how to invoke the Nagios check.

A test of the new Nagios check has been implemented.

Attachment #8685700 - Flags: review?(klibby)

Kendall Libby [:fubar] (he/him)

Updated

•

9 years ago

Attachment #8685700 - Flags: review?(klibby) → review+

Kendall Libby [:fubar] (he/him)

Comment 4

•

9 years ago

Comment on attachment 8685700 [details]
MozReview Request: ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar

https://reviewboard.mozilla.org/r/24875/#review22451

I have slight concerns about the check and config being in v-c-t instead of puppet, as nagios and nrpe are the domain of the MOC; mixing two sources seems like it could cause confusion. Also, when mana documentation is written for this check (and that needs to happen before enabling it), it should include clear instructions on who to contact, and how, for issues with the check itself (and ofc how to respond to the check).

Gregory Szorc [:gps]

Assignee

Comment 5

•

9 years ago

https://reviewboard.mozilla.org/r/24875/#review22451

I wasn't a huge fan of putting it in v-c-t either. But a version needs to live in v-c-t if we are to test it. AFAICT none of the existing checks in syadmins/puppets have tests and for something as complicated as this, I kinda insist on having tests.

Anyway, if you want to vendor this check in sysadmins, I understand completely. But then we'll have a synchronization issue. Sucks either way.

I /could/ make the check output more verbose to include follow-up instructions. Is that preferred to writing separate docs [which may get out of sync with reality]?

Kendall Libby [:fubar] (he/him)

Comment 6

•

9 years ago

https://reviewboard.mozilla.org/r/24875/#review22451

Tests win, honestly. If we document well enough, having it in v-c-t really shouldn't be an issue.

Nearly all of the MOC's nagios checks report status with a link to the check's documentation on mana, e.g.: "host.mozilla.com:HTTP is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/HTTP)". All information on how to respond to the check, or escalate, should be on that page OR linked to from it. A lot of those pages have some basic troubleshooting steps but then refer the user to other mana pages for more complicated (or site specific) troubleshooting, so sending people off to the v-c-t docs wouldn't be unusual (other than it not being on mana).

Gregory Szorc [:gps]

Assignee

Comment 7

•

9 years ago

Comment on attachment 8685700 [details]
MozReview Request: ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/24875/diff/1-2/

Gregory Szorc [:gps]

Assignee

Comment 8

•

9 years ago

Comment on attachment 8685700 [details]
MozReview Request: ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r?fubar

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/24875/diff/2-3/

Gregory Szorc [:gps]

Assignee

Comment 9

•

9 years ago

Attached file MozReview Request: ansible/hg-ssh: check for writing events into Kakfa; r?fubar — Details

ansible/hg-ssh: check for writing events into Kakfa; r?fubar

The replication log for hg has a no-op "heartbeat" message used for
testing whether the replication log is writable. This is used internally
so hg operations can fail fast when the replication log isn't available.
The message can also be used for monitoring. The sending of this message
is conveniently exposed via the `hg sendheartbeat` command.

This commit introduces a Nagios check that simply invokes `hg
sendheartbeat` and reports the results.

By running this check periodically in production, we'll verify the
replication log is online and writable so failures in the replication
log can hopefully be identified and fixed before a push comes in and
fails due to the log being unavailable.

Attachment #8686877 - Flags: review?(klibby)

Kendall Libby [:fubar] (he/him)

Updated

•

9 years ago

Attachment #8686877 - Flags: review?(klibby) → review+

Kendall Libby [:fubar] (he/him)

Comment 10

•

9 years ago

Comment on attachment 8686877 [details]
MozReview Request: ansible/hg-ssh: check for writing events into Kakfa; r?fubar

https://reviewboard.mozilla.org/r/25097/#review22653

lgtm

Gregory Szorc [:gps]

Assignee

Updated

•

9 years ago

Assignee: nobody → gps

Status: NEW → ASSIGNED

Gregory Szorc [:gps]

Assignee

Comment 11

•

9 years ago

https://hg.mozilla.org/hgcustom/version-control-tools/rev/a8eac2fb5c1a6e5c0cacf8afdb2f06287a1a87be
ansible/kafka-broker: add Nagios check for ZooKeeper (bug 1126619); r=fubar

https://hg.mozilla.org/hgcustom/version-control-tools/rev/c0c6165a47a1b27e1fe8e150d1c74ab46533f1d2
ansible/hg-ssh: check for writing events into Kakfa (bug 1126619); r=fubar

Gregory Szorc [:gps]

Assignee

Comment 12

•

8 years ago

Things are deployed. Closing.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Deploy ZooKeeper and Kafka on hg.mozilla.org

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

Tracking

(Not tracked)

People

(Reporter: gps, Assigned: gps)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Updated

Comment 11

Comment 12

Attachment

General

Description

File Name

Content Type