Closed Bug 1122964 Opened 9 years ago Closed 9 years ago

Opsify stack for pipeline deployment

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kparlante, Assigned: whd)

Details

      No description provided.
This involves a bunch of things, which may eventually be split out into separate bugs.

Puppet and CFN work ongoing here:
https://github.com/mozilla-services/puppet-config/tree/kafka
https://github.com/mozilla-services/svcops/tree/kafka

Zookeeper:
CFN/Puppet (done)
Testing of various zookeeper failure scenarios (done)
Determine zookeeper cluster configuration (done, 3 m3.medium across AZs, ephemeral disk)
Determine disk space requirements for zookeeper (done)
  - Currently 4GB ephemeral per instance is more than enough for Kafka metadata, but as we add more projects we may need to switch to EBS-backed.

Kafka:
CFN/Puppet (done)
Determine default partitioning and replication for the cluster (done, 6 partitions, replication factor 2)
Tweak config parameters to be similar to current Bagheera cluster (in progress)
Load test Kafka under expected usage profile (large payloads) (in progress)
  - figure out why G1 garbage collector leaks
  - test effect of snappy/gz compression on throughput
Test graceful recovery of Kafka after instance failure (in progress)
Package 0.8.2 after GA release (expected this week)
Determine whether to use 3-AZ as a single DC for redundancy purposes (may require smart partition allocation/"rack-aware" logic if we want to survive a full-AZ outage) or to create three separate clusters and use mirror-maker to replicate across AZs (in progress)
Test cluster expansion / re-partition logic (in progress)

General Heka:
Process for building custom Heka RPM (done)
RPM naming convention/Jenkins build process for custom RPM (-svc suffix)
Resolve any CentOS 7 issues running Heka (SIGINT issues, running with systemd as process supervisor)
Load testing Heka+Kafka (including a trial run on current data)
  - may need to expose more Sarama configuration to Heka Kafka producer

Ingestion Endpoint ("edge node"):
Standard CFN/Puppet for ELB (SSL) and ASG service
  - most probably running Heka HttpInput -> KafkaOutput
  - in case of custom server code, packaging and Puppet logic for same

Data Warehouse Loader:
CFN/Puppet for Heka instance (KafkaInput -> S3Output)
  - CFN will need special IAM rules for S3 bucket(s)
  - consumer group or single instance?

CEP heka(s):
CFN/Puppet for heka instance (kafkainput -> CEP filters)

General:
Additional monitoring socket checks etc. for various components
Hostname conventions (currently using TYPE.APP.REGION.ENV.mozaws.net for internal hosts, but CEP should probably have a friendly name)
Failover/redundancy of hosts (CEP Heka, DWL Heka)
(In reply to Wesley Dawson [:whd] from comment #1)
> Determine whether to use 3-AZ as a single DC for redundancy purposes (may
> require smart partition allocation/"rack-aware" logic if we want to survive
> a full-AZ outage) or to create three separate clusters and use mirror-maker
> to replicate across AZs (in progress)
FWIW, current telemetry backend lives only in us-west-2, and current FHR backend lives within Mozilla data center, so cross-AZ redundancy may not be necessary.

> General Heka:
> Process for building custom Heka RPM (done)
> RPM naming convention/Jenkins build process for custom RPM (-svc suffix)
Is this process going to be compatible with the build script over in Bug 1122754? (See `bin/build_pipeline_heka.sh` in that repo)
(In reply to Mark Reid [:mreid] from comment #2)
> FWIW, current telemetry backend lives only in us-west-2, and current FHR
> backend lives within Mozilla data center, so cross-AZ redundancy may not be
> necessary.

This is good to know. I'll work with Travis to formalize this if we decide not to be redundant across AZs.

> Is this process going to be compatible with the build script over in Bug
> 1122754? (See `bin/build_pipeline_heka.sh` in that repo)

Yes, this is an additional step to apply a vendor prefix (e.g. -svc) to the RPM generated by that build process. I need our custom-built Heka RPMs to be distinguishable from the official builds, since we use both. I can file a PR to add the rpmrebuild command to bin/build_pipeline_heka.sh (which I am about to review).
Status: NEW → ASSIGNED
Comments from bug triage:

Split out into separate bugs:
- Edge nodes
- Warehouse loader
- CEP node

All just different ways of running heka...
Some updates before splitting outing into separate bugs.

Zookeeper and Kafka work is mostly done and is being tested with current FHR data using Kafka as both producer and consumer. I added Yahoo's kafka-manager to the Kafka nodes which makes cluster management nice and easy via a web interface.

Kafka 0.8.2.0 was released this week and has been packaged. I did a rolling upgrade of the cluster during both consumer and producer usage with no issues. Tomorrow I'm going to do a cluster expand to resize EBS (currently only 1.5TB), which is mostly an exercise in partition reassignment under "real" load. I'm deferring investigation of G1 GC issues because Kafka is running excellently without it.

No special logic is needed for partitioning across AZs because we don't expect to need to survive an AZ-wide outage.

I've filed https://github.com/mozilla-services/data-pipeline/pull/2 for some packaging changes.

CentOS 7 issues are still being worked on. I've added support for running heka via systemd in https://github.com/mozilla-services/puppet-config/commit/108d6f17efeda77a012926183fa11d0965380340 so next steps there are to start building up the configs+CFN (forthcoming separate bugs).
Priority: -- → P1
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.