We have a plan for this from the work week. The short version is: N CEP servers, K Kafka partitions (partitioned by client ID), each reads K/N partitions (the first also reads records with no client ID). Each CEP sends its aggregates to a single CEP aggregator that acts as the current CEP does. This will be implemented by having N ASGs of 1, with user data + local yaml to inform the node which partitions it should consume. In this configuration, if we lose an instance for some reason, the data will end up looking inconsistent (especially if we lose the aggregator). Since this is only monitoring/ad-hoc analysis we're willing to accept this for GA. I worked out on the flight back a marginal improvement on the above involving periodic S3 state saves and kafka buffers (allowing us to lose much less state / recover more easily) but I've been writing a lot of bugmail and don't have the bandwidth to present it here. The CEP nodes will need to be scaled manually horizontally (if vertical scaling on N is not sufficient), but we should have alerting for that. Adding a CEP requires reshuffling the partitions across the nodes (and again, state becomes inconsistent), but as it's monitoring data we're OK with it.
Assignee: nobody → whd
Priority: -- → P2
Whiteboard: [rC] [unifiedTelemetry]
This is done in the following PRs: https://github.com/mozilla-services/puppet-config/pull/1517 https://github.com/mozilla-services/svcops/pull/653 I will be filing separate bugs for load testing and deployment of these changes.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.