Pipeline CEP nodes appear stuck

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: mreid, Unassigned)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
The message counters do not appear to be advancing on the CEP nodes, although the DWL node (the important one!) seems to be advancing as expected.
pipeline-cep2 has had hekad-cep restarted, and it is now throwing the following error message:

>  Error making runner for TelemetryKafkaInput21: Initialization failed for 'TelemetryKafkaInput21': kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.

TelemetryKafkaInput21 varies widely between messages, and hekad-cep fails to start after about 4 seconds, causing systemd to indefinitely restart it.

whd has been paged about this, because I don't know where to go from here.

Comment 2

3 years ago
This should be fixed now.

The way I fixed this last week (and just now) is by switching the offset method from Manual to Newest on the CEP nodes, which is probably acceptable for monitoring hosts. I need to investigate why the CEP nodes occasionally fall behind, but that will be a separate bug. These nodes are pretty much idle, so it's not a resource issue, and the DWL nodes (which use zookeeper-backed offsets) don't have the issue either.

Somewhat related, bug #1208570 appears to still be problematic. I will update that bug later.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.