Closed Bug 1381055 Opened 8 years ago Closed 8 years ago

Make separate validated Kafka topic for experiments

Categories

(Data Platform and Tools :: General, enhancement, P2)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: frank, Unassigned)

References

Details

This topic would include all pings that have an experiment. We'll use it for the separate spark-streaming application that will write an experiments-only real-time dataset.
Can we just filter / explode the experiments array from the main kafka topic as we read from it?
Flags: needinfo?(fbertsch)
(In reply to Mark Reid [:mreid] from comment #1) > Can we just filter / explode the experiments array from the main kafka topic > as we read from it? Sure, but that's extra work for little benefit. I figured since the DWL is already reading experiments (for the "telemetry-cohorts" source) it wouldn't be much overhead to push those same pings to a separate Kafka topic.
Flags: needinfo?(fbertsch)
I want to nail down what the incoming data will look like. Would it be easier to have HS write every ping a single time, or write every ping once for each experiment_id, with additional `experiment_id` and `experiment_branch` entries, similar to how we're doing the "telemetry-cohorts" source? I would prefer the latter.
A proposal for implementation without topology change follows. 1. On the DWL, move the custom experiments logic from the current s3 output to a "kafka exploder" output, which writes the exploded output to a "telemetry-cohorts" kafka topic. We can probably go back to using the generic s3 output with this change. In addition to the explosion, the output should also set the Type to "telemetry.cohorts" or similar for hindsight routing purposes. 2. Add a kafka input to the same DWL for the "telemetry-cohorts" stream. This will re-inject "telemetry.cohorts" messages back into hindsight for additional processing. 3. Add a normal s3 output that matches only "telemetry.cohorts" messages to provide the output that was previously supplied with custom logic. At this point we'd have "telemetry-cohorts" available in heka-framed s3 and as a kafka topic for e.g. spark streaming, with the minimum of infrastructure change.
Points: --- → 2
OS: Mac OS X → Unspecified
Priority: -- → P2
Hardware: x86 → Unspecified
(In reply to Wesley Dawson [:whd] from comment #4) I setup the Spark Streaming job to do the explosion, so for the moment we don't need this. Let's revisit this if another consumer comes up.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Component: Pipeline Ingestion → General
You need to log in before you can comment on or make changes to this bug.