Closed
Bug 1381055
Opened 8 years ago
Closed 8 years ago
Make separate validated Kafka topic for experiments
Categories
(Data Platform and Tools :: General, enhancement, P2)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: frank, Unassigned)
References
Details
This topic would include all pings that have an experiment. We'll use it for the separate spark-streaming application that will write an experiments-only real-time dataset.
Comment 1•8 years ago
|
||
Can we just filter / explode the experiments array from the main kafka topic as we read from it?
Flags: needinfo?(fbertsch)
Reporter | ||
Comment 2•8 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #1)
> Can we just filter / explode the experiments array from the main kafka topic
> as we read from it?
Sure, but that's extra work for little benefit. I figured since the DWL is already reading experiments (for the "telemetry-cohorts" source) it wouldn't be much overhead to push those same pings to a separate Kafka topic.
Flags: needinfo?(fbertsch)
Reporter | ||
Comment 3•8 years ago
|
||
I want to nail down what the incoming data will look like. Would it be easier to have HS write every ping a single time, or write every ping once for each experiment_id, with additional `experiment_id` and `experiment_branch` entries, similar to how we're doing the "telemetry-cohorts" source?
I would prefer the latter.
Comment 4•8 years ago
|
||
A proposal for implementation without topology change follows.
1. On the DWL, move the custom experiments logic from the current s3 output to a "kafka exploder" output, which writes the exploded output to a "telemetry-cohorts" kafka topic.
We can probably go back to using the generic s3 output with this change. In addition to the explosion, the output should also set the Type to "telemetry.cohorts" or similar for hindsight routing purposes.
2. Add a kafka input to the same DWL for the "telemetry-cohorts" stream. This will re-inject "telemetry.cohorts" messages back into hindsight for additional processing.
3. Add a normal s3 output that matches only "telemetry.cohorts" messages to provide the output that was previously supplied with custom logic.
At this point we'd have "telemetry-cohorts" available in heka-framed s3 and as a kafka topic for e.g. spark streaming, with the minimum of infrastructure change.
Points: --- → 2
OS: Mac OS X → Unspecified
Priority: -- → P2
Hardware: x86 → Unspecified
Reporter | ||
Comment 5•8 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #4)
I setup the Spark Streaming job to do the explosion, so for the moment we don't need this. Let's revisit this if another consumer comes up.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Assignee | ||
Updated•3 years ago
|
Component: Pipeline Ingestion → General
You need to log in
before you can comment on or make changes to this bug.
Description
•