Reprocessing and incremental processing architecture for reporting

RESOLVED WONTFIX

Status

Cloud Services
Metrics: Pipeline
P2
normal
RESOLVED WONTFIX
3 years ago
a year ago

People

(Reporter: whd, Assigned: Katie Parlante, Mentored)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [unifiedTelemetry][40b9])

(Reporter)

Description

3 years ago
We want a way of running incremental reports, which we're calling "Reporting CEP" on https://mana.mozilla.org/wiki/display/CLOUDSERVICES/V2+Data+Pipeline. This is the "batch computing" portion of our data pipeline, which needs design. This bug is meant to be the starting point for discussion on architecture and implementation. Some considerations follow.

On the "old" system we run reporting in real-time always, which has some disadvantages. When heka crashes without writing out its filter state, we lose all state since the last successful restart. This requires a somewhat complex (and manual) cleanup procedure:

1. offline reprocess from a known state/time, using the "data warehouse" data
2. move the reprocessing to the live node and have it process the on-disk buffer (that has not been sent to the warehouse yet), until both live and backfill processes are processing the same data
3. stop the live and backfill processes
4. replace live state with backfill state
5. start the live process again

It is proposed that in the new pipeline, real-time CEP should be for monitoring/ad-hoc analysis only and thus backfill capability is not required. In addition, there may be resource constraints on running real-time monitoring/alerting alongside reporting (e.g. the "very large bloom", like loop recurring users) that we will avoid by having separate reporting infrastructure.

We need to design our filters to support the reporting use case. As we've seen, using os.time is rarely the right thing to do. I've had bad luck with cbuf deltas blowing out the plugin running backfill, or other "metrics output" generating filters. I work around this by adding the contents of timer_event into process_message (typically a cbuf inject), but possibly increasing allowed memory or ticker interval can overcome these problems. Similarly, dealing with stream time vs process time for timer events can be problematic. The idea is to make the transition from real-time to report as smooth and easy as possible.

We need to evaluate how this reporting architecture relates to data migration / reprocessing (which is itself a WIP), and in particular, the case of data showing up late or out of order and how that gets processed into our data warehouse and reports.

A few concrete things we need:

A master process that realizes reports from their definitions, by e.g. launching EC2 instances on the appropriate schedules with the right configuration to generate the report.  This is similar to the current telemetry-analysis infrastructure. It is also the piece that will likely keep track of the time range / dimensions that have already been operated on in the incremental case.

A mechanism and interface for creating / updating / modifying / re-running report definitions.

A specification for report definitions, properties of which might include:

   Filter code and filter configuration
   In the case of an incremental report: versioned lua state, stored and retrieved from some place (probably s3)
   Specification of time range / dimensions to operate on, possibly via a list of files or a checkpoint
      (we could likely use a hybrid of fine-grained message_matcher Timestamp and s3 splitfile input + filter message_matcher to achieve this)
   A schedule with which to run the report.
   A mechanism for taking the output of a report and sending it to the right place (s3 output for bespoke dashboards, redshift etc.). Presently this would be achieved with an additional heka output, but there are some subtleties here (you more or less want the output of the final timer event).

A convenient mechanism for converting ad-hoc analysis configuration to full report configuration would also be a useful thing to have.

Updated

3 years ago
Priority: -- → P2
Assignee: nobody → mreid
Mentor: trink
Priority: P2 → P1
(Assignee)

Updated

3 years ago
Blocks: 1142126

Updated

3 years ago
Assignee: mreid → mtrinkala

Comment 1

3 years ago
From Triage:  Next steps: deploy telemetry-dash in new dev, trink to write a scheduled "broken-session" report.

Updated

3 years ago
See Also: → bug 1168412

Updated

3 years ago
Whiteboard: [unifiedTelemetry][b5]
Not being actively worked on, lowering to P2
Priority: P1 → P2

Updated

3 years ago
Assignee: mtrinkala → kparlante
Priority: P2 → P3

Updated

3 years ago
Whiteboard: [unifiedTelemetry][b5] → [unifiedTelemetry][40b9]

Updated

3 years ago
Iteration: --- → 43.1 - Aug 24
Priority: P3 → P2

Updated

3 years ago
Assignee: kparlante → nobody

Updated

3 years ago
Assignee: nobody → kparlante
(Assignee)

Comment 3

3 years ago
My next step: use Mark's instructions and examples to run a report.

Updated

a year ago
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.