Closed Bug 1146699 Opened 9 years ago Closed 7 years ago

Reprocessing and incremental processing architecture for reporting

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: whd, Assigned: kparlante, Mentored)

References

Details

(Whiteboard: [unifiedTelemetry][40b9])

We want a way of running incremental reports, which we're calling "Reporting CEP" on https://mana.mozilla.org/wiki/display/CLOUDSERVICES/V2+Data+Pipeline. This is the "batch computing" portion of our data pipeline, which needs design. This bug is meant to be the starting point for discussion on architecture and implementation. Some considerations follow.

On the "old" system we run reporting in real-time always, which has some disadvantages. When heka crashes without writing out its filter state, we lose all state since the last successful restart. This requires a somewhat complex (and manual) cleanup procedure:

1. offline reprocess from a known state/time, using the "data warehouse" data
2. move the reprocessing to the live node and have it process the on-disk buffer (that has not been sent to the warehouse yet), until both live and backfill processes are processing the same data
3. stop the live and backfill processes
4. replace live state with backfill state
5. start the live process again

It is proposed that in the new pipeline, real-time CEP should be for monitoring/ad-hoc analysis only and thus backfill capability is not required. In addition, there may be resource constraints on running real-time monitoring/alerting alongside reporting (e.g. the "very large bloom", like loop recurring users) that we will avoid by having separate reporting infrastructure.

We need to design our filters to support the reporting use case. As we've seen, using os.time is rarely the right thing to do. I've had bad luck with cbuf deltas blowing out the plugin running backfill, or other "metrics output" generating filters. I work around this by adding the contents of timer_event into process_message (typically a cbuf inject), but possibly increasing allowed memory or ticker interval can overcome these problems. Similarly, dealing with stream time vs process time for timer events can be problematic. The idea is to make the transition from real-time to report as smooth and easy as possible.

We need to evaluate how this reporting architecture relates to data migration / reprocessing (which is itself a WIP), and in particular, the case of data showing up late or out of order and how that gets processed into our data warehouse and reports.

A few concrete things we need:

A master process that realizes reports from their definitions, by e.g. launching EC2 instances on the appropriate schedules with the right configuration to generate the report.  This is similar to the current telemetry-analysis infrastructure. It is also the piece that will likely keep track of the time range / dimensions that have already been operated on in the incremental case.

A mechanism and interface for creating / updating / modifying / re-running report definitions.

A specification for report definitions, properties of which might include:

   Filter code and filter configuration
   In the case of an incremental report: versioned lua state, stored and retrieved from some place (probably s3)
   Specification of time range / dimensions to operate on, possibly via a list of files or a checkpoint
      (we could likely use a hybrid of fine-grained message_matcher Timestamp and s3 splitfile input + filter message_matcher to achieve this)
   A schedule with which to run the report.
   A mechanism for taking the output of a report and sending it to the right place (s3 output for bespoke dashboards, redshift etc.). Presently this would be achieved with an additional heka output, but there are some subtleties here (you more or less want the output of the final timer event).

A convenient mechanism for converting ad-hoc analysis configuration to full report configuration would also be a useful thing to have.
Priority: -- → P2
Assignee: nobody → mreid
Mentor: trink
Priority: P2 → P1
Blocks: 1142126
Assignee: mreid → mtrinkala
From Triage:  Next steps: deploy telemetry-dash in new dev, trink to write a scheduled "broken-session" report.
See Also: → 1168412
Whiteboard: [unifiedTelemetry][b5]
Not being actively worked on, lowering to P2
Priority: P1 → P2
Assignee: mtrinkala → kparlante
Priority: P2 → P3
Whiteboard: [unifiedTelemetry][b5] → [unifiedTelemetry][40b9]
Iteration: --- → 43.1 - Aug 24
Priority: P3 → P2
Assignee: kparlante → nobody
Assignee: nobody → kparlante
My next step: use Mark's instructions and examples to run a report.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.