1146699 - Reprocessing and incremental processing architecture for reporting

Reporter

Description

•

9 years ago

We want a way of running incremental reports, which we're calling "Reporting CEP" on https://mana.mozilla.org/wiki/display/CLOUDSERVICES/V2+Data+Pipeline. This is the "batch computing" portion of our data pipeline, which needs design. This bug is meant to be the starting point for discussion on architecture and implementation. Some considerations follow.

On the "old" system we run reporting in real-time always, which has some disadvantages. When heka crashes without writing out its filter state, we lose all state since the last successful restart. This requires a somewhat complex (and manual) cleanup procedure:

1. offline reprocess from a known state/time, using the "data warehouse" data
2. move the reprocessing to the live node and have it process the on-disk buffer (that has not been sent to the warehouse yet), until both live and backfill processes are processing the same data
3. stop the live and backfill processes
4. replace live state with backfill state
5. start the live process again

It is proposed that in the new pipeline, real-time CEP should be for monitoring/ad-hoc analysis only and thus backfill capability is not required. In addition, there may be resource constraints on running real-time monitoring/alerting alongside reporting (e.g. the "very large bloom", like loop recurring users) that we will avoid by having separate reporting infrastructure.

We need to design our filters to support the reporting use case. As we've seen, using os.time is rarely the right thing to do. I've had bad luck with cbuf deltas blowing out the plugin running backfill, or other "metrics output" generating filters. I work around this by adding the contents of timer_event into process_message (typically a cbuf inject), but possibly increasing allowed memory or ticker interval can overcome these problems. Similarly, dealing with stream time vs process time for timer events can be problematic. The idea is to make the transition from real-time to report as smooth and easy as possible.

We need to evaluate how this reporting architecture relates to data migration / reprocessing (which is itself a WIP), and in particular, the case of data showing up late or out of order and how that gets processed into our data warehouse and reports.

A few concrete things we need:

A master process that realizes reports from their definitions, by e.g. launching EC2 instances on the appropriate schedules with the right configuration to generate the report. This is similar to the current telemetry-analysis infrastructure. It is also the piece that will likely keep track of the time range / dimensions that have already been operated on in the incremental case.

A mechanism and interface for creating / updating / modifying / re-running report definitions.

A specification for report definitions, properties of which might include:

Filter code and filter configuration
In the case of an incremental report: versioned lua state, stored and retrieved from some place (probably s3)
Specification of time range / dimensions to operate on, possibly via a list of files or a checkpoint
(we could likely use a hybrid of fine-grained message_matcher Timestamp and s3 splitfile input + filter message_matcher to achieve this)
A schedule with which to run the report.
A mechanism for taking the output of a report and sending it to the right place (s3 output for bespoke dashboards, redshift etc.). Presently this would be achieved with an additional heka output, but there are some subtleties here (you more or less want the output of the final timer event).

A convenient mechanism for converting ad-hoc analysis configuration to full report configuration would also be a useful thing to have.

Mark Reid [:mreid]

Updated

•

9 years ago

Priority: -- → P2

Rob Miller [:rmiller]

Updated

•

9 years ago

Assignee: nobody → mreid

Mentor: trink

Priority: P2 → P1

Katie Parlante

Assignee

Updated

•

9 years ago

Blocks: 1142126

Mark Reid [:mreid]

Updated

•

9 years ago

Assignee: mreid → mtrinkala

Mark Reid [:mreid]

Comment 1

•

9 years ago

From Triage:  Next steps: deploy telemetry-dash in new dev, trink to write a scheduled "broken-session" report.

Mark Reid [:mreid]

Updated

•

9 years ago

Updated

•

9 years ago

Whiteboard: [unifiedTelemetry][b5]

Mike Trinkala [:trink]

Comment 2

•

9 years ago

Not being actively worked on, lowering to P2

Priority: P1 → P2

Thomas Huelbert

Updated

•

9 years ago

Assignee: mtrinkala → kparlante

Priority: P2 → P3

Thomas Huelbert

Updated

•

9 years ago

Whiteboard: [unifiedTelemetry][b5] → [unifiedTelemetry][40b9]

Thomas Huelbert

Updated

•

9 years ago

Iteration: --- → 43.1 - Aug 24

Priority: P3 → P2

Thomas Huelbert

Updated

•

9 years ago

Assignee: kparlante → nobody

Thomas Huelbert

Updated

•

9 years ago

Assignee: nobody → kparlante

Katie Parlante

Assignee

Comment 3

•

9 years ago

My next step: use Mark's instructions and examples to run a report.

Thomas Huelbert

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

Quick Search

Reprocessing and incremental processing architecture for reporting

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

Tracking

(Not tracked)

People

(Reporter: whd, Assigned: kparlante, Mentored)

References

Details

(Whiteboard: [unifiedTelemetry][40b9])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Updated

Comment 1

Updated

Updated

Comment 2

Updated

Updated

Updated

Updated

Updated

Comment 3

Updated

Updated