1179751 - V2 aggregation pipeline is missing submissions

Reporter

Description

•

9 years ago

I started validating v4 saved_session aggregates against the v2 ones and I noticed the following so far:

- raw v2 and v4 saved_session pings match in terms of counts
- v2 aggregates are missing a substantial number of submissions, both in submission date and build-id based aggregates. As one can see in the evolution plot of [1], the number of submissions is not stable at all and drops near to 0 for some dates. Even dates that seem to have lots of submissions don't always match the raw counts. The last nightly version which seems to be OK is 39, see [2].

Mark, do you have any idea of what might be going on? We have had to "restart" our aggregation pipeline several times in the past months but the data loss seems to be too consistent for that alone to explain it.

[1] http://telemetry.mozilla.org/old/#filter=nightly%2F41%2FSIMPLE_MEASURES_UPTIME%2Fsaved_session%2FFirefox&aggregates=Submissions&evoOver=Time&locked=true&sanitize=true&renderhistogram=Graph
[2] http://telemetry.mozilla.org/old/#filter=nightly%2F39%2FSIMPLE_MEASURES_UPTIME%2Fsaved_session%2FFirefox&aggregates=Submissions&evoOver=Time&locked=true&sanitize=true&renderhistogram=Graph

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Flags: needinfo?(mreid)

Sam Penrose

Updated

•

9 years ago

Whiteboard: [data-validation]

Mark Reid [:mreid]

Assignee

Comment 1

•

9 years ago

I don't have a lot of insight into the aggregation code, but clearly you're right - something is not working properly.

It looks like the first 3 weeks of Nightly 40 are also OK (up to around April 15th).

I think we should set up a test case to run the old aggregation code against a single day's worth of nightly and try to debug this. Is that something Anthony can look into?

Flags: needinfo?(mreid) → needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Flags: needinfo?(azhang)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 2

•

9 years ago

Afaik Anthony tested the aggregator in isolation when he added the support for count histograms so maybe he could have a look at this.

Flags: needinfo?(rvitillo)

Katie Parlante

Updated

•

9 years ago

Whiteboard: [data-validation] → [data-validation][unifiedTelemetry]

Sam Penrose

Comment 3

•

9 years ago

Brendan notes that this issue doesn't affect the FHR side of unified telemetry, in case we should track differently.

Anthony Zhang [:azhang] (last day at Mozilla: 2016-04-29)

Comment 4

•

9 years ago

I ran the v2 aggregation pipeline locally for nightly 42, nightly 41, and release 39 data.

The intermediate results were verified using a Python script that independently counts the submissions for GC_MS at various points in the pipeline https://pastebin.mozilla.org/8839869.

I was not able to reproduce the submissions after all the above testing. For example, for release 39 on July 17th, we have 4811 submissions from one of the Node server's output files, and they virtually all make it safely to the end with 4810 submissions.

That means that it is likely that the submissions are getting dropped either at the Node server, or in transit between aggregator tasks and the master node (this part is excluded when testing locally).

Flags: needinfo?(azhang)

Thomas Huelbert

Updated

•

9 years ago

Iteration: --- → 43.1 - Aug 24

Vladan Djeric (:vladan)

Comment 5

•

9 years ago

Can we drop V2 and just switch to V4? If not, we need to staff this bug

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 6

•

9 years ago

Let's drop V2. I already asked Anthony to make the v4 dashboards the default ones last week.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Flags: needinfo?(mreid)

Vladan Djeric (:vladan)

Comment 7

•

9 years ago

We shouldn't prematurely delete the data already collected by the old pipeline (because the new pipeline doesn't have most of the data for older versions), but maybe we should stop feeding new pings into the old pipeline.

Any objections?

Flags: needinfo?(kparlante)

Flags: needinfo?(benjamin)

Benjamin Smedberg

Comment 8

•

9 years ago

No objections from me.

Flags: needinfo?(benjamin)

Katie Parlante

Comment 9

•

9 years ago

(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #7)
> We shouldn't prematurely delete the data already collected by the old
> pipeline (because the new pipeline doesn't have most of the data for older
> versions), but maybe we should stop feeding new pings into the old pipeline.

+1

Flags: needinfo?(kparlante)

Mark Reid [:mreid]

Assignee

Comment 10

•

9 years ago

Yes, let's stop updating the v2 aggregates.

Vladan, where would be the best place to send a notification for that in the hopes of reaching users of telemetry.js?  I'd rather give a bit of advance warning, so people have time to update any consumers of the v2 aggregates.

We should keep the historic v2 aggregate data in S3 until we have a sufficiently long v4-aggregate history that we don't need that data anymore.

Let's handle the "stop feeding data to the old pipeline" issue in a separate bug - there are still a few scheduled jobs that use the old data+pipeline.

Flags: needinfo?(mreid) → needinfo?(vdjeric)

Mark Reid [:mreid]

Assignee

Comment 11

•

9 years ago

I've filed Bug 1197823 for basically deprecating the entire v2 pipeline.

Vladan Djeric (:vladan)

Comment 12

•

9 years ago

I don't know how to find all the users of telemetry.js. Wasn't someone looking at adding logging to S3 for this purpose?

We could always add a lovely window.alert() box to telemetry.js v1 to warn anyone who pulls from the repo.

Roberto, are there any Mozilla services left on the old pipeline?

Flags: needinfo?(vdjeric)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 13

•

9 years ago

(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #12)

> We could always add a lovely window.alert() box to telemetry.js v1 to warn
> anyone who pulls from the repo.

That might work.

> Roberto, are there any Mozilla services left on the old pipeline?

We have still some scheduled map-reduce jobs running on the old pipeline that feed some of our dashboards.

Mark Reid [:mreid]

Assignee

Comment 14

•

9 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #13)
> We have still some scheduled map-reduce jobs running on the old pipeline
> that feed some of our dashboards.

These do not depend on the v2 aggregations - they can be addressed over in Bug 1197823 without blocking the decommissioning of the v2 aggregation code.

Mark Reid [:mreid]

Assignee

Comment 15

•

9 years ago

Vladan is going to take care of announcing the pending deprecation of v2 aggregates via:
- Blog post syndicated to planet mozilla
- Post to relevant newsgroups

I will announce same to @MozTelemetry.

We will target September 4th as the date to stop computing new v2 aggregates, unless there is outcry from the above comms channels.

Mark Reid [:mreid]

Assignee

Comment 16

•

9 years ago

https://twitter.com/MozTelemetry/status/636878871707652096

Vladan Djeric (:vladan)

Comment 17

•

9 years ago

We're going to stop updating v2 aggregates on Monday, September 14th

Vladan Djeric (:vladan)

Comment 18

•

9 years ago

I announced V2 stopping updates on Monday:

https://blog.mozilla.org/vdjeric/2015/09/08/update-your-custom-telemetry-dashes-telemetry-js-is-obsolete/
https://groups.google.com/d/msg/mozilla.dev.platform/3YZZ1azLKcY/bzZAQuijLwAJ

Katie Parlante

Updated

•

9 years ago

Whiteboard: [data-validation][unifiedTelemetry] → [unifiedTelemetry]

Mark Reid [:mreid]

Assignee

Comment 19

•

9 years ago

I have terminated the CloudFormation stack that runs the V2 aggregation.  For future reference, it used the CloudFormation template here:
https://github.com/mozilla/telemetry-server/blob/master/analysis/analysis-worker-stack.yaml

with the following parameters:
imageId       ami-4b939c7b
instanceType  m1.xlarge
keyName       mreid
maxWorkers    25
spotPrice     0.2

Assignee: nobody → mreid

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard