Closed Bug 1179751 Opened 9 years ago Closed 9 years ago

V2 aggregation pipeline is missing submissions

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: mreid)

Details

(Whiteboard: [unifiedTelemetry])

I started validating v4 saved_session aggregates against the v2 ones and I noticed the following so far:

- raw v2 and v4 saved_session pings match in terms of counts
- v2 aggregates are missing a substantial number of submissions, both in submission date and build-id based aggregates. As one can see in the evolution plot of [1], the number of submissions is not stable at all and drops near to 0 for some dates. Even dates that seem to have lots of submissions don't always match the raw counts. The last nightly version which seems to be OK is 39, see [2].

Mark, do you have any idea of what might be going on? We have had to "restart" our aggregation pipeline several times in the past months but the data loss seems to be too consistent for that alone to explain it.

[1] http://telemetry.mozilla.org/old/#filter=nightly%2F41%2FSIMPLE_MEASURES_UPTIME%2Fsaved_session%2FFirefox&aggregates=Submissions&evoOver=Time&locked=true&sanitize=true&renderhistogram=Graph
[2] http://telemetry.mozilla.org/old/#filter=nightly%2F39%2FSIMPLE_MEASURES_UPTIME%2Fsaved_session%2FFirefox&aggregates=Submissions&evoOver=Time&locked=true&sanitize=true&renderhistogram=Graph
Flags: needinfo?(mreid)
Whiteboard: [data-validation]
I don't have a lot of insight into the aggregation code, but clearly you're right - something is not working properly.

It looks like the first 3 weeks of Nightly 40 are also OK (up to around April 15th).

I think we should set up a test case to run the old aggregation code against a single day's worth of nightly and try to debug this. Is that something Anthony can look into?
Flags: needinfo?(mreid) → needinfo?(rvitillo)
Flags: needinfo?(azhang)
Afaik Anthony tested the aggregator in isolation when he added the support for count histograms so maybe he could have a look at this.
Flags: needinfo?(rvitillo)
Whiteboard: [data-validation] → [data-validation][unifiedTelemetry]
Brendan notes that this issue doesn't affect the FHR side of unified telemetry, in case we should track differently.
I ran the v2 aggregation pipeline locally for nightly 42, nightly 41, and release 39 data.

The intermediate results were verified using a Python script that independently counts the submissions for GC_MS at various points in the pipeline https://pastebin.mozilla.org/8839869.

I was not able to reproduce the submissions after all the above testing. For example, for release 39 on July 17th, we have 4811 submissions from one of the Node server's output files, and they virtually all make it safely to the end with 4810 submissions.

That means that it is likely that the submissions are getting dropped either at the Node server, or in transit between aggregator tasks and the master node (this part is excluded when testing locally).
Flags: needinfo?(azhang)
Iteration: --- → 43.1 - Aug 24
Can we drop V2 and just switch to V4? If not, we need to staff this bug
Let's drop V2. I already asked Anthony to make the v4 dashboards the default ones last week.
Flags: needinfo?(mreid)
We shouldn't prematurely delete the data already collected by the old pipeline (because the new pipeline doesn't have most of the data for older versions), but maybe we should stop feeding new pings into the old pipeline.

Any objections?
Flags: needinfo?(kparlante)
Flags: needinfo?(benjamin)
No objections from me.
Flags: needinfo?(benjamin)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #7)
> We shouldn't prematurely delete the data already collected by the old
> pipeline (because the new pipeline doesn't have most of the data for older
> versions), but maybe we should stop feeding new pings into the old pipeline.

+1
Flags: needinfo?(kparlante)
Yes, let's stop updating the v2 aggregates.

Vladan, where would be the best place to send a notification for that in the hopes of reaching users of telemetry.js?  I'd rather give a bit of advance warning, so people have time to update any consumers of the v2 aggregates.

We should keep the historic v2 aggregate data in S3 until we have a sufficiently long v4-aggregate history that we don't need that data anymore.

Let's handle the "stop feeding data to the old pipeline" issue in a separate bug - there are still a few scheduled jobs that use the old data+pipeline.
Flags: needinfo?(mreid) → needinfo?(vdjeric)
I've filed Bug 1197823 for basically deprecating the entire v2 pipeline.
I don't know how to find all the users of telemetry.js. Wasn't someone looking at adding logging to S3 for this purpose?

We could always add a lovely window.alert() box to telemetry.js v1 to warn anyone who pulls from the repo.

Roberto, are there any Mozilla services left on the old pipeline?
Flags: needinfo?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #12)

> We could always add a lovely window.alert() box to telemetry.js v1 to warn
> anyone who pulls from the repo.

That might work.

> Roberto, are there any Mozilla services left on the old pipeline?

We have still some scheduled map-reduce jobs running on the old pipeline that feed some of our dashboards.
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #13)
> We have still some scheduled map-reduce jobs running on the old pipeline
> that feed some of our dashboards.

These do not depend on the v2 aggregations - they can be addressed over in Bug 1197823 without blocking the decommissioning of the v2 aggregation code.
Vladan is going to take care of announcing the pending deprecation of v2 aggregates via:
- Blog post syndicated to planet mozilla
- Post to relevant newsgroups

I will announce same to @MozTelemetry.

We will target September 4th as the date to stop computing new v2 aggregates, unless there is outcry from the above comms channels.
We're going to stop updating v2 aggregates on Monday, September 14th
Whiteboard: [data-validation][unifiedTelemetry] → [unifiedTelemetry]
I have terminated the CloudFormation stack that runs the V2 aggregation.  For future reference, it used the CloudFormation template here:
https://github.com/mozilla/telemetry-server/blob/master/analysis/analysis-worker-stack.yaml

with the following parameters:
imageId       ami-4b939c7b
instanceType  m1.xlarge
keyName       mreid
maxWorkers    25
spotPrice     0.2
Assignee: nobody → mreid
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.