Closed
Bug 1179751
Opened 9 years ago
Closed 9 years ago
V2 aggregation pipeline is missing submissions
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Assigned: mreid)
Details
(Whiteboard: [unifiedTelemetry])
I started validating v4 saved_session aggregates against the v2 ones and I noticed the following so far: - raw v2 and v4 saved_session pings match in terms of counts - v2 aggregates are missing a substantial number of submissions, both in submission date and build-id based aggregates. As one can see in the evolution plot of [1], the number of submissions is not stable at all and drops near to 0 for some dates. Even dates that seem to have lots of submissions don't always match the raw counts. The last nightly version which seems to be OK is 39, see [2]. Mark, do you have any idea of what might be going on? We have had to "restart" our aggregation pipeline several times in the past months but the data loss seems to be too consistent for that alone to explain it. [1] http://telemetry.mozilla.org/old/#filter=nightly%2F41%2FSIMPLE_MEASURES_UPTIME%2Fsaved_session%2FFirefox&aggregates=Submissions&evoOver=Time&locked=true&sanitize=true&renderhistogram=Graph [2] http://telemetry.mozilla.org/old/#filter=nightly%2F39%2FSIMPLE_MEASURES_UPTIME%2Fsaved_session%2FFirefox&aggregates=Submissions&evoOver=Time&locked=true&sanitize=true&renderhistogram=Graph
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(mreid)
Updated•9 years ago
|
Whiteboard: [data-validation]
Assignee | ||
Comment 1•9 years ago
|
||
I don't have a lot of insight into the aggregation code, but clearly you're right - something is not working properly. It looks like the first 3 weeks of Nightly 40 are also OK (up to around April 15th). I think we should set up a test case to run the old aggregation code against a single day's worth of nightly and try to debug this. Is that something Anthony can look into?
Flags: needinfo?(mreid) → needinfo?(rvitillo)
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(azhang)
Reporter | ||
Comment 2•9 years ago
|
||
Afaik Anthony tested the aggregator in isolation when he added the support for count histograms so maybe he could have a look at this.
Flags: needinfo?(rvitillo)
Updated•9 years ago
|
Whiteboard: [data-validation] → [data-validation][unifiedTelemetry]
Comment 3•9 years ago
|
||
Brendan notes that this issue doesn't affect the FHR side of unified telemetry, in case we should track differently.
I ran the v2 aggregation pipeline locally for nightly 42, nightly 41, and release 39 data. The intermediate results were verified using a Python script that independently counts the submissions for GC_MS at various points in the pipeline https://pastebin.mozilla.org/8839869. I was not able to reproduce the submissions after all the above testing. For example, for release 39 on July 17th, we have 4811 submissions from one of the Node server's output files, and they virtually all make it safely to the end with 4810 submissions. That means that it is likely that the submissions are getting dropped either at the Node server, or in transit between aggregator tasks and the master node (this part is excluded when testing locally).
Flags: needinfo?(azhang)
Updated•9 years ago
|
Iteration: --- → 43.1 - Aug 24
Comment 5•9 years ago
|
||
Can we drop V2 and just switch to V4? If not, we need to staff this bug
Reporter | ||
Comment 6•9 years ago
|
||
Let's drop V2. I already asked Anthony to make the v4 dashboards the default ones last week.
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(mreid)
Comment 7•9 years ago
|
||
We shouldn't prematurely delete the data already collected by the old pipeline (because the new pipeline doesn't have most of the data for older versions), but maybe we should stop feeding new pings into the old pipeline. Any objections?
Flags: needinfo?(kparlante)
Flags: needinfo?(benjamin)
Comment 9•9 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #7) > We shouldn't prematurely delete the data already collected by the old > pipeline (because the new pipeline doesn't have most of the data for older > versions), but maybe we should stop feeding new pings into the old pipeline. +1
Flags: needinfo?(kparlante)
Assignee | ||
Comment 10•9 years ago
|
||
Yes, let's stop updating the v2 aggregates. Vladan, where would be the best place to send a notification for that in the hopes of reaching users of telemetry.js? I'd rather give a bit of advance warning, so people have time to update any consumers of the v2 aggregates. We should keep the historic v2 aggregate data in S3 until we have a sufficiently long v4-aggregate history that we don't need that data anymore. Let's handle the "stop feeding data to the old pipeline" issue in a separate bug - there are still a few scheduled jobs that use the old data+pipeline.
Flags: needinfo?(mreid) → needinfo?(vdjeric)
Assignee | ||
Comment 11•9 years ago
|
||
I've filed Bug 1197823 for basically deprecating the entire v2 pipeline.
Comment 12•9 years ago
|
||
I don't know how to find all the users of telemetry.js. Wasn't someone looking at adding logging to S3 for this purpose? We could always add a lovely window.alert() box to telemetry.js v1 to warn anyone who pulls from the repo. Roberto, are there any Mozilla services left on the old pipeline?
Flags: needinfo?(vdjeric)
Reporter | ||
Comment 13•9 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #12) > We could always add a lovely window.alert() box to telemetry.js v1 to warn > anyone who pulls from the repo. That might work. > Roberto, are there any Mozilla services left on the old pipeline? We have still some scheduled map-reduce jobs running on the old pipeline that feed some of our dashboards.
Assignee | ||
Comment 14•9 years ago
|
||
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #13) > We have still some scheduled map-reduce jobs running on the old pipeline > that feed some of our dashboards. These do not depend on the v2 aggregations - they can be addressed over in Bug 1197823 without blocking the decommissioning of the v2 aggregation code.
Assignee | ||
Comment 15•9 years ago
|
||
Vladan is going to take care of announcing the pending deprecation of v2 aggregates via: - Blog post syndicated to planet mozilla - Post to relevant newsgroups I will announce same to @MozTelemetry. We will target September 4th as the date to stop computing new v2 aggregates, unless there is outcry from the above comms channels.
Assignee | ||
Comment 16•9 years ago
|
||
https://twitter.com/MozTelemetry/status/636878871707652096
Comment 17•9 years ago
|
||
We're going to stop updating v2 aggregates on Monday, September 14th
Comment 18•9 years ago
|
||
I announced V2 stopping updates on Monday: https://blog.mozilla.org/vdjeric/2015/09/08/update-your-custom-telemetry-dashes-telemetry-js-is-obsolete/ https://groups.google.com/d/msg/mozilla.dev.platform/3YZZ1azLKcY/bzZAQuijLwAJ
Updated•9 years ago
|
Whiteboard: [data-validation][unifiedTelemetry] → [unifiedTelemetry]
Assignee | ||
Comment 19•9 years ago
|
||
I have terminated the CloudFormation stack that runs the V2 aggregation. For future reference, it used the CloudFormation template here: https://github.com/mozilla/telemetry-server/blob/master/analysis/analysis-worker-stack.yaml with the following parameters: imageId ami-4b939c7b instanceType m1.xlarge keyName mreid maxWorkers 25 spotPrice 0.2
Assignee: nobody → mreid
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•