We should add a new dimension on S3 to discern between submissions that have extended telemetry enabled (environment.settings.telemetryEnabled). Without such dimension the aggregator job is going to take a very long time to run to completion once unified telemetry is enabled on release.
I propose that we store the "release" data in a separate prefix on S3, and using a different set of dimensions than the pre-release data. This problem has been bugging me for a while - the set of dimensions we use are effective for filtering pre-release data, but on release, everything just ends up in a small handful of dimensions (basically, "lastest build on latest version"). In other words, the version and build dimensions don't do much to actually filter the release channel. Adding extra dimensions that *do* split the release channel (OS, country, etc) explode the cardinality of the dimensions for the pre-release channels, so it's a lose-lose situation. If, instead, we store the release data in a different prefix on S3, we are free to use a different set of dimensions there, while not impacting the pre-release dimensions which already work effectively. I propose the following for partitioning the release channel: submission date source name source version doc type app name update channel app version is extended? (t/f) bucketed os (win/mac/linux/other) bucketed country (for a set of N common country codes, others being bucketed into "Other") Or, instead of OS/Country, we could consider partitioning by "sampleId" to facilitate sampling on release. The downside is that it would increase complexity in querying across all channels. Thoughts?
As mentioned on IRC, I am generally ok with your proposal but I would prefer using sampleId instead of OS/Country as dimension. Someone from metrics should also comment on this as they tend to segment by country in their analyses.
That's correct, OS and country are quite common filters for us, and would probably be pretty useful.
Part 1, adding the extra fields and updating the code to support them: https://github.com/mozilla-services/data-pipeline/pull/133 Next up is to update the puppet config to partition on these new fields and do the necessary matching on the S3 output.
Part 2, adding the secondary S3 output, and matching on the "big" channels: https://github.com/mozilla-services/puppet-config/pull/1568
I've used sampleId in the PRs per Comment 2. Roberto, Sam (and others), do you think we should split all the data? or only data where appName=Firefox?
I am not aware of non-appName=Firefox needs, but that doesn't mean they don't exist.
I don't have a strong opinion about it. Splitting all the data would make it consistent though, so that's a plus.
I would prefer to not support two different partinioning schemes. As discussed on IRC, I propose to collect data with both partitioning schemes for the time being. We can then test our Spark libraries with the new scheme and once we are satisfied with it, backfill a month or so of release data and finally move to the new scheme in production.
Ok, so the plan, as I understand it: 1. Continue storing all data using the current set of dimensions (set A) 2. Store a duplicate copy of release + beta using the new expanded set of dimensions (set B) 3. Change and test the Spark code to access the data across both parititoning schemes 4. Stop storing release + beta partitioning in set A 5. Remove release + beta data from the set A during the "duplicate storage" period 6. Backfill release + beta stream in set B as needed. Sound right?
It does but step 4 and 5 should be the last ones.
I've updated the puppet-config PR accordingly.
Over to Roberto for adding Spark support
Assignee: mreid → rvitillo
I've tested this in stage and it appears to work, so I'll be deploying it tomorrow. One change I'm going to make is to use separate working directories for the outputs (it works fines when using the same directory, this is just for operational clarity). s3://net-mozaws-stage-us-east-1-pipeline-data/ (accessible via old dev/stage) will have the data from the test until sometime next week.
As discussed on IRC, the telemetryEnabled dimension for release data on S3 should either be true or false. A missing setting has implicitely the default value which in telemetryEnabled’s case is false, so it shouldn't be set to UNKNOWN.
Ok, I spent most of today debugging this issue and I think I have found the cause. Here's the fix: https://github.com/mozilla-services/data-pipeline/pull/136. When we see false (the boolean) as in this case, we end up using it as a boolean value in an if expression instead of checking if it was nil, and were subsequently not adding it to the Fields array. The cases where we do have data with telemetryEnabled set to false in s3 are when it is set to the string "false". I am not sure why in some cases clients send data with "false" vs. false, but both should wind up going to the same place in s3. I assume the latter will be more common.
Thanks Wesley. Once your fix lands could you please make sure the data we already collected looks right on S3 so I can start populating the SimpleDB index?
The change has been deployed, so current submissions are being sent to the correct s3 location. I should have the release data backfilled before I sign off tonight (leaving the NI until then).
Update: the backfill script has been running for about an hour but is only about 10% done. I except by afternoon Pacific it will be done.
Backfill complete, and should be ready to index.
The new release split is making operations that require to work on all the data very slow, not surprisingly. Counting all pings for release on a single day with the new split takes about 3 times longer with Spark on a single machine, as the number of files for a day grew by about 100 times. And right now we are collecting only 5% of opt-out submissions, so that's only going to get slower. We are going to have to resize the Spark cluster that aggregates the telemetry submissions for t.m.o. The batch job to backfill S3 files on SimpleDB requires way too long in its current shape and it's unusable. We are going to have to parallelize it.
The updated lambda function has been deployed yesterday and it's working fine. I have sent a couple of PRs to deal with the new release split in the Spark API and to parallelize the batch job that updates the SimpleDB index.
The SDB index is not working properly with the "telemetry-release" stream due to too many files now, right? Can we use "get_records" to access these records until we figure out a more general solution to indexing S3?
(In reply to Mark Reid [:mreid] from comment #23) > The SDB index is not working properly with the "telemetry-release" stream > due to too many files now, right? Can we use "get_records" to access these > records until we figure out a more general solution to indexing S3? We have some scheduled jobs and existing analyses that depend on get_pings to fetch release data. We could call get_records within get_pings but we would have to make sure that the performance hit is acceptable for common queries.
This is the next "bleeding" thing -- we'll focus on it when we come back from Orlando.
Priority: P1 → P2
This data set is being used to generate the longitudinal data set Roberto has been working on (See bug 1242039). Roberto, do we still need the SDB index for the longitudinal data set?
(In reply to Mark Reid [:mreid] from comment #26) > This data set is being used to generate the longitudinal data set Roberto > has been working on (See bug 1242039). > > Roberto, do we still need the SDB index for the longitudinal data set? We don't.
Since this data set is only being used to generate the longitudinal dataset, it will be revised per Bug 1246137.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.