Bugzilla

Assignee

Comment 1

•

9 years ago

I propose that we store the "release" data in a separate prefix on S3, and using a different set of dimensions than the pre-release data.

This problem has been bugging me for a while - the set of dimensions we use are effective for filtering pre-release data, but on release, everything just ends up in a small handful of dimensions (basically, "lastest build on latest version"). In other words, the version and build dimensions don't do much to actually filter the release channel.

Adding extra dimensions that *do* split the release channel (OS, country, etc) explode the cardinality of the dimensions for the pre-release channels, so it's a lose-lose situation.

If, instead, we store the release data in a different prefix on S3, we are free to use a different set of dimensions there, while not impacting the pre-release dimensions which already work effectively.

I propose the following for partitioning the release channel:
submission date
source name
source version
doc type
app name
update channel
app version
is extended? (t/f)
bucketed os (win/mac/linux/other)
bucketed country (for a set of N common country codes, others being bucketed into "Other")

Or, instead of OS/Country, we could consider partitioning by "sampleId" to facilitate sampling on release.

The downside is that it would increase complexity in querying across all channels.

Thoughts?

Reporter

Comment 2

•

9 years ago

As mentioned on IRC, I am generally ok with your proposal but I would prefer using sampleId instead of OS/Country as dimension. Someone from metrics should also comment on this as they tend to segment by country in their analyses.

brendan c

Comment 3

•

9 years ago

That's correct, OS and country are quite common filters for us, and would probably be pretty useful.

Thomas Huelbert

Updated

•

9 years ago

Assignee: nobody → mreid

Assignee

Comment 4

•

9 years ago

Part 1, adding the extra fields and updating the code to support them:
https://github.com/mozilla-services/data-pipeline/pull/133

Next up is to update the puppet config to partition on these new fields and do the necessary matching on the S3 output.

Assignee

Comment 5

•

9 years ago

Part 2, adding the secondary S3 output, and matching on the "big" channels:
https://github.com/mozilla-services/puppet-config/pull/1568

Assignee

Comment 6

•

9 years ago

I've used sampleId in the PRs per Comment 2.

Roberto, Sam (and others), do you think we should split all the data? or only data where appName=Firefox?

Flags: needinfo?(rvitillo)

Assignee

Updated

•

9 years ago

Flags: needinfo?(spenrose)

Sam Penrose

Comment 7

•

9 years ago

I am not aware of non-appName=Firefox needs, but that doesn't mean they don't exist.

Flags: needinfo?(spenrose)

Reporter

Comment 8

•

9 years ago

I don't have a strong opinion about it. Splitting all the data would make it consistent though, so that's a plus.

Flags: needinfo?(rvitillo)

Reporter

Comment 9

•

9 years ago

I would prefer to not support two different partinioning schemes. As discussed on IRC, I propose to collect data with both partitioning schemes for the time being. We can then test our Spark libraries with the new scheme and once we are satisfied with it, backfill a month or so of release data and finally move to the new scheme in production.

Assignee

Comment 10

•

9 years ago

Ok, so the plan, as I understand it:

1. Continue storing all data using the current set of dimensions (set A)
2. Store a duplicate copy of release + beta using the new expanded set of dimensions (set B)
3. Change and test the Spark code to access the data across both parititoning schemes
4. Stop storing release + beta partitioning in set A
5. Remove release + beta data from the set A during the "duplicate storage" period
6. Backfill release + beta stream in set B as needed.

Sound right?

Flags: needinfo?(rvitillo)

Reporter

Comment 11

•

9 years ago

It does but step 4 and 5 should be the last ones.

Flags: needinfo?(rvitillo)

Assignee

Comment 12

•

9 years ago

I've updated the puppet-config PR accordingly.

Assignee

Comment 13

•

9 years ago

Over to Roberto for adding Spark support

Assignee: mreid → rvitillo

Comment 14

•

9 years ago

I've tested this in stage and it appears to work, so I'll be deploying it tomorrow. One change I'm going to make is to use separate working directories for the outputs (it works fines when using the same directory, this is just for operational clarity).

s3://net-mozaws-stage-us-east-1-pipeline-data/ (accessible via old dev/stage) will have the data from the test until sometime next week.

Reporter

Comment 15

•

9 years ago

As discussed on IRC, the telemetryEnabled dimension for release data on S3 should either be true or false. A missing setting has implicitely the default value which in telemetryEnabled’s case is false, so it shouldn't be set to UNKNOWN.

Flags: needinfo?(whd)

Comment 16

•

9 years ago

Ok, I spent most of today debugging this issue and I think I have found the cause. Here's the fix: https://github.com/mozilla-services/data-pipeline/pull/136. When we see false (the boolean) as in this case, we end up using it as a boolean value in an if expression instead of checking if it was nil, and were subsequently not adding it to the Fields array.

The cases where we do have data with telemetryEnabled set to false in s3 are when it is set to the string "false". I am not sure why in some cases clients send data with "false" vs. false, but both should wind up going to the same place in s3. I assume the latter will be more common.

Flags: needinfo?(whd)

Reporter

Comment 17

•

9 years ago

Thanks Wesley. Once your fix lands could you please make sure the data we already collected looks right on S3 so I can start populating the SimpleDB index?

Flags: needinfo?(whd)

Comment 18

•

9 years ago

The change has been deployed, so current submissions are being sent to the correct s3 location. I should have the release data backfilled before I sign off tonight (leaving the NI until then).

Comment 19

•

9 years ago

Update: the backfill script has been running for about an hour but is only about 10% done. I except by afternoon Pacific it will be done.

Comment 20

•

9 years ago

Backfill complete, and should be ready to index.

Flags: needinfo?(whd)

Reporter

Comment 21

•

9 years ago

The new release split is making operations that require to work on all the data very slow, not surprisingly. Counting all pings for release on a single day with the new split takes about 3 times longer with Spark on a single machine, as the number of files for a day grew by about 100 times. And right now we are collecting only 5% of opt-out submissions, so that's only going to get slower. We are going to have to resize the Spark cluster that aggregates the telemetry submissions for t.m.o.

The batch job to backfill S3 files on SimpleDB requires way too long in its current shape and it's unusable. We are going to have to parallelize it.

Reporter

Comment 22

•

9 years ago

The updated lambda function has been deployed yesterday and it's working fine. I have sent a couple of PRs to deal with the new release split in the Spark API and to parallelize the batch job that updates the SimpleDB index.

Reporter

Updated

•

9 years ago

Assignee: rvitillo → mreid

Assignee

Comment 23

•

9 years ago

The SDB index is not working properly with the "telemetry-release" stream due to too many files now, right? Can we use "get_records" to access these records until we figure out a more general solution to indexing S3?

Flags: needinfo?(rvitillo)

Reporter

Comment 24

•

9 years ago

(In reply to Mark Reid [:mreid] from comment #23)
> The SDB index is not working properly with the "telemetry-release" stream
> due to too many files now, right? Can we use "get_records" to access these
> records until we figure out a more general solution to indexing S3?

We have some scheduled jobs and existing analyses that depend on get_pings to fetch release data. We could call get_records within get_pings but we would have to make sure that the performance hit is acceptable for common queries.

Flags: needinfo?(rvitillo)

Katie Parlante

Comment 25

•

9 years ago

This is the next "bleeding" thing -- we'll focus on it when we come back from Orlando.

Priority: P1 → P2

Assignee

Comment 26

•

8 years ago

This data set is being used to generate the longitudinal data set Roberto has been working on (See bug 1242039).

Roberto, do we still need the SDB index for the longitudinal data set?

Flags: needinfo?(rvitillo)

Reporter

Comment 27

•

8 years ago

(In reply to Mark Reid [:mreid] from comment #26)
> This data set is being used to generate the longitudinal data set Roberto
> has been working on (See bug 1242039).
> 
> Roberto, do we still need the SDB index for the longitudinal data set?

We don't.

Flags: needinfo?(rvitillo)